April 13, 2008 4:22 AM
Search strategies to explore the invisible web
Google has announced a new search strategy to explore the invisible web. Invisible web comprises web documents that have not been indexed yet. Google's new search strategy is to retrieve invisible information from the web sites that use HTML forms by providing input data to HTML input controls. As mentioned in the post that announces this search strategy they shall honor the privacy policy of a web site by processing robots.txt file and META tag 'ROBOTS' and avoid HTTP requests that require user information processing.
Another Search Engine that explores the invisible web is RSSMicro Search Engine, this Search Engine crawls the XML feeds and collects syndication data.
"In other words, a new web is evolving which has its roots in XML based feeds rather than HTML pages. Some may refer to Web 2.0 or 3.0 but nevertheless the data is mainly comprised of user generated XML feeds. Unknowingly, we might be building a new web which is a duplicate of the original in terms of the content but is being transformed into XML" RSSMicro.
XML feeds provide good amount of semantically annotated data, the normative metadata used is specified in syndication specifications such as RSS 1.0, RSS 2.0 and Atom. Due to this normative metadata it is possible to collect keywords for intelligent search, e.g. 'category' element in RSS 2.0 and Atom specification provides the keywords used in the post. As mentioned on RSSMicro web site they collect "millions of keywords" and apply clustering algorithms to answer user queries. This normative metadata that is included in the RSS feed is not present in the web page of the news item published in the RSS feed. The difference between the two can be observed by viewing the source code of the RSS feed file and the source code of any post published on this feed. The 'category' element present in the XML feed is not present in the blog web page, instead the keywords are included as anchor text for HTML 'a' element with @rel=tag.
XML is the basis of all semantic technology programming languages such as RDF, RDFa and OWL. With RDFa and microformats it is possible to annotate the web pages so that normative metadata can be found on the HTML web pages also. It is possible to avoid the duplication of information by using same normative metadata in web syndication and web content publishing tools. This issue is addressed in RSS 1.0 syndication specifications to some extent, by inclusion of Dublin Core metadata in RSS 1.0 RDF description of items. The keywords can be annotated with 'dc:subject' property in both RSS 1.0 XML feed and the HTML web page. Since this duplication is being performed by an application that is generating the feed there is no loss of effort or information. Most syndication include only first few lines of the blog post and more data with semantic annotation may be embedded in the content. Example: use of vocabularies vCard, iCal, FOAF, etc. inside the content.
Most blog tools also include links to related, popular or previous posts, categories, archives and tag clouds on the blog web page. No HTML form is to be filled to access this data. A Search Engine such as RSSMicro that reads XML files to collect data can collect some amount of data from the one time syndication. More data shall be collected by the Search Engine that crawls all the @href and @src links on the web page, this shall cover the tag cloud, archives and categories. At this moment in web 2.0 the XML file crawlers have advantage because of normative metadata made available by syndication protocols such as RSS 1.0, RSS 2.0 and Atom.
Conclusion: Indexing of semantic data by Yahoo!, HTML forms by Google and XML data by RSSMicro are steps into invisible web. Invisible web now encompasses not only documents that have not been indexed but also information (i.e. data+context) that has not been found. Information present in the web content published with AJAX tools has data that can be explored with a combination of search strategies. RDF/XML corresponding to a web page derived with GRDDL transformation and stored on the web server with P3P controls shall also solve the data accessibility issues with AJAX applications.
UpdateTo view RDFa embedded in this page:
- Drag the 'RDFa Highlight' bookmarklet from here to the bookmarks bar.
- Open this blog post in the browser.
- Now click the 'RDFa Highlight' bookmark, the embedded RDFa is highlighted with red border rectangle around the object.
- The RDF triple can be viewed if you hover mouse over the rectangle.



Leave a comment