April 14, 2008   Sign In |  About ebizQ |  Contact Us |  Join ebizQ Gold Club

ITGumbo: spicing IT up

IT Copywrite

Technology and application of technology.

ebizQ presents ITGumbo: a spicy blog network where vendors and IT professionals share ideas about creating Business Agility.

Text mining for intelligent content & context search

Text mining is a technique used to mine written information resources for new and undiscovered information. It is similar to data mining technique that is used to mine information resources for statistical data. The difference is that the former is used for structured information such as XML documents. Semantic web schema facilitates deep information search. To be able to extract meaning from the unstructured hidden information that has been published on internet for years the information must be organized systematically.

Some text mining technologies are:

  • Pattern matching
  • Topic-tracking
  • Summarization
  • Categorization
  • Clustering
  • Concept linkage
  • Information visualization
  • Question answering

Pattern matching – for information extraction; identifies key phrases and relationships in the document. Information extraction process yields the <subject, predicate, object> triple. This method is used to find information resource based on semantic data.
Example:
Find document on a given topic created by this author. The triples formed are: <document, topic, given_name>, <document, author, this_name>. The pattern matching search is made on topic & author properties of all documents for given values.

Topic-tracking – an alert service about news updates on a given topic. This technique is used to track new published information for specific keywords. More advanced topic-tracking applications can infer the user's preferences from past reading history and click-through information.
Example:
Personalized feeds can be provided based on user preferences such as clicks on tag words in the tag cloud. The topic-tracking application can also be used to track competitor on the WWW. A reader can subscribe to an alert service on "competitor brand" to get all the news about it.

Summarization – sentence extraction and position information methods are used to summarize long documents. Summarization applications are used to create automated summaries without the need to read complete document. The computer extracts sentences from the important positions in the document such as case study, conclusion, tables and figures.
Example:
The summary for reference article of this blog post can be formed by including statements like "Text mining is ..." from the introduction, "... future is ..." from conclusion and headings will provide text mining technologies list.

Categorization – tags are pre-defined to categorize the document.
Example:
All blog posts with “usability” tag will be categorized in usability tag word category.

Clustering – a technique that groups similar documents. This grouping is done by the clustering engine when it finds the document for the first time. Clustering technique is also used to group documents in a sub-category.
Example:
All blog posts on women apparel, feminine design cars etc., may be clustered in apparel, cars sub-category under women category. The blog post author may identify the “women” tag only and clustering is done automatically.

Concept linkage – similar but more advanced linking technique as compared to clustering. A technique that can link documents with common concepts. This technique has been found useful in research projects.
Example:
A concept is formed that a particular color is liked by women. To form this concept a statistics search is done for popular women apparel color and then another search is done for popular car color amongst women.

Information visualization – a technique that organizes the results of text mining in a visual map. The different categories of the search result are presented in a hierarchical form.
Example:
A search on “women” is presented in a map. The user can directly click on a sub-category.
may02.gif


Question answering – text mining to provide answers to natural language query. A natural language query is translated into ontologies, the triples are then used to search the answer database.
Example:
Question: Who is the author of article “Tapping the power of text mining”?
The document URI <document, title, “Tapping the power of text mining“> is found in the document database and then the name value from <document, author, name> triple is returned.
Answer: name

Reference for this post is article Tapping the power of text mining.

The intelligent tool can identify new topics for categorization or clustering. New categories may be created on the fly based on the count of a particular word in a document. Idle time or run-time clustering can be performed on the document database by finding common words and ontologies. Thus more entries on a given topic can be found automatically. Combination of text mining technologies will produce powerful search tools. Factor is a technology developed by NRC Canada and Nstein Technologies to decode & detect relationships in unstructured information.

Advertisement

0 TrackBacks

Listed below are links to blogs that reference "Text mining for intelligent content & context search".

TrackBack URL for this entry: http://itgumbo.com/microsite/MT/mt-tb.cgi/1437

Leave a comment