Passion of IT

Lucene:information retrieval

How is internet search engine made of?

How I can search a lot of information in many documents or web pages?

I could search these information using a information retrieval system.

Information retrieval

The information retrival system executes queries to text unstructured document but this file is not ready to search information inside it because the user searches only few keyword that are written in sparse positions in the document and also the document can have a synonym of this keyword etc for this reason it is necessary to do a preprocessing operation.

The most important preprocessing operations are the following:

  • Stemming: Replacing words with their stems. For instance the English stem “bikes” is replaced with “bike”; now query “bike” can find both documents containing “bike” and those containing “bikes”.
  • Stop Words Filtering: Common words like “the”, “and” and “a” rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some “noise” and actually improve search quality.
  • Text Normalization: Stripping accents and other character markings can make for better searching.
  • Synonym Expansion: Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.
  • Get most frequency terms

The information retrieval system executes the precedent operation and saves the resulted document and the pointer of the original document in a index file.

When the user searches the keyword, this keyword is find into the index file and the result is shown to the users grouped by ranking (level of relevance) when the user clicks on the link, he can open the document.

 

Apache Lucene

Lucene is the framework composed by many libraries to execute queries on text documents like an information retrieval system.

The principal libs are the following:

  • org.apache.lucene.document.Document: allows to store into the index principal information (for example title or contain or creation date) called field of the document. When executing the query it’s necessary to specify what property of the document is searched
  • org.apache.lucene.analysis.standard.StandardAnalyzer: the analyser that processes the documents to extract the information to save into the index, for example removes stopword, executes stemming and all the four preprocessing operation written above
  • org.apache.lucene.search.Query: executes the query
  • org.apache.lucene.search.ScoreDoc: executes the ranking of the results
  • org.apache.poi.extractor.ExtractorFactory: is very useful for indexing office document like word or excel and to execute queries inside these documents

My Project

I tested all the precedent libraries of Lucene for searching information into word and excel documents and they run well: Lucene creates index files, executes the query and shows the results ordered by ranking.

There are a lot of analysers:

  • org.apache.lucene.analysis.standard.StandardAnalyzer
  • org.apache.lucene.analysis.en.EnglishAnalyzer
  • org.apache.lucene.analysis.it.ItalianAnalyzer

each analyser makes different preprocessing because English stopwords are different from the Italian stopwords, the same for synonyms  and stemming.

The pre processing step is not executed automatically but it is necessary to use the method analyzer.tokenStream

when I execute the following code


for (File f : queue) {

FileReader  fr = new FileReader(f);
TokenStream ts=analyzer.tokenStream("contents", ExtractorFactory.createExtractor(f).getText());
doc.add(new TextField("contents",ts));
ts.reset();
ts.close();
writer.addDocument(doc); }

the console shows the following error:

java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:111)

This is not the correct way to add the token that contains the text preprocessed at the indexed document.

when I add the field of the document to the indexed document there are no exceptions and the query is executed correctly  but when I use a token the previous exception occurs

 

In my opinion the another way to make the fourth precedents preprocessing is manually, by making many simple dictionaries:

  • Stop word dictionary a simple xml file with all words that it will be removed while are processed documents for making the index and it will be removed from the user queries
  • stem synonym dictionary: simple database table with two column: key:base term, another column:term to replace.
    • For example BASE TERM:eat; TERM TO REPLACE:ate,eating,eats, have a meal
    • I use a table that loads in a class file  when the aplication starts by executing the following query “select * from stem_synonym_dictionary”
    • when I process a document or a query and I found eating or have a meal, I replace it with eat (simple search into string)

 

No Comments Yet

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite="
"> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Recent Comments

Michele Rizzithe website doesn't exist. Ara you a robot? In any case it's difficult that an automatic system writes a comment here since there are two captchas...
Hello there! This is kind of off topic but I need some guidance from an established blog. Is it very hard to set up your own blog? I'm not very t...
We are a group of volunteers and opening a new scheme in our community. Your web site offered us with valuable information to work on. You've done a...
April 2024
M T W T F S S
« Dec    
1234567
891011121314
15161718192021
22232425262728
2930  

Login