org.kit.furia.index
Class AbstractIRIndex<O extends org.ajmm.obsearch.OB>

java.lang.Object
  extended by org.kit.furia.index.AbstractIRIndex<O>
Type Parameters:
O - The basic unit in which all the information is divided. In the case of natural language documents, this would be a word.
All Implemented Interfaces:
IRIndex<O>
Direct Known Subclasses:
FIRIndexShort

public abstract class AbstractIRIndex<O extends org.ajmm.obsearch.OB>
extends java.lang.Object
implements IRIndex<O>

AbstractIRIndex holds the basic functionality for an Information Retrieval system that works on OB objects (please see www.obsearch.net). By using a distance function d, we transform the queries in terms of the closest elements that are in the database, and once this transformation is performed, we utilize an information retrieval system (Apache's Lucene) to perform the matching.

Since:
0
Author:
Arnoldo Jose Muller Molina

Nested Class Summary
protected static class AbstractIRIndex.FieldName
          Lucene has the concepts of fields of a document.
protected  class AbstractIRIndex.Word
          Represents an OB object.
 
Field Summary
protected  org.apache.lucene.index.IndexReader indexReader
          This object is used to read different data from the index.
protected  org.apache.lucene.index.IndexWriter indexWriter
          This object is used to add elements to the index.
protected  float mSetScoreThreshold
          At least the given naive mset score must be obtained to consider a term in the result.
protected  org.apache.lucene.search.Searcher searcher
          This object is used to search the index;
protected  float setScoreThreshold
          At least the given naive set score must be obtained to consider a term in the result.
protected  boolean validationMode
          Tells whether or not the index is in validation mode.
 
Constructor Summary
AbstractIRIndex(java.io.File dbFolder)
          Creates a new IR index if none is available in the given path.
 
Method Summary
protected  ResultCandidate calculateSimilarity(org.apache.lucene.document.Document document, java.util.Map<java.lang.Integer,java.lang.Integer> normalizedQuery, float score)
          Calculates the ResultCandidate between a normalized query and a Lucene document.
 void close()
          Closes the databases.
protected  java.util.PriorityQueue<AbstractIRIndex.Word> createPriorityQueue(java.util.Map<java.lang.Integer,java.lang.Integer> words)
          Create a PriorityQueue from a word->tf map.
 int delete(java.lang.String documentName)
          Deletes the given string document from the database.
 void freeze()
          Freezes the index.
 float getMSetScoreThreshold()
          The M-set score threshold is the minimum naive score for multi-sets that the index will accept.
 float getSetScoreThreshold()
          * The Set score threshold is the minimum naive score for Sets that the index will accept.
 int getSize()
          Returns the # of documents in this DB.
 void insert(Document<O> document)
          Inserts a new document into the database.
 boolean isValidationMode()
          Tells whether or not the index is in validation mode.
protected  java.util.List<ResultCandidate> processQueryResults(java.util.Map<java.lang.Integer,java.lang.Integer> normalizedQuery, short n, Document query)
           
 void setMSetScoreThreshold(float setScoreThreshold)
          The M-set score threshold is the minimum naive score for multi-sets that the index will accept.
 void setSetScoreThreshold(float setScoreThreshold)
          The Set score threshold is the minimum naive score for Sets that the index will accept.
 void setValidationMode(boolean validationMode)
          Sets whether or not the index is in validation mode.
 boolean shouldSkipDoc(Document<O> x)
          Returns true if the document corresponding to x's name exists in the DB.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.kit.furia.IRIndex
getIndex, getWordsSize
 

Field Detail

indexWriter

protected org.apache.lucene.index.IndexWriter indexWriter
This object is used to add elements to the index.


indexReader

protected org.apache.lucene.index.IndexReader indexReader
This object is used to read different data from the index.


searcher

protected org.apache.lucene.search.Searcher searcher
This object is used to search the index;


mSetScoreThreshold

protected float mSetScoreThreshold
At least the given naive mset score must be obtained to consider a term in the result.


setScoreThreshold

protected float setScoreThreshold
At least the given naive set score must be obtained to consider a term in the result.


validationMode

protected boolean validationMode
Tells whether or not the index is in validation mode.

Constructor Detail

AbstractIRIndex

public AbstractIRIndex(java.io.File dbFolder)
                throws java.io.IOException
Creates a new IR index if none is available in the given path.

Parameters:
dbFolder - The folder in which Lucene's files will be stored
Throws:
java.io.IOException - If the given directory does not exist or if some other IO error occurs
Method Detail

delete

public int delete(java.lang.String documentName)
           throws IRException
Description copied from interface: IRIndex
Deletes the given string document from the database. If more than one documents have the same name, all the documents will be erased.

Specified by:
delete in interface IRIndex<O extends org.ajmm.obsearch.OB>
Returns:
The number of documents deleted.
Throws:
IRException - If something goes wrong with the IR engine or with OBSearch.

shouldSkipDoc

public boolean shouldSkipDoc(Document<O> x)
                      throws java.io.IOException
Returns true if the document corresponding to x's name exists in the DB. This method is intended to be used in validation mode only.

Specified by:
shouldSkipDoc in interface IRIndex<O extends org.ajmm.obsearch.OB>
Parameters:
x -
Returns:
true if the DB does not contain a document with name x.getName()
Throws:
java.io.IOException

calculateSimilarity

protected ResultCandidate calculateSimilarity(org.apache.lucene.document.Document document,
                                              java.util.Map<java.lang.Integer,java.lang.Integer> normalizedQuery,
                                              float score)
Calculates the ResultCandidate between a normalized query and a Lucene document.

Returns:
A result candidate for the given document and normalized query.

getSize

public int getSize()
Returns the # of documents in this DB.

Specified by:
getSize in interface IRIndex<O extends org.ajmm.obsearch.OB>
Returns:

processQueryResults

protected java.util.List<ResultCandidate> processQueryResults(java.util.Map<java.lang.Integer,java.lang.Integer> normalizedQuery,
                                                              short n,
                                                              Document query)
                                                       throws IRException
Throws:
IRException

insert

public void insert(Document<O> document)
            throws IRException
Description copied from interface: IRIndex
Inserts a new document into the database.

Specified by:
insert in interface IRIndex<O extends org.ajmm.obsearch.OB>
Parameters:
document - The document to be inserted.
Throws:
IRException - If something goes wrong with the IR engine or with OBSearch.

freeze

public void freeze()
            throws IRException
Description copied from interface: IRIndex
Freezes the index. From this point data can be inserted, searched and deleted. The index might deteriorate at some point so every once in a while it is a good idea to rebuild the index. This method will also

Specified by:
freeze in interface IRIndex<O extends org.ajmm.obsearch.OB>
Throws:
IRException - If something goes wrong with the IR engine or with OBSearch.

close

public void close()
           throws IRException
Description copied from interface: IRIndex
Closes the databases. You *should* close the databases after using an IRIndex.

Specified by:
close in interface IRIndex<O extends org.ajmm.obsearch.OB>
Throws:
IRException - If something goes wrong with the IR engine or with OBSearch.

createPriorityQueue

protected java.util.PriorityQueue<AbstractIRIndex.Word> createPriorityQueue(java.util.Map<java.lang.Integer,java.lang.Integer> words)
                                                                     throws java.io.IOException
Create a PriorityQueue from a word->tf map. (This code was borrowed from lucene-contrib)

Parameters:
words - a map of words keyed on the word(String) with Int objects as the values.
Returns:
A priority queue ordered by the most important word.
Throws:
java.io.IOException

getMSetScoreThreshold

public float getMSetScoreThreshold()
Description copied from interface: IRIndex
The M-set score threshold is the minimum naive score for multi-sets that the index will accept.

Specified by:
getMSetScoreThreshold in interface IRIndex<O extends org.ajmm.obsearch.OB>
Returns:
Returns the current M-set score threshold.

setMSetScoreThreshold

public void setMSetScoreThreshold(float setScoreThreshold)
Description copied from interface: IRIndex
The M-set score threshold is the minimum naive score for multi-sets that the index will accept.

Specified by:
setMSetScoreThreshold in interface IRIndex<O extends org.ajmm.obsearch.OB>
Parameters:
setScoreThreshold - the new threshold

getSetScoreThreshold

public float getSetScoreThreshold()
Description copied from interface: IRIndex
* The Set score threshold is the minimum naive score for Sets that the index will accept.

Specified by:
getSetScoreThreshold in interface IRIndex<O extends org.ajmm.obsearch.OB>
Returns:
Returns the current Set score threshold.

setSetScoreThreshold

public void setSetScoreThreshold(float setScoreThreshold)
Description copied from interface: IRIndex
The Set score threshold is the minimum naive score for Sets that the index will accept.

Specified by:
setSetScoreThreshold in interface IRIndex<O extends org.ajmm.obsearch.OB>
Parameters:
setScoreThreshold - the new threshold

isValidationMode

public boolean isValidationMode()
Description copied from interface: IRIndex
Tells whether or not the index is in validation mode. In validation mode we assume that documents with the same name are equal. This helps us to add additional statistics on the performance of the scoring technique.

Specified by:
isValidationMode in interface IRIndex<O extends org.ajmm.obsearch.OB>
Returns:
true if this index is in validation mode.

setValidationMode

public void setValidationMode(boolean validationMode)
Description copied from interface: IRIndex
Sets whether or not the index is in validation mode. In validation mode we assume that documents with the same name are equal. This helps us to add additional statistics on the performance of the scoring technique.

Specified by:
setValidationMode in interface IRIndex<O extends org.ajmm.obsearch.OB>
Parameters:
validationMode - The new validation mode.


Copyright © 2008 Arnoldo Jose Muller Molina. All Rights Reserved.