pt.tumba.cage
Class Cage

java.lang.Object
  extended by pt.tumba.cage.Cage

public class Cage
extends java.lang.Object

Extracting named entities (names, places, dates, and other words and phrases that establish the meaning of a body of text) is critical to software systems that process large amounts of unstructured data coming from sources such as email, document files, and the Web. The purpose of named entity recognition is to locate certain types of phrases, and associate them with a category. This allows text analysis software to create a map of the concepts in a document.

Cage is a utility for extracting named entities from text, using a morphological approach to analyzing text. This means that CAGE works with the language specific features, such as punctuation, capitalization, actual words, word forms and affixes. Particularly, the used method involves matching names from lexicons for persons, geographical places and organizations, together with a mechanism for matching common patterns associated with these entities.

This class encapsulates the needed state information, and provides a command line interface for the program.

Author:
Bruno Martins

Constructor Summary
Cage()
          Simple constructor for invocation by subclass constructors, typically implicit.
Cage(java.lang.String namesFile, java.lang.String placesFile, java.lang.String organizationsFile)
          Constructor for Cage
 
Method Summary
 java.lang.String findEntities(java.lang.String sentence)
          Finds named entities in a given text block.
 NamedEntity[] getNamedEntities()
          Returns an array of all the entities found in the text.
 java.lang.String getText()
          Returns the input text to the named entity recognizer.
 java.lang.String getTextFinal()
          Returns the text given to the named entity recognized, with the named entities surrounded by appropriate SGML tags.
 void load(java.lang.String namesFile, java.lang.String placesFile, java.lang.String organizationsFile)
          Loads the data from the lexicon files and corresponding pattern rule files.
static void main(java.lang.String[] args)
          Main method, used to text named entity recognition.
 void setRegularTextMode()
          Sets the text processing mode to handle regular text files.
 void setTeXMode()
          Sets the text processing mode to handle TeX/LaTeX files.
 void setText(java.lang.String text)
          Sets the text to be processed by the named entity recognizer.
 void setXMLMode()
          Sets the text processing mode to handle XML and SGML files.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Cage

public Cage()
Simple constructor for invocation by subclass constructors, typically implicit.


Cage

public Cage(java.lang.String namesFile,
            java.lang.String placesFile,
            java.lang.String organizationsFile)
     throws java.io.IOException
Constructor for Cage

Parameters:
namesFile - Path for the names lexicon file
placesFile - Path for the places lexicon file
organizationsFile - Path for the organizations lexicon file
Throws:
java.io.IOException
Method Detail

findEntities

public java.lang.String findEntities(java.lang.String sentence)
Finds named entities in a given text block. Text blocks do not contain ponctuation, so initial parsing should be made by the DefaultWordFinder class.

Parameters:
sentence - The text block.
Returns:
A string with the text block where named entities are surrounding by appropriate SGML tags.

load

public void load(java.lang.String namesFile,
                 java.lang.String placesFile,
                 java.lang.String organizationsFile)
          throws java.io.IOException
Loads the data from the lexicon files and corresponding pattern rule files.

Parameters:
namesFile - Path for the names lexicon file.
placesFile - Path for the places lexicon file.
organizationsFile - Path for the organizations lexicon file.
Throws:
java.io.IOException - A problem occurred while reading the lexicons.

getNamedEntities

public NamedEntity[] getNamedEntities()
Returns an array of all the entities found in the text.

Returns:
An array with all the entities found in the text.

setText

public void setText(java.lang.String text)
Sets the text to be processed by the named entity recognizer.

Parameters:
text - The text to be processed.

getText

public java.lang.String getText()
Returns the input text to the named entity recognizer.

Returns:
The text fed into the named entity recognizer.

getTextFinal

public java.lang.String getTextFinal()
Returns the text given to the named entity recognized, with the named entities surrounded by appropriate SGML tags.

Returns:
The text given to the named entity recognized, with the named entities surrounded by appropriate SGML tags.

setTeXMode

public void setTeXMode()
Sets the text processing mode to handle TeX/LaTeX files.


setXMLMode

public void setXMLMode()
Sets the text processing mode to handle XML and SGML files.


setRegularTextMode

public void setRegularTextMode()
Sets the text processing mode to handle regular text files. This is the default processing mode.


main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Main method, used to text named entity recognition.

Parameters:
argv - The command line input, tokenized.
Throws:
java.lang.Exception