Cage

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pt.tumba.cage
Class Cage

java.lang.Object
  pt.tumba.cage.Cage

public class Cage
extends java.lang.Object
extends java.lang.Object

Extracting named entities (names, places, dates, and other words and phrases that establish the meaning of a body of text) is critical to software systems that process large amounts of unstructured data coming from sources such as email, document files, and the Web. The purpose of named entity recognition is to locate certain types of phrases, and associate them with a category. This allows text analysis software to create a map of the concepts in a document.

Cage is a utility for extracting named entities from text, using a morphological approach to analyzing text. This means that CAGE works with the language specific features, such as punctuation, capitalization, actual words, word forms and affixes. Particularly, the used method involves matching names from lexicons for persons, geographical places and organizations, together with a mechanism for matching common patterns associated with these entities.

This class encapsulates the needed state information, and provides a command line interface for the program.

Author:: Bruno Martins

Constructor Summary
`Cage()` Simple constructor for invocation by subclass constructors, typically implicit.
`Cage(java.lang.String namesFile, java.lang.String placesFile, java.lang.String organizationsFile)` Constructor for Cage

Method Summary
`java.lang.String`	`findEntities(java.lang.String sentence)` Finds named entities in a given text block.
`NamedEntity[]`	`getNamedEntities()` Returns an array of all the entities found in the text.
`java.lang.String`	`getText()` Returns the input text to the named entity recognizer.
`java.lang.String`	`getTextFinal()` Returns the text given to the named entity recognized, with the named entities surrounded by appropriate SGML tags.
`void`	`load(java.lang.String namesFile, java.lang.String placesFile, java.lang.String organizationsFile)` Loads the data from the lexicon files and corresponding pattern rule files.
`static void`	`main(java.lang.String[] args)` Main method, used to text named entity recognition.
`void`	`setRegularTextMode()` Sets the text processing mode to handle regular text files.
`void`	`setTeXMode()` Sets the text processing mode to handle TeX/LaTeX files.
`void`	`setText(java.lang.String text)` Sets the text to be processed by the named entity recognizer.
`void`	`setXMLMode()` Sets the text processing mode to handle XML and SGML files.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

Cage

public Cage()

Simple constructor for invocation by subclass constructors, typically implicit.

Cage

public Cage(java.lang.String namesFile,
            java.lang.String placesFile,
            java.lang.String organizationsFile)
     throws java.io.IOException

Constructor for Cage

Parameters:: namesFile - Path for the names lexicon file; placesFile - Path for the places lexicon file; organizationsFile - Path for the organizations lexicon file
Throws:: java.io.IOException

Method Detail

findEntities

public java.lang.String findEntities(java.lang.String sentence)

Finds named entities in a given text block. Text blocks do not contain ponctuation, so initial parsing should be made by the DefaultWordFinder class.

Parameters:: sentence - The text block.
Returns:: A string with the text block where named entities are surrounding by appropriate SGML tags.

load

public void load(java.lang.String namesFile,
                 java.lang.String placesFile,
                 java.lang.String organizationsFile)
          throws java.io.IOException

Loads the data from the lexicon files and corresponding pattern rule files.

Parameters:: namesFile - Path for the names lexicon file.; placesFile - Path for the places lexicon file.; organizationsFile - Path for the organizations lexicon file.
Throws:: java.io.IOException - A problem occurred while reading the lexicons.

getNamedEntities

public NamedEntity[] getNamedEntities()

Returns an array of all the entities found in the text.

Returns:: An array with all the entities found in the text.

setText

public void setText(java.lang.String text)

Sets the text to be processed by the named entity recognizer.

Parameters:: text - The text to be processed.

getText

public java.lang.String getText()

Returns the input text to the named entity recognizer.

Returns:: The text fed into the named entity recognizer.

getTextFinal

public java.lang.String getTextFinal()

Returns the text given to the named entity recognized, with the named entities surrounded by appropriate SGML tags.

Returns:: The text given to the named entity recognized, with the named entities surrounded by appropriate SGML tags.

setTeXMode

public void setTeXMode()

Sets the text processing mode to handle TeX/LaTeX files.

setXMLMode

public void setXMLMode()

Sets the text processing mode to handle XML and SGML files.

setRegularTextMode

public void setRegularTextMode()

Sets the text processing mode to handle regular text files. This is the default processing mode.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception

Main method, used to text named entity recognition.

Parameters:: argv - The command line input, tokenized.
Throws:: java.lang.Exception

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pt.tumba.cage Class Cage

Cage

Cage

findEntities

load

getNamedEntities

setText

getText

getTextFinal

setTeXMode

setXMLMode

setRegularTextMode

main

pt.tumba.cage
Class Cage