pt.tumba.cage
Class DefaultWordFinder

java.lang.Object
  extended by pt.tumba.cage.DefaultWordFinder
Direct Known Subclasses:
TeXWordFinder, XMLWordFinder

public class DefaultWordFinder
extends java.lang.Object

A word finder for normal text documents, which searches text for sequences of words and text blocks.This class also defines common methods and behaviour for the various word finding subclasses.

Author:
Bruno Martins
See Also:
StringTokenizer, BreakIterator, TeXWordFinder, XMLWordFinder

Constructor Summary
DefaultWordFinder()
          Constructor for DefaultWordFinder.
DefaultWordFinder(java.lang.String inText)
          Constructor for DefaultWordFinder.
 
Method Summary
 java.lang.String current()
          Returns the current word in the text.
 java.lang.String currentNGram(int n)
          Returns the current word N-gram from the input.
 java.lang.String currentSegment()
          Returns the current text segment from the input.
 java.lang.String currentWordGram(int n)
          Returns the current word N-gram from the input.
 java.lang.String getText()
          Returns the text associated with this DefaultWordFinder.
 boolean hasNext()
          Tests if there are more words available from the text.
 java.lang.String lookAhead()
          Retuns the next word without advancing the tokenizer, cheking if the character separating both words is an empty space.
 java.lang.String next()
          This method scans the text from the end of the last word, and returns a String corresponding to the next word.
 java.lang.String nextSegment()
          Returns the next text segment from the input.
 void replace(java.lang.String newWord)
          Replaces the current word in the text.
 void replaceBigram(java.lang.String newBigram)
          Replaces the current bigram (current word and the next as returned by lookahead) in the text.
 void replaceSegment(java.lang.String newSegment)
          Replaces the current text segment.
 void setText(java.lang.String newText)
          Changes the text associates with this DefaultWordFinder.
static java.lang.String[] splitNGrams(java.lang.String text, int n)
          Splits a given String into an array with its constituent character n-grams.
static java.lang.String[] splitSegments(java.lang.String text)
          Splits a given String into an array with its constituent text segments.
static java.lang.String[] splitWordGrams(java.lang.String text, int n)
          Splits a given String into an array with its constituent word n-grams.
static java.lang.String[] splitWords(java.lang.String text)
          Splits a given String into an array with its constituent words.
 boolean startsSentence()
          Checks if the current word marks the begining of a sentence.
 java.lang.String toString()
          Produces a string representation of this word finder by returning the associated text.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

DefaultWordFinder

public DefaultWordFinder(java.lang.String inText)
Constructor for DefaultWordFinder.

Parameters:
inText - A String with the input text to tokenize.

DefaultWordFinder

public DefaultWordFinder()
Constructor for DefaultWordFinder.

Method Detail

currentWordGram

public java.lang.String currentWordGram(int n)
Returns the current word N-gram from the input. An N-gram is defined as the word sequence between the current position and the next n words.

Parameters:
n - Number of consecutive words on the n-grams.
Returns:
A String with the current word N-gram.

currentNGram

public java.lang.String currentNGram(int n)
Returns the current word N-gram from the input. An N-gram is defined as the character sequence between the current position and the next n characters.

Parameters:
n - Number of consecutive characters on the n-grams.
Returns:
A String with the current word N-gram.

currentSegment

public java.lang.String currentSegment()
Returns the current text segment from the input. A segment is defined as the character sequence between the current position and the next non-alphanumeric character, considering also white spaces.

Returns:
A String with the current text segment.

nextSegment

public java.lang.String nextSegment()
Returns the next text segment from the input. A segment is defined as the character sequence between the current position and the next non-alphanumeric character, considering also white spaces.If there are no more segments to return, it retuns a null String.

Returns:
A String with the next text segment.

replaceSegment

public void replaceSegment(java.lang.String newSegment)
Replaces the current text segment. After a call to this method, a call to currentSegment() returns the new text segment and a call to getText() returns the text supplied to this WordFinder with the current segment replaced.

Parameters:
newSegment - A String with the new text segment.

getText

public java.lang.String getText()
Returns the text associated with this DefaultWordFinder.

Returns:
A String with the text associated with this DefaultWordFinder.

setText

public void setText(java.lang.String newText)
Changes the text associates with this DefaultWordFinder.

Parameters:
newText - The new String with the input text to tokenize.

current

public java.lang.String current()
Returns the current word in the text.

Returns:
A String with the current word in the text.

hasNext

public boolean hasNext()
Tests if there are more words available from the text.

Returns:
true if and only if there is at least one word in the string after the current position, and false otherwise.

replace

public void replace(java.lang.String newWord)
Replaces the current word in the text. After a call to this method, a call to current() returns the new word and a call to getText() returns the text supplied to this WordFinder with the current word replaced.

Parameters:
newWord - A string with the replacement word.

replaceBigram

public void replaceBigram(java.lang.String newBigram)
Replaces the current bigram (current word and the next as returned by lookahead) in the text. After a call to this method, a call to current() returns the Bigram and a call to getText() returns the text supplied to this WordFinder with the current Bigram replaced.

Parameters:
newBigram - A string with the replacement Bigram.

lookAhead

public java.lang.String lookAhead()
Retuns the next word without advancing the tokenizer, cheking if the character separating both words is an empty space. This is usefull for getting BiGrams from the text.

Returns:
The next word in the text, or null.

startsSentence

public boolean startsSentence()
Checks if the current word marks the begining of a sentence.

Returns:
true if the current word marks the begining of a sentence and false otherwise.

toString

public java.lang.String toString()
Produces a string representation of this word finder by returning the associated text.

Overrides:
toString in class java.lang.Object

next

public java.lang.String next()
This method scans the text from the end of the last word, and returns a String corresponding to the next word. If there are no more words to return, it retuns a null String.

Returns:
the next word.

splitWords

public static java.lang.String[] splitWords(java.lang.String text)
Splits a given String into an array with its constituent words.

Parameters:
text - A String.
Returns:
An array with the words extracted from the String.

splitSegments

public static java.lang.String[] splitSegments(java.lang.String text)
Splits a given String into an array with its constituent text segments.

Parameters:
text - A String.
Returns:
An array with the text segments extracted from the String.

splitWordGrams

public static java.lang.String[] splitWordGrams(java.lang.String text,
                                                int n)
Splits a given String into an array with its constituent word n-grams.

Parameters:
text - A String.
n - Number of consecutive words on the n-grams.
Returns:
An array with the word n-grams extracted from the String.

splitNGrams

public static java.lang.String[] splitNGrams(java.lang.String text,
                                             int n)
Splits a given String into an array with its constituent character n-grams.

Parameters:
text - A String.
n - Number of consecutive characters on the n-grams.
Returns:
An array with the character n-grams extracted from the String.