CAGEclass :: CApturing Geographic Entities

What is CAGEclass?

CAGEclass is a Java package implementating a research framework for experimenting with named entity recognition (NER) through lexicons and a variety of orthographic and contextual features. Under the term "research", we understand that the software is oriented mostly toward flexibility, sometimes at a price of performance losses.

As it is, the system is particularly tailored to the process of finding geographical named entities. Lexicons and extraction rules are included for Portuguese and English texts (for the domains of names, places and organizations), although the system can be adapted for other languages and domains.

The recognized geographical entities can afterwards be used to classify documents according to geographical scopes (hence the name CAGEclass). An ontology of Portuguese geographical features is also included in the datasets, and the classification process is based on disambiguating between all the named entities associated with a document.

All functionality is available through the Java API. A command line interface is also provided for running and evaluating the named-entity recognition and scope assignment stages.

People

CAGEclass was developed at the XLDB group of the Department of Informatics of the Faculty of Sciences of the University of Lisbon in Portugal. It was created to support the research paper "Assigning Geographical Scopes to Web Pages".

CAGEclass was written by Bruno Martins.

Research

Recognizing and disambiguating geographical named entities in text is currently a hot research topic, as many Web documents are primarily relevant to geographicaly limited communities, and users often report the need for retrieval tools that take this into consideration.

CAGEclass is one of the components developed under the Geographic Reasoning for Search Engines (GREASE) project, which researches methods, algorithms and software architectures for geographical information extraction and retrieval. It is responsible for extracting named entities from text (using lexicons and orthographic/contextual features), and using these entities as features for classifying documents according to their geographical scope.

The performance of this software package is state of the art: in finding named entities, the system achieves an overall F1 score (using strict matching) of 57.2 on unseen evaluation data. On classifying pages acording to geographical scopes, it achieves an overall F1 score of 67.3. Named entity extraction operates at roughly 100K words/second on standard desktop hardware running Sun's 1.4.2 JDK.

Availability

CAGEclass is released under the BSD License, which basically states that you can do anything you like with it as long as you mention the authors and make it clear that the library is covered by the BSD License. It also exempts us from any liability, should this library eat your hard disc or kill your cat.

Source code, samples and detailed documentation are provided in the download. The Java API documentation is also available online.

The software is relatively easy to install and run. We encourage you to try it out and let us know of any problems you find. We would also be very happy to hear from people who are using this package.