Information Extraction

Proteus Project

Department of Computer Science
New York University


Information extraction involves processing text to identify selected information, such as particular types of names or specified classes of events.  For names, it is sufficient to find the name in the text and identify its type;  for events, we must extract the critical information about each event (the agent, objects, date, location, etc.) and place this information in a set of templates (data base).  The Proteus Project conducts a wide range of research related to information extraction, including name extraction, event extraction, and unsupervised learning methods, in several languages, and participates in extraction system evaluations.  Most of our work on extraction has been in English and Japanese, but we have also built a system for Spanish and are currently building a system for Chinese.

Name Extraction

Although little attention is devoted to names in the linguistics literature, names (of people, places, organizations, etc.) are very common in most varieties of text and successfully identifying and classifying names is essential for almost all text processing.  NYU was part of the organizing committee for Message Understanding Conference-6, which introduced the evaluation of name extraction as a separate task, and has actively pursued research on name extraction.  In particular, we have investigated a number of techniques for combining information developed by hand (including word lists and patterns) with statistics developed from corpora in which the names have been marked.  We have used decision trees (Proteus Project Memoranda 103, 115) and maximum entropy methods (Proteus Project Memoranda 104, 114, 132) and have gotten good results for both English and Japanese.  These methods, however, require large training corpora in which all the names have been marked -- an expensive proposition;  to reduce this cost, we have recently developed a name tagger which learns from unmarked corpora, using only a small seed set of sample names (Yangarber et al., Coling 2002).

A broad-coverage system must deal with many types of names -- not just people, places, and organizations.  We have therefore designed a rich hierarchy of name types (Proteus Project Memorandum 02-004) and have built a tagger to cover the range of names in this hierarchy.

Event Extraction

Event extraction is a complex task because an event may be described in so many different ways in text.  We address this complexity through an extraction system which incorporates name recognition, analysis of linguistic structure, identification of event patterns, reference resolution, and limited inference rules to combine information across sentences.  This system has been applied to a range of domains, including military messages and various news reports, extracting information about international terrorism, international joint ventures, the appointment of corporate executives, mergers and acquisitions, satellite launchings, natural disasters, and infectious disease outbreaks, among other topics.
The need to address a wide variety of different event types has led us to study how events of different types can best be represented in a data base structure (Proteus Project Memorandum 02-006).

Customizing an extraction system for a new domain requires considerable work, defining new predicates, creating a concept hierarchy, and writing patterns for the events.  To facilitate this task, we have developed a unified graphical interface, PET, for this customization (Proteus Project Memoranda 96, 154).

Our largest effort to date has been an integrated system for providing access to reports on infectious disease outbreaks.  This system combines a web crawler (which searches for reports of outbreaks on a daily basis), an extraction engine, and a data base browser to examine the extracted events (Proteus Project Memorandum 02-001).  Through the browser, the user can examine a table of events, select and sort events by date, location, disease type, etc., and then view the documents reporting these events.

Discovery Methods

Before patterns, predicates, and word classes can be created for an extraction system in a new domain, extensive work is required to analyze the corpus in order to identify all the different forms in which a particular type of event can be expressed.  This remains a significant barrier to the porting of extraction systems to new domains.  To lower this barrier, we are conducting research on methods which automatically learn the patterns for a new domain, given a small set of seed patterns (Proteus Project Memoranda 150, 154) or a topic description (Proteus Project Memorandum 01-009).


The development of information extraction over the past 15 years has been driven to a remarkable degree by a series of evaluations conducted by the U. S. Government.  The original evaluations were called the "Message Understanding Conferences" and were conducted from 1988 to 1998.  NYU participated in all these evaluations, and was involved in the design of the evaluations for the most recent Message Understanding Conferences, MUC-6 (1995) and MUC-7 (1998).  More recently, we have participated in the evaluations of the "Automated Content Extraction" program.