Department of Computer Science
New York University
Information extraction involves processing text to identify selected information,
such as particular types of names or specified classes of events.
For names, it is sufficient to find the name in the text and identify its
type; for events, we must extract the critical information about
each event (the agent, objects, date, location, etc.) and place this information
in a set of templates (data base). The Proteus Project conducts a
wide range of research related to information extraction, including name
extraction, event extraction, and unsupervised
learning methods, in several languages, and participates in extraction
system evaluations. Most of our work on extraction has been in
English and Japanese, but we have also built a system for Spanish and are
currently building a system for Chinese.
Although little attention is devoted to names in the linguistics literature,
names (of people, places, organizations, etc.) are very common in most
varieties of text and successfully identifying and classifying names is
essential for almost all text processing. NYU was part of the organizing
committee for Message Understanding Conference-6, which introduced the
evaluation of name extraction as a separate task, and has actively pursued
research on name extraction. In particular, we have investigated
a number of techniques for combining information developed by hand (including
word lists and patterns) with statistics developed from corpora in which
the names have been marked. We have used decision trees (Proteus
Project Memoranda 103,
and maximum entropy methods (Proteus Project Memoranda 104,
132) and have gotten good results for both English and Japanese.
These methods, however, require large training corpora in which all the
names have been marked -- an expensive proposition; to reduce this
cost, we have recently developed a name tagger which learns from unmarked
corpora, using only a small seed set of sample names (Yangarber et al.,
A broad-coverage system must deal with many types of names -- not just
people, places, and organizations. We have therefore designed a rich
hierarchy of name types (Proteus Project Memorandum 02-004)
and have built a tagger to cover the range of names in this hierarchy.
Event extraction is a complex task because an event may be described in
so many different ways in text. We address this complexity through
an extraction system which incorporates name recognition, analysis of linguistic
structure, identification of event patterns, reference resolution, and
limited inference rules to combine information across sentences.
This system has been applied to a range of domains, including military
messages and various news reports, extracting information about international
terrorism, international joint ventures, the appointment of corporate executives,
mergers and acquisitions, satellite launchings, natural disasters, and
infectious disease outbreaks, among other topics.
The need to address a wide variety of different event types has led
us to study how events of different types can best be represented in a
data base structure (Proteus Project Memorandum 02-006).
Customizing an extraction system for a new domain requires considerable
work, defining new predicates, creating a concept hierarchy, and writing
patterns for the events. To facilitate this task, we have developed
a unified graphical interface, PET, for this customization (Proteus Project
Memoranda 96, 154).
Our largest effort to date has been an integrated system for providing
access to reports on infectious disease outbreaks. This system combines
a web crawler (which searches for reports of outbreaks on a daily basis),
an extraction engine, and a data base browser to examine the extracted
events (Proteus Project Memorandum 02-001).
Through the browser, the user can examine a
table of events, select and sort events by date, location, disease
type, etc., and then view the documents reporting these events.
Before patterns, predicates, and word classes can be created for an extraction
system in a new domain, extensive work is required to analyze the corpus
in order to identify all the different forms in which a particular type
of event can be expressed. This remains a significant barrier to
the porting of extraction systems to new domains. To lower this barrier,
we are conducting research on methods which automatically learn the patterns
for a new domain, given a small set of seed patterns (Proteus Project Memoranda
154) or a topic description
(Proteus Project Memorandum 01-009).
The development of information extraction over the past 15 years has been
driven to a remarkable degree by a series of evaluations conducted by the
U. S. Government. The original evaluations were called the "Message
Understanding Conferences" and were conducted from 1988 to 1998.
NYU participated in all these evaluations, and was involved in the design
of the evaluations for the most recent Message Understanding Conferences,
(1995) and MUC-7
(1998). More recently, we have participated in the evaluations of
Content Extraction" program.