Welcome to the Proteus Project
Members of the Proteus Project have been doing Natural Language
Processing (NLP) research at New York University since the 1960's.
Our long-term goal is to build systems that automatically find the
information you're looking for, pick out the most useful bits, and
present it in your preferred language, at the right level of
detail. One of our main challenges is to endow computers with
linguistic knowledge. The kinds of knowledge that we have attempted
to encode include vocabularies, morphology, syntax, semantics,
genre variation, and translational equivalence.
The Proteus Project members are
simultaneously scientists and engineers. We are driven by the quest
for knowledge, but we also love to build things that work (that
nobody else has built before). Consequently, our devotions cover
the range from the most basic research to immediately useful
resources and applications. The diversity of the project members is
reflected in the diversity of our work styles: Some of us prefer to
encode linguistic knowledge from introspection; others prefer to
build systems that can learn for themselves.
Within NLP, our main focus is on the problems of Information
Extraction. Information Extraction over a collection of texts
involves, first of all, figuring out what types of entities are
mentioned in the text and what relations are expressed between these
entities; this allows us to create a data base schema to capture
information in the text. Once this is done, we create procedures
to extract this information from the text and populate a data
base, with links back to the original text. Making the information
in the text explicit in this way makes possible much more powerful
text search and the ability to infer new information from the
information in the text.
Building systems for Information Extraction forces us to address
the broad range of problems involved in natural language understanding:
analyzing sentence structure (parsing), analyzing the semantic
relation betweeen sentences (paraphrase and implication), and
analyzing the structure of larger texts (reference resolution and
Information Extraction poses multiple learning problems,
first to learn an appropriate information schema, and then
for learning the rules to map the text information into this
schema. These problems are difficult because a given
relationship may be expressed in many ways, and may be
expressed in different ways in different domains. To
address these problems we use a variety of machine learning
methods, aided by suitable linguistic analysis.
Most of our work on Information Extraction has focused
on 'current events', as reported in news articles or discussed
in blogs. We have recently extended our work to include
scientific papers and patents.
The Proteus Project has been supported by
grants and contracts from the National Science Foundation (NSF),
the Defense Advanced Research Projects Agency (DARPA), and other
Government agencies, as well as the NTT Corporation
and Fujitsu Laboratories Ltd.