The Proteus Project
Proteus Logo
  Home  
  People  
  Research  
  Software  
  Publications  
 For Future Students 
  Travel Directions  
  Links  

Computer Science Department
New York University

715 Broadway, 7th floor,
New York, NY
10003, USA
Tel:+1(212)998-3497
  or  +1(212)998-3003
Fax:+1(212)995-4123

general inquiries:
nlp-info

NYU Logo

Welcome to the Proteus Project

Members of the Proteus Project have been doing Natural Language Processing (NLP) research at New York University since the 1960's. Our long-term goal is to build systems that automatically find the information you're looking for, pick out the most useful bits, and present it in your preferred language, at the right level of detail. One of our main challenges is to endow computers with linguistic knowledge. The kinds of knowledge that we have attempted to encode include vocabularies, morphology, syntax, semantics, genre variation, and translational equivalence.

The Proteus Project members are simultaneously scientists and engineers. We are driven by the quest for knowledge, but we also love to build things that work (that nobody else has built before). Consequently, our devotions cover the range from the most basic research to immediately useful resources and applications. The diversity of the project members is reflected in the diversity of our work styles: Some of us prefer to encode linguistic knowledge from introspection; others prefer to build systems that can learn for themselves.

Within NLP, our main focus is on the problems of Information Extraction. Information Extraction over a collection of texts involves, first of all, figuring out what types of entities are mentioned in the text and what relations are expressed between these entities; this allows us to create a data base schema to capture information in the text. Once this is done, we create procedures to extract this information from the text and populate a data base, with links back to the original text. Making the information in the text explicit in this way makes possible much more powerful text search and the ability to infer new information from the information in the text.

Building systems for Information Extraction forces us to address the broad range of problems involved in natural language understanding: analyzing sentence structure (parsing), analyzing the semantic relation betweeen sentences (paraphrase and implication), and analyzing the structure of larger texts (reference resolution and discourse analyisis).

Information Extraction poses multiple learning problems, first to learn an appropriate information schema, and then for learning the rules to map the text information into this schema. These problems are difficult because a given relationship may be expressed in many ways, and may be expressed in different ways in different domains. To address these problems we use a variety of machine learning methods, aided by suitable linguistic analysis.

Most of our work on Information Extraction has focused on 'current events', as reported in news articles or discussed in blogs. We have recently extended our work to include scientific papers and patents.


The Proteus Project has been supported by grants and contracts from the National Science Foundation (NSF), the Defense Advanced Research Projects Agency (DARPA), and other Government agencies, as well as the NTT Corporation and Fujitsu Laboratories Ltd.