Proteus

OAK System

Proteus Project

Department of Computer Science
New York University


General

OAK system is a total English analyzer, which consists of a sentence spliter, a tokenizer, a POStagger, a stemmer, a chunker, a Naned Entity (NE) tagger, a dependency analyzer, a parser, a function tagger and a regularizer. It basically use explicit rules, rather than probabilistic scores, so that human can modify and hopefully improve the accuracy. The rules are mostly extracted based on transformation or decision list learning method, and the rules are look like regular expressions. It can have any level of input (text, plain sentence, tokenized, POS-tagged, chunk-tagged, dependency-tagged, parsed, function-tagged or regularized sentences) and also any level of output (the same). It also can handle different kinds of format (plain, Penn TreeBank's tagged format, Penn Treebank's combined format, plain stemmed format, stem with POS tag's format, MUC format, Collins' parser format, Tipster format, SGML format). So, it can be used as a filter, simplifier, as well as an analyzer.


Current Situation (as of February 29, 2004)


Availability

We would like to make this tool available for anyone for research purpose. However, if you really want it even it is on a development stage and you will be coporative to us, we may provide it now. Please contact sekine@cs.nyu.edu.


Manual

Here is the "under construction" manual http://nlp.cs.nyu.edu/oak/manual.html


Demo: Snap Shot

Tokenizer
LINUX> oak -i SENTENCE -o TOKENIZED
Oak System (0.6)      March.13.2001   Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
-----
> "I'm a boy."
" I 'm a boy . " 
Stemmer
LINUX>oak -i SENTENCE -o POSTAG -O STEM
Oak System (0.6)      March.13.2001   Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
Loading POS tagger rule ...done
-----
> Tables aren't broken.
table be not break . 
POS tagger
LINUX> oak -i SENTENCE -o POSTAG -O PTB_TAG
Oak System (0.6)      March.13.2001   Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
Loading POS tagger rule ...done
-----
> Prof. Sekine promised to create this program by December 2001.
Prof./NNP Sekine/NNP promised/VBD to/TO create/VB this/DT program/NN by/IN December/NNP 2001/CD ./.
NE tagger
LINUX> oak -i SENTENCE -o NE -O MUC
Oak System (0.6)      March.13.2001   Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
...
Loading NE rule ...done
-----
> Prof. Sekine promised to create this program by December 2001.
Prof. <ENAMEX TYPE=PERSON>Sekine</ENAMEX> promised to create this program by <TIMEX TYPE=DATE>December 2001</TIMEX>.
Chunker
LINUX> oak -i SENTENCE -o CHUNK -O CONLL
Oak System (0.6)      March.13.2001   Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
Loading POS tagger rule ...done
Loading chunker quadgram ...done
Loading chunker rule ...done
-----
> Prof. Sekine promised to create this program by December 2001.
Prof. NNP B-NP
Sekine NNP I-NP
promised VBD B-VP
to TO I-VP
create VB I-VP
this DT B-NP
program NN I-NP
by IN B-PP
December NNP B-NP
2001 CD I-NP
. . O


Any comments or questions on this page, please send e-mail to sekine@cs.nyu.edu