OAK System - Manual

Satoshi Sekine (New York University)
June 20, 2001

1. General

OAK system is a total English analyzer, which consists of a sentence spliter, a tokenizer, a POS tagger, a stemmer, a chunker, a Naned Entity (NE) tagger, a dependency analyzer, a parser, a function tagger and a regularizer. It basically use explicit rules, rather than probabilistic scores, so that human can modify and hopefully improve the accuracy. The rules are mostly extracted based on transformation or decision list learning method, and the rules are look like regular expressions. It can have any level of input (text, plain sentence, tokenized, POS-tagged, chunk-tagged, dependency-tagged, parsed, function-tagged or regularized sentence) and also any level of output (the same). It also can handle different kinds of format (plain, Penn TreeBank's tagged format, Penn Treebank's combined format, plain stemmed format, stem with POS tag's format, MUC format, Collins' parser format, Tipster format, SGML format). So, it can be used as a filter, a simplifier, as well as an analyzer.


2. Install


3. Level and Format

Level is the level of input or output and format if the format of them. OAK currently supports 11 levels and 11 formats with some limited combinations (DETAIL, TIPSTER and SGML for all levels):

You can see some sample files by clicking the specified level and format above. These are output of the text shown in the Text.


4. Option and Parameter

There are two methods to specify the dynamic settings of the system; command line options and parameter file. The command line options overwrite the specification in the parameter files.

In the parameter file, you can specify more number of settings in more detail. This is a sample parameter file. We are not going to explain the meaning of each parameter option, but these are mostly self evident.


5. How to run

This is the shell script to create the sample files. You should prepare "oak.prm" file at the current directory.


6. Knowledge


7. For Advanced Users