OAK System - Manual
Satoshi Sekine (New York University)
June 20, 2001
1. General
OAK system is a total English analyzer, which consists of
a sentence spliter, a tokenizer, a POS tagger, a stemmer,
a chunker, a Naned Entity (NE) tagger, a dependency analyzer,
a parser, a function tagger and a regularizer.
It basically use explicit rules, rather than probabilistic scores,
so that human can modify and hopefully improve the accuracy.
The rules are mostly extracted based on transformation or
decision list learning method, and the rules are look like
regular expressions.
It can have any level of input (text, plain sentence,
tokenized, POS-tagged, chunk-tagged, dependency-tagged, parsed,
function-tagged or regularized sentence) and also any level of
output (the same).
It also can handle different kinds of format (plain, Penn TreeBank's
tagged format, Penn Treebank's combined format, plain stemmed format,
stem with POS tag's format, MUC format, Collins' parser format,
Tipster format, SGML format).
So, it can be used as a filter, a simplifier, as well as an analyzer.
2. Install
3. Level and Format
Level is the level of input or output and format if the format of them.
OAK currently supports 11 levels and 11 formats with some
limited combinations (DETAIL, TIPSTER and SGML for all levels):
- Text (PLAIN)
- Sentence (PLAIN)
- Tokenized sentence (PLAIN)
- POS tagged sentence (PTB_BRACKET, PTB_TAG, STEM, STEM_TAG, COLLINS)
- NE tagged sentence (PTB_BRACKET, PTB_TAG, MUC, CONLL)
- Chunk tagged sentence (PTB_BRACKET, PTB_TAG, CONLL)
- Chunk and NE tagged sentence (PTB_BRACKET, PTB_TAG)
- Dependency structure
- Parse tree (PTB_BRACKET)
- Function tagged parse tree (PTB_BRACKET)
- Regularized
You can see some sample files by clicking the specified level and format
above. These are output of the text shown in the Text.
4. Option and Parameter
There are two methods to specify the dynamic settings of the system;
command line options and parameter file.
The command line options overwrite the specification in the
parameter files.
- -h : display help
- -p filename : parameter file
- -i specification : input level
- -o specification : output level
- -s specification : start level
- -I specification : input format
- -O specification : output format
- -r filename : input filename
- -w filename : output filename
- -b : batch mode
In the parameter file, you can specify more number of settings
in more detail.
This is a sample parameter file.
We are not going to explain the meaning of each parameter option,
but these are mostly self evident.
5. How to run
This is the shell script to create the sample files.
You should prepare "oak.prm" file at the current directory.
6. Knowledge
- Dictionary : Word dictionary, POS tag, frequency, stem and class information
headword '/' {pos'/'frequency['/'stem_word]}* ['/' {class}*]
- Class : Class file, used to create dictionary
class '/' pos '/' {word}*
- Stem : Stem file, used to create dictionary
- POS tagger rule : Regular expression based POS tagger rule
{'('lex class pos possible_pos')'}* > position=target_pos
- Chunker quad-gram : Quad-gram for chunker to make initial guess
'=' pos1 pos2 pos3 pos4\n
{frequency chunk1 chunk2 chunk3 chunk4\n}*
- Chunker rule : Regular expression based chunker rule
{'(' lex class pos current_chunk ')'}* > position=target_chunk
- NE Hierarchy : NE hierarchy definition
- NE dictionary : NE dictionary
#include filename
words '/' ne
- NE rule : Regular expression based NE rule
{'(' lex class pos current_ne ')'}* > {position=target_ne}*
- Head table : Head table to find head in constituent
cat {left-to-right|right-to-left} {cat}*
- Function tagger rule : Regular expression based function tager rule
{'(' cat pos class lex ')'}*4 '(' {information}*4 ')' > function_tag
7. For Advanced Users