Proteus

Apple Pie Parser

Proteus Project

Department of Computer Science
New York University


General

The parser is a bottom-up probabilistic chart parser which finds the parse tree with the best score by best-first search algorithm. Its grammar (of English) in the distribution is a semi-context sensitive grammar with two non-terminals and it was automatically extracted from Penn Tree Bank, syntactically tagged corpus made at the University of Pennsylvania. The framework of the algorithm was reported at the International Workshop on Parsing Technologies 1995.

That is a fully automatic acquisition of grammar from a syntactically tagged corpus, instead of human labors or statistically aided human labor which have been used in many conventional projects. Although there are some problems with this strategy, such as the availability of such a corpus and domain restrictions, the performance of the grammar is fairly good. The author believes the idea shows one of the promising directions for the future of the natural language research.

The parser generates a syntactic tree just like the PennTreeBank (PTB) bracketing. Although the latest release (Version 2.0) of PTB has argument structure labels, this parser does not produce such labels. Also APP is just trying to make a parse tree as accurate as possible for reasonable sentences. Here reasonable sentences means, for example, sentences in newspapers or well written documents. Hence, it is aiming neither to parse some reasonable ill-formed sentences (like conversation) nor to refuse absolutely ill-formed sentences. You may be surprised that the parser can make a parse tree for a sentence with number dis-agreement or it can't parse correctly a very simple English sentence. But this is a result of how APP is designed.

The author knows that the performance is not the best compared with the state of art parsers which have been reported recently. However, the author knows the main difference between my parser and these parsers. It's the usage of lexical information. We're planning to incorporate this information into the parser and hopefully we will release the new version soon.

External Terminology on top of PTB


Available by http

To get it over http. Click APP5.9.tar.gz .

Available by ftp

To get it over ftp. Click APP5.9.tar.gz .

Then:

> gzip -d APP5.9.tar.gz
> tar xvf APP5.9.tar
---- This create files under directoy APP5.9
---- Please read "README" file

Executable on Windows is now available

After you install the above on your disk, get the executable, and put it at "bin" directory. You should be able to run APP on Windows now. This porting was done by Mr.Shinichi Torihara at Keio University.

Update

APP Version 5.9 (April.4.1997)

APP Version 5.8 (October.2.1996)

Current Inhouse Best (Ver 6.3)
Recall / Precision are 79.55 / 77.18 . It uses lexical bigram information. We hope to make more improvement, and then we will destribute that version. However, if you like to use this version, please contact me.


Plan for the Next Major Version UP (Version 7)

We are planning for the next major version up. If you have any recommendations or suggestions, we appreciate your coorporations. For detail click here .


Manual

manual.ps

Members

Satoshi Sekine
Ralph Grishman


Publications, Reports, and Related Papers

`A Corpus-based Probabilistic Grammar with Only Two Non-terminals'
Satoshi Sekine, Ralph Grishman
Fourth International Workshop on Parsing Technology (1995)

`A New Direction for Sublanguage NLP'

Satoshi Sekine
International Conference on New Methods in Language Processing (1994)

`Automatic Sublanguage Identification for a New Text'

Satoshi Sekine
Second Annual Workshop on Very Large Corpora (1994)



Any comments or questions on this page, please send e-mail to sekine@cs.nyu.edu