Proteus Project

Department of Computer Science
New York University

Catherine Macleod, Adam Meyers and Ralph Grishman

COMLEX Syntax is a monolingual English Dictionary consisting of 38,000 head words intended for use in natural language processing. This dictionary was developed by the Proteus Project at New York University under the auspices of the Linguistic Data Consortium (LDC). It contains exceptionally detailed syntactic information and is now a widely-used lexical resource.

COMLEX Syntax, like other LDC products, is available for both research and commercial use to LDC members with minimal legal restrictions on its usage. The first version of COMLEX Syntax was delivered to the Linguistic Data Consortium (LDC) in May, 1994, with extensions and corrections in subsequent years.  The final version, delivered in December, 1997, is available as LDC catalog item LDC98L21. Interested users should contact the LDC to obtain this dictionary;  NYU is not permitted to distribute it directly.

The dictionary includes entries for approximately 21,000 nouns, 8,000 adjectives and 6,000 verbs, all of which are marked with a rich set of syntactic features and complements. Nouns have 9 possible features and 9 possible complements; adjectives have 7 features and 14 complements; verbs have 5 features and 92 complements; and adverbs have 11 positional classes and 12 features. For 750 frequent verbs, there are an additional 4 possible features and 32 possible complements. Other entries identify words as prepositions, cardinal numbers, etc. without further specification. The noun, adjective and verb entries were created by a team of four linguistics graduate students, working half-time for approximately one year. Each ELF (enterer of lexical features) was provided with a menu-based entry program, which is written in Lisp using the Garnet GUI package, and which provides access to a concordance based on approximately 90 MB of text. Elves enter features and complements for verbs based on: (1) the concordance; (2) hard copy dictionaries; and (3) their individual knowledge.

 Each lexical entry is organized as a typed feature structure, using a Lisp-style notation which, can be mapped into other forms, e.g. Prolog, SGML-marked text, etc. Each list consists of a type symbol followed by zero or more keyword-value pairs. Each value may in turn be an atom, a string, a list of strings, feature-value list, or a list of feature-value lists. Key-words identify orthography (:orth) inflected forms (e.g., :plural, :pastpart, etc.), features (:features), subcategorization/complements (:subc), and other information. Subcategorization is mostly self-explanatory, e.g., verbs marked with "np" and "part-np" respectively take "np" and "particle + np" complements. Features include "apreq" which is marked on adjectives which can modify a numerically quantified NP, e.g., "the above-mentioned one hundred gorillas" where "above-mentioned" modifies the group of one hundred gorillas (each gorilla is not above-mentioned) and ntitle which refer to nouns that occur as titles preceding names, e.g. "Prof. Mary Fitzburg". We completed a Version 2 of COMLEX Syntax in August of 1995. The two most significant changes were: (1) An improvement in the quality and coverage of COMLEX as the result of our own quality checks as well as feedback from users; and (2) The lexical entries for 750 common verbs now include a list of 100 tags, where each tag consists of one feature or complement, the name of the source (Brown Corpus, Wall Street Journal, etc.) and a pointer to a corpus file. (For illustration purposes, the sample entry below lists three tags rather than 100.) This corpus file is also available from the LDC. The tagging effort was significant for gathering statistics on the frequency of complements and features.

We completed Version 3 of COMLEX Syntax in December of 1997. This latest version of COMLEX Syntax has been updated to include  adverb classes. We also added diacritics to foreign words, while retaining the unaccented versions and performed various other updates to correct and supplement our lexical entries. Some example lexical entries follow.

(verb :orth "build"
                :subc ((np) (np-for-np) (part-np :adval ("up")))
                :TAGS ((TAG :BYTE-NUMBER 6918276 :SOURCE "brown" :LABEL (NP))
                       (TAG :BYTE-NUMBER 6914461 :SOURCE "brown" :LABEL (NP))
                       (TAG :BYTE-NUMBER 6858039 :SOURCE "brown" :LABEL (NP)))

(noun           :orth "assertion"
                :subc ((noun-that-s) (noun-be-that-s)))
(adverb         :orth "exceedingly"
                :modif ((PRE-COMPARATIVE) (PRE-QUANT) (PRE-ADJ) (PRE-ADV))
                :features ((DEGREE-ADV)))
(adjective      :orth "above-mentioned"
                :features ((apreq) (attributive)))
(verb           :orth "abbreviate"
                :subc ((np-pp :pval ("to")) (np) (np-np-pred) (np-as-np))
                :features ((vveryving :pastpart t)))
(noun           :orth "Prof."
                :features ((ntitle)))
We have compiled a set of lisp utilities for use with COMLEX. The utilities run under Allegro Common Lisp and should run under most other lisps, though we cannot guarantee portability. Please let us know about any bugs. Click here for the utilities and their instructions.


