Corpus Working Group Web Page
This webpage includes wiki pages for working groups of SIGANN, the ACL Special Interest Group for Annotation. Some of these working groups were originally connected with various annotation workshops held at past conferences, including The Linguistic Annotation Workshop (The Law): A Merger of NLPXML 2007 and FLAC 2007, held in June 28-29, 2007 at ACL 2007 in Prague. This page also includes previous working groups connected with the Frontiers in Linguistically Annotated Corpora (FLAC) in 2006. Given a major problem facing linguistic annotation, each working group attempts to: (1) Build consensus where possible; and (2) Outline difference of opinion as clearly as possible. This website will be used as a collaborative workspace to achieve these goals. Each group prepares a final report (for the proceedings of the conference workshop) and final presentation (at the workshop). Over time, these same topics may be explored further, in which case updated reports will be presented. In addition, each topic may be linked to an additional web page which will provide more information and, in some cases, will serve as a repository of resources.
The following working groups have so far been formed. Reports are planned for the SharedCorpora and AnnotationBestPractices at the Linguistic Annotation Workshop II (the LAW2) in conjunction with LREC 2008 in Marrakech.
Description: This working group seeks to identify a limited amount of representative corpora, suitable for annotation by the computational linguistics annotation community. This working group's website will serve as a repository for such corpora, as well as for annotation of these corpora. Our hope is that a wide variety of annotation will be undertaken on the same corpora, which would facilitate: (1) the comparison of annotation schemes; (2) the merging of information represented by various annotation schemes; (3) the emergence of NLP systems that use information in multiple annotation schemes; and (4) the adoption of various types of best practice in corpus annotation. Such best practices would include: (a) clearer demarkation of phenomena being annotated and (b) the use of particular "test" corpora to determine whether a particular annotation task can feasibly achieve good agreement scores.
The SharedCorpora page includes various resources including: corpora to be annotated (if permitted by licensing restrictions); pointers to corpora with distribution restrictions; and annotation provided by working group participants.
The first working group report was given at the first Linguistic Annotation Workshop at ACL 2007. As part of this report 2 corpora were investigated for shared annotation: (1) the Opened portion of the American National Corpus (OANC) and (2) the "controversial" subcorpus of Wikipedia XML. A 40K subcorpus of the OANC was identified as a target for annotation by multiple sites.
A Second Report is Planned for the LAW 2 workshop at LREC 2008. The focus will be on collecting of annotation of shared corpora and the merging of such annotation. An additional Shared Corpus (the Language Understanding Corpus) will be added to our list of shared corpora.
Description: Identification of best practices for representing and managing corpora and their linguistic annotations currently recognized by the NLP community, and an effort to reach consensus on comprehensive guidelines. Several practices have emerged in recent years as the preferred means for representing annotations with an eye toward the increasing need for merging annotations that may be produced by different groups using different formats for representation. A wide range of of issues must be addressed in order to reach this goal; as a first step, common practices and accepted strategies will be identified, followed by several iterations of group discussion and solicitation of input from the community. Some crticial issues are: how to deal with variant segmentations, how to ensure a common data model so that annotations can be easily mapped into other formats, and what existing formats/frameworks can be exploited (e.g., UIMA, LAF, etc.). Insofar as they exist, annotators will be encouraged to implement the best practice guidelines when annotating the Shared Corpora identified by the working group on that topic.
The first working group report is planned for LAW 2 workshop atLREC 2008 under the direction of Nancy Ide.
Description: A working group on approaches to discourse coherence, especially as resulting from different interacting annotation layers, and its applications to computational linguistics.
The first working group report was given at he first Linguistic Annotation Workshop at ACL 2007. This took the form of a panel discussion chaired by Manfred Stede and Janyce Wiebe.
Description: A roadmap of the compatibility in linguistic terms of current annotation schemes with each other. Charts and tables are constructed to explain compatibilities and incompatibilities between annotation schemes (tracking parallels as well as gaps in paradigms different underlying assumptions). In addition, some assumptions about interactions between phenomena will be outlined. For example, over the past 50 years, a partial alignment between surface and predicate/argument relations has been assumed in the linguistic literature. Nevertheless, some recent annotation studies have reported cases where this alignment is difficult or impossible. Furthermore, the division of annotated phenomena into "levels" (tiers, strata, etc.) has by no means been standardized. Rather, different annotation efforts have assumed different divisions and some have assumed no divisions. Our roadmap will include a discussion of how the different frameworks interact with different underlying notions of "level".
The first working group report was given at FLAC in 2006. The report was prepared by a large number of contributors: Adam Meyers, Nianwen Xue, Alex Chengyu Fang, Gerald Penn, Martha Palmer, James Pustejovsky, Ed Hovy, David Farwell, Bonnie Dorr, Erhard Hinrichs, Eva Hajicova, Janyce Wiebe, Tsai Jia-Lin, Boyan A. Onyshkevych, Massimo Poesio, Lori Levin, Lisa Ferro, Sandra Kübler, Andrew Dolbey, Karin Kipper Schuler, Edward Loper, Heike Zinsmeister
Description: A discussion of low density languages and the problems associated with them. The languages that are most commonly annotated tend to be those with the largest populations or with recent histories of linguistic scholarship. Renewed interest in the annotation of low-density languages has arisen for a number of reasons, both theoretical and practical. While developing truly language-independent annotation schemata could have earth-shaking ramifications, the more modest goal of simply understanding the inherent differences between low-density annotation and the more annotation of more popular languages would be an important first step. Practical concerns include: resource limitation, segmentation issues and spelling variation. On the theoretical side, it may be that certain popular annotation frameworks would have to be modified significantly to account for some of the less-studied languages.
The first working group report was given at FLAC in 2006. The report was prepared by Baden Hughes and Mike Maxwell.
Pie in the Sky
Description: Two sentences were selected for common annotation and merging by hand as an experiment to see how compatible different annotation schemes could be (if forced). The result were enormous feature structures of information, some of which fit well and some which was rather forced to fit.
This was part of the Second Frontiers in Corpus Workshop in 2005. The linked website provides a detailed discussion.
Future Pie in the Sky type exercises are envisioned as part of the SharedCorpora working group.
