Proteus

Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram

Javier Artiles Satoshi Sekine

Proteus Project
Department of Computer Science
New York University


General

Wikipedia is a relatively big and consistent resource for NLP researchers to work with. However, it is not straightforward even to extract meaningful sentences and portions which are useful for the research. In order to avoid the duplication of the laborious efforts, we will make our "Tagged and Cleand Wikipedia (TC Wikipedia) available for the community.

The static version (html documents) of the English Wikipedia was downloaded from http://static.wikipedia.org/downloads/2008-06/en/. We cleaned up and tagged it by several tools. You can see what we did here.

Note that the files are provided as is, which are not tagged 100% accurately and are not 100% cleaned. Any comments and suggestions are welcome. We will try hard to respond the comments and suggestions, but we can't gurantee that we could always respond and work on the requests.

When you download the data, please notify it to sekine (at) cs (dt) nyu (dt) edu. It is your obligation to follow our instruction when you are asked to delete the data.

License

All text is available under the terms of the GNU Free Documentation License. (See Wikipedia's Copyrights for details.) Wikipedia ® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) tax-deductible nonprofit charity.


5 Data (See some samples below)