Proteus Project
Department of Computer Science
New York University
Wikipedia is a relatively big and consistent resource for NLP researchers to work with. However, it is not straightforward even to extract meaningful sentences and portions which are useful for the research. In order to avoid the duplication of the laborious efforts, we will make our "Tagged and Cleand Wikipedia (TC Wikipedia) available for the community.
The static version (html documents) of the English Wikipedia was downloaded from http://static.wikipedia.org/downloads/2008-06/en/. We cleaned up and tagged it by several tools. You can see what we did here.
Note that the files are provided as is, which are not tagged 100% accurately and are not 100% cleaned. Any comments and suggestions are welcome. We will try hard to respond the comments and suggestions, but we can't gurantee that we could always respond and work on the requests.
When you download the data, please notify it to sekine (at) cs (dt) nyu (dt) edu. It is your obligation to follow our instruction when you are asked to delete the data.
All text is available under the terms of the GNU Free Documentation License. (See Wikipedia's Copyrights for details.) Wikipedia ® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) tax-deductible nonprofit charity.
In the data file, the columns are seperated by TAB character.
Note that if you add "http://en.wikipedia.org.org/wiki/" in front of headword, it will give you the URL in the original Wikipedia.
SAMPLE
| ID | Headword | Filename | Category | Redirection |
|---|---|---|---|---|
| 291071 | Bill Clinton | Bill_Clinton_75aa.html | ['Presidents of the United States', 'Americans of Scots-Irish descent', 'Living people', 'American saxophonists', 'American Rhodes scholars', 'Bill Clinton', 'People from Hope', 'Arkansas', 'Arkansas Democrats', 'Georgetown University alumni', 'Grammy Award winners', 'Time magazine Persons of the Year', 'American humanitarians', 'Honorary Fellows of University College', 'Oxford', 'Arkansas Attorneys General', 'Governors of Arkansas', 'Karlspreis Recipients', 'Impeached United States officials', 'Yale Law School alumni', 'United States presidential candidates', '1996', 'People associated with the University of Arkansas', 'American political scandals', 'Democratic Party (United States) presidential nominees', 'United States presidential candidates', '1992', 'People from Hot Springs', 'Arkansas', 'Americans of English descent', 'Arkansas lawyers', 'American memoirists', 'Spouses of United States Senators', 'Alumni of University College', 'Oxford', 'Grand Companions of the Order of Logohu', 'American legal academics', 'Baptists from the United States', 'Semi-protected', 'Lewinsky scandal figures', '1946 births'] | ['42nd President of the United States', 'Bill J. Clinton', 'Billl Clinton', "Bill Clinton's Post Presidency", "Bill Clinton's Post-Presidency", "Bill Clinton's sex scandals", 'Bill Jefferson Clinton', 'Bill clinton', 'Bill Clinton ', 'Billy Clinton', 'BillClinton', 'Bil Clinton', "Bill Clinton's Sex Scandals", 'Bill Blythe IV', 'Billary Clinton', "Buddy (Clinton's dog)", 'Bull Clinton', 'Clinton Gore Administration', "Clinton's Foreign Policy", 'Clinton, Bill', 'I never inhaled', 'Klin-ton', 'President Clinton', 'President Bill Clinton', 'Putting People First', 'William Jefferson Clinton', 'William Jefferson Blythe III', 'William J. Blythe', 'Willam Jefferson Blythe III', 'William Jefferson Bill Clinton', 'William J. Clinton', 'William J Clinton', 'William Jefferson Blythe IV', 'William clinton', 'William Blythe III', 'WilliamJeffersonClinton', 'William J. Blythe III', 'William Bill Clinton'] |
| 291072 | Bill Sammon | Bill_Sammon_5cc4.html | ['Year of birth missing (living people)', 'Fox News Channel', 'Miami University alumni', 'American newspaper reporters and correspondents', 'Political analysts', 'American journalist stubs', 'American journalists'] | - |
| 291073 | Bill Gunter | Bill_Gunter_6508.html | ['Articles to be expanded since March 2008', 'Gator Caucus', 'Florida Democrats', 'All articles to be expanded', 'American politician stubs', 'University of Florida alumni', 'Members of the United States House of Representatives from Florida'] | ['William Dawson Gunter', 'William D. Gunter', 'William Gunter, Jr.', 'William Gunter', 'William Dawson Gunter, Jr.', 'William D. Gunter, Jr.'] |
| 291074 | Bill Owen (actor) | Bill_Owen_(actor)_b508.html | ['English television actors', 'Infobox actor templates needing updating', 'Members of the Order of the British Empire', '1914 births', 'Pancreatic cancer deaths', 'English film actors', '1999 deaths'] | - |
There are two version of infobox data file. One is html version, the other is text version. The html version is the same as the original Wikipedia file except it contains infobox information only.
SAMPLE of txt version
#start-infobox 291071 Bill_Clinton_75aa.html William Jefferson Clinton Bill Clinton 42nd <a href="President_of_the_United_States">President of the United States</a> In office <a href="January_20">January 20</a>, <a href="1993">1993</a> - <a href="January_20">January 20</a>, <a href="2001">2001</a> Vice President Albert A. Gore, Jr. Preceded by George H. W. Bush Succeeded by George W. Bush 42nd <a href="Governor_of_Arkansas">Governor of Arkansas</a> In office <a href="January_11">January 11</a>, <a href="1983">1983</a> - <a href="December_12">December 12</a>, <a href="1992">1992</a> Lieutenant <a href="Winston_Bryant">Winston Bryant</a> (1983-1991) <a href="Jim_Guy_Tucker">Jim Guy Tucker</a> (1991-1992) Preceded by Frank D. White ... (skip 20 lines) ... Website William J. Clinton Presidential Library #end-infobox 291071 Bill_Clinton_75aa.html #start-infobox 291073 Bill_Gunter_6508.html William D. Gunter, Jr. Bill Gunter Member of the U.S. House of Representatives from Florida's 5th district In office <a href="1973">1973</a> - <a href="1975">1975</a> Preceded by Louis Frey, Jr. Succeeded by <a href="Richard_Kelly_(politician)">Richard Kelly</a> Born <a href="July_16">July 16</a>, <a href="1934">1934</a>(<a href="1934">1934</a>-07-16) Jacksonville, Florida Political party <a href="Democratic_Party_(United_States)">Democratic</a> #end-infobox 291073 Bill_Gunter_6508.html |
Set of sentences which are judges as contents are extracted. The sentences are seperated by LingPipe tool. Then these are tagged by Stanford NE tagger, and the tokenized sentences are tagged by Stanford POS tagger.
Information must be self-evident. Each line contains
SAMPLE
#s-doc 1 !!! !!!.html ['Dance-punk musical groups', '2000s music groups', .... #s-infobox #s-sent 1 1 !!! performing at the Flow Festival in Helsinki, Finland (2007) ! 0 - - SENT . O O O B-!!!_(album) ! 0 - - SENT . O O O I-!!!_(album) ! 0 - - SENT . O O O I-!!!_(album) performing 1 perform perform VVG VBG B-VP O O O at 1 - - IN IN B-PP O O O the 1 - - DT DT B-NP O O O Flow 1 - - NP NNP I-NP O O O Festival 1 - - NP NNP I-NP O O O in 1 - - IN IN B-PP O O O Helsinki 1 - - NP NNP B-NP B-LOCATION B-CITY O , 0 - - , , O O O O Finland 1 - - NP NNP B-NP B-LOCATION B-COUNTRY O ( 1 - - ( -LRB- O O O O 2007 0 @card@ - CD CD B-NP O B-DATE B-2007 ) 0 - - ) -RRB- O O O O #e-sent 1 1 #s-sent 1 2 Background information Background 0 - background NP NN B-NP O O O information 1 - - NN NN I-NP O O O #e-sent 1 2 |
Information must be self-evident. Each line contains the frequency and ngram. (Only the unigram data is sorted by the frequency). Deliminated by tab (\t).
SAMPLE (4gram)
1 the Natyas has traand 1 the Natya Shodh Sansthan 1 the Natya Shathra , 1 the Naro - Fominsk 1 the Naro River the 1 the Naro language , 2 the Naro language . 1 the Naro Space Center 1 the Naro Theater ( |