Department of Computer Science
New York University
In Natural Language, the same fact or the same event can be expressed by different expressions. This is one of the most difficult obstacle for machine to understand Natural Language texts. The recognition of paraphrases is an essential part of many natural language applications, such as Information Extraction, Information Retrieval, Question Answering, Summarization or Machine Translation. If we want to process text reporting fact "X", we need to understand the expression in the document is one of the alternative ways in which "X" can be expressed. Creating the paraphrase knowledge which identify all possible paraphrase of all possible facts of events by hand is an almost overwhelming task because they are so common and many are domain specific. We have therefore begun to develop procedures which discover paraphrase from text.
It sounds intuitively easy to define what is "paraphrase", but if you look at
the data, you will find it is not easy. For example, the following two sentences
can express the same fact in a very special situation (in this case, the person
is a fugitive from the country and the event that the person decide to go back to
the country is certainly means that he is going to be arrested.) :
We are going to define "Paraphrase" as a set of phrases which can mean, for the people with general knowledge, the same event or the same fact without additional explanation of the context. So the special domain knowledge or back ground is not necessary. In the most of the cases it is not depending on the special entities to judge that the two phrases are paraphrase. In the example, the two phrases can mean the same event only if the specific names of person and country are applied. In other words, the set of phrase has to have generalization power for entities.
Link to the Paraphrase databease by Hasegawa's method
The data includes 755 sets of paraphrases, and 3,865 phrases in total. The first number indicate the type frequency regarding the number of ENE pairs, and the second number indicates the instance frequency of the phrase. We set the threshold 2 for the both frequencies to be included in the database.Sample
# 3 7 PERSON 's visit to CITY 3 5 PERSON flew to CITY 2 5 PERSON goes to CITY 2 2 PERSON traveled to CITY 2 2 PERSON visited CITY # 6 37 COMPANY1 , a unit of COMPANY2 3 12 COMPANY1 , which is owned by COMPANY2 2 10 COMPANY2 , which owns COMPANY1 2 8 COMPANY1 's parent company , COMPANY2 # 3 4 POLITICAL_PARTY criticized PERSON 2 6 PERSON has been criticized by POLITICAL_PARTY 2 5 POLITICAL_PARTY accused PERSON 2 3 POLITICAL_PARTY have criticized PERSON 2 2 POLITICAL_PARTY contend that PERSON
# author author 6 PERSON , the author of `` ACADEMIC author 5 PERSON , author of `` ACADEMIC author 4 PERSON being the first living author included in ACADEMIC author 2 PERSON , the Connecticut-based author of `` ACADEMIC # die die 19 AGE , died of DISEASE die 13 AGE died of DISEASE die 10 AGE died of diarrhea , DISEASE die 6 AGE , whose mother died of DISEASE die 2 AGE died of malignant neoplasms , DISEASE die 2 AGE , died from DISEASE die 2 AGE , who died of DISEASE die 2 AGE , has reportedly died of DISEASE # buy acquire acquisition purchase unit merger merge parent billion pay sell takeover acquire 65 COMPANY1 acquired COMPANY2 acquire 19 COMPANY1 is acquiring COMPANY2 acquire 13 COMPANY1 agreed to acquire COMPANY2 acquire 11 COMPANY1 agreed to be acquired by COMPANY2 acquire 9 COMPANY1 announced it was acquiring COMPANY2 acquire 8 COMPANY1 will acquire COMPANY2 ... acquisition 85 COMPANY1 's acquisition of COMPANY2 acquisition 7 COMPANY1 's pending acquisition of COMPANY2 acquisition 6 COMPANY1 completed its nearly $ 10 billion acquisition of COMPANY2 ... billion 8 COMPANY1 's $ 115 billion buyout of COMPANY2 billion 6 COMPANY1 for $ 48 billion , COMPANY2 billion 5 COMPANY1 's dlrs 25.6 billion 1997 takeover of COMPANY2 ... buy 72 COMPANY1 bought COMPANY2 buy 30 COMPANY1 is buying COMPANY2 buy 20 COMPANY1 agreed to buy COMPANY2 ...