Bitext MapsParaphrase Discovery

Proteus Project

Department of Computer Science
New York University


In Natural Language, the same fact or the same event can be expressed by different expressions. This is one of the most difficult obstacle for machine to understand Natural Language texts. The recognition of paraphrases is an essential part of many natural language applications, such as Information Extraction, Information Retrieval, Question Answering, Summarization or Machine Translation. If we want to process text reporting fact "X", we need to understand the expression in the document is one of the alternative ways in which "X" can be expressed. Creating the paraphrase knowledge which identify all possible paraphrase of all possible facts of events by hand is an almost overwhelming task because they are so common and many are domain specific. We have therefore begun to develop procedures which discover paraphrase from text. 

What is "Paraphrase"

It sounds intuitively easy to define what is "paraphrase", but if you look at the data, you will find it is not easy. For example, the following two sentences can express the same fact in a very special situation (in this case, the person is a fugitive from the country and the event that the person decide to go back to the country is certainly means that he is going to be arrested.) :

  • PERSON departed for COUNTRY
  • PERSON was arrested in COUNTRY So, we are going to define what we mean by "Paraphrase" in the current research circumstance.

    We are going to define "Paraphrase" as a set of phrases which can mean, for the people with general knowledge, the same event or the same fact without additional explanation of the context. So the special domain knowledge or back ground is not necessary. In the most of the cases it is not depending on the special entities to judge that the two phrases are paraphrase. In the example, the two phrases can mean the same event only if the specific names of person and country are applied. In other words, the set of phrase has to have generalization power for entities.


    The remaining of the following resources will be available soon. Please send me (sekine AT cs DOT nyu DOT edu) e-mail if you have comments or requests.

    Paraphrase database by Yusuke's method (HLT-02)

    Paraphrase database by Hasegawa's method (ACL-04)

    Link to the Paraphrase databease by Hasegawa's method

    This data is created first by runing the Hasegawa's method on about 10 year's of newspaper corpus. Then a human made a quick clean-up, i.e. delete those expressions which are not paraphrase of any other phrase, and re-cluster if one cluster contains several sets of paraphrases. You have to keep it in your mind that the clean-up is not sufficient enough, so please bear some errors. In a set of phrase, we ignore the tense or aspect differences. So a set may contain past tense and hypothetical expressions, which may contradict each other. Also, a set may contain very simple expression, which in general has different meanings. We created the database so that if ou pick two phrases in a paraphrase set, these can be expressing one fact or event ignoring minor details. Beause of the Hasegawa's algorithm similar sets of phrases appear repeatedly. It does not mean that the sets are disjunctive, but the same phrase can just be appeared on multiple sets. The number of appearance has no meaning. It uses Extended Named entity as two anchors to the expressions.

    The data includes 755 sets of paraphrases, and 3,865 phrases in total. The first number indicate the type frequency regarding the number of ENE pairs, and the second number indicates the instance frequency of the phrase. We set the threshold 2 for the both frequencies to be included in the database.

    3	7	PERSON	's visit to	CITY
    3	5	PERSON	flew to	CITY
    2	5	PERSON	goes to	CITY
    2	2	PERSON	traveled to	CITY
    2	2	PERSON	visited	CITY
    6	37	COMPANY1	, a unit of	COMPANY2
    3	12	COMPANY1	, which is owned by	COMPANY2
    2	10	COMPANY2	, which owns	COMPANY1
    2	8	COMPANY1	's parent company ,	COMPANY2
    3	4	POLITICAL_PARTY	criticized	PERSON
    2	6	PERSON	has been criticized by	POLITICAL_PARTY
    2	3	POLITICAL_PARTY	have criticized	PERSON
    2	2	POLITICAL_PARTY	contend that	PERSON

    Paraphrase database by Sekine's method (IWP-05)

    Link to the Paraphrase databease by Sekine's method
    This data is created using Sekine's method, buts slightly different parameter setting from the one used in the paper in order to get more recall with lower precision. Unlike Hasegawa's paraphrase database , this is NOT cleaned up by human. It includes 19,975 sets of paraphrases with 191,572 phrases. It uses Extended Named entity as two anchors to the expressions.

    # author
    author  6       PERSON  , the author of ``      ACADEMIC
    author  5       PERSON  , author of ``  ACADEMIC
    author  4       PERSON  being the first living author included in       ACADEMIC
    author  2       PERSON  , the Connecticut-based author of ``    ACADEMIC
    # die
    die     19      AGE     , died of       DISEASE
    die     13      AGE     died of DISEASE
    die     10      AGE     died of diarrhea ,      DISEASE
    die     6       AGE     , whose mother died of  DISEASE
    die     2       AGE     died of malignant neoplasms ,   DISEASE
    die     2       AGE     , died from     DISEASE
    die     2       AGE     , who died of   DISEASE
    die     2       AGE     , has reportedly died of        DISEASE
    # buy acquire acquisition purchase unit merger merge parent billion pay sell takeover 
    acquire 65      COMPANY1        acquired        COMPANY2
    acquire 19      COMPANY1        is acquiring    COMPANY2
    acquire 13      COMPANY1        agreed to acquire       COMPANY2
    acquire 11      COMPANY1        agreed to be acquired by        COMPANY2
    acquire 9       COMPANY1        announced it was acquiring      COMPANY2
    acquire 8       COMPANY1        will acquire    COMPANY2
    acquisition     85      COMPANY1        's acquisition of       COMPANY2
    acquisition     7       COMPANY1        's pending acquisition of       COMPANY2
    acquisition     6       COMPANY1        completed its nearly $ 10 billion acquisition 
    of      COMPANY2
    billion 8       COMPANY1        's $ 115 billion buyout of      COMPANY2
    billion 6       COMPANY1        for $ 48 billion ,      COMPANY2
    billion 5       COMPANY1        's dlrs 25.6 billion 1997 takeover of   COMPANY2
    buy     72      COMPANY1        bought  COMPANY2
    buy     30      COMPANY1        is buying       COMPANY2
    buy     20      COMPANY1        agreed to buy   COMPANY2

    Related Publications

  • S. Sekine, Automatic Paraphrase Discovery based on Context and Kwywords between NE Pairs, IWP-05
  • T Hasegawa, S. Sekine, R. Grishman, Paraphrase Acquisition using Unsupervised Relation Discovery (In Japanese), gengo-05
  • T Hasegawa, S. Sekine, R. Grishman, Discovering Relations among Named Entities from Large Corpora, ACL-04
  • Y.Shinyama, S. Sekinen, Paraphrase Acquisition for Information Extraction, IWP-03
  • Y.Shinyama, S. Sekine, K. Sudo and R. Grishman, Automatic Paraphrase Acquisition from News Articles, HLT-02