(a.k.a. Translation Models)

Department of Computer Science

New York University

Translation models describe the mathematical relationship between two or more languages. We call them models of translational equivalence because the main thing that they aim to predict is whether expressions in different languages have equivalent meanings.

A good translation model is the key to many trans-lingual applications, the most famous of which is machine translation (MT). (It was in the context of their application to MT that they were initially called "translation models.") Other applications include cross-language information retrieval, computer-assisted language learning, various tools for translators, bootstrapping of OCR for new languages, etc. We are very interested in applying translation models in these ways, but we are also interested in the more fundamental question of how to engineer translation models that are more powerful and more reliable for an arbitrary application. This kind of engineering research is analogous to the way that designers of mechanical engines strive for power and efficiency, without considering the kind of vehicle that the engine might drive.

These days, the better models of translational equivalence are built empirically. Instead of encoding the equivalence relation from introspection, computational linguists use machine learning techniques to induce them from "bitexts," i.e. pairs of texts that are translations of each other. The idea is that by looking at many pairs of texts that are translationally equivalent, computers should be able to figure out which expressions are translationally equivalent.

The ideal model of translational equivalence should account for every aspect of language, from minute details like spelling variations in text and vowel shifts in speech, to subtle pragmatic factors like whether the speaker is being sarcastic. However, natural languages are very large and complex mathematical objects, and translation modeling is a very young research area. Consequently, the majority of published models concentrate on just one aspect of language, such as syntax or discourse. Most published models also suffer from poor predictive power, i.e. they are often wrong when tested on real-world texts. That is why translation services on the Web are often a better source of entertainment than of useful information. Reliable models for significant "layers" of language have just started to appear, such as our own Models of Translational Equivalence among Words.

Today's best translation models are essentially finite-state transducers (FSTs). The input to an MT system is always finite, so in principle it can be handled by an FST. However, modeling translational equivalence with FSTs is like approximating a complex function with line segments. It's very difficult to build finite-state MT systems that are elegant, adaptive, robust, and easily extendable. That is why finite-state models of translational equivalence for more than a narrow linguistic layer tend to compound their errors and collapse under their own weight.

The Proteus Project has recently invented a new strategy for integrating different layers of translational equivalence in a mathematically elegant way. Our approach is based on the new class of Generalized Multitext Grammars (GMTGs), which are simultaneously translation models (look here). The structures generated by these grammars encompass several linguistic layers. By developing theoretically well-founded methods for inducing such grammars from data, we expect to make the different layers of equivalence reinforce each other. That is, by having our models account for a larger part of the equivalence relation in a consistent manner, each of the component layers will become more reliable. The result will be much more accurate machine translation.