Meaningful Machines has developed a novel set of methods that represent a breakthrough in the quality and extensibility of machine translation (MT). The company’s Spanish-to-English prototype, which is not yet fully enabled, already exceeds all other MT systems known to the company and is expected to approach human-quality accuracy in the near term.
Meaningful Machines’ patent-pending methods are collectively called Context-Based Machine Translation™ or CBMT™ because of their ability to preserve the context of words and phrases when translating from source language to target language. The basis of the technology is a new class of algorithms for coping effectively with long-range textual context for determining the meaning of words, phrases and sentences for translation and meaning extraction.
Also, CBMT is uniquely extensible to new language pairs because it requires only a bilingual dictionary and a large corpus (body of text) in the target language. Accurate corpus-based MT that learns from monolingual text represents a new paradigm in MT and means that Meaningful Machines’ methods are extensible to virtually any language pair.
CBMT Compared to Traditional MT Approaches
Traditional MT Approaches
Current MT systems fall into two broad methodological schools, referred to as Rule-Based MT and Statistical MT.
Rule-Based MT. Rule-Based approaches stretch back to the 1950s. These systems require a bilingual dictionary and an extensive set of manually-coded grammar and transfer rules between the source and target languages developed by language experts and computer scientists. Such systems can take person-decades to produce. Since each language pair has its own grammatical inter-relationships and equivalence mappings, a separate MT system must be built from scratch for each pair. Rule-Based MT systems experience a quality ceiling due to the large number of language rules necessary to capture the breadth and complexity of a language. First, it is impossible to anticipate and accommodate all uses and structures of a language. And, as more rules are added they begin to conflict with each other. Hence, such systems have trouble accurately resolving many written or spoken cases. Currently, most commercial MT systems are Rule-Based.
Statistical MT. Statistical MT systems eschew language rules in favor of computational learning. This school, which began in the late 1980s, represented the second major wave of MT research. These data-driven systems allow for faster development of new language pairs because they do not rely on extensive sets of rules, but they do require large quantities of training material in the form of professionally pre-translated text that is aligned between the source and target languages (called “parallel text” or “parallel corpora”). Statistical systems study parallel text using computationally intensive methods to determine the probability that the translation for source fragment X is target fragment Y, and the translated fragments are assembled in a way that attempts to maximize the target language probability of the assemblage. Once parallel text between source and target languages is in place, new language pairs can be developed in a relatively short timeframe. However, Statistical MT requires huge quantities of parallel text to produce even modest quality translations. Parallel text, an extremely limited resource, simply does not exist in such quantities for most language pairs. And even for language pairs where parallel text does exist in sufficient quantity for general translation, there will be domains (subject matter areas) where coverage is totally lacking. Most university and large company research groups are developing Statistical MT systems. Some hybrid systems, with Rule-Based elements, are also under development.
Meaningful Machines: Context-Based MT
CBMT does not require extensive sets of rules (like Rule-Based systems) nor large quantities of parallel text (like Statistical systems). Instead CBMT uses a bilingual dictionary and monolingual corpora (text in one language) to run its algorithms. It is the only corpus-based method that does not require parallel text, thus representing a significant breakthrough in MT and an entirely new MT paradigm. One MT expert, upon reading Meaningful Machines’ technical paper that was presented at the AMTA-2006 conference (Association of Machine Translation in the Americas), referred to CBMT as a “revolution” and stated, “the consequences of this development are vast.”
Quality Breakthrough. CBMT's Spanish-to-English prototype is producing higher quality translation than all other systems known to the company, and is expected to approach human-quality accuracy with additional language resources – i.e., a more complete dictionary and larger target corpus.
The key to CBMT’s high quality of translation is the system’s ability to accurately translate words and phrases in a manner that preserves the context of the original source text. Whereas other systems lose semantic integrity because they cannot determine context, or if they can, it is over a short range, CBMT has several methods that determine and use long-range context, resulting in more accurate and coherent translations. CBMT’s methods are able to work with longer word-strings than other systems, and build and connect segments on the target side that are reinforced by the original context of the source text. The methods also allow for greater word reordering from source to target.
Extensibility Breakthrough . Accurate corpus-based MT that learns from monolingual text means that CBMT is extensible to virtually any language pair – without the need for rule-based coding or parallel text. (Monolingual corpora, for our purposes, is electronic text in a single language, which includes all text in that language available on the Web.) To build out a new language pair, CBMT requires, at a minimum, a fully-inflected bilingual dictionary, a large corpus in the target language, and some algorithm customization based on the language pair (e.g., segmentation of source text for Chinese).
Given the resources that Meaningful Machines’ methods require (and don’t require), CBMT’s approach is far more extensible to new languages as well as different domains within a language than any other MT method. Also, CBMT is better suited for rare languages (e.g., Pashto, Tajiki, Uzbek) because of the lack of existing parallel text involving such languages. Monolingual corpora are far easier to obtain than parallel corpora – for instance by downloading and indexing portions of the Web. And for languages that are linguistically distant from each other (e.g., Chinese, Japanese, Thai into English), CBMT, with its longer word strings and overlapping confirmation process, is the best approach for preserving long-range context and for word disambiguation.
Additional Capabilities. CBMT has two additional novel capabilities that increase the quality and utility of its MT output.
- Phrasal Synonym and Association Generation. Given a word or phrase in any language, this algorithm can generate semantically related words or phrases such as synonyms, near-synonyms, class members, descriptions, and opposites. Synonym and near-synonym generation, in particular, provides great flexibility to CBMT when, for example, the system is processing a particular segment of text but cannot find a target word or word-string that properly connects with surrounding words (i.e., it doesn’t match the particular context). Instead of accepting a candidate with low confidence or coherence, the synonym and association builder can generate substitutable phrases, giving the system more alternatives and a much greater probability of finding good candidates. This technology also has important implications for other natural language applications such as search, text mining, natural language interfaces, and machine learning.
- Confidence Coding. CBMT’s translation confirmation process can be used to distinguish between segments of output translated with high confidence versus segments translated with lower confidence. This capability is significant as it (1) saves considerable time and expense if post-editing is part of the workflow because the human post-editor need only focus on the “low-confidence” text, and (2) allows decision makers such as intelligence analysts to rely on the system’s high-confidence output for triage and decision support.
Path to Human Quality MT
While CBMT is already achieving higher quality levels than other approaches, it continues to improve with additional language resources that are practically available (dictionary terms and monolingual target corpora). Given the practical and theoretical limitations of other systems, CBMT may represent the only true path to human quality. Statistical MT systems will always be limited by the amount of available parallel corpora, and these systems experience diminishing returns as more parallel corpora are added. In addition, Statistical methods do not account for long-distance context and consistency – so even significantly greater quantities of parallel text may not be sufficient for reaching human levels. Rule-Based MT systems also reach a quality ceiling because the large number of language rules necessary to capture the breadth of a language begin to conflict with each other, and it is impossible to anticipate and accommodate all uses of a language. As for CBMT, this novel approach has a practical path to high quality and human quality translation.
For More Information
For a technical description of some of CBMT’s methods, see our paper, Context Based Machine Translation, which was accepted for presentation at the AMTA-2006 conference (Association of Machine Translation in the Americas) and published in the conference proceedings journal.