ScientificAmerican.com
  July 17, 2002
go
0802 cover image
 
 
 
EXPLORE
July 15, 2002
Multilingual Machines
By Charles Choi

previous   1   2  

While the human brain could never hope to juggle the mental arithmetic involved, computers can. This technique goes through giant databases of translations and breaks the many sentences apart. It then looks for words with a tendency to cluster together. For instance, in a sample English-to-German text, it notes that the phrase "kids love" is linked to 223 occurrences of "kinder lieben," 201 recurrences of "kinder moegen" and 12 incidences of "kleine kinder." Since "kinder lieben" appears with the highest frequency, it will be the preferred translation, although EliMT also notes alternative translations if so desired. Matches between entire sentences and other long word clusters are preferred over shorter building blocks, since words in longer matches often are correctly translated in context.

Statistical MT techniques first emerged about 12 years ago, but since not every word cluster in the world is found in translations, this problem of incomplete databases meant that statistical MT relied on rule-based MT to fill in the blanks. Abir's new system eschews rule-based systems altogether and depends entirely on a statistical solution by looking for overlaps between sentence fragments. While the phrase "kids love chocolate" is not present in the sample text, for example, the segment "love chocolate" is; there are 256 incidences for "liebe schokolade" and 233 occurrences for "lieben schokolade." Even though the former occurs more often, its "liebe" doesn't overlap, so the system goes for the next most popular ranking with "lieben."

Higher Accuracy?

Carbonell says he believes EliMT could produce translations of higher accuracy than Systran in about 12 to 18 months--so much so that he applied for membership on Meaningful Machines' board after his assessment. Also, instead of waiting for decades to develop language-pair rules, by entering in translations from any language EliMT should be able to prepare a makeshift database quickly. "For 100 languages there are 9,900 language pairs, and while a shortcut for one pair is nice, shortcutting 9,900 pairs is essential," Carbonell says.

Another possible advantage of EliMT is that it could steadily refine itself in either a fully automated or a human-assisted manner as more data are entered, unlike rule-based systems that require meticulous tinkering with the rules. In addition, EliMT should recognize exactly where faults in its translation might lie to streamline the human editing process. "With other translations, all you know is that it's about 70 percent accurate, and you don't know which 70 percent that is," Abir says. "This system knows what it doesn't know."

Also, unlike other MT systems, results from other language-pairs could potentially help an EliMT translation by matching these segments--what Abir calls "blocks of meaning" or "the DNA of a language"--across different languages. The extent to which this actually might be of assistance, however, needs further testing.

Right now the EliMT system is still in preperatory stages, although the company hopes to field comparative tests soon. The core database may prove to be an unwieldy hundreds of gigabytes large and translation easily takes a great deal of computing power, so the company at this point plans on operating a server through which customers would process translations. Still, in the future Klein says he hopes to help allow real-time translation applications like e-mail, chat rooms and mobile devices. "Right now MT only accounts for 2 percent of the worldwide translation market, but we expect demand will go up once the supply--a near-human automated system--is finally there," he says.


Charles Choi is based in New York City.




previous   1   2  

More to Explore:
"Multilingualism on the Internet," by Bruno Oudet (Scientific American, March 1997), is available for purchase at the Scientific American Archive




 
1996-2002 Scientific American, Inc. All rights reserved.
Reproduction in whole or in part without permission is prohibited.