While the human brain could never hope to juggle the mental
arithmetic involved, computers can. This technique goes
through giant databases of translations and breaks the many
sentences apart. It then looks for words with a tendency to
cluster together. For instance, in a sample English-to-German
text, it notes that the phrase "kids love" is linked to 223
occurrences of "kinder lieben," 201 recurrences of "kinder
moegen" and 12 incidences of "kleine kinder." Since "kinder
lieben" appears with the highest frequency, it will be the
preferred translation, although EliMT also notes alternative
translations if so desired. Matches between entire sentences
and other long word clusters are preferred over shorter
building blocks, since words in longer matches often are
correctly translated in context.
Statistical MT techniques first emerged about 12 years ago,
but since not every word cluster in the world is found in
translations, this problem of incomplete databases meant that
statistical MT relied on rule-based MT to fill in the blanks.
Abir's new system eschews rule-based systems altogether and
depends entirely on a statistical solution by looking for
overlaps between sentence fragments. While the phrase "kids
love chocolate" is not present in the sample text, for
example, the segment "love chocolate" is; there are 256
incidences for "liebe schokolade" and 233 occurrences for
"lieben schokolade." Even though the former occurs more often,
its "liebe" doesn't overlap, so the system goes for the next
most popular ranking with "lieben."
Higher Accuracy?
Carbonell says he believes EliMT could produce translations
of higher accuracy than Systran in about 12 to 18 months--so
much so that he applied for membership on Meaningful Machines'
board after his assessment. Also, instead of waiting for
decades to develop language-pair rules, by entering in
translations from any language EliMT should be able to prepare
a makeshift database quickly. "For 100 languages there are
9,900 language pairs, and while a shortcut for one pair is
nice, shortcutting 9,900 pairs is essential," Carbonell says.
Another possible advantage of EliMT is that it could
steadily refine itself in either a fully automated or a
human-assisted manner as more data are entered, unlike
rule-based systems that require meticulous tinkering with the
rules. In addition, EliMT should recognize exactly where
faults in its translation might lie to streamline the human
editing process. "With other translations, all you know is
that it's about 70 percent accurate, and you don't know which
70 percent that is," Abir says. "This system knows what it
doesn't know."
Also, unlike other MT systems, results from other
language-pairs could potentially help an EliMT translation by
matching these segments--what Abir calls "blocks of meaning"
or "the DNA of a language"--across different languages. The
extent to which this actually might be of assistance, however,
needs further testing.
Right now the EliMT system is still in preperatory stages,
although the company hopes to field comparative tests soon.
The core database may prove to be an unwieldy hundreds of
gigabytes large and translation easily takes a great deal of
computing power, so the company at this point plans on
operating a server through which customers would process
translations. Still, in the future Klein says he hopes to help
allow real-time translation applications like e-mail, chat
rooms and mobile devices. "Right now MT only accounts for 2
percent of the worldwide translation market, but we expect
demand will go up once the supply--a near-human automated
system--is finally there," he says.
Charles Choi is based in New York City.