João Graça, co-founder and CTO of Unbabel, on what machine translation can teach us about the challenges still lying ahead for artificial intelligence.
Can you understand this sentence? Now try understanding the long and convoluted and unexpectedly – maybe never-ending, or maybe ending-sooner-than-you-think, but let’s hope it ends soon – nature of this alternative sentence.
The complexities of language can be an inconvenience to a reader. But even to today’s smartest machine learning algorithms, there are more translation challenges remaining than advances in other fields would have you believe.
These challenges in particular are a good demonstration of the multitude of complexities that still remain for machines to catch up with human performance.
You say tomato
When it comes to translation, there are two categories of content. On one hand, you have “commodity” translation. Perhaps you want to point your phone at a menu and get a rough idea of what it is. Or you want to impress a colleague with a phrase from their local language.
Here, phrases are short, the content is often formal and errors aren’t life or death.
But on the other hand, you have interactions where context is key – understanding the intent of the writer or speaker, and the expectations of the reader or listener. Take any example where a business speaks to its customers – you better hope you are speaking their language respectfully when they have a complaint or problem.
It’s not enough to solve the problem at a superficial level, and to achieve comparably “human quality” communication still has an enormous amount of research ahead of it. This need for perfection is why most research is focused in this second area.
In the examples below, I discuss the challenges still ahead for the translation industry, and touch on what they mean for how we use machine learning tech more broadly.
Challenge 1: Long-distance lookups
Many of the biggest challenges are structural.
A good example is long distance lookups. If you are translating a sentence word by word, but the order is the same, it’s just solving “what is the correct equivalent of this for that?”
But once you start having to think about reordering the sentence, the problem space that has to be explored is exponentially larger. And in languages like Chinese and Japanese, you find verbs at the end of the sentence, potentially producing the longest distances possible.
The system needs to assess at least three reordering systems. This is why these languages are so hard, because you have to cater to very different grammatical patterns, very different vocabularies, and how many characters are in each word.
Here, you can see how expanding problem spaces create difficulties in an area the human brain handles with ease.
Challenge 2: Taxonomy
The second major area of complexity involves different formats of data.
For example, conversational language has a completely different structure and appropriate models than formal documents. In areas like customer service translation, this makes a big difference. Nobody likes to feel like the representative of a company is being overly officious when handling their problem.
Therefore, any model that is able to learn from a volume of real human queries will have an advantage — and doubly so if it’s able to take it from a particular industry sector. Meanwhile, other models might be relying on news stories or generic online text, and output completely different results.
Similarly, with other machine learning challenges, the ability to learn from the most valuable and representative data can give a big advantage – or risk limiting taxonomical flexibility.
This brings us to context.
Challenge 3: Context
Most translation models still translate sentence by sentence, so they don’t take the context into account.
If they are translating a pronoun, they have no clue which pronoun should be translated. They will randomly generate sentences that are formal or informal. They don’t guarantee consistency of terminology – for instance, translating a legal term correctly in the same way throughout. There’s no way you can guarantee the whole document is correct.
The other problem is the content is not always in the same language. Sometimes it’s one sentence in Chinese, one sentence in English. The sentences are much shorter, so you probably have to look much higher for context. This reaches its extreme in “chat” interactions.
And the context problem is different than if you were translating an email. For example, if you are doing a legal document and the document is ten pages long, you would need to use the entire document for an accurate contextual translation.
This is next to impossible with current models – you have to find some way to summarise it. Otherwise, consistency is nearly impossible.
On the other hand, if you are translating for something like SEO, what you are actually translating is key words that don’t form a sentence, just keywords by themselves. This means you turn to more dictionary-like translation to disambiguate and use other words or the image associated with it.
People think “Oh, we are in the age of unlimited data” but actually we are still enormously lacking in many ways.
Yes, we have a lot of data but often not enough relevant data.
Looking to the future
There will be many translation engines but what makes them different is their models.
The model is going to look at the data and predict patterns and assign them to different customers, and from then, will decide which voice/ language/ tone/ etc. to choose.
In current common public translation tools, they aren’t aware of this yet. They don’t even have the knowledge of the document from where the translation came from, let alone the speaker or their translation preferences.
This will bring in the next level of sophistication in this area. Machine learning, exercised against use-specific corpus of language, will give fast and accurate translations, while being able to forward them to humans to finalise and learn from further.
Languages might still drive machines crazy – but with careful human thinking, we can teach them to persevere.