- The Practice of Translation
- Machine Translation
- Translation Evaluation Metrics
- Translation Data Sources
- Bi-text Extraction/Filtering
Intro
This time we’re going to be talking less about models and more about the actual phenomenon of translation itself and so translation i mean i barely need to introduce what translation is i’m sure everybody knows but you know basically it’s the conversion of content in one language into content in another language
for the purpose of making that content understandable to somebody who doesn’t know the original language of course and that can be anything from you know translating novels like harry potter to you know any other variety of translation that happens and this time i’m going to be talking about how this translation happens a little bit about machine translation and then also about machine translation data sources and empty evaluation
Translation vs Interpretation
There’s actually two types of conversion of language between different languages there are two main types one is translation another is interpretation and translation is basically the conversion of written word from one language to another and interpretation is conversion of spoken language from one language to another and they’re very different in the requirements for them also how people work and the type of type of people who work in these areas so translation in general is usually less time constrained you may still have a deadline for when your translation results are required but usually it’s on the order of you know like a day somebody gives you a translated document and they they say i want this back in a day whereas an interpreter may interpret while the content is being produced you know while i’m talking an interpreter may be interpreting my speech into another language that’s simultaneous interpretation you can also do consecutive interpretation which is basically like i speak for a while then the interpreter speaks for a while i speak for a while interpreter speaks for a while translators a high degree of accuracy and fluent very fluent natural output is necessary and partially because of this translators in may translate all kinds of things but very often they specialize in a single area like i’m a medical translator or i’m a patent translator or something like that on the other hand interpreters may specialize in an area but more commonly they’re generalists who can interpret lots of different things i actually worked as an interpreter and translator for about a year and a half and i did both of them because i was kind of in a position that was a little bit less specialized i worked at a local government in japan they additionally had a single translator in a single interpreter and it’s very interesting because personalities are also very different translators are are often somewhat introverted you know they like working on their own they really like being precise whereas interpreters have to be really good at talking because they spend their whole day talking they tend to be very extroverted you know like talking to people in general not just interpreting between languages so both of these jobs are very hard i found interpretation harder because of the time pressure in the constraints and it’s not it’s you know you can’t get 100 accuracy in a situation where you’re expected to interpret things so you just need to do as well as you can
Translation Industry
How much translation is done the language services market is 56.1 billion dollars that’s a lot of dollars and there’s 640 000 translators worldwide about 75 of them are working freelance and the other ones are employed by a specific organization interestingly europe owns 49 of the market share so 49 of the people about half the people are working in europe
and one interesting thing is you might think that as machine translation technology is getting better that was you know threatening translators in their jobs and you know reducing the value of the industry actually the con the contrary is true you know more translation is being done than ever both machine translation and human translation and the value of the industry is doing nothing but improving so in a way you know more language is being translated now than ever another thing is that the translation industry is becoming more technological so 88 of full-time translators use some variety of computer-aided translation which means that they’re using machine translation they’re using something like translation memory that allows you to look up other related translations the most common tool is something called Tranos and basically it has translation memory it has integrated empty it has terminology e management software other things like this and a lot of translation agencies require that you use Tranos in order to be able to work with them there’s also other ones that i’m less familiar with but basically if you have this idea of like human translators versus machine translators in reality it’s not the case anymore it’s now a hybrid of humans and machines working together some people like this some people don’t like this some people would prefer not to be told what to translate by their translation memory but you know in order to improve efficiency people now have to do this as part of their life
Why people don’t like translation
The reason why some people don’t like translation computer aided translation is i think they feel it stifles their creativity or it because they can do it more efficiently now the requirement is that they do do it more efficiently so they have less time to sit and think about the perfect translation and just have to crank out content basically so i think those are the main reasons why people are like resistant to technology but nonetheless 88 of people are using it so even if they don’t like it they’re using it because they you know have to in order to keep up with the amount they’re expected to produce a lot of people do like it so it’s not everybody yeah so from the modeling type of view the biggest difference is whether speech is your input or text is your input another big difference is interpretation sometimes you’re expected to do it in real time so that’s called simultaneous translation so you need to create the output before you’ve like read the whole document essentially that’s a very interesting topic it might be a topic that some people in this class want to work on if you like the speech you like the translation it’s a very hot topic right now cool so now what about difficulty in translation why is this hard so this is an example i i cannot read this example i inherited it with from someone else but it’s basically a
an old chinese poem and the reason why it’s difficult to translate in general is because there’s divergences in lexical information so words in structure so this is an alignment of the glosses in chinese with the words in english if you don’t know what a gloss is it’s basically a word by word translation in the same order as the original sentence so
this is what it looked like in chinese unfortunately i don’t have the chinese characters above for all the chinese speakers but you can kind of see what it looks like and basically if you look at the chinese it’s like daiyu alone on bed top think baochai which gets translated into as she lay there alone daiu’s thoughts turned about and you can see that the ordering is different between the two also you know some words exist in the chinese but don’t exist in the english even more so in the next sentence so
you need to get the words right you need to get the words in the right order and this is that non-trivial when you know the translations are different between the two languages
Translation Ease
Here’s another example from german which might be a little bit easier to parse if you’re not a chinese speaker so if you have
here you can see the gray gloss on top which is in the in city exploded car bomb and the the kind of canonical english translation that’s listed here is a car bomb exploded downtown so you can see that the word order changed also in in german some things like car bomb is a single word here it’s multiple words here there’s also a phenomenon called translation ease and what translation eases is it’s not exactly natural in the language you’re translating into but it’s a translation that is direct and kind of maintains the original characteristics of the original language there’s actually a fair amount of computational study of translations seeing how like which language you’re translating from affects the output and you can even take translation ease cluster it together and you get a very nice reproduction of the language family tree that the languages came from so basically the effects on translationese very strongly inherit the effects of of the original language and even you know are similar between similar languages etc so you can see there’s a very clear effect and here in the inner city there exploded a car bomb would be a very like literal translation that you can understand in english but it’s also not natural english it’s not like what an english speaker would produce
Lexical Ambiguity
Another issue is not just structure but also lexical ambiguities so this is an example from jarefsky and martin speech and language processing where basically you have leg and foot and paw and how they are how they are translated in different ways based on whether it’s a an animal leg a leg of a journey a leg of a human a leg of a chair a bird foot or a human foot etc etc
Literary Translation
I also thought i had another example but in order to handle all these things you’ll have to handle you know syntactic differences lexical differences
how idiomatic the output is so you know there’s lots of issues here and then if you start talking about literary literary translation it becomes even harder right you know you want to translate a poem or something like this as we had in the assignments for the discussion today and suddenly you need to think about rhyming you need to think about like beauty of the expressions that you’re using so lots of depth in doing translation any questions there before i talk about mt quickly yeah is there any criteria so the criteria that i would use to define translations there might be a different formal definition but i think this is basically right is it is language that only occurs because you’re translating from another language and would not normally be you know how you would how a native speaker would express the same content in that language so it’s kind of like structural or lexical influences of the original language of the produced language it doesn’t need to be like the effect of translation of the output could be very subtle and often for a good translator it is very subtle but it’s still like there i have no confidence i’m a native speaker of english i’m very good at japanese but i have no confidence that i can produce english that sounds like my natural english when i’m transmitting to japanese for example any other things okay cool under the machine translation so machine translation a checked is a three billion dollar market
they’re oh actually i should mention that some of the statistics i got here are from this very very nice blog of the translation industry in 2021 if you didn’t look at this on the page i would definitely take a look it summarizes a whole bunch of statistics and was insightful to me as well these statistics are also from there which is machine translation is a three billion dollar market now so it has about five percent of the market share of the language services industry overall the top providers of it are google amazon and l so google i think a lot of people know amazon and aws web services provide translation for a lot of businesses for example and deepel is a startup that many people might have heard of but it has actually very good translation accuracy
they haven’t revealed all of their secrets but one of the things is that they use like cleaner training data they have good training data cleaning strategies and they also consider context in a better way than other things like google do another thing about machine translation is these are the markets that machine translation is used in you can see that the most common ones are healthcare automotive and military and defense markets but it’s kind of spread out pretty widely including e-commerce and other things like that there’s a very interesting paper that examined the effect of translation on e-commerce that i don’t have in the references but i can share which demonstrated that when ebay i believe introduced automatic translation between spanish and english the number of sales from
latin america to the us basically started increasing immediately after that so you can see also that you know mt has real world consequences impact etc so these are the lists this is maybe a slightly old list of languages supported by google translate it’s pretty impressive you know at least 100 languages maybe it’s nearing 200 now one thing that i like to mention to people whenever you look at this list is just because there’s a hundred languages on this list doesn’t mean that mt is equally good for all of the languages on this list you know it may be obvious if you think about it a little bit but sometimes you think well you know it’s on this list google released a product for it it must work that’s definitely not the case and you know if you try to use it to understand articles you’ll see a very big very big difference
Human Translators
There are some reports that mt is now at the level of human professionals in some areas like for example news translation between very high resource languages like chinese and english. I didn’t believe this at first when microsoft first put out an article that said this essentially and i went in and like analyzed their data looked at their data and they actually their outputs are quite good and human translator outputs are not perfect either so for example let’s say you hire a human translator on a freelancing site and tell them i want you to translate these new news articles because i would like to create training data for a machine translation system if you do that the translator will say sure i want your money i want your money i will i’ll be happy to do that but they’re not going to be super motivated and if you say instead to the translator say i’m going to be asking you to translate these news articles for cnn and or the new york times and a hundred thousand people are going to read your article you’re pretty sure they’re going to do a good job right they’re not going to make a mistake so which human translator you’re trans comparing people to also makes a big difference in these these outputs and not even just which human translator but how motivated that human translator is so i have a feeling that mt systems and high resource languages are almost as good as moderately motivated good professional human translators but they’re not as good as somebody who’s translating for the new york times for like 100 000 people for example
Translation
oh yeah so so sorry here are my other examples this is from google translate i basically put in the first sorry i had animation on this until i had to switch computers but i put in the first sentence from the wikipedia article on translation and i’ll put it through google translate which said translation refers to the act of replacing what is expressed in the form of a in a language with the form of b that corresponds to that meaning specifically in natural language it refers to the act of converting a sentence in a in the source language into a sentence in another target language i did a few small edits here the the red stuff are my edits that i think would make it a little bit better but it’s pretty good
here’s another example of machine translation i entered a lexically ambiguous word kodo in japanese which can be either code chord or chord depending on you know the context and it does a pretty good job at disambiguating like electrical cord code for the program chord for the on the score but if i wrote as a musician i am good at reading chords it made a mistake with that in java i wrote a chord that displays the chord of a guitar that should have been code up here so you can see that it you know is not perfect for doing this as well
so you know translation is is hard even good things like google translate are not perfect but they’re pretty good in high resource languages anyway so why do these work i’m gonna be talking more about translation models next class but basically i’d like to go through a little bit of you know history into how
these were conceptualized and from 1968 there’s this famous thing called the vaqua triangle and basically what it is saying is there’s multiple ways to do translation you can go from words directly to words so you can basically replace words by words you can go up to the syntactic structure of languages and then generate from the syntactic structure you can go up to semantics in the language convert the semantics between the languages and go down and you can go up to something called an interlingua with lingua which is like universal semantics for all languages
Direct Transfer
so what does this look like direct transfer looks like a word by word translation so you would just be translating directly from word words to words syntactic transfer would be like analyzing the syntax of the sentence and then using that to translate you could also generate a syntax of the target sentence or translate from syntax of the source sentence to the target sentence and generate the output
Semantics
You could also have something like semantics which is a logical form which basically says well something was detonated what was detonated it was a bomb
like a car bomb it was that mediated downtown and that was in the past tense and then you generate the output based on that
Interlingua
You could do other things and then there’s also an interlingua which basically says nothing about the you know individual languages and instead is completely in kind of a form a logical form
so each of these methods has their own advantages and disadvantages the advantage of going directly is like let’s say we have a language like spanish and italian which are very very similar in words and structure and other things like this there you could basically do a word by word translation and do a pretty good job however if you have very different languages you know you’re going to have a lot of trouble you need to have very basically a very powerful model to allow you to do this and you know before neural networks we didn’t have any model that really did this very convincingly well
Neural Models
Neural models now you can kind of view them as models that map word by word they take in a whole bunch of words they generate a whole bunch of words but you can also view them maybe as a interlingua based model where you know they’re taking in words and they’re generating hidden vectors they correspond to the meaning of all of those words and then they’re generating from that you know like interlingua between the the languages so where exactly we lie now on the spot triangle is kind of you know unclear but you know it’s kind of an interesting question as well and you know maybe considering syntax or other things like that would help us generalize better in the resource languages for example
Data
I realize i have a lot of slides left and i don’t want to take all of the time. Maybe i’ll just go through the data part and leave the evaluation part till next time but are there any questions so far okay cool so i’d like to talk a little bit about data because data is very important for training our mt models and so basically all models including you know like google translate amazon dpel any of the things that people are using are using machine learning based methods and basically the way they work is they’re trained from parallel data sources where you have one language and then another language one language and then another language this is an example that you can actually do yourself if you’re like interested in trying a puzzle which is basically like take this parallel corpus and then try to translate this sentence at the bottom yourself and you’ll see that you need to like form associations between words you need to understand about what the syntax of the language looks like but it’s definitely possible from this this small parallel corpus
Parallelcorpora
No i think these are made up yeah i’m pretty sure
so where can we get parallel corpora basically it’s anywhere that translators are doing lots of translation this is an example from the united nations from a few days ago and you can see that this is translated into you know three languages here it’s actually six languages i believe it’s the official languages of the
Languages
another good source that we love using for like very low resource languages is the bible because the bible is translated into more languages than any other text as far as i know
Books
You can also get books of course or other other things this is harry potter and english and chinese Restaurants when you go to a restaurant if you’re not chinese you can go to your favorite chinese restaurant if you’re chinese you can go to your favorite indian restaurant i don’t know something else and get the menu and that’s a very good parallel purpose for you to try your your own learning skills on
Web Data
you can also harvest data from the web like you can harvest data from micro micro blogs twitter social media other things like this so this can give you you know more informal language and that’s why you know google doesn’t completely fall over when it tries to translate twitter for example
Opus
and the canonical place to get data for any of these models now is this place called opus and what opus does is it collects a whole bunch of open parallel corpora in many many different languages language pairs and across many domains so if you want to train models this is your best place to get it so to leave some time for the discussion i think i’ll move the evaluation part to tomorrow but are there any questions about stuff we talked about so far
Discussion Question:
Use Google translate to back-translate the text via a pivot language, e.g., “English → Spanish → English” or “English → L1 → L2 → English”, where L1 and L2 are typologically different from English and from each other.Compare the original text and its English back-translation, and share your observations. For example, (1) what information got lost in the process of translation? (2) are there translation errors associated with linguistic properties of pivot languages and with linguistic divergences across languages?
Try different pivot languages: can you provide insights about the quality of MT for those language pairs?
Resources:
- https://redokun.com/blog/translation-statistics
Citation
@online{bochman2022,
author = {Bochman, Oren},
title = {Translation and {Translation} {Data}},
date = {2022-02-01},
url = {https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w05-translation/},
langid = {en}
}