Unsupervised Machine Translation – NLP Course Notes & Research

Video 1: Lesson Video

This week’s slides

Supplementary Figure 1

Learning Objectives

Unsupervised MT
Unsupervised Pre-training (LM, seq2seq)

Transcript

Intro

I’m going to be talking about unsupervised machine translation and this is a very interesting topic overall. If you’ve done the reading, you can see that it’s to a greater or lesser extent practical for some varieties of text-to-text translation, but I think there are a lot of other applications as well, maybe including with speech or other things that we’re going to be talking about in the future. I think the underlying technology is interesting and worth discussing and knowing about both with respect to the techniques and the limitations and other things like this.

Conditional Text Generation

Conditional text generation, which we’ve talked about. Basically, we’re generating text according to a specification like I talked about before in the seq2seq models class. We have our input x and our output y, where it could be for machine translation image captioning summarization, speech recognition, etc. The way we model this is using some variety of conditional language models like seq2seq model, like the ones you’re training for assignment two with our encoder and our decoder, and the way we traditionally estimate the model parameters here is using maximum likelihood estimation to maximize the likelihood of the output given the input, and generally, this needs supervision in the form of parallel data, usually millions of parallel sentences. ## What if we don’t have parallel data?

What we’re going to ask about in this class is what if we don’t have parallel data, so just to give a few examples of this. We have parallel data? Well, what if we don’t have parallel data? For example, let’s say we have a photo of a person’s face or something like that. We automatically want to turn it into a painting, you know to put on your wall and display or something like that or turn it into a cartoon because you want a picture of yourself for your social media profile in a cartoon or something so, unfortunately, we don’t have tons and tons of data for this but we do have tons of photos and tons of paintings so we have lots of input x and lots of output y but very few pairs of input x and output y we could also do other things like transferring images between genders or between ages or something like this I think you’ve seen apps that might do this text from impolite to polite so you know correcting the formality transferring a positive review to a negative review or vice versa or doing something like machine translation and I actually modified this to give a few other examples like some really but the slides disappeared some really interesting examples are what if we had an ancient language or a cipher where we didn’t actually know what it was we didn’t have any text but we wanted to decipher this old text and replicate and like understand what it meant in the modern language so that’s another thing that we could do with unsupervised translation.

Can’t we just collect/generate the data?

Another question is “Couldn’t we just collect or generate data for these tasks”. To some extent, the answer is yes we could for some but it could be too time-consuming or expensive and it can also be difficult to specify what to generate or even evaluate the quality of generations so if we said generate generate this text like Joe Biden said it many people here you know don’t know what Joe Biden sounds like enough to be able to even do this in the first place. You know it’s difficult and under-specified, and finding people who’d be able to do that would be difficult, and because of this, it often doesn’t result in good-quality data sets.

Unsupervised Translation

An unsupervised translation the basic idea is we have some seq2seq task you know translation being the stereotypical example but it could be any of the other ones that I talked about where we instead of using monolingual data to improve an existing NMT system trained on parallel data or reducing the amount of supervision we’d like to talk about can we learn without any supervision whatsoever.

Outline

There are some core concepts in unsupervised MT and these core concepts include initialization iterative back translation bidirectional model sharing and denoising auto encoding and actually from the point of view of unsupervised MT. In some cases people have also used older statistical machine translation techniques instead of neural machine translation techniques because they were more robust.

I’ll talk a little bit about what statistical MT you know means because we haven’t really talked about it here yet and I’ll also explain a little bit about why it is more robust and you know what what we could do to also improve robustness of neural MT as well.

Step 1: Initialization

For step one in initialization, basically a prerequisite for unsupervised mt is that we start out with an initial model that can do something that can do some sort of mapping between sequences in an appropriate way so that we can use it to seed a downstream learning process to do translation and it basically adds a good prior to the state of solutions we want to reach and the way this is done is by using approximate translations of subwords words or phrases usually and the way we take advantage of this is we take advantage of the fact that the context of a word is often similar across languages since each language refers to the same underlying physical world and what we do is we rely on unsupervised word translation.

Initialization: Unsupervised Word Translatic

I talked about this a little bit two classes ago. I also called it a bilingual lexicon induction, and the basic idea is that word embedding spaces in two languages are isomorphic and what I mean by this is if you take an embedding space from one language like English in embedding space from another language, let’s say, Spanish we can learn some function that isn’t overly complicated that allows us to map between these two embedding spaces so, for example, we might run a model like word to back or any other you know kind of word embedding induction technique to embed individual words and embedding spaces and then we learn a mapping between them like an orthogonal like a matrix transformation w x equals y maybe with some constraints like the w is orthogonal which basically makes the embedding mutually makes the embedding like bijective so you can mutually map between one embedding space and another embedding space and we hope that by applying this transformation we will end up with something where the words in one embedding space are or the words in both embedding spaces if they’re close together the words are similar semantically or syntactically and this is hard to believe that this would actually work. I actually remember going to a presentation in ACL, I think 2016 where this method was proposed and I was like there’s no way this could possibly work because you’re assuming that you embed words and then just transform them in some way and the distributional properties cause them to line up, in fact, it does work better than you would think it would and there’s a couple of reasons for this one reason for this is that in addition to distributional properties a lot of the word embedding techniques that are used in these mappings here also take into account like sub word information and then if you’re mapping between English and Spanish or English and German.

A lot of the words have similar spellings and that can give additional indication about whether the words are similar another reason is like words like gato and cat are both common and you know common words tend to map the common words uncommon words so you’re also basically implicitly using word frequency information in the mapping and frequent words in word embedding spaces often tend to have larger norms because they’re updated more frequently and so that kind of implicitly gets added into this calculation as well so there’s a bunch of things working for this nonetheless it doesn’t work perfectly it works kind of well enough to do something and in the case of initialization that’s mainly what we’re looking for we’re mainly looking for something that you know starts us out in some sort of reasonable space

Unsupervised Word Translation: Adversarial T

What if the words in the two domains are not one to one or what if the words in the languages are not one to one. To give an example let’s say we’re mapping English to Turkish where in English a single verb has you know one or five conjugations and then in Turkish it has like 80. The answer is it basically doesn’t work it doesn’t work it doesn’t work very well so that’s another thing that you need to be concerned about but if the morphology is approximately the same it can it can still do something and the later steps of the unsupervised translation pipeline can also help disambiguate like ambiguous words and other things like that so yeah I saw another hand yeah so it’s a it’s a combination of words being frequent and words being words appearing in the same context. So for example just to give an example proper names proper names don’t work particularly well for this method because especially if you have like a Japanese newspaper and English newspaper the proper names mentioned in the English newspaper tend to be English names and the ones in the Japanese newspaper tend to be Japanese names for example however, there are these clusters of proper names and proper names have a certain frequency profile and they tend to cluster together so at the very least you should be able to get the fact that things in this cluster are proper names for example and you know that like I think intuitively you can see how that would be universal across languages more or less so you can get at least that close and then there’s other things like if two languages have determiners are almost always like the most common things so you could map determiners between languages and get them right uh most of the time stuff but uncertain completely unsupervised methods work kind of okay but not really well so yeah also an auxiliary point is that there’s very few languages in the world where you don’t have any dictionary whatsoever if there are languages in the world that don’t have a dictionary they also probably don’t have written text that you could learn word embeddings from so completely unsupervised you know you might not need it but you might want to do something with like just a dictionary and so another another thing that’s commonly done is like so the way another way to do this distribution matching or to learn this distribution matching is basically to have an adversarial objective and the adversarial objective basically what it does is it tries to prevent a model from being able to distinguish between the this x times w and y. So the idea is you want to move the spaces so close together that like a sophisticated neural network or some sort of you know discriminator is not able to distinguish between them so that’s the actual mechanism for doing the distribution matching another thing that is commonly done which I talked about two classes ago but isn’t included in the slides here is you get an initial first pass where you you find like several points that you’re very confident in like several points that don’t that are mutual nearest neighors of each other but don’t have close other close mutual nearest neighbors and you use those as basically pseudo supervision and then super create a supervised model that tries to make sure that those get mapped as close together as possible while making others farther apart and do an iterative process where you gradually like increase the number of words that are mapped together and that further improves accuracy now if you have like a small amount of supervision if you have like i don’t know 50 words in a dictionary you could use that you could use that to do supervision directly without having to do the unsupervised distribution matching at first and in fact two papers were presented at basically the same time one was a paper and completely unsupervised mapping another was a paper where they only use numbers like numbers were the only thing that they used to cross the language because numbers tend to be written in Roman characters in many languages in the world so sorry let Latin characters in many languages in the world so because of that you could use just that as supervision and then that would get you a long way too so if you have a dictionary that gives you better results. Usually, even a smaller does that actually really work because just because numbers I feel like they’re the information about that number and the word embedding is often not it doesn’t really encode what the number actually is just kind of that it is a number a lot of the time right and but I think basically the idea is if you can get any if you can get any supervision it’s better than no supervision and you’re still gonna have like ideally some sort of distribution matching component in your objective anyway so yeah yes with this method we still ensure that your viewers that’s a really good question so you if you have supervision you wouldn’t necessarily have to do that we’ve done some work on unsupervised embedding induction and it almost always helps to ensure that w is orthogonal and it or it almost always helped us anyway to ensure that w was orthogonal and it almost always helped us to not use anything more complicated than a linear transform like you would think you’d be able to throw a big neural network at it and it would do better but like even in supervised settings I guess the problem is too underspecified and that didn’t help very much not to say it wouldn’t have verb cool okay so the next so the next thing is we pull out our favorite data augmentation method date back translation and so we take for example French and back translated monolingual data into English and we take English and we back translate monolingual data into French and so here we have parallel data here but what we can do instead is we can do like pseudo parallel data which I’ll talk about in a second.

One slide primer on phrase-based statistical

Next I’d like to explain how we apply these methods to a non-neural MT system and just to give a very brief overview. A one-slide primer on phrase-based machine translation so this is what a lot of people used to do machine translation before neural mt came out and it consists of three steps first the input the source input is segmented into phrases these phrases can be any sequence of words not necessarily linguistically motivated so they don’t need to be like a noun phrase or verb phrase or anything like this then you take out a dictionary and you replace like the dictionary has translations of the phrases into the target language maybe it has you know 10 or so candidates for each phrase and then the phrases are reordered to get into the correct order and this is a nice method for some reasons and the one of the reasons why it’s a nice method is it’s guaranteed to basically cover every word in the inputs and not do any you know if the model is trained okay not do any really really crazy things so if you guys are struggling with assignment two right now already and you’re getting your like low resource machine translation neural machine translation system it’s probably doing things like repeating the same word 300 times in a row or something like that a phrase-based machine translation system would not do this it might give you a bad output but it wouldn’t you know repeat the same thing over and over again or translate a sentence into like an empty sentence or something like that precisely because it has to cover all the words and it can only use a fixed set of translations so because of this it has a strong bias to generate like non-nonsense but it’s also you know not as powerful as neural mt basically so the segmenting into phrases is easy the reordering is not easy but possible but in order to translate each phrase you need parallel data for this that’s a problem in unsupervised MT of course

Unsupervised Statistical MT

The way unsupervised statistical mt could work which is also detailed in these two papers by arteza in limple is you learn monolingual embeddings for unigrams diagrams and trigrams you initialize phrase tables from cross-lingual mappings of these embeddings so basically you initialize the phrases that could be used in a phrase-based machine translation system and then you do supervised training based on back translation and iterate so what this what this means is basically you take the phrase tables from cross-lingual mappings you estimate a language model on the target language and then you just translate all of the all of the monolingual data and after you’ve translated all the monolingual data then you can then feed it into your like normal machine translation training pipeline and the key here is that you know you’re inducing like each of these phrase translations just by mapping embeddings of unigram’s diagrams and trigrams cool and what we can see here is that if we start out with the unsupervised phrase table and translate from French to English we get a blue score of 17.5 and then as we add more iterations and translate and translate and translate and learn from the translated data the score gets a bit better every time. Basically, it’s this iterative process of recreating the data back translation in one direction the other direction one direction the other direction in training.

Unsupervised neural mt the way this works is we create a neural MT system and actually the exact same procedure could also be done for neural mt with the caveat that you can’t create a phrase table so you would need another you would need another method for learning the initial model because you can’t like induce a phrase table from embeddings so in addition to using that same procedure there’s one thing that you can do to improve the neural MT model and basically the way it works is you take the encoder-decoder model and you use the same encoder-decoder for both languages and another thing that you can do is you can initialize the inputs with cross-lingual embeddings and the idea of what you do here is you train the model to output French but you have the inputs be either French so it’s like an auto encoding objective or English here so if you have this French token here this is basically saying, I want you to generate french next so the model is you know basically guaranteed to generate French and but because these input embeddings are initialized with bilingual like coordinated embeddings the inputs look very similar so it’s like the inputs look similar I know I want to generate French, so if we just train on this encoding objective in the bottom which we can do for monolingual data it nonetheless learns how to translate is the hope and dream so we have a cool we have a couple objectives objective is the denoising

Unsupervised MT: Training Objective 1

Autoencoder objective and so what we do is we take the target sentence the source sentence we map it into a denoising autoencoder and what a denoising autoencoder does is basically we have a sentence in the source side we add some according to this corruption function c oops this corruption function c so we move x into some other place c we map it into a latent space and then we try to regenerate the original x and I’ll give an example of this in a moment and the other objective that we can use an unsupervised NMT is back translation so what we do is we translate the target to the source and we use this as a quote-unquote supervised example to translate the source to the targets.

So one example of but the noising objective would be just to like cross off individual words here and try to reproduce the original sentence from the crosstalk chords so we set their word embeddings to zero or something like this and so that would be going from this x to the c x and then we could just run a seq2seq model to try to take in this noise input and generate the output so that would be one example and in this particular case what we could do is we could try to translate this into the into another language and this translation could either be done with like a pre-trained model that we have already from our a previous iteration about unsupervised translation or it could be done with some sort of heuristic like mapping the words in the input and I think I have a slide about that coming up soon and so the reason why it works is as I said before cross-lingual embeddings and the shared encoder decoder gives the model a good starting point where translating from this is similar to translating from from this here and so then another objective that people use to further enforce this is an adversarial objective and what they try to do essentially is you have an auto encoder example or a back translation example and you apply an adversarial objective to try to force these two phrases together so this is also similar to what they use an unsupervised like word translation where you have a model and you try to get these encoder vectors to be so similar that you fool the model and it’s not able to distinguish between them yes yeah so actually mast lms are a variety of denoising autoencoding it’s basically a subset of denoising yeah well so here we’re not using any parallel data whatsoever so the back translation is we’re not using any real parallel data all of the parallel data we have is obtained through like that translation for example. ## How does it work? Basically there’s two ways that you can you can do this the the first way is with just using an auto encoding or denoising auto encoding objectives so to take the example here we have like I am a student and and so if you have bilingual embeddings or something like that where bilingual embeddings gotten out close together basically you know these word events look the same these sport embeddings maybe look the same and these were embedding the same so overall and so like despite the fact that what you want to do is on the bottom the vector the light blue vectors on the top if each of the word wrappers look similar also looks similar and then you can also do things like randomly masking out words to make it look a little bit like basically to make the problem of auto encoding more difficult and so you need to fill in more information so when you move from this queen English over here with some without words to French you know the problem also gets harder so you kind of need to infer missing things or differences from this but because you’re doing denoising it allows you to do that for back translation back translation is basically like what we talked about before it’s like more or less the same so actually perhaps the more important thing is yeah so wouldn’t that lead to more hallucination from the model yeah basically yes I think it’s going to be very hard to train a model that’s that works perfectly with through purely unsupervised translation but the idea one thing is that any mistakes in your one reason why back translation works in the first place is any mistakes in your training data tend to be more random than the actual correct training data so as you refine the model more and more and get a better and better model the hallucinations because they’re random tend to get washed out whereas the like correct translations tend to be reinforced because they’re more consistent so it still will hallucinate and make mistakes but hopefully the idea is that hopefully they get washed out it works yeah because yes we do not directly use this move by statement during the training the veterans practice later is was trained based on this data so somehow that we use the difference between not that of dsd yeah so this this slide is a little bit deceptive because this would be the ## Step 2: Back-translation example of like supervised back translation here in unsupervised back translation in the back translation in the unsupervised like MT paradigm basically you don’t use any parallel data you just use a denoising auto-encoding objective to seed your back translator and then use that to generate data so you’re never using any like actually parallel data yeah so basically you start out with just a model training using monolingual data so for example, English is ## Performance This is a graph from the original paper. I think there’s a big caveat in this graph so you need to be a little bit careful in interpreting it but basically the horizontal lines are an unsupervised translation model that uses lots of monolingual data but no parallel data. The non-horizontal lines are a model are the supervised translation model and basically what they’re showing here is with no monolingual no parallel data they’re able to achieve scores of about the same level as something trained with 10 to the five so 10 thousand parallel sentences a big caveat here is that they didn’t use any monolingual data in the in the supervised translation system so if they did the supervised translation system would probably be a lot better but still I mean it’s kind of interesting that you’re able to do anything at all with unsupervised translations so I think as long as you’re aware of their caveats it’s kind of an interesting graph·yeah so another so basically I’ll go to the open problems in unsupervised mts so basically this is exactly the problem that Ellen was pointing out which is unsupervised machine translation works in ideal situations so basically languages are fairly similar written with similar writing systems large monolingual data sets in the same domain and match the test domain so in this particular case they were using data from the European parliament and it basically the data in the English and the French or the English and the Germans was from exactly the same distribution and that really helps in like inducing the lexicon or doing translation and so when you have less related languages truly low resource languages diverse domains less monolingual data unsupervised machine translation performs less well and reasons for performance basically small monolingual data in low resource languages can result in bad embedding so if we don’t have lots of data in the low-resource language this won’t work because the embeddings will be too poor to get a good initialization different word frequencies or morphology like the English and Turkish example I talked about before also different content makes things like back translation are less effective in bootstrapping a translation model. So for example if you’re trying to translate Twitter in one language and you have news text in another language that’s not like back translation is not gonna be good for covering what is said on Twitter so just in an interest of time I’ll skip that one but so there are some things that can be used to improve so better initialization recently people have been using things like cross-lingual language models or multilingual language models to improve the initialization of multi of unsupervised translation models so things like masked language modeling across language. So basically what you what you can do is you can train a single monolingual or bilingual single multilingual language model as your encoder or both your encoder and your decoder and you use that to initialize the translation model and this is good because you’re initializing the whole model as opposed to just the input and for various reasons it’s nice to have it’s nice to have a single model that is trained on all the languages and just to give one example even in like Chinese or something that’s written an entirely different script than English there’s still lots of English words so if you’re training the English model in the Chinese model on tons and tons of monolingual data the English words can also help anchor you know things into the same semantic space because they appear in various languages as well another thing is masked sequence sequence modeling so there’s things like mass which basically they have an encoder decoder form formulation of mass language modeling where basically you mask out a piece of the input and you generate that masked-out piece of the output and recently a model that lots of people have been using is this mBART model multilingual bart model and the way it works is basically you mask out words in the input but then you you generate all of the words in the outputs so then you you basically train this on tons and tons of data it’s a pre-trained model that you can then use to initialize your downstream unsupervised NTM And I had some stuff about the unsupervised multilingual 70. I’m going to skip over that in the interest of time, but how practical is this strict unsupervised scenario? So one thing that I definitely can say is that a lot of the techniques that are used in unsupervised mt are practical and semi-supervised learning scenarios where we have a little bit of training data so we can either train the model first with an unsupervised method and then fine-tune using the parallel corpus or train the model using a parallel corpus and update with iterative back translation and part of the reason why it’s why this is particularly good is there’s very few languages in the world that don’t have any parallel data but have lots of monolingual data for example almost every language in the world has parallel data from the bible if it has any amount of monolingual data but the bible is very out of domain very you know small so because of that using that to speed a model but then doing basic almost unsupervised translation seems like a practical way to do like news translation or something like that and then another area where these techniques are practical is if you have a task where you like legitimately cannot get very much data whatsoever so one example being style transfer from informal to informal text where you know there’s not very much data especially in different languages. Another example that has been used recently is the translation between Java and Python for example where there’s like lots of Java lots of Python but very little you know very little parallel data there’s some parallel data but not very much so yeah the discussion question is pick a low resource language or dialect research all of the monolingual or parallel data that you can find online for it would unsupervised or semi-supervised mt methods be helpful and how could you best use the existing resources to set up an unsupervised or something supervised empty for success on this language or dialect and the this is a reference to a paper that you could take a look at to discuss that so cool any questions before we start the discussion.

Citation

BibTeX citation:

@online{bochman2022,
  author = {Bochman, Oren},
  title = {Unsupervised {Machine} {Translation}},
  date = {2022-02-17},
  url = {https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w10-unsupervised-NMT/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2022. “Unsupervised Machine Translation.” February 17, 2022. https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w10-unsupervised-NMT/.