Multilingual Q&A – NLP Course Notes & Research

Video 1: Lesson Video

This week’s slides

Supplementary Figure 1

Learning Objectives

Unsupervised MT
Unsupervised Pre-training (LM, seq2seq)

Transcript

Today’s lecture is on multilingual question answering and this is the first time this lecture has been given so uh please stop me anytime during questions uh throughout this lecture and so question answering is a topic that has been really popular in the last few years so it really needs no motivation but um [Music] for the last 50 60 years there’s been a futuristic vision of humans asking computers questions and here’s an example from a fictional movie where humans try to ask computers like what’s the meaning of life and while we’re not there yet we are at a point today where people actually use question answering day-to-day i’m guessing most of you have used one or argues regularly as part of a daily life so for some examples of qa systems or tasks that you might actually be familiar with the first is lunar this one you might actually not know but uh it was one of the first three systems to actually be used it was developed for by nasa to answer questions about the moon in 2010 ibm watson or around that time ibm watson which was uh code developed by eric heiberg at cmu was introduced and this was like the first two-way system to really achieve human level performance on some domain and it beat some of the best jeopardy contestants in the world and then in 2016 the squad data set was released and this really spurred a huge amount of research into neural network methods for question answering so do you notice anything in common with all these different systems that i just described they’re all on english and this is really what we’re going to talk about today is how can we take these advances that have really been tremendous and apply them to languages that are not english so just i’m curious can you raise your hand if you’ve actually tried to use a digital assistant to answer a question in a language other than english or about a topic other than english like in the data it’s actually a lot of people and uh so uh in the purple uh how well did it work [Music] okay okay so it sounds like there’s an issue with the speech side of it and also with the question answering retrieval part of it and so um this is a paper that was just um uh recently released by graham and collaborators where they looked at they tried to index different nlp tasks based on uh two axes one is the number of speakers like how how much demand there is for the technology and then the y-axis is the quality of the technology in that language and you see the qa has like a huge drop-off after maybe the seventh or eighth most popular language you see like a dramatic drop in performance and this is actually a lot worse than most nlp technologies like something like parsing it’s actually a much smoother curve and so i think this is important because to me question answering is really about making information more easily accessible to people and when you don’t when you can’t access information other languages you’re restricting information access for a large portion of the world so there’s two kinds of qa that we’ll talk about in this lecture the first is open retrieval qa or open domain qa which is where you have a question like where did beyonce grew up and you have a corpus like wikipedia that is presumed to contain the answer and you search your question against the corpus get a document that might contain the answer and then extract an answer from that document houston being the answer here the next kind is knowledge graph question answering here you don’t assume that your answer is in a corpus but rather you have a knowledge graph that might be constructed by humans or it might be automatically generated using information extraction and here you convert your question into a graph query against the knowledge graph and then get the answer that way knowledge graph question answering is especially useful situations where you have structured data or you’re querying against some kind of automatically generated data so in this example what will tomorrow’s high temperature be it’s very unlikely that any corpus will actually have the answer to that question and so on the other hand a knowledge graph could be automatically generated using government data every day and then uh that’s where this can really be super valuable so um before i jump into open reboot qa are there any questions about the past i need to find it okay so we’ll focus on open repeatable qa which in the multilingual setting has seen a really large amount of activity in the last few years and then we’ll spend only about 10 minutes at the end talking about knowledge graph here so for open retrieval qa just uh drilling into this example uh we first want to extract the document here from wikipedia that contain that is likely to contain the answer and this task this subtask is known as passage approval and then after passage retrieval you now have a document that is likely to contain the answer to your question and then the next step is called reading comprehension which is given this evidence what is the actual succinct answer passage retrieval is a standard search engines or information agreeable problem where you’re given a query and a set of documents and you want to retrieve the most relevant document d for the query and when you frame it like this the the question is how do you define this score function that scores a document given a question and there’s been a a lot of work on how to define the score in some kind of optimal way the the most traditional approach which i’m guessing a lot of you are already familiar with it’s called pfidf and here um effectively what the motivation is is that you want to choose documents that contain a lot of the terms in your query but you might notice that not all the terms in the query are equally valuable so in where did beyonce grow up uh did or where are very common words but they’re going to occur in a lot of documents so the idf term the denominator here discounts terms that occur in a lot of documents and so ultimately you’re basically trying to find documents that contain a lot of terms that don’t occur in every document and tf idf was introduced about 50 years ago and it’s still um it’s still very popular but in the last 20 years there was a modification that’s become pretty much the standard baseline and fast to table and this is called bm25 it’s actually really conceptually the same thing as cfidf you have an idf term in the left you have the document frequency being effectively discounting the importance of that term and on the right side you have your term weight but the way that it really differs is that based on this observation that by tf idf it really uh values documents that are long because long documents are more likely to have term matches for any query and so here you uh divide the the document length by the average document length and use this to discount the term weight so if you have a document that’s much longer than average then you better have a much greater term weight in order to to recommend this document and uh with bm25 um you have two hyper parameters k and b where k controls how much weight you give tf versus idf and b controls how much you care about the document length normalization and so these are typically tuned against some kind of validation set for your language and one important note about bm25 that is really relevant to this class is that it doesn’t require learning any statistical model it’s effectively just a heuristic that you can run against your language and so this means that you don’t need any data you can run it against english or against tamil or whatever language you care about and just to um highlight how amazingly well bn25 performs uh you might have heard about openai embeddings which is this like flashy new um api from open ai that uh promises to do to be able to encode any kind of data or text and do like really effective search uh bm25 was listed as their baseline and they didn’t really talk about this too much in the paper but uh it’s surprising that their baseline model with the small model is actually worse than the m25 on most of the metrics even though it was trained with self-supervision on a huge unsupervised data set and bm25 requires very simple indexing probably costs a couple of dollars whereas um the openai model which is actually worse than the m25 would cost about 20 thousand dollars to index a few million articles and so bm25 is like a really great tool and even for comparative state-of-the-art it can be effective um i believe it was uh actually i don’t remember i’m comparing bm25 which is the top row with the smallest version of the openai api and on three of the four metrics um the opening item system provides worse when you get to a much larger uh cell supervised model it provides but no it absolutely is a problem and that that is a good point in that um if you have a language that maybe is uh very morphologically rich then your token overlaps are going to be are going to really depend on how well you tokenize or how well you break up the markings in that language the tokenization is uh is orthogonal to the input algorithm but dn25 is going to be very sensitive to how well you tokenize that branch okay so uh one problem with tf idf is that it will only recommend documents that contain a lot of the terms in the query and so let’s look at this example again where did beyonce grow up this document that which we know to contain the answer it does contain the word beyonce but it doesn’t contain grow on it rather contains rays which we know to be like the same but um there’s a vocabulary mismatch problem here and in order to solve this we can use your standard pre-trained language models and there was a paper a couple years ago called dense passage of people that’s become the standard baseline for passage of people using neural networks and in this method you encode the query and the document into vectors using the cls token and then you can just score the uh the match within the query in the document as the dot product of these two vectors so it’s very simple conceptually and you might be wondering um how this is practical because running bert against a huge corpus like all of your documents in wikipedia is really expensive and it might take hours or days and so how can you do this every time you want to search one query the reason the dpr is so effective is that you could index your search corpus using burke offline do it every night or just do it once and then you have these vectors that are static then when you get a new query in your search engine you can just encode the query using pert and it’s just one sentence so it can be hopefully quite fast and then you can now just compare that query vector against your offline document vectors and this comparison which is known as maximum inner product search there’s algorithms do it really efficiently so the major downside with dense fast retrieval is that if you don’t if you take these burp encoders as just pre-trained from the original dirt without any fine-tuning it often won’t work very well in order for it to work really effectively you want to train it or fine-tune it on a set of questions that have each been tagged with a set of known documents that are known to be relevant or non-load to the query and then you use this to fine-tune your encoders and uh for this to work well you actually need quite a large data set which does not exist in most of the world languages so this motivates um and uh so but despite this um just to show to illustrate about how popular dpr is today here is the papers with code leaderboard for natural questions which is a really a really popular qa data set right now and the top six methods for passage retrieval on this data set all use some variant of dpr and then the seventh one uses bm25 but it’s quite a bit worse here so uh before we move on to reading comprehension are there any questions about past detail okay so now once you’ve retrieved the passage that is likely to contain the answer the next task is how to extract a single answer from that passage and this is known as reading comprehension now sometimes given this question who is beyonce’s sister this question cannot be answered by this passage and in that situation you have to tell the user uh as such so now having framed this question um there’s two kinds of reading comprehension that we’ll cover in this lecture the first is extracted question answering where you assume that the answer is contained as a segment of the passage in this example houston being a subset of the capacities and then the next kind is a more challenging setting where the answer is freely generated using the passage as evidence and so in this example which is actually a harder question than before is beyonce from the southern united states we know that houston texas is in the southern united states and the answer is yes she is from the southern united states but yes is not contained in this passage so these are two problem settings that we’ll talk about we’ll first establish a baseline for each so for extractive qa there’s an approach using burke that’s become quite popular and it’s really simple but it seems to do the job most of the time and in this approach you you’re so just uh refreshed you have a question and you have a passage that has already been retrieved and you just concatenate those two together pass it into birth and then encode each token in your passage using burn and the key here is that because it’s contextual encoder each token in the passage has been able to self-attend to the question as well so you kind of have a question aware representation of each token now at each token in a passage you can take that encoding that vector and then pass it through uh a simple linear classifier that predicts how likely it is that this token is the start or end of the answer and um it’s important to note that uh one one downside of this method is that you assume that you can predict the start and the end of the answer independently which doesn’t actually make a lot of sense like really there’s some expectations you have that the start would definitely be before the end index or the end will be relatively close it’s not going to be a thousand characters away but somehow uh models just tend to learn this behavior most of the time and it’s not too much of a problem and then when it comes time to make a prediction you just choose the index that has the maximum probability of being the start the one has a maximum probability being the end and then what’s in between is the correct answer stand any questions uh that previous method assumes the answer is a span in your passage but what if you want to do free form answer generation and in this setting uh another very simple related baseline is to again concatenate the question in the passage and then use a sequence to sequence model like t5 being a very popular pre-trained one here you encode your question and passage together to get like a representation of them jointly and then you use this uh to pass to a decoder that then generates answer token by token and so this is like a classic encoder decoder model and so this way you’re directly generating an answer using a propane model so uh returning to the goal of this class which is multilinguality most data sets in open retrieval qa are only in english so here is the papers with code uh the set of most popular benchmarks the question answered on papers who code and if you look at these 15 benchmarks literally all of them are only in english there’s two questions we’re going to ask in the rest of this lecture firstly can we support questions in a language where we may not have data and then can we search against a corpus that may be different than the question the language that we’re asking the question in so um this introduces two related problem settings multilingual qa and cross things with qa so in multilingual qa your questions and your corpus that you’re searching against and your answers are all in some language but they may be in a language uh either that you don’t have data for or you want to leverage multilinguality in some way so maybe you want to answer you want to do chinese question answering but you only have access to an english language data set like squad and how can you do this that’s the first problem we’ll talk about and the second one which is actually just a more general version is what if you want to ask questions in one language search against documents in another language and maybe even return an answer in a third line and this is called cross-lingual qa usually the questions and answers are in the same language because that’s most practical use cases will be satisfied by that so in this example you have a question in tumble which is a beyonce valenta and beyonce is very popular in in south india but um there may be less information about her on wikipedia in that language than in english and so here we want to search against english wikipedia get a document that contains the answer and then return the answer to the user in health and this is called cross-lingual theory so now we’ll talk about three approaches to solving both these problems and uh the first is zero shot transfer where uh you might remember uh in patrick’s lecture uh three lectures ago he talked about different multilingual pre-k models like ember xlmr these models have been taught basically what a language what many different languages look like what their statistics are and their properties uh in terms of the pre-training tasks but they have not been taught to do questions specifically so the zero shot transfer approach is to take one of these pre-trained multilingual models fine-tune it against an english language qa data set like squad and then hope that by learning to do english qa and based on its knowledge of what tamil looks like it can also learn to do thermal qa without any explicit supervision and so we saw in patrick’s lecture that these models tend to perform well in things like classification using this exact approach so do they does the same apply to delay and uh the answer is that it actually performs perhaps a lot better than you might imagine so here a model based on ember achieves like 45 45 exact match accuracy on a multi-lingual qa data set which is actually uh perhaps serviceable for your application but simply by adding just a few target language examples like four or ten you can achieve like three to five percent improvement which suggests that there’s actually a lot of room for improvement beyond this naive strategy that’s right it was fine tuned on english 287 it means you take your multilingual model it’s been fine tuned on english qa and then you just use that and then give that a new language of input and have it search potentially against against documents from a new language as well right you’re never training it how to do double view it’s pretty amazing that it works actually you know but it basically relies on the embedding space being similar enough in this data set that’s the assumption they make but you don’t have to make that assumption in theory this this approach could actually support cross-lingual keywords the next approach which is actually um a lot older this has been around since 2000’s and taruko from lti has been working on this for uh she was working on this 15 years ago uh this is this approach is to use translation so um imagine you have a question in a language like a really obvious solution is to translate that question into english use an off-the-shelf english query system which we know works quite well get the answer in english and then translate it back into the language of your choice so this is known as a translate test and uh if you read the um the reuter reading uh which is one of the three readings for this class he taught about this method and its downsides and there’s a few issues with this method even though it’s really simple and actually quite simple to engineer firstly uh you’re relying heavily on an empty system to translate your question into from english and for a lot of languages your empty systems are just not that good and so you’re going to see pretty serious error propagation here and the second more nefarious issue with this is that you’re assuming that all your questions can be answered using knowledge in english and this is like a very anglo-centric view and uh a lot of the practical use cases in native countries are not going to be able to satisfy this and we’ll talk more about the second point at the end of the lecture i’ll just skip this actually and then um the final approach we’ll talk about is actually building off of the the first approach of using a multi-lingual model multi-level encoder and this is a method for doing cross-lingual qa without doing any translation to bypass some of the issues that we saw in the first approach and uh not only here do we want to do no translation required but we also want to be able to search for answers and accumulate answers across multiple different languages parkour so how can i combine chinese japanese and english wikipedia together to get one right answer in your target language and so in this this is the architecture here the first piece of this architecture is a multilingual variant of dense patches recruitable which is the neural pass retrieval method that we discussed earlier and this multilingual dpr simply uses a multi-level encoder to recreate passages from queries of any language to past discerning language next given a question here in spanish you can extract relevant passages from multiple languages so here there’s spanish and hebrew and then now given these multilingual evidence passages we want to now generate an answer in the target language which here could be english and in order to do this they leverage another multilingual protein model which is a multilingual generator mt5 that is conditioned on the evidence given and then generates an answer in the language of the choice uh yeah i believe it can really be any uh the trail number is this would be the one that you use off top of my head but uh any of those should be able to support this architecture that generates the answer is not assuming anything about where in the passage of the answers became now uh one so this is like a really nice clever approach when you look at it but if you just tried this using off-the-shelf multilingual encoders and generators it wouldn’t work and that’s because you actually do need training data to make these things work and uh this is actually super challenging because while there are data sets for multilingual passage retrieval there is almost no data that can support uh generating an answer from evidence from different languages that’s a task that doesn’t really have any data set right now and uh in theory that would be like a huge issue for this but they proposed a solution that uh that somehow works which is iterative self-training they start with a version of the dpr model that is not trained not fine-tuned and then they given a bunch of questions for which you may not know the answer you uh extract passages from wikipedia and then choose the ones that are that are likely to be correct using the generator and then consider those to be like pseudo answers so you’re effectively running your model to get like pseudo answers adding in your training set retraining your model and you do this multiple times until you reach subjective performance we had a question about like what model they use here so i i looked it up and they used multilingual um they said they tried xlmr and it didn’t improve results and this is actually uh it’s kind of interesting because it also matches a general trend that i’ve found like if you’re trying to use mbr or xlmr you probably want to know which one you want to use and it seems that embers is better at like content based modeling so like if it’s on modeling entities or modeling other things like this it seems that ember is better often whereas xlmr is better at things like identifying word types or something like this so like xlmr seems to be better at like part of speech tagging dependency parsing uh kind of like reading reading and identifying spans based question answering so for something like looking up the content within dpr it makes sense to me that ember like based on my previous experience it makes sense that ember is better of course there’s a whole lot of other models too but i think it’s worth remembering that some models are better at some things is yeah in the beginning i believe it’s not fine-tuned for any kind of sentence of coding tests and it’s iteratively fine fine-tuned okay so um before we jump into uh how effective this method is we need to define a few metrics and uh by the way there’s a class at cmu on question answering so uh if you really are interested in the nitty gritty here take that class with eric nyberg and taruko and but the um the three metrics they use which are all pretty popular the first is f1 score where given a generated answer and then an actual answer you can compute the precision and recall of the token overlap and then compute the f1 score with that exact match is is the generated answer exactly the same as the true answer which here it’s not and then blue score has been defined in the previous lecture so um i’m not going to go into that but it’s effectively just ngram overlap and um on xor qa which is a cross-single qa data set uh they they post quite strong performance here uh and i believe like the human level is around 70 uh and the fifth method here is exactly what we call translate tests earlier and so you can see how much better a neural method that’s been fine-tuned can be than this naive translation based approach like they’re using monolingual data that’s an interesting point so he he mentions that um in the iterative self training they have assumed access to wikipedia which they use to help refine the pseudo answer and is the access to the information kind of cheating against these other systems that don’t assume access to that kind of external knowledge i think you can make an argument for that but if you just view this as trying to solve the problem as effectively as possible it may be doesn’t matter and uh i want to highlight here that uh the state of the art on a very similar english data set natural question which is actually collected in almost the exact same way is 80 and so even though this is effective and state-of-the-art for this task it’s still a lot worse than english so there’s a lot of room for improvement the quora model yes it’s the model it’s the approach if there is with them but uh i think that’s a good question yeah so within within english we actually have a paper how can we know when language models know and basically the what the paper does is it measures the calibration of like generative uh or extract like language models that read and then generate the output um and the answer is the models aren’t super good but in calibration sorry calibration means like when they say there’s a 50 probability of the answer being correct is there actually a 50 probability of the answer being correct and if they are basically right about the probability of it being correct you can just choose what level of confidence you need before you output the answer um and the models aren’t necessarily extremely good at it out of the box like they can be like way over confidence or something like this but you can also adjust them relatively easily to be well calibrated if they are well calibrated that works i’m pretty sure there’s nothing on multilingual so with class project i don’t know so it might be something um i’m not sure of an x squad but i know that uh uh tidy qa which is one that i’ll mention later does have unanswered and so what you saw this is actually a great sideways question which is um some data sets for multilingual multiple qa and so i’ll just breeze through this but there’s these two data sets which are both extremely popular they both they do just machine reading comprehension no retrieval so you’re given a passage and a question and the answer is either contained or not contained by pathways and they both uh mlqa is larger but they both are based on taking an english language to create a data set and then translating it to the language of the choice mkqa is an open retrieval created a set based on natural questions which is an english language to a data set and they again translate all of the questions answers and corpora into the language of your choice tight eqa uh is um i think one of the most impressive ones here it’s another very large data set but here they collect questions naturally in each one so they have users and say tamil actually i don’t think one of the languages but in each language they have users generate natural questions in that language and then they search against the native wikipedia without language and here they do have unanswerable questions um and the reason that idq i think is so important is that every other data set translated english to a paris into a target language and uh there’s two really big issues with this the first is that the translated questions even using a human translator may not actually be natural in your target language so let’s say you’re translating an english question to russian which is like a mostly a free word order language it’s very likely that your translator will actually use english word order because that was the source language and that general that leads to like a version of russian that’s not actually that natural and the second is that the answers are assumed to be anglocentric you assume they can be found in english with the media and so here’s an example of a question which indian lawyer advocated the preservation of marina v in chennai this question can be answered using tamil wikipedia i i checked actually but uh it cannot be answered so any kind of translation based approach will simply not cover these kinds of buttons um and then there’s a version of tidy qa that also supports cross-lingual qa by whenever you have an unanswerable question looking it up in english if it exists and that that lets you test uh cross-lingual retrieval as well and uh several of these benchmarks are included in the extreme uh multilingual benchmark so now um i’ll just spend a couple of minutes almost out of time here but talking about knowledge graphql and so in knowledge graphqa the goal is to give in a question to convert it into a query against the structured knowledge there and this query often takes the form of a of a like query language like sparkle which is like a version of sequel for a graph query and this skull can be concisely called semantic parsing now in terms of multilingual knowledge graphqa very little work has been done in the space i would say i’ve only seen papers in the last year or two working on this and there’s two known methods both are very related to what we’ve already talked about open people qa the first is zero shot transfer where you take a multi-lingual encoder prank unit for english kg qa and then try to transfer it to another language and then the second method is translation based adaptation and in the paper uh by uh zooedal they use this approach but uh they make the observation that uh even if you don’t have access to a good translator for your language in knowledge graphqa you may be able to get away with using much more word level or local translations because in a knowledge graph or in the questions that query knowledge graph you don’t need as much paragraph level sentence global context so here they use unsupervised uh bilingual x-file induction to do word-level translation of an english data set into a target language data set and they find that this works actually uh pretty well and so there’s even though there hasn’t been a lot of work done in knowledge graphqa there have been a few data sets in the last couple years um you can uh the slides will be posted and you can look at these on your own time but this creates an opportunity actually the knowledge graphql where you haven’t got there hasn’t been a lot of methods developed but there are data sets that you can use and so i think there’s a lot of interesting questions that could be answered using this data that people haven’t really thought about and then some other interesting open problems that you can think about for your final project or for near future are um in cross-lingual qa we show that there’s a huge gap still between the state-of-the-art methods and either english language qa or human level performance in that language and so this suggests that we’re pretty far from having solved this problem in any kind of meaningful capacity and now um one other point is that with multilingual qa there’s a lot of information that is actually non-linguistic that you can use to make the task easier you can look at images in wikipedia or fact tables info boxes and not much work has tried to do this so i think that this is a really promising area after research and lastly building off of allen’s lecture on tuesday if you go to india and ask people to interface with the english with a hindi language to a system they’re very likely to ask questions in code next english and i don’t know much work that’s done that’s tried to actually solve this so these are all some uh some open problems in multilingual field so for the discussion today we had these three different readings the first two were papers from conferences and the third one was a kind of a summary blog post from an emlp tutorial and the question is to think about a practical qa application in a language or domain of your choice and given this application um think about uh at least two of these things which are who would actually find this application to be useful what methods do you think could actually work for this application and if you’re interested in kind of thinking like a researcher what resources or strategies might you use to either improve performance in the system or evaluate how well it works and uh any last questions before we go this question between mp5 and xlr um just because like cross-lingual qa is the thing it’s included in all the benchmarks yeah and extremes directly like is about uh zero shot monkey um you’re asking how can you score a different language passages from different languages um you can assume that you’re that you’re i guess you can assume that your your ranker can assign scores from different packages using some kind of language independent i think that would be the goal but uh that’s what it tries to do yeah i think it actually works it’s a good question there is um i think it was actually developed here by galway as i said and others so i think that’s actually the best way i think so yeah that was for so i’m familiar with that i said that was qa for that involved translated or transcribed text and so but the trend because it’s transcribed that’s they must also so yeah i think there’s i think there’s that i’ve worked a little bit on this but i don’t think there’s a whole lot i think um in terms of multi-lingual scope but it’d be super useful for this because a lot of people

Outline

Introduction to Question Answering (QA)
- Humans have long envisioned asking computers questions.
- QA is now part of daily life.
- Examples of QA systems:
  - LUNAR
  - IBM Watson
  - SQuAD
- Traditional QA systems are mostly English-language based.
The Importance of Multilingual QA
- Multilingual QA helps people access information easily.
- There is a steep drop in QA technology quality after the top languages.
Types of Question Answering
- Open-Retrieval QA (aka Open Domain QA)
  - Given a question, the system finds an answer from a text corpus.
- Knowledge Graph QA
  - Given a question, the system finds an answer from a knowledge graph.
Open-Retrieval QA in Detail
- Steps:
  - Passage Retrieval: Extracting the document from a corpus that likely contains the answer.
  - Reading Comprehension: Extracting the succinct answer from the retrieved document.
Passage Retrieval Techniques
- TF-IDF
  - Selects documents containing many query terms, discounting terms that occur frequently in many documents.
- BM25
  - A modification of TF-IDF that discounts term weight based on document length.
  - Does not require a statistical model.
- Dense Passage Retrieval (DPR)
  - Encodes queries and documents into vectors using models like BERT.
  - Indexes the search corpus offline.
  - Requires fine-tuning on a dataset of questions with relevant documents.
Reading Comprehension Techniques
- Extractive QA
  - The answer is assumed to be a span in the passage.
  - A common approach is to concatenate the question and passage, encode each token, and predict the start and end tokens of the answer.
- Generative QA
  - The answer is freely generated using the passage as evidence.
  - A sequence-to-sequence model can be used to generate the answer token by token.
Multilingual QA Problem Settings
- Multilingual QA
  - Questions, corpus, and answers are all in the same language.
- Cross-lingual QA
  - Questions, corpus, and answers can be in different languages.
Approaches to Multilingual QA
- Zero-Shot Transfer
  - Finetune a multilingual model on an English QA dataset and transfer it to a new language.
  - Adding a few target-language examples can improve performance.
- Translation-Based Adaptation
  - Translate the question to English, use an English QA system, and translate the answer back.
  - “Translate-Train” involves translating the full training data to the target language.
- Multilingual Retriever-Generator
  - Uses a multilingual DPR to retrieve passages in any language and a multilingual generator to produce the answer.
  - Requires iterative self-training due to lack of training data.
Evaluation Metrics for QA
- F1 Score
  - Calculates precision and recall of token overlap between the generated and actual answers.
- Exact Match (EM)
  - Measures how often the generated answer exactly matches the true answer.
- BLEU Score
  - N-gram overlap between the generated and actual answers.
Multilingual QA Datasets
- XQuAD and MLQA
  - Machine reading comprehension datasets based on translating English datasets.
- MKQA
  - Open-retrieval QA dataset based on translating Natural Questions.
- TyDi QA
  - QA pairs collected naturally in multiple languages.
- XOR-QA
  - Based on TyDi QA with added English translations for unanswerable questions.
Knowledge Graph QA
- Involves converting a question into a query against a structured knowledge graph.
- Methods include zero-shot transfer and translation-based adaptation.
Open Problems in Multilingual QA
- Cross-lingual QA performance is still far from English QA or human-level performance.
- Leveraging non-linguistic cues (images, fact tables).
- Answering code-mixed questions.

Papers

The papers mentioned in the context of multilingual question answering are: * One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval (Asai et al. 2021) * Improving Zero-Shot Cross-lingual Transfer for Multilingual Question Answering over Knowledge Graph (Zhou et al. 2021) * Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al. 2020) * TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages (Clark et al. 2020) * A BERT Baseline for the Natural Questions (Alberti, Lee, and Collins 2019) * Bootstrap Pattern Learning for Open-Domain CLQA (Shima and Mitamura 2010) * Natural Questions: A Benchmark for Question Answering Research (Kwiatkowski et al. 2019) * On the Cross-lingual Transferability of Monolingual Representations (Artetxe et al. 2019) * RuBQ 2.0: An innovated Russian question answering dataset (Rybin et al. 2021) * QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia and Wikidata Translated by Native Speakers (Perevalov et al 2022) * Multi-domain Multilingual Question Answering (Ruder 2021 [blog post])

Citation

BibTeX citation:

@online{bochman2022,
  author = {Bochman, Oren},
  title = {Multilingual {Q\&A}},
  date = {2022-02-15},
  url = {https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w12-qa/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2022. “Multilingual Q&A.” February 15, 2022. https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w12-qa/.