- Unsupervised MT
- Unsupervised Pre-training (LM, seq2seq)
We are going to talk about another class of languages today particularly languages which are either drifting into new languages or maybe they’re not scripting maybe they’re targeting into new languages and where languages are mixed often called code switching because we are really interested in all languages that people actually speak and not just official languages which happen to be what countries define what the languages are going to be most people in the world are multilingual okay and it’s only unusual that the people from united kingdom and america are monolingual and they’re a really weird bunch compared to the rest of the planet but most people are speaking multiple languages they go through life they grow up speaking multiple languages they have languages for different reasons they have languages they speak at home their languages they speak at the university education the languages they speak in business and these are often quite different languages and historically they can come from quite different groups they can be colonial languages they can be other local languages that are maybe related to something but they’re doing it and that definition is not always well defined and sometimes people will just call it a dialect sometimes people will just say well that city speak as opposed to the rules in speech sometimes people will speak these two languages in quite distinct languages at the same time i mean not overlapping but sharing words and sharing grammar of them and so obviously we’re quite used to this idea where you’re borrowing a word from another language when you’re speaking but you also find that it’s very common amongst groups of people who know the languages and are know that the other people know the languages that they’ll flip between the language within the sentence and possibly even within the work so you get two languages into one space okay and this can actually even get within the word where especially where we have a rich morphology language where you’re borrowing a word and then you’re doing something to it you’re adding some form of conjugation to it and you’re doing plural from one language to the other so it’s relatively common when we do this we call this code switching or code mixing and there are some groups of people on the planet do it almost all the time and can’t even work out when they’re not doing it but i’d like to point out that for any of you who only speak one language and i’ll naively assume that if you only speak one language it’s english because if it’s not english you won’t understand what i say that monolinguals do this all the time actually because they flip dialects as well so they’re you know they will say something in standard english if we’re talking about english and then they might see something in scott’s dialect afterwards to get a laugh out of people or something because they’re trying to do something with what they’re speaking so they’re flipping between dialects which in some sense can be code switching and even if you don’t do that think if you’re a monolingual person when you swear okay there are only some circumstances is acceptable to swear in some circumstances where it’s not and depending who you’re talking to what you’re talking about you may choose between these different dialects swear words and other crude forms as opposed to maybe a more standard form that you might do if you’re standing up and that’s the same idea that there’s some mixture of language choice vocabulary possibly grammar as well that’s going on okay this colloquial formal or swear word or form is relatively fun in all languages but in some communities this is an actual split between completely different languages okay sometimes these forms of language come up out of necessity where you have two groups of people who do not speak the same language and what they’re trying to do is find a lingua franca they’re fine trying to find a way to communicate and what happens is a new language gets developed and that new language is actually maybe a mixture of both languages it’s often a simplification because nobody’s a native speaker and it allows them to be able to communicate these are often called pigeons they’re often referred to as maybe not official languages but they’re actually how people communicate and over time they begin come quite formal and they become the standard way that people communicate and sometimes they’re actually called pidgin in the name and what often happens is that pigeon becomes the thing that people speak and then after a while people stop speaking at home and then they as they’re bringing up kids the kids learn it natively and when the kids speak it we usually linguistically call that a creole rather than the pigeon because there’s now native speakers of the language and native speakers of a language are allowed to make other decisions about the direction of the language it’s not fully defined but we usually define creoles as having native speakers even though they may be derived from other languages and maybe multiple languages while pigeons are ones where there are no native speakers although we often refer to languages or dialects by pidgin and creole and we’re not necessarily following them that particular rule okay creoles have native speakers pigeons do not yet but actually when when when a language is called a creole or a pigeon and linguistically that’s what we mean but actually there are actual languages on the planet that are maybe called a creole or pigeon and they might not be an actual pigeon or a cruel they’re all derived and mixed from some other language but they’re they’re on their way to being a maybe more formal native language for a particular area i’m going to play some here typically pigeons are not written and they’re speech communication and and so the writing systems for them are somewhat non-standard and therefore they can be quite difficult to read even though people can understand them quite well think about dialects of english if you’re not aware about dialects of chinese that you’re not used to actually seeing written shanghaies might be understandable to you but you don’t really know how to write it and when you’re trying to read somebody who’s written it it’s hard to do because you’re just not used to seeing that okay so the first one i’m going to play here is jamaican patois patwa another word for a creole actually all spoken in in jamaica and it has a certain amount of english in it and so you’ll sort of understand some of this so and so you know it doesn’t look anything like english but this says 14 generations and so you can begin to understand that and the more you listen to it you’ll be able to pick up more and there are some things that are not english based they’re actually from other indigenous languages but after a while you’ll learn to be able to understand bob marley when he’s sings these songs being one of the more famous international users of jamaican platform similarly in the same area haitian creole often just called creole and is the language that’s spoken on haiti it is derived from french and some people consider it to be not a real language and just a bad version of french other people say no it’s quite distinct and it is true that in education in haiti people may be taught standard continental french as well and they may be able to move into that and when they’re writing they will often move into that although typically haitian creole is written much more phonetically than what continental french is and i’m not here to complain about how badly continental french is written with all the last letters of words not being pronounced i’m an english speaker how can i ever complain about spelling and but here’s an example of haitian creole if you know french and you might be able to understand some of it but probably not all of it and do we have any french speakers we do have some fresh speakers yes right okay and you certainly can’t read it no but because it’s not written the same way as french at all but actually you’d be able to attune to it faster than the rest of us and to me it sounds very french but of course it doesn’t mean a french okay now we all speak english i’m assuming and english is arguably a creole and maybe even code switched depending what you look at it between saxon and norman french the fundamental grammar of english is mostly germanic and a lot of the lexicon and a lot of some parts of the grammar are actually from norman french very latinate and they’ve been mixed for some 800 years okay and it’s true even if you’re a non-native you can generate things in english that make it sound more elaborately a formal and more bass and crude and often what you’re doing there is you’re actually using a french or latin grammar rather than the the saxon germanic forms there’s lots of words where there’s multiple words that mean basically the same thing and historically they come from the different people who were speaking these languages the languages for animals especially animals we eat in english are usually germanic so this is things like swine or pig or cow as opposed to the words that we have for these animals when we eat them as meat we have quite distinct words because the rich people were eating the meat and therefore they’re often garment french woods like pork and beef they come from french and there’s lots of things that are actually like that but you know it’s you know 800 years is actually quite a long time for a language and not all languages survive that long and still be in some sense mutually intelligible although english from 800 years ago is quite hard to understand unless you speak german so how do you differentiate between languages that are derived from other languages yeah so it’s really hard to do some linguistics to come along and say look at these distinctions and know something about the history as well so you know english is an indo-european language and we can talk about what we think brutally in the european wards which is the root language of all of europe and much of the middle east and at least northern india there’s some history a shared form that we can come with and we’re not saying that in general these are creoles we say that they’ve generated over time but most of the creoles and pigeons are something that happened when invasion or trade happens and therefore you have much more localized and usually historical events that we actually identify so that’s usually the distinction while just because we we saw and things in russian where there’s lots of things that are in the european there’s lots of things that are borrowed from greek as well in the same way that things are bordered from greek in english often scientific terms and that’s sort of different than having something like a talk pigeon that’s spoken in papua new guinea that is a pigeon of english and it is modified foreign these are linguistic terms and we must be aware of that and they’re useful to talk about in terms of linguists but the name of the language and how it actually gets treated and be also aware that there’s lots of non-trivial political aspects of the definitions of what these languages actually are and all of the different languages in china okay are called dialects they are not called different languages even though they are more different than the european languages and that’s probably political okay the arabic is usually termed as a single out of it with different dialects across the different arabic speaking countries even though the difference between moroccan arabic and iraqi arabic is as different as the difference between english and polish okay and so they’re pretty distinct there’s common words but you know they’re not the same language but the arabic speaking world wants to keep it as a single language and then talk about dialects and so it’s a lot of politics are going on here and so when you’re actually doing language technologies sometimes you’ll be looking at dialect distinctions or something vegan language districts and just be aware that you should look at what the differences actually are rather than what the name of these things actually are because it can be hard to fully define that [Music] often pigeons in creole because they are more recent are much more to do with speech communication they are to do with writing people might not actually write them down or there’s no standard way of doing it for europe during the middle ages there was a written language called latin that everybody who did read and write would read and write and then there was what’s called vulgar latin which now means something that’s crude something that’s not well spoken well in fact all it meant it was the common tongue that people were actually speaking what people actually spoke and it was drifting away from the written form of latin same thing as what’s happening today in arabic the written form about it is much more standardized while the spoken form is different okay this means that when you’re looking at these forms it can often be really hard to find examples because when people write things down they go into their formal mode and so you don’t often get colloquial farms when people were writing down things in the middle ages 1500 and they write and land why would they write in english or french or german i mean that would just be weird to do that’s that’s not what we do we write things in latin so that everybody can read it what’s the point of doing it in some local language and so often that’s still the case you think as standard english i mean the difference between different english dialects throughout the world actually disappears a little bit when you actually start writing things down so this means that you might not be able to find good examples of a pigeon or a creole or code switching which might be the pre pigeon because people are not writing it down okay of course until recently so the linguistic definition is are the native speakers okay the political definition is somewhat arbitrary haitian creole is just called creole if you ask people in in haiti what language they speak we’ll say creole yes code switching is when people are communicating when they have good fluency usually in two languages and they’re mixing them okay it’s usually only happens in casual speech and text casual text which we get on social media and it’s usually face-to-face it’s usually something that is not in official documents it’s usually not something that you see the government supports it’s not something that’s taught in education in schools but it’s how people actually communicate often it’s defined in terms of what’s often called a matrix language so there’s an underlying language that’s being used and then the things being mixed on top of it okay for those of you who speak hindi you’re probably nowadays speaking english because you’re actually mostly speaking with a hindi matrix language with lots of english on top of it okay in the same way that i said english is really a saxon language with lots of norman french inside it okay and that’s sort of what it is nowadays i’m not talking about indian english which is its own english on the planet and actually it’s the most common english on the planet and so they actually get to define what the rest of us speak and so prepone and thrice are perfectly reasonable english words and they tell me but so having that matrix now it might be the person speaking will flip into english and then use english as a matrix language and then put hindi into it that might happen and it’s quite useful to try to identify that but having that notion of the matrix language helps us when we start trying to do analysis what’s the word order how does the end function words work and is there morphology going on doesn’t that get deleted etc okay often there’ll be flips that are going on often people will do filler words in one language and maybe flip into another so this is filler words like you know and like and okay and the reverse in hindi which isn’t an english word but often gets used by indian speakers in english what we can do is we can try to measure the amount of changing that’s actually happening so if we have the data we can talk about how often are people changing how many words are in one language as opposed to the other and we can also look at which words is it content words is it function words is it strings in one language before you change into the other and that’s sort of quite useful to try to work out the type of code switching that’s going on it’s not the case that you have nothing but english and then you have one word at the end that actually says ads in a different language and therefore that’s code switching you put law on the end and now you get english out of it no it’s it’s usually some form of mixing and defining that mixing can actually be relatively hard to do okay and depending on the language switching that’s happening it could have different definitions especially if the languages have got very similar grammar for example it’s quite common for spanish speakers in the united states to speak both english and spanish completely fluently we speak this and could switch spanglish form and because the grammar of these two languages is incredibly similar it’s much more fluid than say what you might get in hindi english where the grammars are really quite different verb final and therefore pro-drop there’s lots of things that are different between those languages so sometimes the languages that are being chosen are a big education language and a local language that’s actually quite common and people will do that all over the planet and where they’ve got their own local dialect and they’ve got the one that’s spoken in the main city i’m a scottish english speaker and certainly that is one thing that i used to do when i lived in scotland and depending on who i was talking to i might use scott’s dialect and throat in and out of english sometimes that those other languages of the education language it’s the colonial language and it shows that you have education it shows that you’re sophisticated and it shows that you’re being nice to the invaders and all of those reasons complex reasons that why you might actually want to do it but english is a really good example of it there are lots of english speakers on the planet it’s probably the one with the most example data and hindi has definitely changed even over the last 50 years from being somewhat purely hindi into lots more borrowed english to the extent that it’s hard for a hindi speaker to not use english words in their hindi which is slightly different from a japanese speaker not using english words in their japanese because in japanese most of the english words are borrowed words directly rather than actually aspects of the grammar while in english there’s much more grammar related things that are going on chinglish and chinese english is relatively common in bilingual places so that includes places like singapore hong kong and carnegie mellon university and where we have lots of chinese speakers you’re probably aware of that and where things are mixed all the time and that is something that is growing in these languages i’ve got similar grammar but there’s lots of sharing that’s going on okay spanglish i mentioned african-american english and standard english is also something that happens in north america and many black americans will use a standard language african-american english which is different than standard english it’s got different grammar it’s got different choice of words it’s certainly got different pronunciation of standard words and and they will mix between them depending on the situation that they’re actually in okay and almost all of the cases of people do people are very fluent in both languages and they can speak each of the languages but they’re just mixing it because it’s more convenient okay when do they code switch well there’s lots of reasons and that’s one of the things that would be quite interesting to to to look at further and hopefully in the discussion we’ll get a chance to do some of that sometimes it’s vocabulary so you know if you’re talking about hindi food you probably don’t want to translate it into english you just want to use the hindi name for it and likewise you know there could be a difference even though it’s the same thing so potato the english potato in english means a certain type of potato and dishes that you would build with potato well aloe is the hindi word for it where you might use that for hindi dishes so you’ve now got two words for sort of the same thing that you could choose between them and that happens quite a lot and again that happens in english where we’ve got doctor physician or we’ve got card and we’ve got automobile they mean the same thing ish but not exactly the same sometimes we are doing it to show off to friends either to be part of the group and actually be local or show that we actually have education or sophistication and so there’ll be choices that happen with that and sometimes we just do it because it’s easier and the other person knows it and your brain’s not quite thinking at the right time and you end up using the word from one language rather than the other so there might not be good reasons for doing it there seems to be something that people are more open about their sentiment within their local language as opposed to the other language so this was a study in english that people seem to give more direct opinions in hindi and what they would do if they were just giving it english it might be a politeness thing it’s sort of unclear and but there seems to be some evidence that as people go into their more native dialect they’re more likely to actually be more truthful or be maybe that’s truthful is wrong it’s not that they’re lying in the other one but they’re actually going to be more open about their opinions this was looking at language choice for tweets there’s also this entrainment thing that happens so if somebody uses a term in one language the person who’s talking with them will use the same term back unless okay they feel that they don’t want to be part of the person or or linked with the person who’s on the other side and in fact there is a a study that julia was doing that julia spence was doing where she could see that ukrainian news reporters in interviews that were all in english would tend to use ukrainian phrases to refer to things that were still fully understandable in russian but the choice was on the ukrainian side when interviewing a russian speaking person okay so it doesn’t mean that they might align they might disalign somewhat deliberately there might also just be different semantics subtle semantics that’s going on one there’s a more general term or one is a more specific term between the two languages and therefore it’s more convenient to be able to do that why do we care about this and everybody speaks english in putin hua so why do we care about these other languages why do we care how people actually communicate newspapers wikipedia are not going to have code switched or pigeon i mean they have a bit but not really because people are going to go into a formal education mode so why should we care about actually doing nlp code switching is how people actually communicate if you go and grab some indian person’s phone and break their privacy and look at all of the text messages that they’ve sent they’re all going to be romanized in english if they’re hindi speakers okay or almost all okay because that’s how people actually communicate and so are we inter we’re not interested in spying on people’s phones but we are interested in how people actually communicate it’s how people type in questions into google and bing they do code switching all the time and it’s how they talk on call centers and it’s how they write their actual opinions probably they use code switching when they’re trying to define a group membership so if you want to say something about cricket you better say it in english because then you’ll be shown that you’re actually a native person who knows about cricket [Music] so maybe people trust code switched and communication better you might get a better response by telling people in english that they have to wear a mask rather than english that they have to wear a mask so there can be advantages from that or telling them to wash their hands or whatever so facebook better and amazon microsoft apple all want to understand code switch data they like to be able to build systems that can actually do it why is it hard well it’s actually quite hard because most of our data sets don’t contain code switch data in fact most of the time we remove code switch data from any data that we’re building these big pre-trained language models from that’s not completely true roberto is actually a bit better from that point of view but it’s still the case that it’s often looked upon as some mixed language that we’re not sure about and we’re going to extract it it’s very noisy when people write in code switch form they don’t have a standardized writing form the spelling’s going to be random so you’ve got much more variability in the possible way that the word could get written and you’ve got less of it so unlike in english or french there’s a standardized way of doing it so you’re going to get much more noise already i mean we can deal with that but you know it’s going to be harder our favorite word embeddings are often confused because what you now have is you have an english word and you have a hindi word and they’re right next to each other so the context for the english word has got some english in it and some hindi and the context for the hindi word which is romanized it’s not written in dev nagari has got some english and hindi round about it possibly in a non-standard order so you already got really weird context when you’re looking at word embeddings for things in code switching so everything gets worse okay it doesn’t fail but it all gets worse and and and there there’s not good examples if we went and collected enough data of english and art etc would just be fine but it’s not and english is the one that’s got lots of data in it every single regional language in india every single regional language in southeast asia where english might be involved in some british empire thing that happened before and we’ll also have some form of code switching and therefore we just don’t have good data for it okay often it’s casual it’s speech it’s not in the well-written things it really is in natural communications of social media and where do we find this data well it’s harder to verify that we actually have it and hard to find out when it’s good and twitter youtube reddit and social media is a good place to find it but it is very noisy collecting is particularly hard and you need the right environment you need people need to be in the right environment to do it if you take a bunch of people who normally could switch and put them in a python programming environment they’re probably just going to use english okay though that’s not completely true as one of my telugu students was saying to me that she remembered that she was discussing a homework with her friend in telugu and all of the words were in english because it was a machine learning homework but it was all intelligent were built okay and and they were communicating for some time before kathy noticed that oh that’s what i should go and tell alan that that i’m actually code switching in grammar but really not in words and i would have been able to understand pretty much the whole conversation and maybe even answer the questions but it was actually intelligent and it was using you know stochastic gradient descent type words that calibri doesn’t seem to have a word for also be aware there’s different types of code switching because there’s probably the one that you do when you’re talking about homework there’s the one that you’re talking to your friends about what you’re going to do in the evening there’s the ones that you’re going to talk to your family then there’s probably different types of that that we don’t really fully understand it and what about data well we now have regular workshops there’s sort of two every year sorry one every year one’s focused on speech ones focused on text and a sunni ana sitaram who’s at microsoft research in bangalore india a wrote this code switching speech and language processing and paper i was one of the authors as well but it sort of looked at all the data sets but it’s up to date because it’s two years old tamar solorio at university houston also has good resources to try to identify people are usually concentrating on english spanglish and chinglish even though there’s a lot more going on on the planet than what sort of tasks do we do a well language id is typically one of the first things either in text or in speech where we have a string of code switch data we’ve got to know which language is going on so we can do something about it and that can be quite hard to actually define especially take and spanish where the word might be the same in both languages but the pronunciation is the only thing that’s actually different okay for the hindi speakers you know the word doctor is really the hindi word for doctor okay it’s a borrowed word from english it’s been there for 300 years and it really is the standard word that you would use i don’t think that’s code switching okay but maybe iphone is okay so speech recognition is speech synthesis how well can we recognize this mostly due to the fact that we don’t have good language models and for these mixed languages and therefore it’s sort of hard to actually do that and synthesis what you have to be careful about if you’re speaking in hindi and come up with an english word you shouldn’t just speak it in standard american because it’s jotting if you do that okay and those of you who speak multiple languages and meet somebody in america who grew up here and speaks your language say chinese or hindi it’s always weird when you listen to them because they speak in chinese and then you dig in this completely angler-sized english word in there and that’s just weird that’s not how a chinese person would do it or a hindi speaker would do it where somebody is american born indian so when you’re doing speech synthesis you’ve got to get the right amount of mix in the language so when it’s spoken it sounds okay when we listen to the examples of english malaysian and english okay all of the english words were pronounced not completely english right okay they were pronounced in a english way and they were still the english words but that’s something you need to care about spelling normalization and it’s really quite interesting indians can spell but they can’t spell when they write english okay even the hindi words and it’s utterly random i mean it’s not utterly random but you know there’s an awful lot of noise in there and they’ll sometimes write phonetically and they’ll sometimes just get it well i don’t know wrong so there’s noise there that you have to actually deal with of some form it might be good to get part of speech tagging and people try to do that and how to do cross-lingual part of speech tagging when we’re doing it there are some data sets for that what sort of tasks at the higher level named entity recognition would be incredibly important you want to know what people are talking about especially people places entities etc and there’s going to be a cross-lingual aspect and you want to make sure you capture that sentiment analysis is actually quite interesting because as i said it seems that people are maybe a little bit more open with their opinions and in their native language and what we would be if they’re writing a more formal english review of something so what did they really think about the movie a question answering because often this happens especially google and bing are very much aware of that and natural language inference dialogue processing so there are people looking at how we can get people to code switch when they’re talking to a machine and what’s the benefits of being able to do that and there’s also there’s this whole thing of when should we do it should we do language generation code switched or not is it an advantage does it depend what we’re actually talking about is it good to fit in or will it look bad to try to fit in then we all know that trying to fit in is much worse than not fitting in code switching techniques most of them are very similar to the low resource things you’ve got to try and find appropriate data you’ve got to try and bootstrap labeling you’ve got to try some interesting forms of data augmentation and some form of generation techniques and finding new reliable evaluation techniques is always important how do you know you’re actually even getting better or things that are actually quite hard where do we find it social media sites are probably the best where you’re going to get it code switching is pretty common in many forms so youtube read it youtube speech is really good so you can usually find and people speaking on youtube and forums in both english and english it’s relatively common and actually the google indian english asr is an english asr and it’s been trained to recognize hindi english when it’s spoken and they actually spent a lot of time doing it never advertised we started using the google indian english asr to try to label things that were in english and then we discovered it was incredibly good so i mailed my friend there and said what’s going on and she said oh yeah we made it do this we’ve been working this for three years but we’re not allowed to announce it so we can’t announce it but it’s actually quite interesting how well that they’re doing that broadcast news bibles etc are usually all monolingual in fact they might even be weirdly more in the target language and so for example in the hindi case you might even get news in hindi where the hindi speaker will go what does that word mean and they need to be told in english because they don’t normally use that word apart from in english or whatever the local equivalent is twitter weibo whatever the local thing is that’s appropriate to be able to find that and bootstrapping labeling actually in a course that we ran two years ago called multilingual nlp it’s a really good course i recommend that you do it and one of the projects was actually looking at bootstrapping sentiment data cross-lingually the idea being that if you have a sentiment labeler that’s only in english could you label english data and start getting data that’s labeled with sentiment positive or negative and then actually use that data to retrain and we worked on various ways in the project and ended up publishing this so that what we end up doing is we end up with a system that only uses the english parts of it and it can label things but it learns some of the other words as we get more and more hindi in it such that as we do this bootstrapping technique it then depends more on the on the hindi side of the information rather than the english side and we also did it for i can’t remember it was telugu or channel and it was one of those other which has similar mixing but is not the same language data augmentation generation a building a generator from limited data this is something we’d like to be able to do we have some data can we generate some more most of the things that we do in nlp now are very word oriented so they really benefit from having much more surface level word rather than say parse trees which would be even harder to actually get and so if we could generate data maybe by mixing and matching the same thing that graham was talking about when we’re talking about augmentation in machine translation we’d like to be able to do that too but it’s surprisingly hard to do you build a classifier that tells you whether something is mixed or not and you generate some data and your classifier goes yep that’s english because it’s got one indie word at the end it’s high at the end and then that’s it but no that’s not what good code switching actually is and that’s really hard to do well okay you might think it’s trivial because you can recognize it but when you start generating it you end up with things that match the the classifier that you’re looking for good english but you show it to an english speaker and they go yeah nobody would ever say that’s just weird yeah so we try to do that we’ve done various ways and not just us and lots of other people on the planet often singapore often india are places where people care about code switching to see if we can generate good data see if we can not differentiate it from natural data and then find ways of trying to find out whether that new data that we’re generating is actually going to help and say improving word embedding on some downstream tasks okay how do we evaluate well it’s actually quite hard because we might be able to evaluate that as code switch but we want to know whether it’s good we want to know where it’s going to help and so can we do things like engage people in conversation can we answer questions can we do web search so things like dialogue understanding summarization question answering are things that we actually want to do microsoft india in a paper have come up with a somewhat standard technique called glucose which allows us to be able to test whether embeddings that we get are good on eight different tasks and that’s something that we will often do there may be still issues in things like glucose because we might over train to it but it’s a good standardized way for looking it for english but that’s only for hindi in english but doesn’t deal even with just the other languages in india so finally a code switching pigeons and creoles multilingual is much more varied than just one single language code switching is mixed within the occurrences that’s probably the best way to define it the pigeon is non-native mixed lingua franca so people nobody are speaking it natively so it’s a bunch of non-natives speaking it creoles are when a pigeon becomes native in its own language which means kids are speaking it natively and you’re probably talking about two or three generations before that’s really going to move through an area to do that the issues are still very hard the model level monolingual casual speech and monolingual casual speech is still pretty hard in most languages in the planet we just don’t have enough data and enough context to know how to deal with that so the discussion point that i want you to do today is i want you to select two languages that have some level of code switching you’ll have you’ll have examples of that somewhere but it might be in a particular situation including cmu if it is we like to understand the different types of code switching so when people are doing different things in code switching and these types might be matrix language decisions which language do you choose when you or do the people choose when they’re actually speaking about different things what happens in cross-lingual morphological issues how does plurals get down do you what do you do with gender if you’re copying a word that doesn’t have gender into language it does have gender and cross-lingual style issues politeness register etc low-level semantic distinctions stylistic semantic distinctions there might be things about trying to be more city than rural or others okay so we are running out of time but let’s see what we can do we’re not going to report back but let’s split off into our groups and see if we can answer any of these questions
Outline
Here is an outline of the lesson, based on the sources you provided:
- Language Change Languages are always changing.
- This can be due to a number of factors, including laziness/efficiency, emphasis/clarity, politeness, misunderstanding, group identity, and structural reasons.
- Language Contact Language contact, where multiple languages are used in the same place at the same time, is a major driver of language change.
- Lexical Borrowing Lexical borrowing is common in languages.
- This involves identifying words that are similar across languages and likely to be mutual translations.
- Lexicon Structure Languages have a core-periphery lexicon structure.
- Transliteration Models These models, including FSTs and LSTMs, can help map lexicons across languages.
- Transliteration Evaluation This can be done intrinsically (word accuracy, fuzziness, ranking) or through downstream evaluation like machine translation.
- Cognates and Loanwords These are important in language contact, with phonological and morphological integration.
- Bilingual Lexicon Induction This involves learning monolingual embeddings, finding alignment between embedding spaces, and finding nearest neighbors to induce a lexicon.
- Code-Switching This usually occurs between fluent speakers in casual speech or text, often with a matrix language determining word order.
- The amount of code-switching can be measured.
- Pidgins and Creoles
- Pidgins are simplified mixes of multiple languages learned by non-native speakers.
- Creoles develop when pidgins become native languages.
- These terms can also be used politically.
- Major Code-Switching Dialects Common examples include Hinglish, Chinglish, Spanglish, and African American English.
- When Code-Switching Occurs This can be influenced by vocabulary coverage, “showing off,” sentiment, and entrainment.
- Why NLP of Non-Standard Language Matters Code-switching is how people communicate and can define group membership.
- Challenges in Code-Switching NLP Limited data, noise, non-standard spelling, and the confusion of word embeddings.
- Code-Switching Data Social media sites like Twitter, YouTube, and Reddit are good sources.
- Code-Switching: LT Tasks These include language ID, speech recognition/synthesis, spelling normalization, POS tagging, named entity recognition, sentiment analysis, and question answering.
- Code-Switching: Techniques These are similar to low-resource language techniques, including finding appropriate data, bootstrapping labeling, and data augmentation.
- Mining Data This involves labeling a small amount of data, building a classifier, and using it to label more data.
- Bootstrapping Labeling This involves building a generator from limited data and using classifiers to distinguish real from generated data.
- Data Augmentation/Generation This can involve paraphrasing existing data or replacing named entities.
- Evaluation Techniques These include task evaluation of held-out data and using standard techniques.
- Discussion Points The lesson includes discussion points such as picking a language and analyzing its influence on others, analyzing code-switching in another language, and considering cross-lingual morphological issues.
Papers
- Danescu-Niculescu-Mizil et al. 2013
- Gibson 2019
- Knight & Graehl ’98
- Rosca & Breuel’16
- Wu & Cotterell’19
- Mann & Yarowsky ’01
- Dellert ’18
- Kondrak ’01
- Kondrak, Marcu & Knight ’03
- Bouchard-Côté et al. ’09
- Hall & Klein ’10
- Hall & Klein ’11
- Tsvetkov & Dyer ’16
- Soisalon-Soininen & Granroth-Wilding ’19
- Yip ’93
- Kang ’03
- Kenstowicz & Suchato ’06
- Benson ’59
- Friesner ’09
- Schwarzwald ’98
- Ojo ’77
- Schadeberg ’09
- Johnson ’14
- Haspelmath & Tadmor ’09
- Holden ’76
- Van Coetsem ’88
- Ahn & Iverson ’04
- Kawahara ’08
- Hock & Joseph ’09
- Calabrese & Wetzels ’09
- Kang ’11
- Rabeno ’97
- Repetti ’06
- Whitney ’81
- Moravcsik ’78
- Myers-Scotton ’02
- Guy ’90
- McMahon ’94
- Sankoff ’02
- Appel & Muysken ’05
- Rudra et al 2016
- Khanuja et al. (2020) GLUECoS : An Evaluation Benchmark for Code-Switched NLP
- Common Amigos (Ahn et al 2020, Parekh et al 2020)
Note that many of these citations do not have links in the provided sources. The citations that have URLs are linked.
References
Citation
@online{bochman2022,
author = {Bochman, Oren},
title = {Code {Switching,} {Pidgins,} {Creoles}},
date = {2022-02-24},
url = {https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w11-code-switching/},
langid = {en}
}