Sequence Labeling – NLP Course Notes & Research

Video 1: Lesson Video

This week’s slides

Supplementary Figure 1

Learning Objectives

Example Sequence Classification/Labeling Tasks
Overall Framework of Sequence Classification/Labeling
Sequence Featurization Models (BiRNN, Self Attention, CNNs)

Transcript

This time we’ll be talking about text classification and sequence labeling and about half of this class will or half of the lecture portion of the class will be a review for some of the very basic things regarding building neural network models for NLP tasks in general.

If you participated in cs-11-711 Advanced NLP or a similar class this will mostly be review or stuff that you know already. But I think it’s important because there are other people who are coming from other other backgrounds too. We haven’t done that before so it’d be good to have the basics before you dive into the code. In addition to that i’m going to be talking about kind of text classification and sequence labeling from a multilingual perspective. So giving some pointers to data sets and tasks and other things like that so hopefully that’ll be useful even After we’ll walk through the assignment, including a look at what is required for doing assignment one, and the code etc etc. I think potentially the assignment one description on the website is like a little bit old but we’ll be updating that very shortly.

Text classification and sequence labeling

So text classification and sequence labeling are both very broad. I like to call these task categories. They’re not tasks of themselves but they’re categories of tasks that’ll look very similar and thus can be solved in similar ways. Text classification: given, input text X, predicted output: some categorical label Y. This can be all kinds of different things like topic classification where we take i like peaches and pears and that gives us the topic food and i like peaches and herb and that gives you the topic music because peaches and herb is a old band basically we have language identification which is a particularly important task in multilingual learning and basically this is Taking in language and outputting labels. Taking in text and now putting the language that it’s written in some you know obviously the first is english and the second one is japanese here this becomes very interesting and difficult as i’ll be elaborating later in the lecture portion. Another example that’s very widely known is sentiment analysis and this can be done at the sentence level the document level or you can even do some sentiment analysis with respect to you know individual entities or things like that but from a text classification point of view it would be like sentences or documents so if we have i like peaches and pairs i’d be positive i hate peaches in paris i’d be negative obviously and you know there’s many many other many many other tasks that fall into this sequence labeling on the other hand is given an input text text predict an output label sequence y usually of equal length so this can be formulated as taking in a sequence of words and outputting one tag for each sequence of words so we have part of speech tagging is one example of this so we take in words and output parts of speech another one is lemmatization so what this is doing is it’s basically taking in words and outputting their base form this is kind of this is used sometimes in in english but it’s actually particularly useful or important in languages that have richer morphology than english because essentially if they have lots and lots of conjugation or other things like this the words themselves can become very sparse so identifying the underlying base form helps you understand like what the word is referring to better another variety is morphological tagging in morphological tagging again we’re going to be talking about morphology in about two classes but basically it’s predicting various features of the word based on like for example here we have he saw becomes the past tense and a finite verb form two it’s a number and the type is a cardinal number this is plural for example so this can become more complicated but more languages have more complicated morphologicals and there’s other ones as well there’s also span labeling tasks and sometimes these fan labeling tasks are treated in sequence labeling tasks but they’re actually a little bit different and basically the idea is given in input text x predict output spans and labels y and these include things like named entity recognition where you want to identify the spans and labels on the spans for named entities like person people or things so here we have graham newbig as a person in carnegie mellon university as an organization another example is something called syntactic chunking or shallow syntactic parsing where basically you want to split up the sentence into like noun phrases and verb phrases like this and that’s another example and there’s also semantic role labeling where semantic role labeling basically identifies fans and tries to identify like this is an actor this is a predicate and this is a location so what role each of these arguments is playing with respect to the predicate but as you can see all of these basically have to do with like identifying spans and labeling them in some way so span labeling can also be treated as a sequence labeling task so you predict the beginning in and out tags bio tags for each word in the span so if we take our span labeling task like this where we want to identify a person or organization we convert this into a thing where we basically have one task or one tag for each word so we have beginning of person inside a person out out out which means no no span is identified beginning organization in organization in an organization like this so the good news is then you know if you want to do this span identification task you can just solve it with a sequence labeling model so sequence labeling is nice in that way there might be better ways to handle span identification but this is this is one way to do it so another another task that is slightly different but also can be handled as sequence labeling is text segmentation so given in input text x split it into segmented text y so a very common example of this in many languages is tokenization where you want to take something like a well a well-conceived thought exercise that has like lots of punctuation and intervening hyphens and stuff like this and split it up into things that look a little bit more like just kind of natural word boundaries like like this adding spaces between each of the punctuation etc another variety of this which is necessary for some languages not not so many languages in the world but but some languages is word segmentation and this is a example from japanese because japanese is written with no spaces between words so you can’t just split on white space using your python like split function you have to actually find the locations of the word boundaries and this is a non-trivial task so this is one segmentation that you could have here this is the kind of like quote-unquote correct segmentation but you could also make a mistake and segment like this and this phrase itself if you split it this way means foreign people voting rights or like suffrage for whereas this means the foreign government so basically if you split the correct way you get one meaning if you split the wrong way you get another meeting and this can you know mess up information retrieval systems any translation systems anything that you would think of and then there’s also morphological segmentation which again we’re going to talk about a bit more which is for example we might have something that looks a little bit like this which can be split into different ways i i believe this is turkish i think someone can correct me if i’m wrong here i i actually forgot where i got the example but i believe it’s turkish and so then we have dog and plural or dog paddle and attempts so basically whether you split it one way or the other also affects whether this is like a plural noun or a verb and thank you for confirming that this is a determination so this is another another issue that you could have cool are there any questions about this so far oh is the word segmentation solved with additional context in japanese yeah that’s a really good question and morphological analysis also is similar you know having context so basically yes having context is very important because both of these are both of these are reasonable morphological segmentations in different contexts you know they’re both things that could happen sure this might be a lot more frequent but there certainly are examples where this would occur it’s also the case in japanese although that’s a little bit more rare like one one example is that i can put in the zoom chat this means if you split off the first character that means american nuclear reactor if you split off the second character it means a train that’s departing from maihara which is a place in japan and so depending on whether you’re whether you’re in a news newspaper article or a train schedule either of those would be correct so it’s yeah it is based on context then i had another question explaining the distinction between b and i here so b basically is the first tag in a sequence in a span so it would be applied to the first word in the span i would be applied to every subsequent word at the span so you you if you had a single word span it would just have a b but if you have the multi word span it will be b i i i tell the span finishes and having these two different tags is necessary to distinguish cases where you have two spans in a row so if you have like person and then another person later how do we constrain the model so that b is predicted before i for a particular entity that’s a another really good question so if you i maybe maybe i’ll talk about how we actually do this prediction first and then get back to that question great okay so i’d like to talk a little bit about modeling for sequence labeling and classification so the first question is how do we make predictions so given an input x input text x extract features h and predict labels y and the way this works is basically we have text classification and sequence labeling so we have like i like peaches we have some sort of feature extractor that extracts a whole bunch of features from the whole sequence into like a single vector here and then given that single vector we make a prediction we have for sequence labeling we have a feature extractor that extracts one feature or one vector of features for each word in the input and then we make a prediction from this like that and so either way we need a feature extractor it’s just a matter of whether we are extracting a single vector of features for the whole sequence or one vector for each word so a very simple feature extractor that we might use for text classification for example is bag of words where what we do is we look up a a single feature for each of the words and then we add them together to get a vector a vector representing the number of times a particular word occurred in the sentence and then we feed this into a predictor maybe we have a a matrix multiply that turns this into label label values and then we use this to calculate label probabilities so just to clarify when i say bag of words this we have a one hot vector which means a single value in this vector here is one and all the other values are zeros so this is essentially a a vector that represents the identity of the word and nothing else so as our simple predictor we can have something like a linear transform and a softmax function so what that looks like is we take we take the extracted features we multiply them by a weight matrix we add a bias here which tells you essentially how how likely each of the labels is a priori and then the softmax converts arbitrary scores into probabilities so we exponentiate the score at each of the elements in the resulting vector divided by a normalizer to make sure that all of them add up to one add up to one and then that gives us a probability of our final output so it might look a little bit like this after the after the linear transform and bias addition we might have a score vector that looks like this and we turn it into a probability so i think this should be pretty familiar to a lot of people [Music] no questions about this any questions if not i have i have a little bit of a quiz so like we talked about text classification sorry every time i touch my mouse it moves my my slides it’s a little bit annoying but so we talked about text classification where we extract features for the whole sequence and then we use that to predict the the label probabilities for the whole sequence by adding all of these together what do you think of a bag of words of a similar feature extractor used for sequence labeling does that make any sense whatsoever so we extract one one vector using this kind of bag of words lookup function and then make predictions based on this vector would that would that do anything would there be a reasonable way of solving this problem maybe not so reasonable because it’s neglecting the word order yeah maybe not so reasonable because it’s neglecting the word order yeah that’s a a pretty good idea but it’s actually maybe not that bad right what it what it would end up doing is it would basically look up i and then it would make a prediction of what part of speech tag i would be based on what the most frequent part of speech tag for i is and it would make the prediction of like and it would make the prediction of the most frequent thing for likes so yeah it’s a frequency-based model it would be looking up the majority class for each word and actually for part of speech tagging in english and even more so for other languages this actually isn’t that bad at part of speech tagging it gets like high 80s accuracy just because there’s relatively little like peaches is always a plural noun i i can’t i can’t think of any case where it wouldn’t be a plural so even that would be you know a moderately okay feature extractor both for classification and and sequence labeling however you know it does neglect the word order so if you want to do like even better than just majority class you have to come up with something that way so another issue language is not just a bag of words for classification or labeling so to have some examples we have i don’t love pairs there’s nothing i don’t love about pears if you just look at the words in the sentence like love is the positive word and then you have a negation but negation in itself is not you know very indicative of positive or negative sentiment so both of these would be relatively hard to tackle with a bag of words model so what we want to do is we want to come up with a better futurizer to better pull out features for our sequences for each word in our sequences so one example of a better futurizer might be a bag of engrams model so basically instead of looking up each word we would look up each engram so here for example then don’t love would also become a feature in our model and if don’t love is a feature in our model that’s kind of like a negative feature right it would maybe overpower the the love feature and make move that more negative you could also come up with syntax-based features like subject object pairs or neural networks like recurrent neural networks convolutional neural networks self-attention and so until you know 20 or so it was very common to use these kind of handcrafted features or at least handcrafted feature templates and throw them into a support vector machine model or something like this to do text classification now it’s much more common to use neural networks and they also have some really nice properties like allowing transfer across languages which is a very big you know part of this class so we’re going to mainly be focusing on these models in this class this time i’ll talk about recurrent neural networks just because they’re conceptually easy but we’re also going to be talking about things like self-attention so a neural network which a lot of people know is basically a computation graph that is parameterized in a certain way in order to allow it to make predictions the name neural networks comes from neurons in the brain where basically you know they take an information across synapses and then they fire and they give output over the the outputs and the absolutes but the current conception of how we use them it’s basically a mathematical object that allows us to calculate something take in an input generate an output and in particular in this case it’s going to be our featurizer and our predictor so it’s going to take in an output input output some features for each word or each class or it’s going to take in an input and output a score for each for each class so if we define an expression the way we represent it as a computation graph is through nodes and edges between the nodes and so here we have a single variable maybe a vector and that’s represented as a single node the node can be like a scalar value of vector value matrix value tensor value and we can also have operations over all of these values so like let’s say we transpose the vector that operation would be demonstrated as an edge to another node that implements the transpose so an edge represents a function argument and a node with an incoming edge is a function of that edge’s tail node and a node knows how to compute its value and the value of its derivative with respect to each argument and then functions can be nullary like input values here or unary unity or binary so here’s a something representing a matrix multiply so basically now we have x transpose times times a here and these are directed in the cyclic graphs that allow us to calculate more and more complicated expressions so they allow us to you know do things like calculate features make predictions et cetera and we can also name individual parts of the graph and this allows us to you know calculate values that we would like to be calculating like the score of the the probability of prediction and so some algorithms are graph construction and forward propagation that allow us to calculate values so forward propagation basically we we start out with the input values of the graph and we gradually move through the graph to calculate the final result here and we also have back propagation and back propagation basically what it does is it processes examples in reverse topological order calculating the derivatives of the parameters with respect to the final value and this is usually the loss function the value we want to minimize so in many cases this is the negative log likelihood of the predictions over the true values and by minimizing this negative log likelihood that allows us to maximize the probability of getting the correct answer and then we take the derivatives calculated through back propagation to update the parameters of the model so back propagation basically works like this we start from the end of the graph and gradually move back until we back propagate into parameters so let’s say a was a parameter a b was a parameter and c was a parameter we would back propagate into these values and once we have the derivatives that we calculate through this process of back propagation we can use them to update the parameters to kind of improve the likelihood of getting the correct answer so this is a kind of five-minute intro to neural networks if you’re not very familiar with them there’s lots of good tutorials online in particular we’re going to be using a neural network framework pytorch for examples in this class that i think you know a lot of people are familiar with already and our first assignment this time is aimed to kind of have the dual purpose of allowing you to get familiar with you know building models in high torch and stuff like this and also learn more about kind of the interesting difficulties that you have to deal with when you apply these multilingually so if you’ve already taken a class or already used neural networks pretty widely in your in your work then you know all of this will be old to you and you’ll more or less know this already if you haven’t then definitely take advantage of the ta office hours look at the examples ask lots of questions we can forward you some tutorials online to help you out as well so this will be your chance to catch up before we get into the more involved assignments that happen later so yeah actually maybe i’ll skip that part so to give an example of a type of neural network that can be used for featurizing a text classifier or a sequence labeler we are going to talk briefly about recurrent neural networks and recurrent neural networks are models that allow us to do precisely what was mentioned before as being the [Music] issue with a bag of words model which is handling either short or long distance dependencies in language handling word order handling other things like this and so in language there’s many dependencies that span across whole sentences so for example agreement is one example there’s not a whole lot of agreement in english like for example there’s gender agreement and there’s a plural there’s like number agreement between subjects and verbs so here we can see he and himself need to agree she and herself need to agree also he does needs to agree here so if this were i would be i do he does so that’s an example also word order in general we talked about that last class so we need to have some sort of model that’s able to handle these things these are syntactic characteristics that we need to be able to handle and there’s also semantic characteristics so for example we need to have semantic congruency between rain and queen and rain and clouds here you know they’re they’re just things that make sense make sense semantically and don’t make sense mentally based on our knowledge of the world and we also need to handle these as well so recurrent neural networks basically are are one of the tools that we can use to encode sequences either to get representations for each word or representations for other words so basically what recurrent neural networks do is they look up they look up the context or the input at the current time step do a transformation of this into features and then they feed in the features from the previous time step for the next time steps so to give an example if we have i we would feed it through an rnn and get the next vector here like we would feed this through another rnn function we would calculate a vector corresponding to light this is a parameter of the model and then we feed in the result of running i through the rnn to get the and this input here to get the representation for i like and then we have these and we calculate the representation of these i like these and then we have pairs and we calculate the representation of i like these pairs and basically this is a recursive function so each time you’re using the result of the previous function to calculate the result of the next function so when we represent the sentence for text classification basically it would look a bit like this so we would take the last vector in the sequence to make a prediction and that would be useful for things like text classification condition generation retrieval we’ll be talking about the latter queue later in the class but text classification is the one we’re talking about this time and it can also be used to represent words so if we wanted to predict a part of speech label for i and like and these and pairs we could use the immediate output after inputting that word to try to make that prediction as well and this would allow us to do things like pull in contacts from the left side to make this part of speech prediction for things like sequence labeling language modeling calculating representations for for parsing etc so the way we train the rnns is like let’s say we’re training one for sequence labeling we have the predictions here from these we could calculate a negative log likelihood or a loss function something like this we have the label we use the true label of the output to calculate the loss function and we add them together to get the sequence level total loss so this is one big computation graph for the whole sentence and then we take the total loss and we do back propagation from this total loss into the whole the representations for the whole sentence so the parameters of the model are all tied across time the derivatives are aggregated across all time steps and this gives us something called back propagation through time so you can back propagate through the whole sentence and basically optimize the probability of making the correct predictions for the whole science so what did i mean by parameter tying basically the parameters are shared between this rnn function over the entire sentence and because of this this allows you to apply this to sentences of arbitrary length so if you have a sentence of length 50 or a sentence of length 20 or something like this then this would essentially allow you to represent all of them within the same recurrent neural network by just applying this rnn function 50 times or 20 times or three times for a three word sentence and when doing when doing representation for things like sequence labeling it’s very common to use bi-directional rnns and what i mean by this is basically you take the left side and you run a recurrent neural network that steps from time to step zero to time step one to two to three to four from left to right and then you have another recurrent neural network that steps from right to left in this way and aggregates information from both of the directions congratulates that together and makes a prediction and the reason why this is useful is you never know whether the context to disambiguate a particular word would be available on the left side or the right side so this allows you to pull in information from both sides okay so that’s basically the overview of you know a simple method for doing calculation of either representations for the whole sentence so for example if we’re representing the whole sentence we might take this vector we might concatenate the right side of the left rnn and the left side of the right the right side of the back forward rnn and the left side of the backward arm or if we want to represent individual words we might concatenate together the things in each time step to make predictions so this would allow us to do text classification or sequence labels okay are there any questions about this before i jump into the multilingual part okay i guess not so we can jump into the multi-uh multilingual thing that is part of the name of the class of course so i’m gonna be talking about some text some text classification and sequence labeling tasks most of these are tasks that are applicable to any language so they’re explored quite widely on english as well but some of them are inherently multilingual so for example language identification is one that’s inherently multiple so language identification as i mentioned before is the task of identifying the language that a particular text comes in and this is really important for a very broad number of reasons the first reason might be if you want to create like let’s say you want to show people content in only a language that they speak so you know people when they’re doing search online they’ll probably more appreciate results in the language they speak than in another language another example could be for creating data sets for something like machine translation or language modeling or something like this where you only want data in a particular language and and other things like this so actually one of the largest language identification corporal was created by ralph brown here at at lti it’s a benchmark on 1152 languages from a variety of free sources so this is kind of a widely known data set here if you want off the shelf tools for doing language identification one example of a relatively easy one to use is slang id.pai so you can just download this use it for 90 plus languages another example is there’s the chrome language identifier from the chrome browser i actually don’t have a link here but that’s also pretty widely used by people if you want an off-the-shelf method that you can use there’s also a nice survey it’s a little bit old by now but automatic language identification and texts which i can recommend you can take a look at if you’re interested in this and oh missing a slide that i thought i had added here weird okay so i i will just discuss i’ll just discuss this paper so this is a recent paper from 2020 by people at google working on kind of low resource languages it’s quite interesting it’s called language id in the wild unexpected challenges on the path to a thousand language web text corpus and this is not the only paper that has it has pointed out this problem that language identification doesn’t work well but they have some very interesting insights and they also have a very nice kind of example of a of the issues that you encounter when trying to do language identification on web text and so here here are some mind sentences from the web that were supposedly in one language according to google’s text like language identification model so like a whole bunch of people raising their hand and emojis got classified as amani beri the in this was in twee which is why you lie in why you always lie in written in kind of like strange characters a misrendered pdf was for hadi the non-unicode font i’m not sure what this was

yeah it’s a non-non-unicode font i guess was written in this way here this was as balinese it was just like boilerplate this was also english but it was written in like the cherokee script for stabilization he just wrote me ow that became cooler so you can see basically here when people write in like slightly non-standard language it gets identified as other things like even this is like clearly standard english but there were hints of words that often occur in remote so it got recognized as a remote so this kind of just demonstrates how difficult this this task is when you start applying it to web text and i actually have had a similar experience when i was trying to do twitter there was a certain like face like not emoji but like the the faces written with regular characters it was written in canada characters and so many many things were recognized as commonly just because they use that like popular face so there are lots of large corpora like while i’m at it here i can also introduce the oscar purpose this is a very large corpus huge and multilingual obtained by language classification and filtering of the common crawl corpus it’s gotten a bit better since they first released it in terms of the noisiness but when it was first released it was like extremely noisy just because language id didn’t didn’t work so well so this is actually like a really big problem that you need to be aware of if you’re if you’re starting out are there any questions about language id before i move on to the next okay so also kind of standard text classification like yes i said text classifications were like a class class of tasks that make tasks in itself here are some representative ones that people have used these are mostly used for benchmarking multilingual models as opposed to like actually building anything useful so but still you know sometimes you want to know how good your multilingual representations are so they could be good test beds one example is ml doc corpus which is a corpus of a multilingual document classification there’s also the pos x corpus so the this is a paraphrase detection between languages it’s sentence pair classification where you feed in two sentences also cross-lingual natural language inference so this is textual entailment prediction or natural language inference which is also a sentence pair classification task there’s also cross-lingual sentiment classification in chinese in english so this could be used for facing another thing is part of speech and morphological tagging i’m not going to go into a whole lot of about this because i know it’s going to be covered more when we talk about like words parts of speech and morphology but basically there’s the universal dependencies treebank and the universal dependencies treebank basically it contains syntactic parses like dependency forces but it also contains parts of speech and morphological features for 90 languages and it has a standardized universal part of speech set in universal morphology headset to make things consistent across the languages so this is one of the highest quality like multilingual corpora that i’m aware of it’s you know well controlled well conceived and there are some pre-trained models that can use that can do like syntactic analysis on many languages trained on these datasets like unify and stanza if you’re interested in doing multilingual like syntactic analysis named entity recognition this is going to be what we’re going to be doing for the assignment and there’s different types of named entity recognition data sets there’s a gold standard data set from connell 2002 2003 on language independent named entity recognition this is an english german spanish and dutch with human annotated data i actually forgot to add one that i just remembered now that just came out i actually i helped out with this a little bit but it’s a an identity recognition data set for african languages i called masakonner and this is this is nice because it’s also manually labeled but it’s in african languages that have like a lot fewer resources than english german spanish and dutch so it gives kind of a better idea of [Music] like you know how well we’ll be doing a lower resource languages there’s also this wiki and data set for entity recognition and linking in 282 languages this was extracted from wikipedia using inter page links so in wikipedia of course if we go to carnegie mellon university on wikipedia there are many there are many links so there’s no link here but here’s pittsburgh pennsylvania the melon institute of industrial research andrew carnegie so all of these links link to other pages and then if you look up the type of the page according to some annotations that come on wikipedia you can tell that pittsburgh is a city mellon institute of industrial researchers and organization and andrew carnegie is a is a person and then of course you know this is available in lots of languages so we can go to chinese and find the chinese equivalent of andrew carnegie or the chinese equivalent of pittsburgh and and do the same thing so basically this data creates that in many different languages there’s also several composite benchmarks for multilingual learning so they aggregate many different sequence labeling or classification tasks for testing multilingual models one popular one is extreme it’s a massively multilingual benchmark for 10 different tasks 40 different languages another one that came out at a similar time is exclu with 11 tasks over 19 languages so there’s also a new version of extreme cold extreme r that just came out i had been a little bit involved in both of these and extreme r is they swapped out some easy tasks added some harder tasks and added better analysis so you might also consider looking at that for your class project i would warn you that these these benchmarks are very popular and there’s people with like lots of compute that are competing on these benchmarks so you might not it might be a bit of a challenge to keep up with the state of the art there but i think you could work on individual tasks and still do a very good job like some of the tasks where kind of generic models are not working as well so you can definitely take a look to get inspiration for ideas okay great so that’s all i have are there any questions before we move on to the the like discussion period which in this case we’re not doing discussion we’re having a presentation of the assignment but any question about data sets or tasks or oh sorry the homework was on ner i said it beyond that in er but the top part of speech tagging i apologize cool yeah but we’ll have the description of the the task for assignment one before i go into that i just like to point out that starting next time we will indeed be having discussion and reading assignments so the reading assignment for next time is this modeling language variation in universals a survey on typological linguistics for natural language processing the reading is actually it’s only required for suggested that you do sections one through three but the whole survey is good so if you don’t mind reading 30 pages or so it would be worth taking a look at that as well so required is one through three and then based on what you learned in that reading you can try to think of what are some unique typological features of a language that you know regarding phonology morphologies and dex pragmatics doesn’t you don’t need to cover all of them but you can cover like one or two of these and we’ll have a discussion where everybody will share what they came up with cool and today is assignment one introduction tr vijay or who’s going to be presenting this thing

are you speaking if you’re speaking around me do i need to unmute me i don’t think he’s on mute okay there we go now we can take gray i think you are talking right can you hear me now yep yes great okay so okay hi hi i’m think gray and ti is going to give some introduction yeah all right so this first assignment is to give you a practical introduction to multilingual parts of speech tagging and i think briefly was mentioned so part of speech is just lexical categories or word classes or tags so in the example sentence he saw two words we assigned the part of speech pronoun to he verb to saw and so on so you’ll get a data set of a sentence sentences in different languages and you want to output the pos tags for each of the words in the sentence yeah so as mentioned previously aside from giving you guys a practical introduction to multilingual pause tagging yeah we want to give you a sort of like an experimental approach to multilingual problems such as investigating the challenges to languages which are low resource meaning that there’s a limited availability of label data and of course just be familiarized with deep learning frameworks aws and multilingual data sets so to do this assignment you would need a machine with a gpu so you can use aws or you can use your own computer if you have a gpu and you’ll have to install some python packages for this assignment so the tricky part here i guess would be like the aws setup so shortly i’ll be posting instructions on piazza on how to request aws credit and i think the assignment will have further instructions on how to set it up and all students should have an aws account using your andrew email and just follow the instructions on how to set it up we tried doing the assignment without a gpu but it’s strongly recommended because i think the next assignments are not really doable without a gpu and i trained it on a very old macbook air and it took me around three to four hours to complete the training and you need to retrain it you know when you’re changing the parameters and so on so yeah do it with the gpu and i think one more thing to take note is that make sure that you stop the aws instance when you’re not doing it because you will be continuously billed so i think the aws setup should be fairly straightforward i’m not sure myself tingerie has done it before without instructions but the instruction instructors and the tas of the intro to deep learning course have provided a very comprehensive aws fundamentals playlist so it will be linked in the assignment handout page as well and you can follow the steps in case you have any difficulties yeah so for this assignment we are going to give you a deep file and this this this file will contain all the code and the data that is to use for this assignment and we will post a link later on psl so in the date file you can find the training data it is the training data for six languages and the right hand side is an example what what the format looks like for this training data so this is an example of one sentence so you can see there are the words and the pls text of that word separated by some new lines and each line contains a word and it’s text separated by a tab but actually you don’t have to worry too much about this format because our code will handle this for you and in this homework we are going to use this simple baseline model by rcm model it is a model with an embedding layer and also a bi-directional stem there and so the input of this model will be a sequence of words and the outputs will be a sequence of pos tags so let’s work through the files in the the file so the first file is the config.json file it is the file that contains the hyper parameters that will be used to trend the model you may have to change this ma this file in order to write a report when you are doing some analysis and this udprs.pipe file is the place where we implement how to read a data set but actually i don’t think you will have to modify this so i won’t go into detail here and this model.pi file is the place where we implement this bi-directional stm model and if you are familiar with pytorch then you can see it’s a very simple model it just simply applies and it’s embedding layer and then lcn there and then finally predict the text using a fully connected layer yeah but if you want to make a stronger model then you may have to modify this file and most of the complex jobs are done in this main.pi file it handles the loading of data and it also do some pre-process and do the thing that have simple into batch and also the part that trends the model so let’s go through this content so the first thing it does is that a load the data set with the function we define and then it build a vocabulary for the input text and also the the output pos text the reason that we want to build this vocabulary is that in modern deep learning framework we are when we are using a embedding layer what the embedding layer does is that a map the index of sound balls into some batters so that’s to say before we can use this embedding layer we need to first build a mapping they can assign each word with an index so we have to iterate through the changing data to see what to see the words that occurs in the training data and for similar reason we have to iterate through the data set to see all the possible possible pos tags in this data so these are the purpose of this device and once we have the mapping that map the tokens to the indices and the mapping then map the pos tag to the index we can define these two functions that converge they convert the train data into indices and with these two functions we can define this collet batch function the purpose of this function is that it pack a bunch of text label pairs into a single batch and then this batch will be used to train the model so yeah so what this function does is that it first converts the words inside the the samples to a tensor by coding the function with the file above and you also can convert the tag into indexes by code by cleaning the function and then the most important thing it does is that it paid those sequence of tokens into the sentence so those tensors can be stacked into a single tensor and that single tensor can be later used to train the model and once we have this correct batch function we can define a data loader a data network is an iterator we can get some batches from it by iterate through this data loader so here we define three data loaders for the training set the validation set and testing set respectively and then we can use the data loader to change our model the way we train our model is that we repeat this process for a certain number of times defined in this hyper-parameter mesh epoch and the process is that we first train the model by using the training data and then evaluate the model using the validation set and if after this epoch the model get a better validation loss then this grip will set the model to some place so at the end of this training process you will have the model that has the minimum validation loss as for what this trend function does it is also very simple it just iterates through the data order the chain data loader and then makes prediction over the text and then compute the laws by comparing its prediction and the quantum tags and then use these those to do backward propagation and then code atomizer to update the parameters in the model yeah so it’s basically what our codes do so what you need to submit for this assignment is code and a write-up so part of the you know how to obtain points for this assignment is you need to run the code you need to make notifications to the model and so on but it’s also equally important to give a detailed explanation on the results that you see or what you do and why do you think your changes have made an effect on the results and you can submit this on campus and i think this is the part that everyone really wants to see is how do i get a good grade in this assignment so there’s a lot of ways there’s a lot of tiers as well so if you just want to get you can get a b by just running the code on the existing english model and just running that running the model on the test set you’ll get a b but if you train the model on the different multilingual data sets and you evaluate them using their respective test sets you’ll get a b plus to get anywhere of an a you need to write a report with detailed analysis so there’s a lot of ways you can comment on the results you can see how the performance varies across different languages you can also see you know which tags are most often misplaced for other tags so you know our pronouns and nouns more easily mistaken for each other and so on so that’s what we want to see and if you have a report you know detailing all these explanation you’ll probably get an a minus to get an a or above you will need to get to create a non-trivial extension to improve the existing scores and there’s really a lot of ways you can do this you can add like cnn input layer to capture character level features you can use pre-trained embeddings and so on so there’s really it’s kind of like an experiment that you need to run on your own and we’re excited to see your results.

Lesson Outline

Introduction to Text Classification and Sequence Labeling
- These are broad task categories, not tasks in themselves, that can be solved in similar ways.
- Text classification involves predicting a categorical label for an input text.
- Sequence labeling involves predicting an output label sequence of equal length to the input text.
Text Classification Tasks
- Topic classification: Assigning a topic to a given text. For example, “I like peaches and pears” would be classified as “food,” while “I like peaches and herb” would be classified as “music”.
- Language identification: Identifying the language in which a text is written. This task is particularly important in multilingual learning.
- Sentiment analysis: Determining the sentiment of a text (e.g., positive, negative, neutral) at the sentence or document level.
Sequence Labeling Tasks
- Part-of-speech (POS) tagging: Assigning a part of speech tag to each word in a sentence. For example, “He saw two birds” would be tagged as “PRON VERB NUM NOUN”.
- Lemmatization: Identifying the base form of words, which is particularly useful in languages with rich morphology.
- Morphological tagging: Predicting various features of a word, such as tense, number, and verb form.
Span Labeling
- Involves predicting output spans and labels for input text.
- Named entity recognition (NER): Identifying and labeling named entities like person, organization, or location. For example, “Graham Neubig” would be labeled as a “person” and “Carnegie Mellon University” as an “organization”.
- Syntactic chunking: Splitting a sentence into noun phrases and verb phrases.
- Semantic role labeling: Identifying the roles of different parts of a sentence, such as actor, predicate, and location.
- Span labeling can also be treated as a sequence labeling task by predicting beginning, inside, and outside (BIO) tags for each word in the span.
Text Segmentation
- Splitting an input text into segmented text.
- Tokenization: Splitting text into tokens.
- Word segmentation: Identifying word boundaries in languages without spaces between words. For example, Japanese requires word segmentation since it does not use spaces.
- Morphological segmentation: Splitting words into morphemes.
Modeling for Sequence Labeling and Classification
- The process involves extracting features from the input text and then predicting labels.
- A feature extractor can extract a single vector for the whole sequence (for text classification) or one vector for each word (for sequence labeling).
- A simple feature extractor is the bag-of-words model, which looks up a feature for each word and adds them together.
- A simple predictor is a linear transform with a softmax function, which converts scores into probabilities.
- More complex feature extractors include n-grams, syntax-based features, and neural networks.
- Neural networks, which use computation graphs, are now more common because they allow for transfer across languages.
Neural Network Basics
- Neural networks are computation graphs with nodes and edges.
- Nodes can represent values, and edges represent operations.
- Forward propagation involves computing the value of each node in topological order, starting from the input values.
- Backpropagation is used to calculate the derivatives of the parameters with respect to a loss function and update the parameters to improve model performance.
Recurrent Neural Networks (RNNs)
- RNNs are used to handle dependencies in language, such as word order, agreement, and semantic congruency.
- They process sequences by taking the input at the current time step and the features from the previous time step.
- RNNs can be used to represent sentences for tasks like text classification or represent words for tasks like sequence labeling.
- Training RNNs involves calculating a loss function, summing it up over a sequence, and doing backpropagation through time to update the parameters.
- Parameters are tied across time, allowing the network to handle sequences of arbitrary length.
- Bi-directional RNNs process the input sequence from both left to right and right to left.
Multilingual Tasks and Datasets
- Language identification is a task that is inherently multilingual.
- Other multilingual text classification and sequence labeling tasks can be used for benchmarking multilingual models.
- MLDoc corpus is used for multilingual document classification.
- POS-X is a corpus for paraphrase detection between languages.
- Cross-lingual natural language inference (XNLI) is used for textual entailment prediction.
- Cross-lingual sentiment classification is another benchmark task.
- Universal Dependencies (UD) Treebank is a high-quality multilingual corpus that contains syntactic parses, POS tags, and morphological features for 90 languages.
- CoNLL 2002/2003 is a dataset for language-independent named entity recognition.
- WikiAnn is a dataset for entity recognition and linking in 282 languages.
- XTREME and XGLUE are composite benchmarks for multilingual learning, aggregating different sequence labeling and classification tasks.
Assignment 1: Multilingual Part of Speech Tagging
- The goal is to give a practical introduction to multilingual part-of-speech tagging.
- Students will be given a dataset of sentences in different languages and must output the POS tags for each word.
- The assignment emphasizes an experimental approach to multilingual problems.
- The assignment requires the use of a GPU and AWS is recommended.
- Files containing code and data will be provided to students.
- Students need to submit code and a report detailing their analysis of results and model improvements.
- Grading criteria include running the code, training on multilingual datasets, and providing detailed analysis of the results.
- Students can improve their grade by making a non-trivial extension to improve the existing scores.

Reflection

Sequence labeling and text classification are broad task categories that can be solved using similar methods. Sequence labeling involves predicting an output label sequence of equal length to the input text, while text classification involves predicting a categorical label for an input text. These tasks are essential in natural language processing and can be used for a wide range of applications, such as sentiment analysis, named entity recognition, and part-of-speech tagging.

One point I consider is if there is one or two formats that are more common in the industry or academia. For example, is BIO more common than IOB? Or is there a standard format for representing part-of-speech tags? I would like to know more about the standard practices in the field.

More so is there is a format that is a good fit for other tasks like allignment annotaion, tts, and so on.

Papers

Here is a list of all the papers mentioned in the lesson:

Jauhiainen et al. (2018) Automatic Language Identification in Texts: A Survey
- This is a survey that provides an overview of automatic language identification methods.
Caswell et al. (2020) Language ID in the Wild: Unexpected Challenges on the Path to a Thousand Language Web Text Corpus
- This paper discusses challenges in language identification, particularly when applied to web text.
Schwenk and Li (2018) MLDoc: A Corpus for Multilingual Document Classification in Eight Languages
- This paper introduces a corpus for multilingual document classification.
Qi et al. (2022) Cross-lingual Natural Language Inference (XNLI) corpus
- This paper presents a corpus for cross-lingual natural language inference, a task that involves textual entailment prediction.
Yang et al. (2019) PAWS-X: Paraphrase Adversaries from Word Scrambling, Cross-lingual Version
- This paper describes a dataset for paraphrase detection across multiple languages.
Ponti et al. (2019) Modeling language variation and universals: A survey on typological linguistics for natural language processing
- This is a survey on typological linguistics for natural language processing.
Sang and De Meulder (2003) CoNLL 2002/2003 Language Independent Named Entity Recognition dataset
- These papers introduce datasets for language-independent named entity recognition.
Pan et al. (2017) WikiAnn Entity Recognition/Linking in 282 Languages
- This paper presents a dataset for entity recognition and linking extracted from Wikipedia.
Hu et al. (2020) XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization dataset
- This paper introduces a benchmark for evaluating cross-lingual generalization across multiple tasks and languages.
Liang et al. (2020) XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation presentation dataset
- This paper presents a benchmark dataset for cross-lingual pre-training, understanding, and generation.

These are all the papers explicitly mentioned in the sources. The lesson also refer to the Universal Dependencies Treebank, Udify, and Stanza as resources for multilingual NLP tasks which are linked below.

Resources

References

Caswell, Isaac, Theresa Breiner, Daan van Esch, and Ankur Bapna. 2020. “Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus.” In Proceedings of the 28th International Conference on Computational Linguistics, edited by Donia Scott, Nuria Bel, and Chengqing Zong, 6588–608. Barcelona, Spain (Online): International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.579.

Hu, Junjie, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. “Xtreme: A Massively Multilingual Multi-Task Benchmark for Evaluating Cross-Lingual Generalisation.” In International Conference on Machine Learning, 4411–21. PMLR.

Jauhiainen, Tommi, Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Lindén. 2018. “Automatic Language Identification in Texts: A Survey.” https://arxiv.org/abs/1804.08186.

Liang, Yaobo, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, et al. 2020. “XGLUE: A New Benchmark Dataset for Cross-Lingual Pre-Training, Understanding and Generation.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), edited by Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, 6008–18. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.484.

Pan, Xiaoman, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. “Cross-Lingual Name Tagging and Linking for 282 Languages.” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1946–58.

Ponti, Edoardo Maria, Helen O’Horan, Yevgeni Berzak, Ivan Vulić, Roi Reichart, Thierry Poibeau, Ekaterina Shutova, and Anna Korhonen. 2019. “Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing.” Computational Linguistics 45 (3): 559–601. https://doi.org/10.1162/coli_a_00357.

Qi, Kunxun, Hai Wan, Jianfeng Du, and Haolan Chen. 2022. “Enhancing Cross-Lingual Natural Language Inference by Prompt-Learning from Cross-Lingual Templates.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, 1910–23. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.134.

Sang, Erik F, and Fien De Meulder. 2003. “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.” arXiv Preprint Cs/0306050.

Schwenk, Holger, and Xian Li. 2018. “A Corpus for Multilingual Document Classification in Eight Languages.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), edited by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, et al. Miyazaki, Japan: European Language Resources Association (ELRA). https://aclanthology.org/L18-1560/.

Yang, Yinfei, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019. “PAWS-X: A Cross-Lingual Adversarial Dataset for Paraphrase Identification.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), edited by Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, 3687–92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1382.

Citation

BibTeX citation:

@online{bochman2022,
  author = {Bochman, Oren},
  title = {Sequence {Labeling}},
  date = {2022-01-20},
  url = {https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w02-sequence-labeling/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2022. “Sequence Labeling.” January 20, 2022. https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w02-sequence-labeling/.