course banner

This week’s slides

Figure 1

My notes for Week 1 of the Natural Language Processing with Sequence Models Course in the Natural Language Processing Specialization Offered by DeepLearning.AI on Coursera

Learning Objectives

Feature extraction
Supervised machine learning
Binary classification
Text preprocessing
ReLU
Python classes
Trax
Neural networks

Course Intro

Course Intro Video Transcript

Welcome to the third course of this specialization. This course is called natural language processing with sequence models. In this course, here are some of the things you get to build. First, you take sentiment analysis to the next level with deep neural networks. You also build a language generator using RNNs or recurrent neural networks. You apply LSTM units. LSTM stands for long short term memory. You apply LCM units to the problem of named entity recognition. And, finally you use Siamese networks to identify duplicate questions like say if there’s an online discussion forum and different people are asking questions. Can you figure out if two different people ask essentially the same question but with different wording? With the skills you develop in this course, you’ll be able to build powerful NLP systems to solve problems across a wide range of industries. It is my pleasure to welcome Lukasz and Younes as your instructors for this course and they’re thrilled to dive into these topics with you. Younes, perhaps you could say a bit more about the applications that learners will build in this course.

Sure thing. Thanks, Andrew. Well, I’d like to start by saying that in the first two courses of the specialization, you build a powerful foundation that will provide you with both the context and the fundamental skills you need to tackle this course. For example, you’ve already done sentiment analysis in course one with a simple naive Bayes classifier. But now, you will leverage the power of deep neural networks to build a much more robust sentiment analysis classifier. You’ve also seen how to do things like predict the next word in a sequence, using relatively simple n-gram language models in course two. In this course, you’ll create an advanced model using recurrent neural networks to generate text. You can think of this course as taking the step from foundational skills into building real world NLP applications.

Very cool, Younes. I think that’s a great way to think about it. Lukasz, maybe you could say a bit more about the applications learners will get to build.

Sure, thank you, Andrew. Well, as you saw in the first course, the problem of sentiment analysis is a really tricky one. But in many applications you want to determine the sentiment of a sentence. So it’s a really good problem to work on as well. With language modeling, which you’ll tackle in week two of this course, the problems you can solve are almost infinite. From translation to autocomplete to generating text from scratch. In week three of this course, you’ll work on named entity recognition, which is the problem of separating named entities in sentences, like people and places. This is a building block of many important NLP systems. Finally, in week four you will tackle the problem of identifying duplicates. The question whether two pieces of texts are duplicates of each other might not sound very interesting at first, but as you will see, it’s actually core building block of things like online forums and search engines and we’ll show you how to solve it. So we’re really excited to show you these applications and bring your skills to the next level.

Thank you Lukasz and thank you Younes. This is going to be an exciting course, so let’s get started.

Good luck.

Have fun.

Neural Networks for Sentiment Analysis

Previously in the course we did sentiment analysis with logistic regression and naive Bayes. Those models were in a sense more naive, and are not able to catch the sentiment off a tweet like: “I am not happy” or “If only it was a good day”. When using a neural network to predict the sentiment of a sentence, we can use the following. Note that the image below has three outputs, in this case we might want to predict, “positive”, “neutral”, or “negative”.

Note that the network above has three layers. To go from one layer to another we can use a W matrix to propagate to the next layer. Hence, we call this concept of going from the input until the final layer, forward propagation. To represent a tweet, we can use the following:

Note, that we add zeros for padding to match the size of the longest tweet.

Video Transcript

This week I’ll show you how to create neural networks using layers. This simplifies the task a lot as you will see. Let’s dive in. First, you’ll revisit the general structure of neural networks and how they make predictions. I’ll show you the structure you’ll be using to perform sentiment analysis during this week. Neural networks are computational structures that, in a very simplistic way, attempt to mimic the way the human brain recognizes patterns. They’re used in many applications of artificial intelligence and have proven very effective on a variety of tasks, including those in NLP. Have a look at this example of a simple neural network with n input parameters, two hidden layers, and three output units. As inputs, this neural network receives a data representation x with n features, then performs computations in its hidden layers. Finally, it delivers an output which in this case has size 3. Let’s take a look at how it works mathematically. All the nodes every activation layer as a_i, where i is the layer’s number. First, define a_0 to be the input vector x. To get the values for each layer’s activation, a, you have to compute the value for z_i, which depends on both the weights matrix for that layer and the activations, a, from the previous layer. Finally, you get the values for each layer by applying an activation function, g, to the value of z. As you can see, this computation moves forward through the left of the neural network towards the right. That’s why this process is called forward propagation. For this module’s assignments, you’re going to implement a neural network that looks like this. As inputs, it will receive a simple vector representation of your tweets. It will have an embedding layer that will transform your representation into an optimal one for this task. Finally, it will have a hidden layer with a ReLU activation function and then output layer with the softmax function that will give you the probabilities for whether a tweet has a positive or negative sentiment. This neural network will allow you to predict sentiments for complex tweets, such as a tweet like this one that says, “This movie was almost good.” That you wouldn’t have been able to classify correctly using simpler methods such as Naive Bayes because they missed important information. The initial representation, x, that you’ll use for this neural network will be a vector of integers. Similar to your previous work with sentiment analysis, you will first need to list all of your words from your vocabulary. Next for this application, you’ll assign an integer index to each of them. Then for each word in your tweets add the index from your vocabulary to construct a vector like this one for every tweet. After you have all the vector representations of your tweets, you will need to identify the maximum vector size and fill every vector with zeros to match that size. This process is called padding and ensures that all of your vectors have the same size even if your tweets don’t. Let’s do a quick recap. At this point, you’re familiar with the general structure of the neural network that you’ll be using to classify sentiments for a set of complex nuance tweets. You also reviewed the integer representation that’s going to be used in this module. Next, I’ll introduce the tracks library for neural networks and demonstrate how the embedding layer works. I’ll see you later.

Trax: Neural Networks

Trax has several advantages:

Runs fast on CPUs, GPUs and TPUs
Parallel computing
Record algebraic computations for gradient evaluation

Here is an example of how we can code a neural network in Trax:

Video Transcript

Trax and JAX, docs and code

Official Trax documentation maintained by the Google Brain team:

https://trax-ml.readthedocs.io/en/latest/

Trax source code on GitHub:

https://github.com/google/trax

JAX library:

https://jax.readthedocs.io/en/latest/index.html

Lab: Introduction to Trax

Introduction to Trax

Trax: Layers

Trax makes use of classes. If we are not familiar with classes in python, don’t worry about it, here is an example.

In the example above, we can see that a class takes in an __init__ and a __call__ method.

These methods allow we to initialize your internal variables and allow we to execute your function when called.

To the right we can see how we can initialize your class. When we call MyClass(7) , we are setting the y variable to 7. Now when we call f(3) we are adding 7 + 3.

We can change the my_method function to do whatever we want, and we can have as many methods as we want in a class.

Video Transcript

To create a neural network out of layers, you need to put them together. In a deep network, you run them one after the other in a sequential way. Let me show you how you can do this. First, you will observe how a basic neural network is defined in Trax. Then I’ll show you some of the benefits of using a framework like TensorFlow, which is the framework that Trax is built on. Let’s take this network architecture as an example. In this model, you have two hidden layers with sigmoid activation functions and an output layer with softmax activation. In Trax, you’ll need to specify the type of model architecture.
For simple architectures like this one, you’ll use a serial model. To start, list the layers from left to right, or from your input variables to the output layer. In this case, first you have a dense layer with four units, and then assign the sigmoid activation function to that layer. After that, repeat the process for the second hidden layer and the output layer. You can specify any architecture you like in the simple way. Note that, this way to specify your models architecture, follows the order in which the computations are made in your neural network. There are several advantages to using libraries like Trax, such as they’re designed to perform computations efficiently in hardware like CPUs, GPUs, and even TPUs. They allow you to easily perform parallel computing by running gear models on multiple machines or course simultaneously. They keep a record of all the algebraic operations on your neural net in the order of computation. So they are able to compute the gradients of your model automatically. There are many open source frameworks out there, and Trax is one of the latest.
It’s based on TensorFlow. You might be already familiar with TensorFlow, PyTorch, and JAX. If you’re not familiar with those, don’t worry. I’ll show you the basics of Trax and you’ll be able to implement amazing NLP models. So far, I showed you how to define a model in Trax with the simple sequential architecture, and I pointed out some of the advantages to be had, like computational efficiency and parallel computing.

Next, I’ll get into more detail on how to use Trax.

Why we recommend Trax

Video Transcript

Hi. My name is Lukasz, and I want to tell you in this video why we made the machine learning library Trax. This is a little bit of a personal story for me. I’ve been at Google for about seven years now. I’m a researcher in the Google Brain Team. But before I was a researcher, I was a software engineer. I worked on a lot of machine learning projects and frameworks. This journey for me ended in the Trax library. I believe Trax is currently the best library to learn and to productionize machine learning research and machine learning models, especially sequenced models, models like transformer and models that are used in natural language processing. The reasons I believe that come from a personal journey that I took that led me here. I will tell you a little bit about myself and how I came here, and then I’ll tell you why I think Trax is the best thing to use currently for machine learning, especially in natural language processing. My journey with machine learning and machine learning frameworks started around 2014-15 when we were making TensorFlow. TensorFlow is, you probably know, is a big machine learning systems, it has about 100 million downloads by now. It was released in November 2015. It was a very emotional moment for all of us when we were releasing it. At that point, we were not sure if deep learning will become as big as it did. We were not sure how many users there will be. What we wanted to do was a system that’s primarily very fast, that can run distributed machine learning systems, large-scale fast training. The main focus was speed. A secondary focus was to make it easy to program the systems that wasn’t a reader, but it was not the most important thing. After releasing TensorFlow, I worked on machine translation and especially on the Google’s Neural Machine Translation System. This was the first system using deep sequence models that was used by the Google Translate team that was actually released as a product. It’s handling all of Google translations these days. Every language that we have has a neural model. It started with LSTMs and RNN models, and now it’s a lot of transformers. We released that in 2016 based on the TensorFlow framework. These models, they’re amazing. They’re much better than the previous phrase-based translation models, but they took a long time to train. They were training for days on clusters of GPUs at that time. This was not practical for anyone else to do rather than Google. This was only because we had this TensorFlow system, a large group of engineers who would ferry very well, and we were training for days and days. That was great. But I felt like this is not satisfactory because no one else can do that. It’s not possible to be done at the university. You cannot launch a startup doing that, because it was impossible if you were not Google, or maybe from Microsoft, but no one else. I wanted to change that. To do that, we created the Tensor2Tensor Library. The Tensor2Tensor Library, which was released in 2017, started with the thought that we should make this deep learning research, especially for sequence models, widely accessible. This was not working with these large RNN models, but while writing the library, we created this transformer model. This transformer has taken NLP by storm because it allows you to train much faster. At that time within a few days, now, it’s less than a day in a matter of hours on an 8 GPU system. You can create translation models that surpass any RNN models. The Tensor2Tensor library has become already widely used. It’s used in production Google systems. It’s used by some very large companies in the world and it has led to a number of startups that they know about that basically exists thanks to this library. You can say, well, this is done and this is good, but, the problem is, it’s become complicated and it’s not nice to learn and it’s become very hard to do new researcher. Around 2018, we decided it’s time to improve. As time moves on, we need to do even better, and this is how we created Trax. Trax is a deep-learning library that’s focused on clear code and speed. Let me tell you why, so, if you think carefully what you want from a deep-learning library, there are really two things that matters. You want the the programmers to be efficient and you want the code to run fast, and this is because what costs you is the time of the programmer, and the money you need to pay for running your training code. Programmer’s time is very important. You need to use it efficiently, but in deep learning you’re training big models and these costs money too. For example, using eight GPUs on-demand from the Cloud, can cost $20 an hour almost. But using the preemptible eight could TPU costs only $1.40. In Trax, you can use one or the other without changing a single character in your code. How does Trax make programmers sufficient? Well, it was redesigned from the bottom-up to be easy to debug and understand. You can literally read Trax code and understand what’s going to come. This is not the case in some other libraries, this is unluckily of the case anymore in TensorFlow. But, you can say, well it used to be the case, but nowadays TensorFlow, even when we clean up the code, it needs to be backwards compatible. It carries the weight of these years of development, and this is crazy errors of Machine Learning. There is a lot of baggage that it just has to carry because it’s backward compatible. What we do in Trax is we break the backwards compatibility. This means you need to learn new things. This carries some price. But what you get for that price, is that it’s a newly cleanly designed library which has four models, not just primitives to build them, but also four models with dataset bindings, we regression test these models daily because we use these libraries, so we know every day these monster running. It’s like a new programming language. It costs a little bit to learn, this is a new thing, but it makes your life much more efficient. To make this point point clear, the Adam Optimizer, the most popular optimizer in machine learning timesteps. On the left, you see a screenshot from the paper that introduced data, and you see it has like about seven lines. Next is just a part of the Adam implementation and patronage, which is one of the cleanest ones actually and you need to know way more, you need to know what are parameter groups, you need to know secret keys into these groups that key parameters by some means, you need to do seven stick initialization and some conditional to introduce either and other things. On the right, you see the Adam optimizer in TensorFlow and Keras and as you’ll see it’s even longer. You need to apply it to resource variables and two non-research variables and you need to know what these are. The reason they exist is historical. Currently we only use resource variables, but we have to support people who used the old non-research variables too. There are a lot of things that in 2020 you actually don’t need anymore, but they have to be there and painted and in TensorFlow code. While if you go to Trax code, this is the full code of Adam and Trax. It’s very similar for the paper. That’s the whole point. Because if you’re implementing a new paper or if you’re learning and you want to find, in the code of the framework, where are the equations from the paper, you can really do with this here. So that is the benefit of Trax. The price of this benefit is that you’re using a new thing. But there is a huge gain that comes to you when you’re actually debugging your code. When you’re debugging your code, you will hit lines that are in the framework. So you will actually need to understand these lines, which means you need to understand all of these PyTorch and all of these TensorFlow if you use those. But in Trax, you only need to understand these Trax lines. It’s much easier to debug, which makes programmers more efficient. Now this efficiency would not be worth that much if the code is running slow. Hey, there’s a lot of beautiful things where you can program things in a few line, but the run so slowly that it’s actually useless. Not so in Trax because we use the just-in-time compiler technology that was built in the last six years of TensorFlow. It’s called XLA, and we use it on top of Trax. These teams have put tremendous effort to make this coat the fastest code on the planet. There is an industry competition called MLPerf. In 2020, JAX actually won this competition, being the fastest transformer to ever be benchmarked independently. So JAX transformer ran in 0.26 of a minute, so in about 16 seconds, I think, while the fastest TensorFlow transformer on the same hardware took 0.35 minutes. So you see, it’s almost 50 percent slower. The fastest PyTorch, but this was not on TPU, took 0.62. So being two times faster is significant game. It’s not clear you’ll get the same gain in any model on other hardware. There was a lot of work to tune it for this particular model hardware. But in general, Trax runs fast. This means, you’ll pay less for the TPUs and GPUs you’ll be running on Cloud. It’s also tested with TPUs on Colab. Colabs are the IPython notebooks that Google gives you for free. You can select a hardware accelerator, you can select TPU and run the same code with no changes. It’s GPU, TPU, or CPU, on this Colab, where you’re getting an eight-code TPU for free. So you can test your code there and then run it on Cloud for much cheaper than other frameworks, and it really runs fast. So these are the reasons to use Trax, and for me, Trax is also super fun. It’s super fun to learn, it’s super fun to use, because we had the liberty to do things from scratch using many years of experience now. You can write model using combinators. This is a whole transformer language model on the left. On the right, you can see it’s from a README. This is everything you need to run a pre-trained model and get your translations. So this gave us the opportunity to clean up the framework, clean up the code, make sure it runs really fast. It’s a lot of fun to use. So I encourage you come check it out. See how you can use Trax for your own machine learning endeavors, both for research. If you want to start a startup or if you want to run it for a big company, I think Trax will be there for you.

Lab: Classes and Subclasses

Classes and Subclasses

Dense and ReLU layer

The Dense layer is the computation of the inner product between a set of trainable weights (weight matrix) and an input vector. The visualization of the dense layer could be seen in the image below.

The orange box shows the dense layer. An activation layer is the set of blue nodes. Concretely one of the most commonly used activation layers is the rectified linear unit (ReLU).

ReLU(x) is defined as max(0,x) for any input x.

Video Transcript

Now, I’ll show you dense and ReLU layers. First, I’ll show you dense layers and then ReLU layers. Suppose that you have a simple serial model like this one. Let’s focus on the first parts of the model. Here, you have an input vector X fully connected to a layer of activations. To get the value of z, let’s go into the activations. You will have to compute the inner products between a set of trainable weights and the input vector. This single computation is called a dense layer. The ReLU layer is much simpler. Let’s take the same model you’ve been working with.
The ReLU layer is an activation layer that typically follows a dense fully connected layer, and transforms any negative values to zero before sending them onto the next layer. To do this, the ReLU layer computes the function g, which returns a value of zero for all negative values of z, and z for all positive ones. You’ve now seen the dense layer and the ReLU layer. Next, I’ll show you how to put a model together.

Serial Layer

A serial layer allows we to compose layers in a serial arrangement:

It is a composition of sublayers. These layers are usually dense layers followed by activation layers.

Video Transcript

To create a neural network out of layers, you need to put them together. In a deep network, you run them one after the other in a sequential way. Let me show you how you can do this. Previously, you saw how to define layers, and I showed you how the dense and ReLU layers performed single steps of forward propagation. Now, I’ll show you how to define a serial neural network as a composition of layers thats operates in a sequence. Imagine, a basic neural network like this one. You have some dense layers and activation layers, and the sequential arrangements of those layers is done in tracts, when you define a serial layer. You could think of this new serial layer as your whole neural network model in one layer. Let’s summarize what you just learned. A serial layer is a composition of sublayers that operates sequentially to perform the forward computation of your entire model. Coming up, I’ll introduce some additional layers and the training procedure.

Other Layers

Other layers could include embedding layers and mean layers. For example, we can learn word embeddings for each word in your vocabulary as follows:

The mean layer allows we to take the average of the embeddings. We can visualize it as follows:

This layer does not have any trainable parameters.

Video Transcript

Earlier, you worked with a few different ways to represent text data. When using neural networks for NLP tasks, embedding layers are often included in the model. Going forward, I will introduce embedding layers, and explain why you will need to include layers such as the mean layer in serial models directly after your embedding layer. Let’s now dive into the embedding layer. In NLP, you usually have a set of unique words called the vocabulary. An embedding layer takes an index assigned to each word from your vocabulary and maps it to a representation of that word with a determined dimension. In this example, embedding of size equal to two. For instance, the embedding layer in this example, we’ll return a vector equal to 0.020, and 0.006 for the word I, and negative 0.009 and 0.050 for the word NLP. Every value from this part of the table is trainable in tracks. When using an embedding layer in your model, you will learn the representation that gives the best performance for your NLP task. For the embedding layer in your model, you’d have to learn a matrix of weights, of size equal to your vocabulary times the dimension of the embedding. The size of the embedding could be treated as a hyperparameter in your model. With this layer, your model can learn or improve the word embeddings for your NLP task like it improves the weights matrices on each layer.

The embedding layer is able to map words to embeddings. If you had a series of words, like a tweet that says, “I am happy,” the embedding layer will map each of those words to their corresponding embedding, and return a matrix of word embeddings. If you had padded vectors representing your tweets, you could unroll this matrix and feed its values to the next layer on the neural network. But in doing this, you might end up with lots of parameters to train. As an alternative, you could just take the mean of each feature from the embedding, and that’s exactly what the mean layer does in tracks. After the mean layer, you will have the same number of features as you’re embedding size. Even for sequences of texts, those are very long. Note that this layer doesn’t have any trainable parameters because it’s only computing the mean of the word embeddings.

At this point, you are familiar with what’s embedding layers, and mean layers are, and how they work. An important takeaway here is that using an embedding layer in your model allows you to learn a good representation of your vocabulary for the specific task you’re working on, and that’s the mean layer takes a matrix of embeddings, and returns a vector representation for a set of words, like tweets.

Training

In Trax, the function grad allows we to compute the gradient. We can use it as follows:

Now if we were to evaluate grad_f at a certain value, namely z, it would be the same as computing 6z+1. Now to do the training, it becomes very simple:

We simply compute the gradients by feeding in y.forward (the latest value of y), the weights, and the input x, and then it does the back-propagation for we in a single line. We can then have the loop that allows we to update the weights (i.e. gradient descent!).

Video Transcript

To train a neural network, you need to compute gradients. You’ve done it by hand earlier in this specialization, and you’ve seen that it can be quite complex. But you know what? You don’t have to necessarily do it by yourself. Deep learning frameworks can do it for you. Let’s dive in. I will show you how the trax grad function allows you to easily compute gradients, which will allow you to perform back-propagation and train your model. You will see how easy it is compared to back propagating by hand. Computing ingredients using Trax is pretty straightforward. Suppose that you have the following equation, f of x, whose gradient with respect to x is equal to this derivative. To get that derivative in Trax, define the function f that takes in x, and then just call the grad function with f as its single parameter. The function grad will return a function that computes the gradient of f. That’s nice.

Using the grad function to train a model is similarly painless. Suppose that you have a neural network model y. To get the gradient of your model, just apply the grad function with the forward method of your model as a single parameter. Then evaluate the gradient with your weights and inputs. Notice the double sets of parentheses.
The first one passes the arguments for the grad function, and the second one, the arguments for the function returned by grad. After you have the gradients for your model, just iterate until convergence is reached.

That’s it, forward and back-propagation performed in a single line. This final block is gradient descent. You can always change the optimization algorithm if necessary. So let’s summarize. Having a defined model in Trax, make training significantly easier than computing forward and back-propagation by hand because the built-in grad function automatically computes everything you need.

In the programming assignments from this module, you’ll be able to define and train a neural network using Trax.

Good luck, and have fun. This was the last video of the week. You now know how to create neural networks with layers and how to train them.

Next, you will learn even more complex layers and built networks that perform better.

Lab: Data Generators

Data Generators

Resources:

References

Chadha, Aman. 2020. “Distilled Notes for the Natural Language Processing Specialization on Coursera (Offered by Deeplearning.ai).” https://www.aman.ai. www.aman.ai.

Citation

BibTeX citation:

@online{bochman2020,
  author = {Bochman, Oren},
  title = {Neural {Networks} for {Sentiment} {Analysis}},
  date = {2020-11-06},
  url = {https://orenbochman.github.io/notes-nlp/notes/c3w1/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2020. “Neural Networks for Sentiment Analysis.” November 6, 2020. https://orenbochman.github.io/notes-nlp/notes/c3w1/.