This review is a bit of a mess, it has gone through at least three editions of this blog and I have learned much about writing and structuring a review since then. I think it needs a bit of an overhaul. This latest version is a step in the right direction.
There are many other review of this paper but I think that covering this paper is highly relevant followup to the assignments for NMT as well as the earlier assignment in for MT with KNN in the Classification and Vector Space Models course.
This is an attention paper for machine translation.
The earlier work c.f. Bahdanau, Cho, and Bengio (2016) the authors had used a bidirectional RNN with (GRUs) as the encoder and a unidirectional RNN with GRUs as the decoder. The attention mechanism dynamically aligned source words with the target words during decoding.
In this paper the authors used stacked LSTMs for both the encoder and decoder. The paper proposed two simple and effective classes of attention mechanisms: a global approach that always attends to all source words and a local approach that only looks at a subset of source words at a time. The paper demonstrated the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. The paper also introduced the concept of input-feeding approach, which feeds attentional vectors as inputs to the next time steps to inform the model about past alignment decisions.
The paper is a must-read for anyone interested in neural machine translation and attention mechanisms in NLP. Not long after the transformer architecture was introduced in 2017 with attention becoming the backbone of the model.
Podcast
Effective Approaches to Attention-based Neural Machine Translation
Abstract
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker. – (Luong, Pham, and Manning 2015)
Outline
- Introduction
- Lists the advantages of neural machine translation (NMT)
- Minimal domain knowledge
- Conceptual simplicity
- Ability to generalize well to long sequences
- Small memory footprint
- Easy implementation of decoders
- Discusses the concept of attention in neural networks
- Mentions different applications of attention in different tasks
- Highlights the application of attention mechanism in NMT by Bahdanau et al. (2015) and the lack of further exploration
- Presents the purpose of the paper, which is to design two novel types of attention-based models
- A global approach
- A local approach
- Presents the experimental evaluation of the proposed approaches on WMT translation tasks and its analysis
- Lists the advantages of neural machine translation (NMT)
- Neural Machine Translation
- Describes the conditional probability of translating a source sentence to a target sentence
- Presents the two components of a basic NMT system an Encoder and a Decoder
- Discusses the use of recurrent neural network (RNN) architectures in NMT
- Presents the parameterization of the probability of decoding each word in the target sentence
- Presents the training objective used in NMT
- Attention-based Models
- Classifies the various attention-based models into two broad categories
- Global attention
- Local attention
- Presents the common process followed by both global and local attention models for deriving the context vector
- Describes the concatenation of target hidden state and source-side context vector for prediction
- Classifies the various attention-based models into two broad categories
- Global Attention
- Describes the concept of global attention model
- Presents the derivation of a variable-length alignment vector by comparing the current target hidden state with each source hidden state
- Presents three different alternatives for the content-based function used in calculating the alignment vector
- Describes the location-based function used in early attempts to build attention-based models
- Describes the calculation of the context vector as the weighted average over all the source hidden states
- Presents a comparison of the proposed global attention approach to the model by Bahdanau et al. (2015)
- Simpler architecture
- Simpler computation path
- Use of multiple alignment functions
- Local Attention
- Describes the concept of local attention model
- Mentions the inspiration from the soft and hard attentional models
- Describes the selection of a small window of context and its advantages over soft and hard attention models
- Presents two variants of the local attention model
- Monotonic alignment
- Predictive alignment
- Describes the comparison to the selective attention mechanism by Gregor et al. (2015)
- Input-feeding Approach
- Describes the suboptimal nature of making independent attentional decisions in the global and local approaches
- Discusses the need for joint alignment decisions taking into account past alignment information
- Presents the input-feeding approach
- Mentions the effects of input-feeding approach
- Makes the model fully aware of previous alignment choices
- Creates a deep network spanning both horizontally and vertically
- Presents the comparison to other related works
- Use of context vectors by Bahdanau et al. (2015)
- Doubly attentional approach by Xu et al. (2015)
- Experiments
- Describes the evaluation setup and datasets used
- newstest2013 as development set
- newstest2014 and newstest2015 as test sets
- Mentions the use of case-sensitive BLEU for reporting translation performances
- Describes the two types of BLEU used
- Tokenized BLEU
- NIST BLEU
- Describes the evaluation setup and datasets used
- Training Details
- Describes the data used for training NMT systems
- WMT’14 training data
- Presents the details of vocabulary size and filtering criteria used
- Discusses the architecture of the LSTM models and training settings
- Mentions the training speed and time
- Describes the data used for training NMT systems
- English-German Results
- Discusses the different systems used for comparison
- Presents the progressive improvements achieved by
- Reversing the source sentence
- Using dropout
- Using global attention approach
- Using input-feeding approach
- Using local attention model with predictive alignments
- Notes the correlation between perplexity and translation quality
- Mentions the use of unknown replacement technique and the achievement of new SOTA result by ensembling 8 different models
- Describes the results of testing the models on newstest2015 and the establishment of new SOTA performance
- German-English Results
- Mentions the evaluation setup for the German-English translation task
- Presents the results highlighting the effectiveness of
- Attentional mechanism
- Input-feeding approach
- Content-based dot product function with dropout
- Unknown word replacement technique
- Analysis
- Describes the purpose of conducting extensive analysis
- Understanding of the learning process
- Ability to handle long sentences
- Choice of attentional architectures
- Alignment quality
- Describes the purpose of conducting extensive analysis
- Learning Curves
- Presents the analysis of the learning curves for different models
- Notes the separation between non-attentional and attentional models
- Briefly mentions the effectiveness of input-feeding approach and local attention models
- Effects of Translating Long Sentences
- Briefly discusses the grouping of sentences based on lengths and computation of BLEU score per group
- Mentions the effectiveness of attentional models in handling long sentences
- Notes the superior performance of the best model across all sentence length buckets
- Choices of Attentional Architectures
- Presents the analysis of different attention models and alignment functions
- Highlights the poor performance of the location-based function
- Briefly mentions the performance of content-based functions
- Good performance of dot function for global attention
- Better performance of general function for local attention
- Notes the best performance of local attention model with predictive alignments
- Alignment Quality
- Briefly discusses the use of alignment error rate (AER) metric to evaluate the alignment quality
- Mentions the data used for evaluating alignment quality and the process of extracting one-to-one alignments
- Presents the results of AER evaluation and the comparison to Berkeley aligner
- Notes the better performance of local attention models compared to the global one
- Briefly discusses the AER of the ensemble
Dot-Product Attention
Dot-Product attention is the first of three attention mechanisms covered in the course and the simplest covered in this paper. Dot-Product Attention is a good fit, in an engineering sense, for a encoder-decoder architecture with tasks where the source source sequence is fully available at the start and the tasks is mapping or transformation the source sequence to an output sequence like in alignment, or translation.
<eos>
marks the end of a sentence.
The first assignment in the course using encoder decoder LSTM model with attention is so similar to the setup disused in this paper, I would not be surprised if it may well have inspired it.
This is a review of the paper in which scaled dot product attention was introduced in 2015 by Minh-Thang Luong, Hieu Pham, Christopher D. Manning in Effective Approaches to Attention-based Neural Machine Translation which is available at papers with code. In this paper they tried to take the attention mechanism being used in other tasks and to distill it to its essence and at the same time to also find a more general form.
They also came up with a interesting way to visualize the alignment’s attention mechanism.
So to recap: Luong et all were focused on alignment problem in NMT. When they try to tackle it using attention as function of the content and a function of its location. They came up with a number of ways to distill and generalize the attention mechanism.
Attention was just another engineering technique to improve alignment and it had not yet taken center stage in the models, as it would in Attention Is All We Need (Vaswani et al. (2023)).I find it useful to garner the concepts and intuition which inspired these researchers to adapt attention and how they come up with this form of attention.
The abstract begins with:
“An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation.”
which was covered in last lesson. The abstract continues with:
“This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.”
talks about
§2 Neural Machine Translation:
This section provides a summary of the the NMT task using 4 equations: In particular they note that in the decoder the conditional probability of the target given the source is of the form: log \space p(y \vert x) = \sum_{j=1}^m log \space p (y_j \vert y_{<j} , s)
Where x_i are the source sentence and y_i are the target sentence. p (y_j \vert y{<j} , s) = softmax (g(h_j))
Here, h_j is the RNN hidden unit, abstractly computed as: h_j = f(h_{j-1},s)
Our training objective is formulated as follows J_t=\sum_{(x,y)\in D} -log \space p(x \vert y)
With D being our parallel training corpus.
§3 Overview of attention
Next they provide a recap of the attention mechanism to set their starting point: >Specifically, given the target hidden state h_t and the source-side context vector c_t, we employ a simple concatenation layer to combine the information from both vectors to produce an attentional hidden state as follows:
\bar{h}_t = tanh(W_c[c_t;h_t])
The attentional vector \bar{h}_t is then fed through the softmax layer to produce the predictive distribution formulated as:
p(y_t|y{<t}, x) = softmax(W_s\bar{h}_t)
§3.1 Global attention
This is defined in §3.1 of the paper as:
\begin{align} a_t(s) & = align(h_t,\bar{h}_s) \newline & = \frac{ e^{score(h_t,\bar{h}_s)} }{ \sum_{s'} e^{score(h_t,\bar{h}_s)} } \newline & = softmax(score(h_t,\bar{h}_s)) \end{align}
where h_t and h_s are the target and source sequences and score() which is referred to as a content-based function as one of three alternative forms provided:
Dot product attention:
score(h_t,\bar{h}_s)=h_t^T\bar{h}_s This form combines the source and target using a dot product. Geometrically this essentially a projection operation.
General attention:
score(h_t,\bar{h}_s)=h_t^TW_a\bar{h}_s
this form combines the source and target using a dot product after applying a learned attention weights to the source. Geometrically this is a projection of the target on a linear transformation of the source or scaled dot product attention as it is now known
Concatenative attention:
score(h_t,\bar{h}_s)=v_a^T tanh(W_a [h_t;\bar{h}_s])
This is a little puzzling v_a^T is not accounted for and seems to be a learned attention vector which is projected onto the linearly weighted combination of the hidden states of the encoder and decoder. they also mention having considered using a location based function location :
a_t = softmax(W_a h_t)
which is just a linear transform of the hidden target state h_t§3.2 Local Attention
in §3.2 they consider a local attention mechanism. This is a resource saving modification of global attention using the simple concept of applying the mechanism within a fixed sized window.
We propose a local attentional mechanism that chooses to focus only on a small subset of the source positions per target word. This model takes inspiration from the tradeoff between the soft and hard attentional models proposed by Xu et al. (2015) to tackle the image caption generation task.
Our local attention mechanism selectively focuses on a small window of context and is differentiable. … In concrete details, the model first generates an aligned position p_t for each target word at time t. The context vector c_t
is then derived as a weighted average over the set of source hidden states within the window [p_t−D, p_t+D]; Where D is empirically selected. The big idea here is to use a fixed window size for this step to conserve resources when translating paragraphs or documents - a laudable notion for times where LSTM gobbled up resources in proportion to the sequence length…
They also talk about monotonic alignment where p_t=t and predictive alignment
p_t=S\cdot sigmoid(v_p^Ttanh(W_ph_t))
a_t(s)=align(h_t,\bar{h}_s)e^{(-\frac{(s-p_t)^2}{s\sigma^2})}
with align() as defined above and
\sigma=\frac{D}{2}
I found the rest of the paper lesser interest
In §5.4 In alignment quality
some sample translations
the references
This is appendix A which shows the visualization of alignment weights.
The Paper
References
Citation
@online{bochman2021,
author = {Bochman, Oren},
title = {Effective {Approaches} to {Attention-based} {Neural}
{Machine} {Translation}},
date = {2021-05-08},
url = {https://orenbochman.github.io/notes-nlp/reviews/paper/2015-effective-approaches-to-attention-based-NMT/},
langid = {en}
}