This is a mentioned in the tokenization lab in course four week 3
Time permitting I will try and dive deeper into this paper.
Podcast
Abstract
Linguistic similarity is multi-faceted. For instance, two words may be similar with respect to semantics, syntax, or morphology inter alia. Continuous word-embeddings have been shown to capture most of these shades of similarity to some degree. This work considers guiding word-embeddings with morphologically annotated data, a form of semisupervised learning, encouraging the vectors to encode a word’s morphology, i.e., words close in the embedded space share morphological features. We extend the log-bilinear model to this end and show that indeed our learned embeddings achieve this, using German as a case study. – (Cotterell and Schütze 2019)
Terminology
Here are some terms and concepts discussed in the sources, with explanations:
- Word embeddings
- These are vector representations of words in a continuous space, where similar words are located close to each other in the vector space. The goal is to capture linguistic similarity, but the definition of “similarity” can vary (semantic, syntactic, morphological). Word embeddings are typically trained to produce representations that capture linguistic similarity.
- Log-bilinear model (LBL)
- This is a language model that learns features along with weights, as opposed to using hand-crafted features. It uses context to predict the next word, and word embeddings fall out as low dimensional representations of the context. The LBL is a generalization of the log-linear model.
- Morphology
- The study of word structure, including how words are formed from morphemes (the smallest units of meaning).
- Morphological tags
- These are annotations that describe the morphological properties of a word, such as case, gender, number, tense, etc.. They can be very detailed for morphologically rich languages. For example, in German, a word might be tagged to indicate its part of speech, case, gender, and number.
- Morphologically rich languages
- Languages with a high morpheme-per-word ratio, where word-internal structure is important for processing. German and Czech are cited as examples.
- Morphologically impoverished languages
- Languages with a low morpheme-per-word ratio, such as English.
- Multi-task objective
- Training a model to perform multiple tasks simultaneously. In this case, the model is trained to predict both the next word and its morphological tag.
- Semi-supervised learning
- Training a model on a partially annotated corpus, using both labeled and unlabeled data. This approach is useful when large amounts of unannotated text are available.
- Contextual signature
- The words surrounding a given word. The context in which a word appears can be used to determine the word’s meaning and morphological properties.
- Hamming distance
- A measure of the difference between two binary strings, calculated as the number of positions at which the corresponding bits are different. In this context, it is used to compare morphological tags, which are represented as bit vectors.
- MORPHOSIM
- A metric for evaluating morphologically-driven embeddings that measures how morphologically similar a word is to its nearest neighbors in the embedding space. It is calculated as the average Hamming distance between morphological tags of a word and its neighbors. Lower values are better, as they indicate that the nearest neighbors of each word are closer morphologically.
Outline
- Introduction
- Presents the importance of capturing word morphology, especially for morphologically-rich languages.
- Highlights the multifaceted nature of linguistic similarity (semantics, syntax, morphology).
- Discusses the goal of the paper: to develop word embeddings that specifically encode morphological relationships.
- Related Work
- Discusses previous integration of morphology into language models, including factored language models and neural network-based approaches.
- Notes the role of morphology in computational morphology, particularly in morphological tagging for inflectionally-rich languages.
- Highlights the significance of distributional similarity in morphological analysis.
- Log-Bilinear Model
- Describes the log-bilinear model (LBL), a generalization of the log-linear model with learned features.
- Presents the LBL’s energy function and probability distribution in the context of language modeling.
- Morph-LBL
- Proposes a multi-task objective that jointly predicts the next word and its morphological tag.
- Describes the model’s joint probability distribution, incorporating morphological tag features.
- Discusses the use of semi-supervised learning, allowing training on partially annotated corpora
- Evaluation
- Mentions the qualitative evaluation using t-SNE, showing clusters reflecting morphological and POS relationships.
- Introduces MorphoSim, a novel quantitative metric to assess the extent to which similar embeddings are morphologically related.
- Experiments and Results
- Presents experiments on the German TIGER corpus, comparing Morph-LBL with the original LBL and Word2Vec.
- Describes two experiments:
- Experiment 1: Evaluates the morphological information encoded in embeddings using a k-NN classifier.
- Experiment 2: Compares models using the MorphoSim metric to measure morphological similarity among nearest neighbors.
- Discusses the results, highlighting Morph-LBL’s superior performance in capturing morphological relationships, even without observing all tags during training.
- Conclusion and Future Work
- Summarizes the contributions of the paper: introducing Morph-LBL for inducing morphologically guided embeddings.
- Notes the model’s success in leveraging distributional signatures to capture morphology.
- Discusses future work on integrating orthographic features for further improvement.
- Mentions potential applications in morphological tagging and other NLP tasks.
Log-Bilinear Model
p(w\mid h) = \frac{\exp\left(s_\theta(w,h)\right)}{\sum_{w'} \exp\left(s_\theta(w',h)\right)} \qquad
where w is a word, h is a history and s_\theta is an energy function. Following the notation of , in the LBL we define s_\theta(w,h) = \left(\sum_{i=1}^{n-1} C_i r_{h_i}\right)^T q_w + b_w \qquad
where n-1 is the history length
Morph-LBL
p(w, t \mid h) \propto \exp(( f_t^T S + \sum_{i=1}^{n-1}C_i r_{h_i})^T q_w + b_{w} ) \qquad
where f_t is a hand-crafted feature vector for a morphological tag t and S is an additional weight matrix.
Upon inspection, we see that
p(t \mid w,h) \propto \exp(S^T f_t q_w) \qquad
Hence given a fixed embedding q_w for word w, we can interpret S as the weights of a conditional log-linear model used to predict the tag t.
The Paper
Resources
- Leipzig Glossing Rules which provides a standard way to explain morphological features by examples
- CMU Multilingual NLP 2020 (17): Morphological Analysis and Inflection
- Kann, Cotterell, and Schütze (2016) code
- unimorph univorsal morphological database
References
Citation
@online{bochman2025,
author = {Bochman, Oren},
title = {Morphological {Word} {Embeddings}},
date = {2025-02-07},
url = {https://orenbochman.github.io/notes-nlp/reviews/paper/2019-morphological-embeddings/},
langid = {en}
}