Podcast
Abstract
Canonical morphological segmentation aims to divide words into a sequence of standardized segments. In this work, we propose a character-based neural encoder-decoder model for this task. Additionally, we extend our model to include morpheme-level and lexical information through a neural reranker. We set the new state of the art for the task improving previous results by up to 21% accuracy. Our experiments cover three languages: English, German and Indonesian. – (Kann, Cotterell, and Schütze 2016)
Outline
- Introduction
- Discusses morphological segmentation and its applications in NLP.
- Explains the difference between surface segmentation and canonical segmentation, providing an example.
- Highlights the advantages of canonical segmentation and the algorithmic challenges it introduces.
- Presents a neural encoder-decoder model for canonical segmentation and a neural reranker to incorporate linguistic structure.
- Neural Canonical Segmentation
- Formally describes the canonical segmentation task, mapping a word to a canonical segmentation.
- Explains the probabilistic approach to learn a distribution p(c | w).
- Details the two parts of the model: an encoder-decoder RNN and a neural reranker.
- Describes the neural encoder-decoder model based on Bahdanau et al. (2014), using a bidirectional gated RNN (GRU) as the encoder.
- Explains how the decoder defines a conditional probability distribution over possible segmentations.
- Explains the attention mechanism and how attention weights are computed.
- Explains the neural reranker’s role in rescoring candidate segmentations from a sample set generated by the encoder-decoder.
- Describes the reranking model’s ability to embed morphemes and incorporate character-level information.
- Related Work
- Discusses various approaches to morphological segmentation.
- Mentions unsupervised methods like LINGUISTICA and MORFESSOR.
- Describes supervised approaches using conditional random fields (CRFs).
- Distinguishes the approach from surface morphological segmentation methods using a window LSTM.
- Relates the approach to other applications of recurrent neural network transduction models.
- Experiments
- Describes the dataset used for comparison to earlier work.
- Specifies the three languages used in the experiments: English, German, and Indonesian.
- Notes the potential cause of the high error rate for German due to its orthographic changes.
- Explains the data extraction process from CELEX, DerivBase, and MORPHIND analyzer for English, German, and Indonesian, respectively.
- Details the training setup, including the use of an ensemble of five encoder-decoder models.
- Describes the training of the reranking model, including sample set gathering and optimization.
- Describes the baseline models used for comparison: JOINT model and a weighted finite-state transducer (WFST).
- Outlines the evaluation metrics used: error rate, edit distance, and morpheme F1.
- Results
- Presents the results of the canonical segmentation experiment, showing improvements over baselines with both the encoder-decoder and reranker.
- Discusses the additional improvements achieved by the reranker due to access to morpheme embeddings and existing words.
- Analyzes cases where the right answer is not in the samples and errors due to annotation problems.
- Discusses cases where the encoder-decoder finds the right solution but assigns a higher probability to an incorrect analysis.
- Explains how the reranker corrects some errors based on lexical information and morpheme embeddings.
- Investigates whether segments unseen in the training set are a source of errors.
- Conclusion and Future Work
- Summarizes the developed model consisting of an encoder-decoder and neural reranker for canonical morphological segmentation.
- States the model’s improvement over baseline models.
- Discusses the potential for further performance increase by improving the reranker.
Reflections
- why has this not caught on.
- why are people using byte pair encoding and morphological segmentation.
- what resources are needed to train this kind of model on a new language?
- A dataset of words with canonical segmentation
- Access to a lexicon or a large corpus to determine if a canonical segment occurs as an independent word in the language. What is it is a bound morpheme that never appears alone or part of a root-template morphological system? How should we verify that we are deleing with a morpheme and not a surface phonemic fragment.
- Can we do this without a canonical segmentation dataset. More specifically can we induct morphology by processing surface forms of words and induct the canonical morphological forms using one of three loss function that
- affix loss (prefix,stem, suffix)
- template loss (root,template) loss
- agglunative loss (stem suffix sequence)
The Paper
References
Kann, Katharina, Ryan Cotterell, and Hinrich Schütze. 2016. “Neural Morphological Analysis: Encoding-Decoding Canonical Segments.” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, edited by Jian Su, Kevin Duh, and Xavier Carreras, 961–67. Austin, Texas: Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1097.
Citation
BibTeX citation:
@online{bochman2025,
author = {Bochman, Oren},
title = {Neural {Morphological} {Analysis:} {Encoding-Decoding}
{Canonical} {Segments}},
date = {2025-02-12},
url = {https://orenbochman.github.io/notes-nlp/reviews/paper/2016-neural-morphological-segmentation/},
langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2025. “Neural Morphological Analysis:
Encoding-Decoding Canonical Segments.” February 12, 2025. https://orenbochman.github.io/notes-nlp/reviews/paper/2016-neural-morphological-segmentation/.