
Podcast
Abstract
Canonical morphological segmentation aims to divide words into a sequence of standardized segments. In this work, we propose a character-based neural encoder-decoder model for this task. Additionally, we extend our model to include morpheme-level and lexical information through a neural reranker. We set the new state of the art for the task improving previous results by up to 21% accuracy. Our experiments cover three languages: English, German and Indonesian. – (Kann, Cotterell, and Schütze 2016)
Outline
- Introduction
- Discusses morphological segmentation and its applications in NLP.
 - Explains the difference between surface segmentation and canonical segmentation, providing an example.
 - Highlights the advantages of canonical segmentation and the algorithmic challenges it introduces.
 - Presents a neural encoder-decoder model for canonical segmentation and a neural reranker to incorporate linguistic structure.
 
 - Neural Canonical Segmentation
- Formally describes the canonical segmentation task, mapping a word to a canonical segmentation.
 - Explains the probabilistic approach to learn a distribution p(c | w).
 - Details the two parts of the model: an encoder-decoder RNN and a neural reranker.
 - Describes the neural encoder-decoder model based on Bahdanau et al. (2014), using a bidirectional gated RNN (GRU) as the encoder.
 - Explains how the decoder defines a conditional probability distribution over possible segmentations.
 - Explains the attention mechanism and how attention weights are computed.
 - Explains the neural reranker’s role in rescoring candidate segmentations from a sample set generated by the encoder-decoder.
 - Describes the reranking model’s ability to embed morphemes and incorporate character-level information.
 
 - Related Work
- Discusses various approaches to morphological segmentation.
 - Mentions unsupervised methods like LINGUISTICA and MORFESSOR.
 - Describes supervised approaches using conditional random fields (CRFs).
 - Distinguishes the approach from surface morphological segmentation methods using a window LSTM.
 - Relates the approach to other applications of recurrent neural network transduction models.
 
 - Experiments
- Describes the dataset used for comparison to earlier work.
 - Specifies the three languages used in the experiments: English, German, and Indonesian.
 - Notes the potential cause of the high error rate for German due to its orthographic changes.
 - Explains the data extraction process from CELEX, DerivBase, and MORPHIND analyzer for English, German, and Indonesian, respectively.
 - Details the training setup, including the use of an ensemble of five encoder-decoder models.
 - Describes the training of the reranking model, including sample set gathering and optimization.
 - Describes the baseline models used for comparison: JOINT model and a weighted finite-state transducer (WFST).
 - Outlines the evaluation metrics used: error rate, edit distance, and morpheme F1.
 
 - Results
- Presents the results of the canonical segmentation experiment, showing improvements over baselines with both the encoder-decoder and reranker.
 - Discusses the additional improvements achieved by the reranker due to access to morpheme embeddings and existing words.
 - Analyzes cases where the right answer is not in the samples and errors due to annotation problems.
 - Discusses cases where the encoder-decoder finds the right solution but assigns a higher probability to an incorrect analysis.
 - Explains how the reranker corrects some errors based on lexical information and morpheme embeddings.
 - Investigates whether segments unseen in the training set are a source of errors.
 
 - Conclusion and Future Work
- Summarizes the developed model consisting of an encoder-decoder and neural reranker for canonical morphological segmentation.
 - States the model’s improvement over baseline models.
 - Discusses the potential for further performance increase by improving the reranker.
 
 
Reflections
- why has this not caught on.
 - why are people using byte pair encoding and morphological segmentation.
 - what resources are needed to train this kind of model on a new language?
- A dataset of words with canonical segmentation
 - Access to a lexicon or a large corpus to determine if a canonical segment occurs as an independent word in the language. What is it is a bound morpheme that never appears alone or part of a root-template morphological system? How should we verify that we are deleing with a morpheme and not a surface phonemic fragment.
 
 - Can we do this without a canonical segmentation dataset. More specifically can we induct morphology by processing surface forms of words and induct the canonical morphological forms using one of three loss function that
- affix loss (prefix,stem, suffix)
 - template loss (root,template) loss
 - agglunative loss (stem suffix sequence)
 
 
The Paper
References
Kann, Katharina, Ryan Cotterell, and Hinrich Schütze. 2016. “Neural Morphological Analysis: Encoding-Decoding Canonical Segments.” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, edited by Jian Su, Kevin Duh, and Xavier Carreras, 961–67. Austin, Texas: Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1097.
Citation
BibTeX citation:
@online{bochman2025,
  author = {Bochman, Oren},
  title = {Neural {Morphological} {Analysis:} {Encoding-Decoding}
    {Canonical} {Segments}},
  date = {2025-02-12},
  url = {https://orenbochman.github.io/notes-nlp/reviews/paper/2016-neural-morphological-segmentation/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2025. “Neural Morphological Analysis:
Encoding-Decoding Canonical Segments.” February 12, 2025. https://orenbochman.github.io/notes-nlp/reviews/paper/2016-neural-morphological-segmentation/.
