Learning Objectives
- Example Sequence Classification/Labeling Tasks
- Overall Framework of Sequence Classification/Labeling
- Sequence Featurization Models (BiRNN, Self Attention, CNNs)
Transcript
Some Ideas
- crearing a multilingual POS tagger
- using a hierarchical model that does partial pooling to learn from multiple languages when working on a low-resource language
- creating a surrogate simulated language which
- has parameters that correspond to the low resource language - resources are drawn from language database like WALS, Ethnologue, and Glottolog Note that the challange then becomes in how to generate the surrogate language based on these parameters. One could try to create real phonetic and morphological rules etc or one might side step this complexity and use a simple mathematical construct to create data to create suitable embeddings.
- Use a phrase book as a template for generating texts in the surrogate languages. The outcome should be a dataset of translations of the phrase book in multiple languages. Note that it could also be feasible to generate multiple variants for the both the source and target language to avoid overfitting on the phrase book.
- a priors distribution that follows high resource languages. (i.e. idea that high frequency source words are more likely to be translated to high frequency target words)
- a language model that is trained on the high resource language and then used to generate the surrogate language
- a model that is trained on the surrogate language and then used to tag the low resource language
Citation
BibTeX citation:
@online{bochman2022,
author = {Bochman, Oren},
title = {Assignment 1: {Multilingual} {POS} {Tagging}},
date = {2022-02-15},
url = {https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w02-assignment-1/},
langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2022. “Assignment 1: Multilingual POS
Tagging.” February 15, 2022. https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w02-assignment-1/.