Typology: The Space of Languages – NLP Course Notes & Research

Video 1: Lesson Video

This week’s slides

Supplementary Figure 1

We’ve always sent dances to tiktok as a way to communicate and I can’t remember when when we ever did anything else – Graham Neubig

Learning Objectives

How to quantify similarity between languages
Language families and genealogical similarity
Linguistic typology and typological similarity
WALS and other typological databases
Typology Prediction / Typology-based language transfer (Lin et al.)

Transcript

Introduction

I’m gonna run through this we have about half an hour i’m gonna talk about all languages on the planet basically not not not too big and of course there are a lot of languages and the question really is how can we try to structure that space we could look at them directly but we’d like to be able to gather them into groups or similarities and that’s where we come up with a more organized structural way and we call this linguistically and typology and so there is official ways to try to find out similarities between language and that’s what we’re going to talk about in this in this form so let’s have a look so first we’ve got all the languages in the planet we have a dot here and for each of the languages now this is very hard to do because languages are not point they’re spread over multiple places and of course they’re spread over multiple places that will overlap as well but you can see something about here that there are some places that seem to have more languages per you know square inch than others and it seems to be the the richer countries the ones where there’s a common education system travel is very easy and they have a long history of fighting each other with borders you’re more likely to find languages that cover whole what would we we call countries while when we’re looking at places that are been around for a very long time but might not have the same definition of border and there’s more languages and sometimes these languages are related to each other and sometimes they’re not sometimes they’re from rival groups and sometimes they’re just not related like korean is not related to any of the other languages in the area and there’s some relations to japanese but it’s really the japanese and korean are not like anything else and therefore they’re more like each other but only because they’re not like anything else in the area at all even though both of them have got some substantial chinese influence with words and english influence as well what’s the definition definition of a language well that’s sort of quite hard and there’s different definitions and we basically are coming down to whoever defines it but of course it become can become quite political we’ve talked about these things before urdu and hindi different languages well it depends what you’re talking about from a political point of view unquestionably from a linguistic point of view yeah they’re sort of different from a phonological point of view they’re not very different at all they really pretty much overlap and we have lots of examples like that

Defining Languages

There is the standard joke but it’s a good joke and it’s very relevant and that is “a language is a dialect with an army” and there certainly is political decisions when it comes to defining what the border actually is between two languages and often it’s quite weak between two languages and people can float between one and the next so depending on the area depending on the

history depending on the politics depending to on the historical aspect of ethnic groups and travel etc defining what the languages actually are is still pretty hard but say in the in india there is something like about 460 languages which is a lot officially from the government i think it’s 21 if i remember correctly and which is about the same size as what europe is when it comes to official languages but of course as you look closer you discover that you start getting distinctions between languages which might be relatively small but they might be very big to the extent that these people absolutely can’t understand each other but there may be many people who speak multiple languages remember it’s really only america and britain where people only speak one language and only one language so it’s pretty rare actually in the world that people are only speaking one language

Language Families

There are different language families and so linguists have decided looking at languages especially looking at things like where choice and a word overlap which might vary between languages but there might be relationships between languages that we can see language families that are actually sharing information like lexicons so the choice of the word the grammatical aspects the morphology act actually even though there could be differences they might be more similar and so across africa we can see something like regions of about six different major language families and you know we can see that geographical aspects madagascar is different but that’s because you know there’s a big sea between that in the mainland and that actually makes less chance for interaction to happen and therefore it’s sort of easier to keep that distinction so you find sometimes that languages have got borders that are quite geographical mountains rivers and the sea of course ### Online Languages

How many of these languages are online we’re all very aware that both you know english european languages chinese japanese korean are very well represented online and how many people within the country are actually getting online nowadays often many of the richer people are online and they can be quite substantial amounts of data online but also when you look at a major areas of wealth that have computers that have internet often there may be mixed languages and there may be a preferred language you often discover that in india even though there are lots of different languages in india that are native that people who are online are often using english or they’re writing in the romanized form rather than their native script because well history etc it’s easier they’re fluent in english there’s a lot of english influence that’s actually there china almost everything is written in putonghua stands mandarin even though there are many other dialects there so there’s not the same representation across everywhere because it depends on whether these people have access and that they’re willing to talk in these languages often they consider these to be spoken only or rarely written and so their language of literacy is some colonial language be that english and spanish french etc or swahili or arabic but it might be that it’s just easier for them to do that and therefore there’s a tradition to be able to communicate in one of the standard larger languages on the planet rather than their own language and so when we’re trying to do multilingual nlp we’re actually caring about these smaller languages that may not be as much online okay so how do we try to find similarity

Similarity

between languages well obviously the the clearest thing is what we can do is we can start looking at the words and so we can see whether there’s a shared information and we usually look for words not like iphone and computer that are relatively modern we look at words that have been in the language for a long time so this is often body parts family relationships core food aspects water air etc and there’s actually a specific list called the swabish list which you can be often used to compare similarities between languages now you want to be a little careful about caring about the writing systems so often that there’s a writing system that writes things in a very different way but you may discover that there’s actually a shared information between the language in the phonetic form that’s not there in the written form and therefore you can’t just use string match to find out whether it actually works or not there’s a number of online groups that try to identify all of the languages on the planet ethnologue is one of the best to try to do that it actually comes out the summer institute of linguistics which is a religious organization which is caring about translation of the bible but they’re quite independent in in doing their list mostly and it’s definitely the most comprehensive list in in the world glottalog tries to do a similar thing but it also is identifying this the position of the language on the planet and that’s not the only influence to similarity because people get on boats and cross oceans and we’re currently in the united states and we all know in this particular area english is not native at all and french is native no no french was there before english was there but there was earlier indigenous languages definitely iroquois and possibly a language called mingo is somewhat native and though the name allegheny is probably a mingle word for probably it’s a word for river maybe northern river because it’s the northern one in the obvious three rivers but glottalog tries to do this but also gives downloadable spreadsheets that allow you to be able to do a in other aspects now remember people move i mean huge populations have moved around the planet which has influenced how languages have moved and also trade has done a lot to have shared aspect of languages as well as grammatical history or linguistic history of languages and we want to be careful about that when we look at that distribution you know one might think that if you go to take the example of korean there are lots of english words in korean they have a different pronunciation from the english but they’re clearly derived from english but that doesn’t mean that there’s any linguistic relationship between korean and english it’s much more to do with english being the international language of the last hundred years and korea has picked up many of these more modern words into its language because that’s a convenient way to do it

Genealogical similarities

genealogical similarities so this is the language family and this is usually divine defined by linguists who make decisions about looking at the linguistic properties historically we used to do this in the animal kingdom before we could do dna tests and there are interesting errors in the dna tests for animals that two animals have come from different family and moved together because that’s a convenient co-evolution to end up with a different way but actually they’re quite got different histories but that might not be obvious and that’s going to happen with languages as well where things end up being borrowed maybe things get simplified over time we can see some examples of this but things also get more complex over time and there’s lots of interesting and boring from some of the major families that are out there niger congo in africa has got a lot of different languages and covers 21 of the languages spoken on the planet that’s a lot okay well if we look at something like indo-european that covers most of europe not all of europe and through the middle east at least the northern middle east m iran and into northern india and that’s a lot of people okay but it’s only about 6.3 of the languages because many of these languages are spoken by a very large number of people so you’re going to get very large numbers of people and also some of the links between languages that although you know german and english have lots of common lexical items that people can sort of work out and if they know english or they know german and be able to work out what the other one is but sometimes it’s not immediately obvious that there’s a relationship between the languages and very few of the words are actually overlapping lithuanian which is sometimes identified as the one that’s most archaic in the sense of it’s got more of the history of the original part of Indo-european and so does english but the relationship is not obvious at all to an english speaker

Typological similarities

how do you work out these typological similarities well this is one of the major things that linguists have been doing for a long time and they’ve been looking at ways of linguistic properties to try to see what the similarities are between them and there’s a number of books and studies that try to collect that information together from multiple research studies okay now when we’re looking at phonology so the actual pronunciations the ipa is an excellent way for being able to split down the possible ways that most languages on the planet actually do their pronunciations and we have a vowel space that’s continuous vowel space and we split each language splits it into different ways and there’s often drifts between different languages that are maybe even predictable for consonants and things which are not vowels that’s probably the best definition of them there’s lots of things about a place of articulation in manner of articulation places where we put constrictions from the front of the mouth down to the back of the throat and we can sort of have things that deal with the lips with things like p and b things that deal with the teeth things like tea and things that deal with just behind the teeth which are things like okay and all of these may have different variations depending on the a on the languages that we’re actually speaking and have different distinctions in english we may produce some of these but we don’t make distinctions between them but for example a korean has got three different p’s that most english speakers would not distinguish between so when we’re looking at similarity between languages we could look at the similarity in the phonology and how many phonemes actually are common between the different languages now there’s actually a group called

Walls

walls now walls is also on a collection of all of these different typographic variations that are a over all of the languages and basically back in the early 2000s a group of people tried to start collecting papers and showing what the similarities so for the most part you can go to the walls and a website you can select some particular feature and you can see the distribution here we’ve got the distribution of the planet of different number bases so we all count in all languages some more than others and sometimes we use decimal and it’s probably related to the fact the number of fingers that we have but some count in twenties and that’s not really unusual even in english we have some residual twenties that’s there and even in chinese there’s some religious 20-ness where we have a specific word for 20 in english it’s score and here’s a mapping of all of the languages now we’re not saying that these languages are related to each other we’re saying that when we look at numbers all of the languages with the blue dot are counting in decimal basically well those for example that are in the purple pinkish dot are counting in twenties across and some don’t have good ways of doing that at all now walls is this excellent detailed form where you can go through and find different things you can find out which languages refer to t as t and which refers to it is chai which is quite interesting in itself some of them are maybe a little bit light-hearted and some of them are quite detailed like for example a word order or a default word order of in english we have a subject verb object japanese is a subject object verb now historically walls originally came in a book okay and this book is what’s called really really big and i’m sorry i had to take it from underneath my monitor because it normally keeps my monitor at the right level but the book of course isn’t updated but the website is and over the years the website gets more and more and people now actually think about registering the piece of work that they’re doing to be able to cover what’s actually there now not everything is in walls because not all of the features have been studied in all of the languages most linguists are going to study something that’s interesting so if they’re interested in something like voicing inconsonants after long vowels that’s only going to be interesting in some languages and other languages there’s just nobody’s going to study that and so the question is can we actually predict the missing feature from other factors because often there’s information that’s in there so for example in linguistics default word order seems to be less fixed when you have more morphology and that’s something to do with if you’ve got morphology it allows you to be able to identify who did what to whom better and therefore you don’t need to care about word order in the same importance level so there’s some predictable things so if somebody tells me something about a language i don’t know and says it’s a really rich morphology i i’ll think maybe it’s got free word order that’s not true for everything but there’s a more likelihood that it does if something doesn’t have lots of morphology it probably has more fixed order and can we learn this from the data by looking at all of the features all of the languages find the missing ones make predictions hold out the ones we do know and it may allow us to be able to make these predictions and lots of people have tried to do that at various levels and being quite successful and in fact the paper that we asked you to look at already has looked at some how well some of these predictive things actually work and of course some of them work better for some aspects than others okay sometimes you can do it purely unsupervised some sometimes you want to have supervised learning to be able to do this sometimes you want to make these predictions and go and explicitly ask somebody to to ask you ask you whether it’s correct or not.

Typological databases

There’s a number of these typological databases out there walls is only one it’s quite good it’s quite famous and there’s a number [Music] been derived from those or derived from multiple ones especially to fill in particular aspects of the features and these can be really useful in trying to do things when you’re doing multilingual modeling because you want to know maybe these features make a difference on my downstream or my predicted prediction task and i like to be able to get these features I’d like to have reliable features i’d like to know when the features are confident when the features are missing when the features are going to be more important than others so that maybe I look harder to be able to find these features here at cmu a number of years ago we had

lorelai

A project called lorelai which we’ll mention a few times in this course where we actually tried to build a specific vector that tries to represent a language so lang to vec so given the name of the language given the code of a language we’ll give a vector representation that will try to identify all of the aspects that would be relevant for feeding in as a prior when you’re building various language models okay there was a bunch of work done on this to be able to do this basically both building the vector and also trying to predict and and across that there are still people doing phd’s graham is still very much working in that way and aditi is our phd is very much in that space

Universals

are there any unit universals that are features that are there for everything for all possible languages and the answer is yes mostly and sometimes there’s a little bit caveats around the edge and so for example all languages do seem to have vowels and consonants but the definition that the boundary for vowels and consonants isn’t very good so that’s might be an easy thing to fulfill but it seems to be true given that we’re all using the same vocal tract that we are actually trying to do that almost all languages have got nouns and verbs now there’s some languages where the distinction isn’t very strong and it’s morphological variants that allow you to be able to distinguish between it but pretty much everything has nouns and verbs now once you get the adjectives the next major class that’s not so clear and we end up with a number of languages that will use nouns as adjectives maybe with some morphological variation in english you can get away with quite complex nouns being used as adjectival forms more so in american english than in british english and but that’s sort of moved to that we get quite complex noun compounds which are really some form of adjectival form and that’s true across a number of languages now there’s other things that are very common across multiple language families or related languages that are relatively interesting and identifiable and there are multiple non-related languages that will do that you know species between words in the written form a morphology that’s segmental so we’re joining things together as opposed to templatic morphology where you have maybe a bunch of different consonants and the vowels change inside it arabic and hebrew and a number of other northern african and languages have that that are not necessarily all in the semitic language family and so there’s a number of things that are relatively common but they’re not going to be everywhere but you know there’s also things about you know if language is distinguished between voiced and unvoiced they don’t always they’re going to care more about the voice than the unvoiced ones how do we deal with the low repo low

low resource languages

resource aspect of this how can we actually find out how if we were given a language and we maybe only got a few features for it how can we predict other things for it well we can look at all other languages that are similar and we can find out well most languages of these features have got those features and we might also say for that language what’s the most related language and we’ll just say well let’s assume all these features it might not be true but it’s probably better than just assuming everything is english because everything is certainly not english okay and the number of different groups and throughout the world have been studying this is quite a major area of looking at how to be able to predict or get other features for low resource languages when you don’t have data that’s there now there’s a number of different ways of doing such multilingual nlp what you can do is you can say well i’m going to try to find a close by language and i’m just going to assume everything about that and maybe remove things that are not appropriate and often that happens so imagine that you come to europe and there’s this island nation that separated itself from the rest of europe pretending that it’s not part of europe at all and nobody knows anything about english and eventually somebody braved the channel and gets to england how might they understand english and the answer is well it seems sort of germanic-like so let’s just pretend it’s german or dutch which is even closer and then just use everything that we have from the dutch language to apply it to english and maybe train a little from that and you would get much further with that than if you took chinese or hindi or maybe even french although a lot of the words are borrowed from french and but the grammar is very much germanic in english so we take another language and try to do things so what we’re doing is taking an existing model and then trying to fine-tune it for the target for okay another way to do this is to try to take

multilingual birth

all languages or all languages in some language family and what we would then do is we would then train everything together in some multiple joint way there’s lots of different ways of doing that and then we would have a model that was multilingual in a true sense that’s actually how multilingual birth is done rather than having different births from different languages and then doing adaptations of the target one we actually build a multi-lingual bar and then we’ve got this multilingual thing so it’s sharing some information about all of these languages and that might make it easier when you’re doing adaptation to the target one form and that is an open question okay it probably depends on the amount of training data and the amount of languages and how close it is and whether it’s going to be similar or not to these other languages when you’re doing that whether you have a writing system that’s the same whether it’s different whether phrenology is different and all these things are going to be important so there won’t be one answer for everything but you should be aware of the different ways of actually trying to do that okay why do we care about typology at

why typology

all can’t we just well we’re going to train from everything and the answer basically is for most low resource languages you just don’t have enough data and even if you pretend that you have enough data it’s been shown in english the more data you have the better your models are going to be now in general in machine learning the more structured data you have the easier it is to be able to learn things so if you have external data to help you when you’re dealing with small amounts of data it will usually not always but it will usually learn better so knowing about the default word order knowing whether morphology is an issue or not knowing the types of grammatical structures the types of verb structures the types of noun structures or their noun classes or their politeness these will potentially help you when you’re building your model to be able to get better results quicker okay

how to choose a transfer language

How do you choose a transfer language? Well often what people will do is they’ll go and ask some a knowledgeable person about which language is close and you’ll get an answer but that might not always be the right answer and there was a bunch of work done here a few years ago and trying to look at that for particular tasks like translation and there’s non-trivial aspects of actual similarity of the language reliability of the language the amount of data that you have and whether the data on the domain of the data that you have is appropriate is it conversational is it newspaper text is it bible text and the linux all people actually did quite a lot of that to try to do it i remember the sort of default answer for that i am right about this i think it’s this one and said turkish by default is the best one over everything and that’s sort of probably because turkish is fairly well resourced it’s influenced by a lot of different other language families so it has arabic it has english it has iranian and hindi farsi sanskrit and things in it and it spreads over there’s a vast amount of the world going from europe all the way into turkic languages all the way into china open research problems that we have.

open research problems

How to extract typological features automatically so if you give me a language can i find out the default word order that’s sort of hard i mean we’re gonna get some there but but when the it’s not obvious and maybe it’s different in written form compared to in the spoken form and therefore you have to be able to care for that but there is this thing called the universal dependency tree bank that originally came out of google and so there are these existing toolkits and data sets which try to give this information for many of the major languages and some of the minor language as well and these resources are hard to do yourself and therefore it’s always good to know about them and to be able to build on top of them okay there’s lots of other things that are

Multilingual aspects

out there if you want to learn about morphology or phonology you can look at multilingual aspects in the in the computational linguistics and conferences there’s lots of geographical groups that are looking at say specifically looking at indian languages african languages there are lots of things that are looking at low resource languages there are lots of ones looking at interesting morphology languages and so often it’s worth looking at and though everybody who doesn’t know a language thinks i’ll just train from an infinite amount of data the answer is well you won’t have an infinite amount of data and sometimes it’s quite hard to find data and if you discover that morphology is rich in the particular language it might be worth doing morphological segmentation and there may already be an existing morphological analyzer that’s there or at least help to be able to find that okay so that’s a very quick view of typology on how we actually structure languages and it’s becoming more and more important in the computational form than what it was been before 10 15 years ago you’d see less papers about it but now people are really caring about it because we are doing much more multilingual work you were asked to read this particular paper which was a survey paper on looking at aspects of typology across different people trying to do predictions and how well they were actually doing and the issues that are involved in this and what we’re going to do now is we’re going to split you off into groups and in those groups you’ll have a ta or an instructor will be there and what we want you to do is we want you each of you to identify things which are unique or very rare compared to other languages that are important over the languages you know in distinct from things which are not very interesting maybe whole classes of languages that are unrelated are all using a romanized form and to write them but they’re not related but things that are going to be unique from that point of view now a has someone set up the groups graham have you set up the groups have maybe we could take some questions if people had questions yes sure yes thank you 22 indian languages yeah yeah so what language you write things in is quite interesting and especially once you’re in a code switching space in india almost everybody when they’re code switching will write in a romanized form they’ll often call it english but it’s not english it’s the romanized form so they’re writing both hindi in a romanized form and english in romanesque form while when you look at singapore for example where people can be as fluent in chinese and english they actually use hansi for writing chinese and english for writing english words most of that’s got to do with input method actually it’s like how easy is it to type these things on a computer and for historical reasons actually partly because there was less Chinese speakers who spoke english the Chinese input systems became better while in india for the past 200 years the educated elite were all English-speaking and therefore they were used to reading and writing in using romanized m form probably that’s got something to do with it but definitely information is lost when you may be using a non-native script and for example spelling goes on completely out of the way in English when you’re doing it but remember most written most scripts are not appropriate for the language we are using a latin script for a germanic english in english. We use Kanji in Japanese for writing lots of things and yeah there’s other scripts in Japanese for dealing with more native things. Hangul is native in Korea but there’s still lots of Chinese borrowed words, especially scientific words, that come from Chinese. So often the writing system even for the native speakers is not very appropriate. Often it’s just convention. It’s like this is the way we write it and we’ve always written it. And it was only since last year that people live but they think it’s facts that we’ve done it forever. You know basically we’ve always sent dances to tiktok as a way to communicate and I can’t remember when when we ever did anything else

Outline

Introduction to Linguistic Typology
- Typology is a way to structure and organize languages based on similarities.
- Languages are not fixed points, they exist across regions and often overlap.
- Language definition can be political, e.g. Hindi & Urdu.
- Remember the gag: “a language is a dialect with an army”.
- Acknowledged the difficulty of defining the borders between languages.
Linguistic Diversity
- The world has approximately 7,000 languages.
- Some areas have higher concentrations of languages than others.
- India has around 460 languages.
- Africa has an estimated 1,500-2,000 languages from 6 language families.
- Most people in the world are multilingual.
- The U.S. and Britain are unusual in that many people speak only one language.
- Geographical features can create borders between languages.
Language Families
- Linguists identify language families by looking at shared features such as lexicon, grammar, and morphology.
- Examples of major language families include:
  - Niger-Congo (Africa).
  - Indo-European (Europe, Middle East, and Northern India).
- Languages within families share common linguistic information.
- Some languages, like Korean, may not be related to other languages in their area.
Identifying Similarities between Languages
- Methods for identifying similarities include:
  - Word overlap: looking at shared words, especially core vocabulary (body parts, family terms, food, water, air). A specific list called the Swadesh’s list is often used for this.
  - Phonetic form: considering phonetic similarities rather than just written forms.
  - Areal similarity: geographic proximity and influence between languages.
  - Genealogical similarity: language families based on linguistic history.
  - Typological similarity: classifying languages based on functional and structural properties.
- Resources such as Ethnologue and Glottolog try to identify and classify languages.
- The World Atlas of Language Structures (WALS) is a key resource for typological data.
Typological Features
- Typology involves classifying languages based on shared formal characteristics.
- Examples of typological features include:
  - Phonology: How languages pronounce sounds; the International Phonetic Alphabet (IPA) helps in comparing phonological systems.
  - Numeral bases: Different languages use different counting systems, such as decimal or base-20.
  - Word order: Languages may have a default word order such as subject-verb-object (English) or subject-object-verb (Japanese).
- WALS provides a database of 192 attributes across 2,676 languages.
- Some features may be predictable based on others; for example, languages with rich morphology may have less fixed word order.
Typological Databases
- WALS is a major typological database.
- URIEL is another typological database which includes phonology, morphosyntax, and lexical semantics.
  - URIEL has data for 8,070 languages and 284 attributes.
- These databases are useful in multilingual NLP, providing features for language models.
Linguistic Universals
- Most languages have vowels and consonants.
- Almost all languages distinguish between nouns and verbs.
  - The distinction between adjectives is less clear across languages.
Multilingual Natural Language Processing (NLP)
- Typological features can help in multilingual NLP by providing structured data.
- Low-resource languages benefit from typological information due to the lack of available data.
- Methods in multilingual NLP:
  - Cross-lingual transfer: using a model from a resource-rich language on a resource-poor language.
  - Zero-shot learning: applying a model from one domain to another with no extra training.
  - Few-shot learning: adapting a model using a few examples from a low-resource domain.
  - Joint multilingual learning: training a single model on multiple languages.
- Typological information can be used to select an appropriate transfer language.
- The amount of training data and the similarity between languages is important.
Open Research Problems
- How to automatically extract typological features from existing resources.
- How to accurately predict typological knowledge while controlling for biases.
- How to incorporate linguistic typology into models.
- How to alleviate negative transfer in multilingual models using typological knowledge.
Further Resources
- Papers in computational linguistics conferences.
- Workshops such as SIGMORPHON, SIGTYP, and AfricaNLP.
Discussion
- Identifying unique typological features in languages.
- Considering aspects of phonology, morphology, syntax, semantics, and pragmatics.

Papers

Here is a list of the papers covered in the lesson

Ponti et al. (2019) Modeling language variation and universals: A survey on typological linguistics for natural language processing
- This paper is a survey on typological linguistics for natural language processing. It is cited as a key resource for understanding how typological information can be used in NLP. The paper also explores modeling language variation and universals.
Lin et al. (2019) Choosing Transfer Languages for Cross-Lingual Learning
- This paper discusses how to choose appropriate transfer languages for cross-lingual learning in NLP. It is relevant to the topic of using typological features for low-resource languages.
Littell et al. (2017) URIEL Typological database
- This paper introduces the URIEL typological database. It provides information on phonology, morphosyntax, and lexical semantics across many languages.
Malaviya, Neubig, and Littell (2017) Learning language representations for typology
- This paper is related to the lang2vec representations derived from the URIEL database and explores how to learn language representations for typology.
Georgi, Xia, and Lewis (2010) Comparing Language Similarity across Genetic and Typologically-Based Groupings
- An example of research in the automatic prediction of typological features.

Choosing Transfer Languages for Cross-Lingual Learning

Georgi, Ryan, Fei Xia, and William Lewis. 2010. “Comparing Language Similarity Across Genetic and Typologically-Based Groupings.” In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), edited by Chu-Ren Huang and Dan Jurafsky, 385–93. Beijing, China: Coling 2010 Organizing Committee. https://aclanthology.org/C10-1044/.

Lin, Yu-Hsiang, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, et al. 2019. “Choosing Transfer Languages for Cross-Lingual Learning.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, edited by Anna Korhonen, David Traum, and Lluís Màrquez, 3125–35. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1301.

Littell, Patrick, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. “URIEL and Lang2vec: Representing Languages as Typological, Geographical, and Phylogenetic Vectors.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, edited by Mirella Lapata, Phil Blunsom, and Alexander Koller, 8–14. Valencia, Spain: Association for Computational Linguistics. https://aclanthology.org/E17-2002/.

Malaviya, Chaitanya, Graham Neubig, and Patrick Littell. 2017. “Learning Language Representations for Typology Prediction.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, edited by Martha Palmer, Rebecca Hwa, and Sebastian Riedel, 2529–35. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1268.

Ponti, Edoardo Maria, Helen O’Horan, Yevgeni Berzak, Ivan Vulić, Roi Reichart, Thierry Poibeau, Ekaterina Shutova, and Anna Korhonen. 2019. “Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing.” Computational Linguistics 45 (3): 559–601. https://doi.org/10.1162/coli_a_00357.

Citation

BibTeX citation:

@online{bochman2022,
  author = {Bochman, Oren},
  title = {Typology: {The} {Space} of {Languages}},
  date = {2022-02-25},
  url = {https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w03-typology/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2022. “Typology: The Space of Languages.” February 25, 2022. https://orenbochman.github.io/notes-nlp/notes/cs11-737/cs11-737-w03-typology/.