Compositionality and Generalization in Emergent Languages

Review of “Compositionality and Generalization in Emergent Languages”

Very exciting - this is a paper with a lot of interesting ideas. It comes with a a lot of code in the form of a library called EGG as well as many JuPyteR notebooks. There is also a video of the talk at NeurIPS 2020.

In (Chaabouni et al. 2020) the authors look at ideas from representation learning and apply them to emergent languages in deep networks. THey come up with a number of results.

Abstract

Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as compositionality. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results.

First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts.

Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize.

Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive

— (Chaabouni et al. 2020)

Outline

Here is the outline of the paper:

Introduction

Describes a variant of Lewis signaling game used to study the emergence of reference to composite concepts in deep multi-agent simulations.
Discusses two specific and intuitive compositionality strategies that capture common compositional structures in natural languages.
Introduces two new compositionality measures, positional disentanglement (posdis) and bag-of-symbols disentanglement (bosdis), inspired by work on disentanglement in representation learning.

Measurements

Describes the commonly used topographic similarity (topsim) metric.
Introduces and defines two new measures of compositionality:
- posdis - positional disentanglement and
- bosdis - bag-of-symbols disentanglement.
Explains how the new measures are similar to the Information Gap disentanglement measure used in representation learning.
Illustrates the behavior of the three compositionality metrics on three miniature languages in the Appendix.

Topographic Similarity

Given these two lists, the topographic similarity is defined as their negative Spearman ρ correlation (since we are correlating distances with similarities, negative values of correlation indicate topographic similarity of the two spaces). Intuitively, if similar objects share much of the message structure (e.g., common prefixes or suffixes), and dissimilar objects have little common structure in their respective messages, then the topographic similarity should be high, the highest possible value being 1 – (Lazaridou et al. 2018)

\mathit{topsim}=\rho\left(\left\{d\left(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}\right), d\left(\mathbf{m}^{(i)}, \mathbf{m}^{(j)}\right)\right\}_{i, j=1}^{n}\right)

Lazaridou, Angeliki, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. 2018. “Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input.” https://arxiv.org/abs/1804.03984.

Positional Disentanglement

positional disentanglement (posdis) metric measures whether symbols in specific positions tend to univocally refer to the values of a specific attribute. This order-dependent strategy is commonly encountered in natural language structures (and it is a pre-condition for sophisticated syntactic structures to emerge) – (Chaabouni et al. 2020)

\mathit{posdis}=\frac{1}{c_{len}} \sum_{j=1}^{c_{len}} \frac{\mathcal{I}(s_j,a^j_1)-\mathcal{I}(s_j,a^j_2)}{\mathcal{H}(s_j)}

s_j the j^th symbol of a message and
a^j_1 the attribute that has the highest mutual information with s_j : a^j_1 = arg max_a \mathcal{I}(s_j ; a)
a^j_2 the attribute that has the second highest mutual information with s_j : a^j_2 = arg max_{a \neq a^j_1} \mathcal{I}(s_j ; a)
\mathcal{H}(s_j) the entropy of j-th position (used as a normalizing term)

positions with zero entropy are ignored in the computation.

Bag-of-symbols Disentanglement

Posdis assumes that a language uses positional information to disambiguate symbols. However, we can easily imagine a language where symbols univocally refer to distinct input elements independently of where they occur, making order irrelevant.3 Hence, we also introduce bag-of-symbols disentanglement (bosdis). The latter maintains the requirement for symbols to univocally refer to distinct meanings, but captures the intuition of a permutation-invariant language, where only symbol counts are informative – (Chaabouni et al. 2020)

\mathit{bodis}=\frac{1}{c_{voc}} \sum_{j=1}^{c_{voc}} \frac{\mathcal{I}(n_j,a^j_1)-\mathcal{I}(n_j,a^j_2)}{\mathcal{H}(n_j)}

n_j a counter of the j-th symbol in a message

Chaabouni, Rahma, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, and Marco Baroni. 2020. “Compositionality and Generalization in Emergent Languages.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, edited by Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, 4427–42. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.407.

Generalization Emerges “Naturally” if the Input Space is Large

Presents an experiment showing that emergent languages are able to generalize to unseen combinations as long as input size is sufficiently large.
Discusses how the results challenge claims in the recent literature that deep networks fail to generalize.
Notes that the minimum channel capacity required for the emergence of a generalizing language is significantly larger than the minimum channel capacity required for a perfectly compositional language.
Presents additional experiments in the Appendix analyzing the effects of agent capacity and input density on generalization.

Generalization Does Not Require Compositionality

Presents results showing that there is no correlation between compositionality and generalization ability.
Analyzes the language of a specific run with near-perfect generalization accuracy and medium posdis score.
Discusses how the analyzed language uses a “leaky disentanglement” strategy where two positions largely specialize as predictors of two attributes, respectively, but a third more entangled position is still necessary for perfect communication.
Briefly analyzes in the Appendix a language with near-perfect generalization accuracy and very low posdis score.

Compositionality and Ease of Transmission

Discusses the hypothesis that compositional languages are easier to decode and transmit to new learners.
Presents an experiment where new Receivers are trained on frozen Senders that achieved a high level of generalization accuracy.
Finds that learning speed and generalization accuracy of new Receivers are strongly positively correlated with degree of compositionality.
Mentions further experiments in the Appendix that replicate the ease-of-transmission analysis across various channel capacities.

Discussion

Summarizes the main findings of the paper, highlighting the results that challenge common assumptions in the emergent language literature.
Relates the findings to the ongoing debate on the origins of compositionality in natural language.
Discusses the potential benefits of compositionality for developing languages that are quickly usable by wide communities of artificial agents.
Highlights the connection between compositionality and disentanglement in representation learning.

My Thoughts

We have three main results:

Generalization Emerges “Naturally” if the Input Space is Large
No Correlation Between Compositionality and Generalization
Compositionality increases Ease of Transmission

From a simple analysis of the lewis signaling game. The cost of coordination for an non compositional language is exponential in the number of states. For a compositional language the cost can reduced an exponential of the size of the lexicon (number of atomic signals) plus some constant factor for learning the grammar if it is some small set of aggregation rules. (Usually one rule is sufficient to support compositionality).

The lewis signaling game is more or less guaranteed to converge to some a signaling system¹ with the suitable algorithm. Without a good algorithm the game is more likely to converge to a partial pooling equilibrium where payoffs are less than 1 due to one side or both being unable to conflate different states.

¹ one that has an expected payoff of 1

What this means is that all things being equal the compositional language will emerge much sooner the the non compositional language given the opportunity to do so. So why don’t we see this.

When the number of states isn’t larger than the number of basic signals it is easier to learn a non compositional language. So we should put a cap on the number of basic signals.
The multiplier for learning grammar may be big especially if we use a neural network, more so if the grammar is complicates. So if we don’t see compositionality perhaps we need to make the grammar simpler or the state space bigger.
This is theoretical - perhaps since we use a neural network rather then a tabular RL solution we need lots of data to learn anything. As we go though the epoch there may be enough rounds in the lewis game for use to establish a convention without the need for compositionality.
Ok Let’s say we have a fast learning algorithm and a big but manageable input space. Do we get compositionality. The answer is not necessarily.

Let dive deeper into this last point:

Despite what Chat GPT will tell you if you ask the lewis signaling game only outputs simple languages. No complex signals, no grammar, no recursion and no compositionality. It has many possible equilibria and none correspond to a language complex signals or grammar.

For grammar and complex signals to emerge you need to tweak the lewis signaling game. (Skyrms 2010) reports on a couple of papers which produce complex signals. Most of the interesting work came out after the book. Regardless in the papers the agents were given a simple aggregation rule to follow. The conjunction leads to a bag of words. The concatenation leads to sequences. But what they don’t seem to stress is that for a complex signals we want a state that decomposes into a way we can match in our aggregation rule. Think group homomorphism. And there may be multiple decompositions so think normal subgroups.

Skyrms, Brian. 2010. “14512 Complex Signals and Compositionality.” In Signals: Evolution, Learning, and Information. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199580828.003.0013.

There isn’t a how to guide to get the agents to use some arbitrary grammar. (Not AFAIK). There are a bunch of books and many papers but they don’t seem to have a the kind of recipes that are needed here. In my view most of these books look for the answers based on what they know rather what they need to know. They may have fascinating points but lead to greater confusion rather then more clarity.

One abstraction I came across is that the notion of a grammar is essentially a decision tree mapping the signals back into the state. Decision trees sounds simple enough but this tree is not given but needs to be learned by trial and error. Signals due to successes are sparse. There might be a setting in which a sender can construct the tree and then the receiver just needs to learn it. But it requires the sender to have access to the distribution of states and sub states. This distribution can be used to come up with a tree that is optimal for a given set of signals. If the sender and receiver don’t have access to this distribution the can learn it. But my work indicates that to learn the distribution to a high degree of accuracy requires more turns then my lewis signaling algorithm does. At least for simple signaling systems.

For a simple signaling system I developed two algorithms. THe first learned a signaling system. The second first enumerated all signaling systems of a certain size and then selected one from those it believed were optimal. Each new state would reduce the belief until the sender and receiver had a common belief. This may not scale but can adapt to new distributions seamlessly. I thought about a similar approach for working with complex grammars. But In this case I did not have an efficient way to enumerate all possible grammars. However there seems to be a way to do this. Instead of considering all decision trees, we can instead consider just huffman trees. These means that the sender and receiver use the lewis signaling game to learn a shared huffman tree. The outcome is that the tree should compress the state space. The only problem is that such a grammar is not likely to be compositional and would be very difficult to learn for humans.

So what we need is for the agents to learn the tree interactively. Two approaches come to mind and these are 1. huffman coding - which builds the tree but doesn’t update it to account for distributional shits. 2. Vitter algorithm for adaptive huffman coding - which updates the tree as new states are seen. THis is 3. adaptive arithmetic coding - which is a generalization of adaptive huffman coding.

One point to consider is that such a grammar is likely to provide a good compression of the state space. This is due to the these algorithms also being compression algorithms.

I would imagine that the output of such a grammar to be a binary sequence. This suggest that this would lead to a entangled representation with no discernable compositional structure.

Now there are reputedly many languages with very simple grammars. But the ones we are familiar with are not simple. They also have large lexicons. We need to put that aside and look for ways to work with simple grammars. It is quite possible to come up with two or three rules that can generate both morphology and a recursive syntax. It might be possible with one rule.

Ok lets briefly consider the other two points.

The paper

Compositionality and Generalization in Emergent Languages

there is also a video at

video

The code

and code at

papers with code

and even the colab link to

EGG walkthrough

Citation

BibTeX citation:

@online{bochman2025,
  author = {Bochman, Oren},
  title = {Compositionality and {Generalization} in {Emergent}
    {Languages}},
  date = {2025-01-01},
  url = {https://orenbochman.github.io/posts/2024/2024-10-10-marco-baoni-composionality/paper2.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2025. “Compositionality and Generalization in Emergent Languages.” January 1, 2025. https://orenbochman.github.io/posts/2024/2024-10-10-marco-baoni-composionality/paper2.html.