TL;DR
In 2020 Marco Baroni gave a talk where in which he discussed the role of compositionality in neural networks and its implications for language emergence. He also presents recent work by his group at Facebook on language emergence in deep networks, where two or more networks are trained with a communication channel to solve a task jointly. Baroni argues that compositionality is not a necessary condition for good generalization in neural networks and suggests that focusing on enhancing generalization directly may be more beneficial than worrying about the compositionality of emergent neural network languages.
This talk doesn’t provide us with an aristotelian definition of compositionality but a pragmatic one that Baroni used in investigation.
The notion of language emergence in deep networks is a fascinating yet also vague. With these caveats in mind there are a number of results that Baroni presents that are worth considering.
If I was initially critical of this talk and speaker I soon came to realize that this is just the kind of message that simulates thought and discussion. It is a good talk and I would recommend it to anyone interested in the topic of language emergence in deep networks. 👍
My work on this subject shows that by adding compositionality to the lewis signaling game not only significantly increases the ratio of messages to simple signals but also renders the overall systems easier to master by making it more systematic and predictable. This is greatly affected by form of aggregation used for signal composition. e.g. using a simple template can create a morphology that is automatic and therefore easy to master.
For example if we add a parameter to the game to penalize long signals and reward early decoding of the signal we can get a more efficient signaling system. This allows agents to learn a larger signaling system faster. It also has more subtle effects- perhaps the most important one is that the pooling equilibria in the lewis game which are far more common then the non-pooling equilibria which we call a signaling system can be allow us to learn more efficient signaling systems if we reinterpret these partial pooling equilibria as leading to categories of signals which are further refined within the complex signal but can lead to an early decoding of the signal.
In this subject we need to ask ourselves what are the main research questions and what are the main definitions we can make.
- What is language emergence?
- What is compositionality
- How are Neural networks used here
- how were they set up?
- what was the task loss function?
- what is generalization in this task?
The big question is will learning an emergent language aid agents to achieve better generalization? - There is also talk about compositionality in RL where transfer learning is challenging. - Hinton discussed the idea of capsules in neural networks as a way to encode compositional structure in neural networks.
in page 16 of the slides we see that agents are using a learned signaling system to very effectively communicate about noise patches.
skyryms has a couple of papers that can explain this using the idea of a template - this is where agents evolve a signaling systems under some regime (environment with a distribution of states assigned by nature) and under some new regime they can repurpose the old signaling system instead of learning a new signaling system from scratch.
- does this mean they have not learned to signal - clearly not
- does this mean their language is wierd - nope they language is just a lexicon with a 1:1 mapping between what I call the frame game.
- so WTF ?
Most people using the lewis referential game don’t really under stand the lewis game or game theory. e.g. deep RL algs only consider a single player game.
Lewis game has four types of equilibria. There are N! The signaling systems giving the best payoffs, there are lots of suboptimal with homonyms called partial pooling that give lower payoffs. There are a N complete pooling solutions that lead to almost 0 solutions. and there are also an infinite number mixed equilibria that are a mixture distribution over the other solutions.
Learning a good equilibrium is easy if the framing game has high fidelity. If the framing game can lead to mistakes in identifying th correct state then the RL algorithms may take longer to converge or may converge to a bad equilibrium.
In most MARL cases researcher we don’t care about coordination sub-task by itself but about some other task.
- Lets call the lewis game I call the coordination task
- The Lewis game has a structure that leads to 1:1 mappings. If you want other mappings to be learned you need to do something else.
- Let’s call the other task I call the frame task and I tacitly assume it has a reward structure.
- The Framing game should not violate the payoff structure of the lewis game. I.e. lewis coordination games encode the incentive of agents too cooperate on coordination. If the symmetry of the payoffs is broken they will try to coordinate on different equilibria (c.f. battle of the sexes and the expected payoffs drop from 1 to 2/n!) and if payoffs structure is further eroded coordination the incentives for cooperation evaporate and we never learn a language.
- Lets call the lewis game I call the coordination task
recall that rationality in game theory assumes that players’ decisions are the result of maximizing their own selfish payoff functions conditional on their beliefs about the other players’ optimal behavior. In evolutionary game theory we can restate it as the decisions that will result in the greatest expected progeny (aka fitness) again given the state and other agent’s optimal behavior.
There are many ways rational agents can to learn to coordinate:
- If each agent has some shared knowledge that allows them to enumerate each state in the same order then they can use it to infer the same signaling system immediately.
- if there is only one new action pair introduced then coordination only takes n-1 steps
- if they both know the distribution of states, and no two states are as likely, agents can infer an signaling system.
- this implies that if they can observe the state long enough they can suddenly signal with high fidelity.
- the sender can dictate a dictate a system that he knows and the receiver can learn from him or
- the receiver can dictate a system that he knows and the sender can learn from him
- they can also learn a system stochastically by trying signals and actions.
- If each agent has some shared knowledge that allows them to enumerate each state in the same order then they can use it to infer the same signaling system immediately.
Thus is they both have a pertained CLIP classifier, or they both learn the frame task using come classifier with shared weights they can simply send it random noise and many times and learn an empirical distribution of its categories which correspond to the state. They should have a very good signaling system at their disposal…
unless you realty know what you are doing you won’t get extra structure from the lewis game.
in this case the frame game learning to classify an image into thier lexicon and
- the lewis game is to send and recover the state i.e. pick the image with the same label
If both agents share the same classifier the frame game is a no-brainer for this task.
In the lewis game agents need to send N signals and recover the state from N possible options in the referential game they need to guess one of four. If the classifier is the same then there is a very good chance that the distractors are different from the signal so the agents
This section explains naive compositionality.
- “A L M” = Blue Banana
Representation | Description |
---|---|
[ALM] : Blue Banana | Not compositional |
[A L] : Blue, [M] :Banana | Compositional |
[AL] : Blue, [LM] :Banana | less compositional, but entangled representation. |
- The idea is that tokenizing a sequential signals should yield independently semantics. A simillar notion is that when messages are aggregated like sets, that certain subsets have this property. (the definition is vague and I would call it entangled with the aggregation)
What we would like are
- a definition of compositionality that takes a general notion of aggregation not just set and sequence. e.g.
- prefix, suffix, v: e.g. [verb-prefix slot] and [slot, verb-inflection] n: e.g. [noun-prefix slot] and [slot, noun-inflection] adj: e.g. [adj-prefix slot] and [slot, adj-inflection] adv: e.g. [adv-prefix slot] and [slot, adv-inflection]
- one template, [v s o]
- many templates:
- n2: [adj n] | n - is a noun phrase
- v2: [v adv] | v - is a verb phrase
- s1: [v] - e.g. run
- s2: [v n2] - e.g. hit intransitive
- s3: [v1|v,n2,n2] where v is a verb or a v2 , s and o are nouns either n or n2
- s: s2 | s3 | v is a sentence
- AND [s, s] - a conjunction
- OR [s, s] - a disjunction
- Not [S]
- (templates can be just a prefix taking a fixed number of arguments).
- templates prefix might be omitted if it can be inferred from the context.
- two level templates (a morphological agregation a syntax aggregation)
- in a marked template we can discard the morphological markers
- there may even be a unmarked morphological form with a null prefix
- the template marker can be ommited if it can be inferred from the context…
- recursive templates.
- a definition of compositionality that plays well with embeddings
- simple lewis games don’t have embeddings
- complex
- a definition of compositionality adresses context sensitive grammars
naive compositionality is indeed naieve. It does not capture basic aspects of what I considers to be compositionality.
note that a sequence like:
OR ANDN A B C ] Z
where OR Takes two arguments and ANDN takes different numbers of arguments, until it seees a closing bracket. This isn’t naively compositional as there are nested segments with different semantics. Also the ]
the above is a rather efficent way to encode complex signals
A B C
where A B and C are dense disjunctional embeddings (Or aggregation within the signal) and And to aggregate between signals. and that State1:3 are the most common perdation warning states out of 100 states.
A = [State 1, State 2, State 3]
B = [State 1, Not State2, State 2]
C = [State 1, State 2, Not State 3]
these might be considered less compositional as they are entangled
however they have the advantage allowing early partial decoding.
This section of the talk goes over the key results in (Chaabouni et al. 2020) in Baroni’s group at Facebook AI Research. I haven’t read the paper at great depth.
At this point I don’t think that the main results are particularly surprising not that the claims made are applicable to other framing games.
First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts.
I believe that the right settings we may see compositionality emerge in small Lewis games whose states are endowed with a structure amenable to compositionality, and that by restricting the signal will drive this capability.
Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize
I believe that once a complex signaling system is learned it can be repurposed to solve other tasks. So that a signaling system learned that is highly composable is likely to generalize to other domains. I think that with some time this can be demonstrated as it requires lots of tinkering with the lewis game to get it to support such structures as well as new RL algorithms that can learn these structures quickly.
This might also be a apples and oranges comparison as I may be thinking about the notions of generalization and compositionality in a different way than the authors of the paper and do not even see these as the top priorities in developing better signaling systems. (I think that they should be salient, learnable and transferable to other domains and translatable into a human readable forms.) Furthermore I don’t think the metrics used for compatibility are particularly useful. In liu of better ones is I listed all three metrics even though only the positional one is discussed in the talk. To come up with good metrics of novel ideas requires developing good intuition. I don’t have that yet for complex signaling systems.
Off hand my best guess has to do with integrating the lewis game with a contrastive loss that operates on structural components. Keeping members with similar structures close and members with different structures far apart. This may also be in line with the notion of Topographic similarity shown below
A second direction that seems to be promising is in (Mu and Goodman 2022) is to use a framing games in which agents refernce sets of signals, or reference so called concepts which are more like equivilence classes over states.
A third direction comes from (Rita et al. 2022) where the authors consider how signaling systems that fail to generalize are overfitting and that this can be mitigated by decomposing the a co-adaptation loss and a information loss.
Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners,
Simple signaling systems require a large lexicon to be learned if there are many states. Complex signaling systems can reduce learning if the lexicon by allowing agents to use aggregates of simple symbols AKA a morphology. A systematic aggregation means you can learn a combinatorial explosion of possible signals. So it is no surprise that compositionality can makes signaling systems easier to learn. However this is only really helpful if the structure of the signaling systems is semantically a good match to the states… If it is not then the learner is just gets a massive list of words but needs to learn what each means.
For example many languages have a gender system that likely originated from the group of nouns that were used to describe people. However it is perpetuated into almost all nouns and ends up an arbitrary and a confusing system that makes learning the language harder. Bantu languages like Ganda have 10 nouns classes c.f. Languages by type of grammatical genders - clearly this is reuse of a template but not one that is the easiest to learn as now there are 7x arbitrary distinctions that need to be learned.
in (Chaabouni et al. 2020) the authors define the following metrics for compositionality:
there are a three metrics:
Given these two lists, the topographic similarity is defined as their negative Spearman ρ correlation (since we are correlating distances with similarities, negative values of correlation indicate topographic similarity of the two spaces). Intuitively, if similar objects share much of the message structure (e.g., common prefixes or suffixes), and dissimilar objects have little common structure in their respective messages, then the topographic similarity should be high, the highest possible value being 1. – (Lazaridou et al. 2018)
\mathit{topsim}=\rho\left(\left\{d\left(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}\right), d\left(\mathbf{m}^{(i)}, \mathbf{m}^{(j)}\right)\right\}_{i, j=1}^{n}\right)
positional disentanglement (posdis) metric measures whether symbols in specific positions tend to univocally refer to the values of a specific attribute. This order-dependent strategy is commonly encountered in natural language structures (and it is a pre-condition for sophisticated syntactic structures to emerge) – (Chaabouni et al. 2020)
\mathit{posdis}=\frac{1}{c_{len}} \sum_{j=1}^{c_{len}} \frac{\mathcal{I}(s_j,a^j_1)-\mathcal{I}(s_j,a^j_2)}{\mathcal{H}(s_j)}
- s_j the jth symbol of a message and
- a^j_1 the attribute that has the highest mutual information with s_j : a^j_1 = arg max_a \mathcal{I}(s_j ; a)
- a^j_2 the attribute that has the second highest mutual information with s_j : a^j_2 = arg max_{a \neq a^j_1} \mathcal{I}(s_j ; a)
- \mathcal{H}(s_j) the entropy of j-th position (used as a normalizing term)
positions with zero entropy are ignored in the computation.
Posdis assumes that a language uses positional information to disambiguate symbols. However, we can easily imagine a language where symbols univocally refer to distinct input elements independently of where they occur, making order irrelevant.3 Hence, we also introduce bag-of-symbols disentanglement (bosdis). The latter maintains the requirement for symbols to univocally refer to distinct meanings, but captures the intuition of a permutation-invariant language, where only symbol counts are informative – (Chaabouni et al. 2020)
\mathit{bodis}=\frac{1}{c_{voc}} \sum_{j=1}^{c_{voc}} \frac{\mathcal{I}(n_j,a^j_1)-\mathcal{I}(n_j,a^j_2)}{\mathcal{H}(n_j)}
- n_j a counter of the j-th symbol in a message
- Simple Compositionality - The idea that the meaning of a complex signal is an aggregation the meanings of its parts.
- Non compositionality - When each symbols or sequence is mapped to a full state.
- Naïve compositionality - the meaning of a bag of atomic symbols or sequence.
- Entanglement -
- By the end of the talk he more or less concludes that the compositionality of emergent neural network languages is not a necessary condition for good generalization.
, and there is no reason to expect deep networks to find compositional languages more “natural” than highly entangled ones. Baroni concludes that if fast generalization is the goal, we should focus on enhancing this property without worrying about the compositionality of emergent neural network languages.
Blurb from the talk
Compositionality is the property whereby linguistic expressions that denote new composite meanings are derived by a rule-based combination of expressions denoting their parts. Linguists agree that compositionality plays a central role in natural language, accounting for its ability to express an infinite number of ideas by finite means.
“Deep” neural networks, for all their impressive achievements, often fail to quickly generalize to unseen examples, even when the latter display a predictable composite structure with respect to examples the network is already familiar with. This has led to interest in the topic of compositionality in neural networks: can deep networks parse language compositionally? how can we make them more sensitive to compositional structure? what does “compositionality” even mean in the context of deep learning?
I would like to address some of these questions in the context of recent work on language emergence in deep networks, in which we train two or more networks endowed with a communication channel to solve a task jointly, and study the communication code they develop. I will try to be precise about what “compositionality” mean in this context, and I will report the results of proof-of-concept and larger-scale experiments suggesting that (non-circular) compositionality is not a necessary condition for good generalization (of the kind illustrated in the figure). Moreover, I will show that often there is no reason to expect deep networks to find compositional languages more “natural” than highly entangled ones. I will conclude by suggesting that, if fast generalization is what we care about, we might as well focus directly on enhancing this property, without worrying about the compositionality of emergent neural network languages.
References
https://cs.stanford.edu/people/karpathy/cnnembed/,
https://www.inverse.com/article/12664-google-s-alphago-supercomputer-wins-second-go-match-vs-lee-sedol
https://hackernoon.com/deepmind-relational-networks-demystified-b593e408b643
Lazaridou et al. ICLR 2017 Multi-Agent Cooperation and the Emergence of (Natural) Language
Bouchacourt and Baroni (2018) How agents see things: On visual representations in an emergent language game
Are emergent languages compositional?
- Andreas ICLR 2019,
- Choi et al ICLR 2018,
- Havrylov & Titov NIPS 2017,
- Kottur et al EMNLP 2017,
- Mordatch & Abbeel AAAI 2018,
- Resnick et al AAMAS 2020
A compositional language is one where it is easy to read out which parts of a linguistic expression refer to which components of the input
- Naïve compositionality
- a language is naïvely compositional if the atomic symbols in its expressions refer to single input elements, independently of either input or linguistic context
- Chaabouni, Kharitonov et al. ACL 2020 Compositionality and Generalization In Emergent Languages
Quantifying (one type of) naïve compositionality
Positional disentanglement measures strong form of naïve compositionality: to what extent do symbols in a certain position univocally refer to different values of the same attribute
note - the paper has two other measures
Do emergent languages support generalization?
Is compositionality needed for generalization?
- kind of obvious, particularly if parameters are shared not but is should help
Lazaridou et al ICLR 2018
Kharitonov and Baroni: Emergent Language Generalization and Acquisition Speed are not Tied to Compositionality Emergent Language Generalization and Acquisition Speed are not tied to Compositionality
Video
Slides & and a paper
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Is Compositionality Overrated? {The} View from Language
Emergence},
date = {2024-09-01},
url = {https://orenbochman.github.io/posts/2024/2024-10-10-marco-baoni-composionality/summary.html},
langid = {en}
}