Compositionality in Lewis signaling games and MARL transfer learning.

compositionality
emergent languages
reinforcement learning
transfer learning
information theory
linguistics
Author

Oren Bochman

Published

Monday, October 14, 2024

Keywords

compositionality, emergent languages, reinforcement learning, transfer learning

TL;DR: Compositionality - A guide to the perplexed

Compositionality means different things to different people in different contexts, which is irksome to the student, renders researches prone to vagaries and foist unexpected complications onto partitions. Although I’m no Maimonides, I will assay to identify the different meanings; delineate thier contextual boundaries; and to establish a hierarchy relating such different facets of Compositionality.

We start considering Lewis signaling games, first with atomic signals and later with compound ones, I then consider transfer across tasks in Multi Objective Multi Agent Reinforcement MO-MARL learning and also try to abstract these into a formalize mathematical form.

Motivation 1 - Synthetic and Emergent languages

When dealing with emergent languages, I believe that complex signaling systems that are more faithful representations of real world states are superior to ones that just map them arbitrarily. I say complex because the simple signaling systems are highly symmetrical but when we use aggregate signals we immediately get systems that are have different levels of desireable properties. The signaling systems one ends up with may be path dependent, and may not be the most optimal for the task at hand.

But what does it mean to be faithful? 1

1 this is not related to the mathematical notion of a group action discussed below, It just means that the signal is a good representation of the state.

We can most easily understand it using an example. If the state contain some space time description, a signaling system that has a rule about describing space and similar rule about time will require learning two rules and a much smaller lexicon rather than leaning no rules and a massive lexicon for each of the space time combinations deemed pertinent to communication. We also see how compositionality is intimately tied to learnability and faithfulness. In a language that is compositional we can allocate more of the lexicon to atomic and semantically orthogonal concepts and use the rules to create a whole space of possible meaning.

Why is this a challenge? In the real world states are complex and there are many facets to being faithful. For example, the state might be a picture of a cat. The signaling system might have to learn to describe the color of the cat, the shape of the cat, the size of the cat, the position of the cat, the orientation of the cat, the texture of the cat, the breed of the cat, the mood of the cat, the age of the cat, the health. The list goes on and on. So as a language grows we shift from a simple rules to capture parts of the state into more abstract system that can capture all the many facets of the state with the added constraint that this abstract system must be easy to learn via this same idea of composition. 2

2 One would imagine that given a flexible template for a complex signaling game the constraint on learnability would select for more compositional languages. In reality there are many impediments to learning such a systems so there are no guarantees that using various constraints will lead to a paragon of complex signaling to emerge - we will more likely see something odd and obscure that is very hard to interpret. I’d like to point out that natural languages are rather hard for machines to learn and for most humans even more so.

Desiderata fulfilled by the original Lewis signaling game

Natural language temper a using faithful one to one mapping with of state to signals with abstraction that are easier to learn by being general.

  • When learning using a Lewis signaling game, agents begin with a very simple semantic setup
  • there is a list of states and we want to be able to name them.
  • Agents learn a mapping not unlike a lexicon which list the different meaning of a token.
  • A good lexicon also lists things like prefixes, suffixes and collocations which are compositional elements of language.
  • A thesaurus list synonyms which are also compositional elements of language. Lewis games can also capture synonymy by having multiple signals for the same state we can call these partial pooling states equilibria - separating equilibria as they do not require receiver to guess the state from the signal. While synonyms are clearly inefficient in a a signaling system, when adding a time dimensions synonyms for common ideas can diverge into more nuanced states as we learn more facets of the partial state they correspond. We can think of this also as X+w_1 a, X+w_2 b ... X+n where X is the common state and a, b, … n are the different semantics atoms but the weights w_i start as 0 and slowly increase thus diverging into more meaningful versions.
  • Lewis games can also capture homonymy by having multiple states for the same signal this is called a partial pooling equilibrium. These are useful if we consider them as the most informative partial signals that can be sent. (This may sound a bit of a stretch but it is best way I found to think about it.)

We can see how Lewis game can capture at least three structural properties of language. In the literature the focus has been on signaling systems which are one to one mappings between states and signals this corresponds to a large part of language which is unambiguous and has a list like structure I discussed above. However we can now see that algorithms that could be designed to target a broader set of equilibria that facilitate use of synonyms and homonyms. This is a more complex signaling system that is more like a thesaurus and a dictionary combined.

Motivation 2 - Transfer learning in RL

Some modern RL algorithms are fantastic for finding optimal or superhuman level policies for a single task. However, when we want to learn a new task we often have to start from scratch. This is because the policy is a complex function that is hard to decompose into simpler functions that can be reused. This is a problem of compositionality. If we could decompose the policy into simpler functions we could reuse the simpler functions and learn the new task faster. This is the idea behind transfer learning.

Recent research into using LLMs with RL agents indicates that with an expressive enough language and the kind of common sense knowledge captured in such a language agents may have enough structure to represent thier task in terms of an abstraction that is sufficiently amenable to transfer skills between task and may significantly reduce the amount of learning required to learn a new task. So if a language is a compositional representation of the world and the rewards can also be expressed as compositional functions of the state components then agents may be able to leverage these structures.

Also learning language in the microcosm of other games framing the lewis signaling might be key to exploring this duality of RL algorithms that learn abstract representations along with transferable RL skills.

Games and constraints

Besides the lewis signaling game there are:

  1. the Lewis reconstruction game - where the receiver needs to reconstruct the state (an image) using the signal and there is a reconstruction loss. The agents get a shared reward but it is not 0 or 1 but a continuous value. (Deep learning practitioners likes continuous rewards since they can be differentiated and plugged into the backpropagation algorithm.)
  2. the Lewis referential game AKA the classification game. The receiver needs to pick the correct state from a set of distractions and one or more good image. This is easier than the original game as there are fewer states to choose from. However selecting the state requires learning an image classifier or even a clip classifier and this is a harder task then just learning a mapping from states to signals. (It requires more steps to train if we start from scratch and learn one example at a time as we do in the lewis game. In this game i think if the distractions come from a GAN there would be better opportunities for compositionality to emerge.)
  3. The set refernce game (Mu and Goodman 2022) in which states are sets of images that need a rule
  4. The concept game (Mu and Goodman 2022)

Note: in both these task there are usually two modalities. Perception with multiple modalities may be key to developing the discriminative ability to learn to ignore distractions and focus on the most salient parts of the state. Each modality has its own distractions and noise. This places the actors language expressive enough to be general purpose. On the other hand the real world is four dimensional. A large parts of languages like tenses and part of the case systems are about capturing these. Anyhow if we can get an adversarial setup ideally the adversary can learn to generate distraction in all modalities.

A four dimensional world is a world where the state is a sequence of three dimensional attributes that evolve. Some are salient to one task others to another and most are distractions and should be ignored. Also in this kind of a game agents can more readily learn to distinguish between cheap talk and informative signals. This is because distractions are not just random noise but are generated by a model that is trying to fool the receiver.

Constraints

I am kind of biased that by adding constraints, preferably encoded as incentives some undesireable outcomes can be avoided.

  1. Laziness the loss of a complex lewis game should penalize agents for long messages and reward them for short ones. see also (Rita, Chaabouni, and Dupoux 2020) where this is called lazy communication.
  2. Impulsivity the loss of a complex lewis game should reward early actions i.e. impulsiveness if it results in a correct action. see also (Rita, Chaabouni, and Dupoux 2020) where this is called a impulsiveness.
  3. I think that these could happen in a frame game of predation which mutiplies the Lewis game outcomes with a bonus and a malleus or in which each atomic signal sent carries a risk of sudden death.
  4. Communication bottleneck see (Kirby 2001) - complex signals would need to arise if agents have to use a restricted communication channel. I think of this as a shannon noisy channel and can only send a short sequence of drawn from a limited set of symbols. 3
  5. Errors Correction if there are errors then agents will benefit from being able to correct them. Injecting errors into the signals should incentivize agents to learn redundent a more complex signaling system that can detect and correct errors. This together with the previous item forms the notion shannon game, operating as a frame game for the lewis game.
  6. under-parametrization (Kottur et al. 2017), (Galke, Ram, and Raviv 2022)
  7. population dynamics (Chaabouni et al. 2022), (Rita et al. 2022)
  8. memory restriction (Cogswell et al. 2019), (Cornish et al. 2017)
  9. partial observability agents only see a fraction of the states at training time perhaps one or two combinations of each partial state. They need to be able to use language to coordiante the full state by pooling thier partial observations. This is what we generally mean about generalization.4
Rita, Mathieu, Rahma Chaabouni, and Emmanuel Dupoux. 2020. “" LazImpa": Lazy and Impatient Neural Agents Learn to Communicate Efficiently.” arXiv Preprint arXiv:2010.01878.

3 This together with the previous constraints should encourage agents to learn to do source coding on the signals.

Kottur, Satwik, José MF Moura, Stefan Lee, and Dhruv Batra. 2017. “Natural Language Does Not Emerge’naturally’in Multi-Agent Dialog.” arXiv Preprint arXiv:1706.08502.
Galke, Lukas, Yoav Ram, and Limor Raviv. 2022. “Emergent Communication for Understanding Human Language Evolution: What’s Missing?” arXiv Preprint arXiv:2204.10590.
Chaabouni, Rahma, Florian Strub, Florent Altché, Eugene Tarassov, Corentin Tallec, Elnaz Davoodi, Kory Wallace Mathewson, Olivier Tieleman, Angeliki Lazaridou, and Bilal Piot. 2022. “Emergent Communication at Scale.” In International Conference on Learning Representations.
Rita, Mathieu, Florian Strub, Jean-Bastien Grill, Olivier Pietquin, and Emmanuel Dupoux. 2022. “On the Role of Population Heterogeneity in Emergent Communication.” https://arxiv.org/abs/2204.12982.
Cogswell, Michael, Jiasen Lu, Stefan Lee, Devi Parikh, and Dhruv Batra. 2019. “Emergence of Compositional Language with Deep Generational Transmission.” arXiv Preprint arXiv:1904.09067.
Cornish, Hannah, Rick Dale, Simon Kirby, and Morten H Christiansen. 2017. “Sequence Memory Constraints Give Rise to Language-Like Structure Through Iterated Learning.” PloS One 12 (1): e0168532.

4 The greater the signaling systems the more challanging to learn from a few examples as agents are trying to learn a grammar a lexicon and many distributions to more effectively decode messages.

Functors abstractions of function composition

In mathematics composition of function is one of their most fundamental and useful properties. When we think about compositionality in natural language and in machine learning we are really trying to impose some version of this idea into the problem and this is a point we almost always forget. But since mathematics is where this ideas are formalized, mathematics is where the some of the best ideas are likely to be found.

In mathematics one is often more interested in functions that preserve a structure which are called morphisms. and the abstraction of this idea is functor in category theory)

functor

functor

However lewis games do not require us to only use simple symbols. Agents can play the game with more complex signals and states. This is where the notion of compositionality becomes more interstsing. We can think of the lewis game as a function from states to signals

  • formal languages deal with transformation of one set of symbols to another set of symbols. This lets us rewrite a message from basic symbols into one with more complex symbols and allows us to use numbers to represent the different languages in the chomsky hierarchy. This is probably not the first thing that people in this field consider. However work starting with the simple formalism of Lewis game quickly raises the questions of how can we get language in which subsequent signals can used to break the symmetry leading to ambiguities associated with partial pooling equilibria. This is it worth noting as it might be the necessary abstraction to properly state the the problem.

The Chomsky hierarchy expresses greater expressivity.

The Chomsky hierarchy expresses greater expressivity.
  • games have a tree representation called extensive form.
    • We can graft to the lewis game tree additional trees states before and after and thus get game with equilibria that are more in line with various notions of compositionality and other properties of natural languages.
    • If this seems extreme it is worth noting most of the time we are not interested in a coordination task but some other framing task in which coordination is a means to an end. If this task can be learned by MARL then we already have this kind of extended tree with an embedded lewis game tree. It is essential that some kind of harmony is maintained between the parts or the equilibira may not be part of the biggger game. e.g. lewis games are cooperative games where the agents are trying to coordinate on a single equilibrium. If the framing game is a zero sum game it then it may eliminate the incentive to coordinate encoded into the payoffs of the lewis game. I don’t mean to say you cant have a game with incentives to cooperate and to compete but that when you do its a subtle ballance to maintain both without breaking either.

syntax tree

syntax tree
  • signals can be aggregated using different ways and it is hard to generelize from conjunction, to recursive structures.

The syntax of English, for example, is clearly compositional—that is, the meaning of a sentence is some function of the meanings of the parts of that sentence. — (Kirby 2001)

Kirby, Simon. 2001. “Spontaneous Evolution of Linguistic Structure-an Iterated Learning Model of the Emergence of Regularity and Irregularity.” IEEE Transactions on Evolutionary Computation 5 (2): 102–10.
Root (Gold) Tense (Red) Person & Number (Green) Group Action
áll Present 1st person singular állok
áll Present 1st person plural állunk
áll Present 2nd person singular állsz
áll Present 2nd person plural álltok
áll Present 3rd person singular áll
áll Present 3rd person plural állnak
áll Past 1st person singular álltam
áll Past 1st person plural álltunk
áll Past 3rd person singular állt
áll Past 3rd person plural álltak
áll Future 1st person singular fogok állni
áll Future 1st person plural fogunk állni
áll Future 2nd person singular fogsz állni
áll Future 2nd person plural fogtok állni
áll Future 3rd person singular fog állni
áll Future 3rd person plural fognak állni
  • this is considered a regular verb in hungarian

  • i have ommitted many of the forms of the verb for simplicity

  • we can ser that person and number have an entangled representation.

  • present tense is unmarked

  • past tense is is an infix

  • future tense has its own template with a auxilary verb and a infinitive

  • there is another point I’d like to make and it has to do with making agent able to communicate with humans.

  • if the agent’s language has the same group actions (homomorphism) to express structural semantics of nouns, verbs, pronouns etc. It should be much easier to then learn to converse in the homomorphic human language. The task boils down to learning functors that map agentic-roots to natural roots and agentic rules to natural rules. The agentic language might be highly regular with a single verb template and to use hungarian it might need to learn 60+ verb templates. But this is much easier I think then learning hungarian from scratch.

  • in reality learning a few extra rules might faccilitate (e.g. phonemic adjustments and vowel harmony) being able to communicae with hungarian verbs.

Note though that we are no longer talking about learning hungarian but some underlying structure that is shared between hungarian and the agents language.

This idea of learning a underlying structure and a surface structure is one that can be further abstracted. We can have a morphological level a syntactical level and a logical level all disentandled and seperatable or we can have them all sitting in one level and possibly entagled.

Entaglement can arrise from … cancatenation, from coding the most common segements into shorter segments, erosion, and swapping to help with difficult phone sequences.

THis suggests that we might have a sequence-bag or soft-sequence agregator - a conventino that has a prefered order but is indifferent to change in the order so long as semantics are preserved.

also word order

şehirlileştiremediklerimizdensiniz mean “you are one of those that we could not make into a city dweller” in archaic turkish. The word order is the opposite of English or Hebrew, this is because Turkish is a VO and English is an OV language. The word order is a surface structure that is not important to the meaning of the sentence. (Deutscher 2006)

Deutscher, G. 2006. The Unfolding of Language: An Evolutionary Tour of Mankind’s Greatest Invention. Henry Holt; Company. https://books.google.co.il/books?id=maz9oLIKZKkC.

An evolving desiderata of Emergent Languages

  1. mappings between states and signals

    1. morphosyntax mappings preserve partial states (Homomorphism of normal subgroups)
    2. mappings preserve semantic topologies (if a is close to b then f(a) should be close to f(b))
  2. Stability

    1. Regualrity is stable (mappings used in syntax and morphology are stable over time)
    2. Irregularity is stable (mappings used in irregular verbs and nouns are also stable over time) In english we maintain many irregular borrowings from other languages and thus preserve thier regularity - making such exceptions easier to learn too.
  3. Compositionality

  4. Brevity (source coding)

  5. Self correcting (channel coding to detect errors and correction them through redundancies like agreement, vowel harmony, etc.)

  6. Learnability - how many things need to be coordinated; the complexity of the strucures, the Hoffding bound on rates of learning distribution when there are errors. The Bonferroni correction for multiple learners.5

  7. Stable irregularities

  8. zipfian distributions - the most common signals should be the shortest and the least common the longest. This is a form of source coding and would arrise naturally from huffman coding, except that this isn’t practical for several reasons. It could also arise out of laziness in the sender

  9. Faithfulness

  10. Distributional stability

  11. Decidebility - easy to disambiguate ambiguous signals from thier context

  12. Expressivity - the ability to express a wide range of concepts

  13. Generalization - learning the language is possible from just a few signal state pairs.

5 multiple learners has similar logic as a multiple hypothesis testing problem, for each learner postulating different signaling system with each failure or success in a Lewis game. More so when learners get to observe each other’s actions and rewards.

Some metrics

  1. Compositionality
    1. Topographic similarity c.f. [](Mu and Goodman 2022)
  2. Source coding
    1. Compression ratio
    2. Entropy
    3. Mutual information
  3. Error detection and correction
    1. Error rate
    2. Hamming distance
  4. Learnability
    1. Number of examples to learn a new word
    2. Number of examples to learn a new rule
Mu, Jesse, and Noah Goodman. 2022. “Emergent Communication of Generalizations.” https://arxiv.org/abs/2106.02668.

Another random thoguht or two:

VO and OV via symmetry breaking

If we use Huffman coding like process to organize the order of the morphological and syntactical elements (effectively making the fixing the on avarge most surprising partial signals before the next on average most surprising ones) we should have emergent languages that are rather similar and fairly easy to learn Like Turkish and Japanese. However there is at the start the question of how to apply aggregations. If action is first we get V O languages if it is second we get OV languages. I think that V carries more entropy in Predation and Resource gathering games so that VO should be more pravelent. However once this decision is made most partical algorithms will not be able to reverse it.

Vowel Harmony

if agents backprogogate with topographic similarity in mind and the basic signals (phonemes) are endowed with a similarity they may end up with systems with vowel harmony and alternation of consonants to capture sets normal subgroups with greater similarity.

if these regular configuration also lead to better channel coding the benefits should persist.

Compositionality in Lewis signaling games

So here is a sketch idea for an algorithm for learning a compsitinal language in a lewis game.

We need a language designer. This is can be the sender, the reciever or implicit. Without loss of generality we can assume that the sender is the language designer.

THe language designer needs 1. to a ‘semantic’ metric to decide when two state are close or distant. 2. a way to deconstruct states into atomic orthogonal/independent parts. I am thinking of normal subgroups.

Note that we can define the metric on the parts and aggregate them to get the metric on the whole. This is a form of compositionality.

More abstractly we can just say that the state is an algebric topological group.

So the language designer can use a template with n part (on for each of the subgroups) Idealy ordered with by the decresing size to prefix code the substates. If they there are duplicate sizes this will yeild multiple equilibria to be selected via sponatneous syemtry breaking.

The designer now can allocate the states to one of the points in the topology. By picking the system with loweset overall distances we get a regular compositional language.

Since there are many ways to do this the designer needs to coordiante with the reciever. However since there is graear regularity they onely need to coordinate a minimal set with each atomic substate apearing once.

Citation

BibTeX citation:
@online{bochman2024,
  author = {Bochman, Oren},
  title = {Compositionality in {Lewis} Signaling Games and {MARL}
    Transfer Learning.},
  date = {2024-10-14},
  url = {https://orenbochman.github.io/posts/2024/2024-10-18-Compositon-A-Guide-For-The-Perplexed/what-is-composition.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2024. “Compositionality in Lewis Signaling Games and MARL Transfer Learning.” October 14, 2024. https://orenbochman.github.io/posts/2024/2024-10-18-Compositon-A-Guide-For-The-Perplexed/what-is-composition.html.