Review of “Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input”
This is the paper that Marco Baroni used to explain the emergence of languages in his talk “Is Composonality over rated?”.
In (Lazaridou et al. 2018) the authors look emergence of language using a deep reinforcement learning approach. They train reinforcement-learning neural network agents on referential communication games. They extend previous work, in which agents were trained in symbolic environments, by developing agents which are able to learn from raw pixel data, a more challenging and realistic input representation. They find that the degree of structure found in the input data affects the nature of the emerged protocols, and thereby corroborate the hypothesis that structured compositional language is most likely to emerge when agents perceive the world as being structured.
Abstract
The ability of algorithms to evolve or learn (compositional) communication protocols has traditionally been studied in the language evolution literature through the use of emergent communication tasks. Here we scale up this research by us ing contemporary deep learning methods and by training reinforcement-learning neural network agents on referential communication games. We extend previous work, in which agents were trained in symbolic environments, by developing agents which are able to learn from raw pixel data, a more challenging and realistic input representation. We find that the degree of structure found in the input data affects the nature of the emerged protocols, and thereby corroborate the hypothesis that structured compositional language is most likely to emerge when agents perceive the world as being structured
Outline
So I went through the paper and outlined most of the methodolgy etc. Its a bit long but I think it is worth it. Here is a quick outline. I might come back and add more material later. But I think the video above though not dierectly on this paper is a good explainer to understand this more easily.
Introduction
- Explores the emergence of linguistic communication through referential games with symbolic and pixel inputs.
- Motivated by understanding the role of environmental conditions on emergent communication.
- Introduces the use of deep reinforcement learning agents to scale up traditional studies of language emergence.
Referential Games Framework
- Based on multi-agent cooperative reinforcement learning, inspired by the Lewis signaling game.
- Involves a speaker communicating a target object to a listener, who identifies it among distractors.
- Differentiates between symbolic data (structured and disentangled) and pixel data (entangled).
Study 1: Referential Game with Symbolic Data
- Uses disentangled input from the Visual Attributes for Concepts Dataset.
- Demonstrates that agents can learn compositional protocols when input is structured.
- Explores the effects of message length, showing improved communicative success and reduced ambiguity with longer messages.
- Investigates how context-dependent distractors impact language emergence and object confusability.
Study 2: Referential Game with Raw Pixel Data
- Employs synthetic scenes of geometric objects generated using the MuJoCo engine.
- Agents learn to process raw pixel input without pre-training, achieving significant communicative success.
- Highlights environmental pressures’ role in shaping emergent protocols, leading to overfitting and ad-hoc conventions.
Structural Properties of Emergent Protocols
- Examines the topographic similarity metric, correlating object similarity with message similarity.
- Observes compositional signals in structured environments but instability and environmental overfitting with pixel input.
Probe Models
- Analyzes the speaker’s visual representations using linear classifiers.
- Finds that disentanglement is necessary for encoding object properties and effective communication.
Conclusion
- Demonstrates that structured input aids compositionality, while raw pixel input challenges protocol stability.
- Highlights the scalability of emergent communication studies with realistic data and deep learning techniques.
- Suggests future work to mitigate overfitting and promote generalization across diverse environments.
The paper
Citation
@online{bochman2025,
author = {Bochman, Oren},
title = {Emergence of {Linguistic} {Communication} from {Referential}
{Games} with {Symbolic} and {Pixel} {Input}},
date = {2025-01-01},
url = {https://orenbochman.github.io/posts/2024/2024-10-10-marco-baoni-composionality/paper3.html},
langid = {en}
}
Comments
Looking over this paper I did not see any unreasnable claims not much that I thought wrong. Althogh Lazaridou has a number of criticsm on reaserch in this area this paper seems to be a good example of how to do it right.
One term I don’t know if i like is pre-linguistic concepts, usually we call this as the states. However I think that this term isn’t bad at all. It suggests that we arn’t looking just at states but at an item we want to talk about. This makes more sense particularly when we think about bitmaps of states - they are less like states.
One more point is that by adding the vision learning we are adding a second game. Call it a classification game. The agent needs to succeed at classification game otherwise they are just guessing. It worth while to consider though that just guessing with a good memory is enough to develop a signaling system.
There is a massive asymmetry between the sender and the reciever that is not extant in the original game. The sensder can learn all images via a ground truth while the reciever can only learn about the correct ones. So that as a framing game this needs to be reconsidered. What I mean is that the sender’s vison should be evaluated compared in a scale between an agent with a perfect vison and perfect blindness as baselines. And the same for reciever.
The vision capability should be factored in to the evaluation of the agent’s learning of the signaling system.
The paper does have many intersting ideas and shows methods, for achieveing them. In a number of areas I think one could do better, but I doubt the resutlts should be very different.
One area that seems wort further investigation is CONCEPTUAL ALIGNMENT in appendix A. This seems to be related to semantic grounding — getting the agents lanaguage concepts/semantics to align with the world or with a second set of semantics like say a human language.
What they consider here is much more specific - does the visual capacity learned by the agents provide them with a disentagled view of the world that is in line with the compositional structure of the state space they are observing (called pre-linguistic concepts).
It seems that either the methodology is inadequate or that there is a problem with alignment.
What might be done -