While reviewing the literature on lewis games and thier extensions I realized that there are a many ways that signaling systems can arise. Ofttimes people make assumption that the one they envision is the only one while in reality there are many nuanced ways that signaling systems can arise.
One fascinating aspect the Lewis signaling game is that although there are many theoretical equilibria intially the agents will fail to coordinate and they can only reach the optimal signaling systems after some iterations of the game in which they either evolve or use reinfocement learning to coordinate a themselves to a common signaling strategy. In the prisoners dillema agents can learn to coopertate if the game is iterated. In the Lewis signaling game agents can learn to coordinate on a signaling system if the game is iterated.
To reach a good signaling system requires some kind of algorithm as well some number of iterations. I don’t recall seeing a discussion of the the minimum or the expected number of iterations required to reach a signaling system under different algorithms. In other words most researchers have considered the complexity of coordination in signaling systems. This is actually a fairly simple problem to solve in the most common settings.
Another two point primerily addressed by the evolutionary game theory community who view evolution in terms of replicator dynamics is that of stability of equilibria and the notions of evolutionarily stable strategies. The first has to do with convergence of learning to an optimal signaling system. The second has to do with the ability of an equilibrium to resist invasion by a mutant strategy.
A related issues is that of enumerating different types of equilibriums in larger games. For basic lewis signaling games this is not very diffucult but once one imposes a structure on the state and requires complex signals to emerge we get to a point where it may be quite challenging to enumerate all the possible equilibria.
Another point of interest to me is to consider the emergence of grammer and of a morphology. In (Nowak1999?) The authors give a result for the emergence of garammar in a signaling system. This is that there are many more
I think it worth while to list them in this space — particularly as I believe that signaling systems are a key for transfer learning in reinforcement learning which together with learning to represent complex states may be the key to AGI.
Introduction
Listing number of different scenarios on how signaling systems can arise in the Lewis signaling games.
I will start with a story
Next add some dtails like some variants and look some basic analysis.
Finally I’ll try to place it into the context of MARL. Note that we will be dealing with partially observed multi agent RL. But each scenario can have a different setting.
In The book signals (Skyrms 2010) the author, Skryms, discusses how Lewis challenged the skepticism of his advisor Quine regarding the meaning and convention may arise via an arbitrary mechanism like symmetry breaking.
When I considered solving some additional issues surrounding the fundamentals of signaling systems I realized that I had a few different scenarios in mind and that writing them down with some semblance of formalism might be helpful. It turns out that indeed this turns out to be a stepping stone towards developing an optimal algorithms for learning signaling system in different rl settings.
Let’s face it under different settings the task of acquiring a signaling system can be easier or harder. In (Skyrms 2010) the author points out that at symmetry breaking all the different signaling systems that could be learned are equivalent. However if there is an asymmetry in the form of a non-uniform distribution of states or different signaling risks then we we might prefer some signaling systems over others and there might even be a unique optimal signaling system. Furthermore like in reality one would expect that with time distributions of states might change and the optimal signaling system might change as well.
Sender and Receivers consult an “Oracle” (perhaps in book form). The oracle tells them how to map states to action, the oracle provides the sender with a one to one mapping of states to actions and to the receiver with the transpose, a mapping of signals to actions. The sender and receiver can then use this information to infer the signaling system.
In many situation where agents share some experience or can consult the same oracle they can infer the same signaling system and avoid the cost of lengthy coordination required to reach a common signaling system. This is the easiest case and the most likely scenario for the evolution of signaling systems.
Two cases come to mind.
They have booth been observing the state space long enough to infer the distribution of states to a high degree of confidence.
They can listen to a third party who knows the distribution and learn to signal from them.
They can access a state classifier and send it random noise thus deriving an empirical distribution of states in the classifier (not nature) and use it to learn the signaling system.
coordinate via the state distributioncoordinate by imitationcoordinate via a classifier
Once a distribution of states in known it can be used to create huffman codes using 0 and 1. These signals are then ranked.
There is a distribution of the states of the word known to all players.
In the easiest case each state has a different probability of occurring. -It is easiest because all players can infer a canonical signal system from such a distribution of states.
They order states and corresponding actions in decreasing expected value. The canonical system is the one mapping between the states and the actions.
Thus the salience distribution breaks the symmetry of all viable signaling systems and leaves just one option.1
In each subsequently harder case there are two or more states with equal probability of occurring. These probabilistic symmetry of these states cannot be broken as before and require the use of coordination. The coordinators can break the symmetry by trial and error when that state arises. Once all the symmetries have been coordinated the players can infer the rest via the canonical signal system from the distribution of states.
In the worst case all states have equal probability of occurring. This is the hardest case because after each state signal pair the problem is still maximally symmetric. The players need to solve this by using trial and error.
1 This is notion of a most salient mapping acts as an optimal policy for agents who need to quickly avoid the long run costs of a non salient signaling system
MARL Formulation
In terms of the MARL formulation:
A PMDP has states S and actions A.
States are observed by agents of type S whose actions are signals
Actions are performed by agents of type R.
Rewards are assigned symmetrically to both All senders and receivers when the receiver action matches the sender observed state.
States can be uniformly distributed or be drawn from a distribution.
We like to call such a distribution the saliency distribution after Schelling notion of a focal point AKA (Schelling point) in his book The Strategy of Conflict. In a lewis signaling game there are n! signaling systems if there are n states, signals and actions. If the states are uniformly distributed then all signaling systems are equivalent. But if the states probabilities are monotonicaly distributed then there is a unique optimal signaling system which is precisely the Schelling point.
Since saliency
2. Learning the Saliency distribution.
Story: Creation of the Oracle of Bayes
In another tribe where agents are too busy with their routine to coordinate on a signaling system. But they vigilantly observing and tallied thier environment. all the agents sooner or later will record the same empirical distribution of states. Whenever a state’s probability emerges into ‘significance’ it becomes common knowledge which allows all to order it along with the others and to enumerate with its ‘canonical’ signal. As the states’s distribution evolves over time so does the signaling system.
Another point is to consider that if agents just observe states long enough they should eventually learn to approximate the state distribution. How long would this take ?
If there least common state has probability \alpha and the agents want to know the distribution with confidence \epsilon they would need, according to Hoeffding’s Inequality
K\ge\frac{log(2/\epsilon)}{2\alpha^2} \qquad \text{(samples to learn S)}
also recall that although there is no lower bound on \alpha when S\sim Uniform[N] the upper bound is 1/N
K\ge\frac{N^2log(2/\epsilon)}{2} \qquad \text{(samples to learn uniform S)}
Code
import math# Given valuesK =8# statesepsilon =0.34# confidence# Calculate time to learn the saliency distribution # N using the formula N >= (K^2 * log(2 / epsilon)) / 2N = (K**2* math.log(2/ epsilon)) /2print(f'Expected time {int(N)} to learn a {K} state distribution with confidence {epsilon}') # Expected time to learn a signaling system with N statesT = K * math.log(K)print(f'Expected time {int(T)} to learn a {K} signaling system ')
Expected time 56 to learn a 8 state distribution with confidence 0.34
Expected time 16 to learn a 8 signaling system
So learning a signaling systems is easier then learning the distribution of states. Once they they know how to signal states it is easy to use this system to communicate the distribution to all the receivers.
We have not put a cost on learning the signaling system. But if there was a cost associated with learning we could use it to model when agents would prefer to learn the signaling system or just wait until they can infer the distribution of states and infer they systems from that.
A third point is that if they are bayesian they could start to infer the signaling system after viewing a few stats and update thier system as they update their beliefs regarding the distribution of states.
Bringing Up Baby
Story: Bringing Up Baby
Here the sender is tha parent and the receiver the child. Each time the child learn a new action a new signal is added to the signaling system. Since the other signals are known the child can learn the new signal in a single step. This is another trivial case where learning is easy.
Hoppes Urn
Incremental Learning
In RL this is called incremental learning. We can also assign such signals to sequences of actions which we call capabilities. The child can learn a new capability in a single step. This is the most efficient way to learn a signaling system incrementally.
Skryms discusses two methods that agents can use to learn a signaling system incrementally. First is the Chinese restaurant process and the second is the Hoppe urn. He suggest that they are equivalent. I too came up with the Hoppe urn model - as I had already investigated how to codify the most common probability distributions as urn models.
Another way to make learning easier is to always have just one action in context when we need to learn. This allows the receiver to learn the signal system in a single step. It might work with a student learning to signal and act in tandem.
incremental learning with one new action
In this case urn used in learning have an Hoppe urn with a black stone indicating that a new state action pair is being learned. If the receiver learns the new signal action pair, the agents keep track of it otherwise the new signal and action are discarded.
Note that if the there is only one new state and action a suitable algorithm can learn it immediately. IF there is an exploration - this may cause an error.
We retain this mechanism and might use it for expanding a signaling systems incrementally in the presence of new data.
Note: if there are saliency distributions is being used a new signal would be the last signal in the saliency distribution or in the last group. Over time signals that are not in use might be discarded if thier saliency is bellow the minimum saliency threshold.
3. Ship of Fools
Story: Ship of Fools
Senders and Recievers lack all prior knowledge. They follow an optimal strategy for a related game the battle of the sexes. Is a state is uncoordinated senders will explore randomly pick a signal and recievers will randomly pick an action until they get a reward and exclude the signal action pair from exploration.
This strategy is not the best one for senders, but it is easier to anlyse.
If the state is T and there are N states, signals and actions then are N\times N choices for sender and recievers of which the ones with action A=T get a reward. So the expected reward is 1/N chance of getting a reward.
The expected rewards are 1/N but since the sender is randomizing each turn is independent. Can they do better?
3. The steady navigator
Indeed they can do better. If the sender picks a signal and sticks with it the receiver can eliminate an action each turn. This is the optimal strategy for this, the most common setting of the Lewis signaling game.
Story: The Steady navigator
Senders and Recievers lack all prior knowledge. For each new state, the sender picks a signal at random but if the state is the same as the last state the sender sticks to the same signal. The receiver must explore an action at random but if the signal is the same as the a previous seen signal the receiver will explore an an untested action for the signal until they get a reward.
Lets estimate the expected rewards under this strategy for a state T and N states, signals and actions.
Sender has 1 signal and
Since the sender sticks with the same signal the receiver can eliminate an action each turn.
Receiver has N choices intially with 1 correct choice so we has a expected chance of 1/N of getting a reward.
Next he can eliminate his first choice and has N-1 choices with 1 correct choice so we has a expected chance of 1/(N-1) of getting a reward.
And after k tries he has N-k+1 choices with 1 correct choice so we has a expected chance of 1/(N-k+1) of getting a reward.
In the worst case he will have to try all N actions but
The Expected number of steps
\begin{aligned}
\mathbb{E}[steps] &= \sum_{k=1}^{N} \frac{1}{P_{\text{success k}}} \times P_\text{failure up to k} \newline
&= \sum_{k=1}^{N} \frac{1}{{N-(k-1)}} \underbrace{\times \prod_{i=1}^{k-1} \frac{N-i}{N-i+1}}_{\text{telescopic product}} \newline
&= \sum_{k=1}^{N} \frac{1}{\cancel{{N-(k-1)}}} \times \frac{\cancel{{N-(k-1)}}}{N} \newline
\end{aligned}
MARL Formulation
This is basicaly an optimistic initialization strategy. The sender does not explore. The reciever intilizes all signal action pairs optimisticaly with value of 0.5. This way he will keep exploring untill he gets a reward of 1.0 At this point exploration ends.
So we can expect that the number of steps needed to learn to signal the state T is N. They should pick a signal for a state and stick with with it.
The Guru’s Prior
The Sender is a privileged elder who knows the distribution of the states, the associated risk and cost of signaling to sender and receiver and figures our the optimal signaling systems. As such he selects a specific signaling system. This means that students need to coordinate to this system.
This means that whenever the state s_i arises we will get signal sig_i=Send(s_i) rather then some random signal. This means that the student for a mistake the receiver can use a negative reinforcement for <sig_i,action_j> is the return is 0. This should allow the receiver to narrow down the actions chosen for the next time we he gets that signal.
This is second hardest learning scenario but also most realistic. We don’t want to have to learn a new language for every person we meet.
What could happen - the distribution of states could evolve over time.
The prophet’s prior
The sender knows the distribution of the states and how it evolves over time. He choses the currently optimal signaling system. The receivers must learn the signaling system but once a change in the state distribution is observed they will switch to the the new optimal signaling system.
Imagine a world with many predators troubling the signaler. To avoid becoming prey agents must send a risky signals to their neighbors. They should use the signaling with the least expected cost. This cost combines the predator risk and its frequency. Signals can be 1 or 0. 1 is risky and 0 is safe. As frequency of the predators change the optimal signaling system will change as well.
The Gurus’ Posterior
Here there are multiple gurus with knowledge of different distribution. Can they coordinate on the most salient signaling system with respect to thier common knowledge ?
This should be the signaling system that is most salient for a mixture distribution with weight w_i for each guru.
Lets perhaps assume that there are a very large N and a cutoff \epsilon probability for which the gurus won’t bother to include rare sates.
In the second setting two or more students must come up with any signaling systems as fast as possible.
Babylon Consensus
Multiple senders and receivers take shelter in common ground and need to arrive at a common signaling system.
They can want to learn the least costly signaling system in terms of learning.
They want to learn the most salient signaling system in terms of the distribution of states.
There is an agent who knows the current distribution of states and the optimal signaling system.
There isn’t such an agent but the senders want to use a
Cost of learning a second dialect
for each agent and for each signal that is different from the target signalaling system add a cost of 1.
C = \sum_{i=1}^{N} \sum_{j=1}^{M} \delta_{ij} \\
\tag{1}
where \delta_{ij} is 1 if the signal j is different from the target signal for state i and 0 otherwise.
POMDP
In this settings one or multiple senders only a partial state.
Again we consider a hypothetical case where the state describe predators and that it can be partitioned into disjoint parts like <type, proximity> or <type, proximity, number> or <type, proximity, number, direction>. This partioning is also at the basis of compositionality in signaling systems.
Skyryms first considers three different settings.
observation one of mutually exclusive partition: the case where each sender views one part of the partitioned state.
observation of all mutually exclusive partition the case where senders see all the parts of the state but don’t have a mechanism in place to coordinate who sends which part of the state.
observations of all mutually exclusive partition with coordination the case where one sender see all the parts of the state but lacks symbols to send the full state and needs to send each part. He must send the parts one at a time resulting in a sequence of signals.
In the first settings the receiver somehow knows that he should first aggregate the signals using a logical and then decode the state.
In the first settings
where the agent again observe the full state but don’t have a a coordination mechanism for picking differnt parts of the message.
They send a partial signal to the receiver who must infer the state and take the appropriate action. The receiver must
aggregate the messages
infer the state
take the appropriate action
note:
In the first case so long as each part of the state is a unique signal the state can be infered by the reciever using conjunction. The second case if more problematic and shows us a new way that some signaling systems can be better then others.
part the agent can’t infer the state better then chance. However reinforcement of random partition the senders can learn to send they both need to learn a decorelated partition for each state the state and send different parts of the state. The issues is if the semantics are composeable.
An issue here is that there is no guarantte that the senders will send the same part of the state at each turn. If the aggregation rules is conjunction, i.e. logical and, then the receiver will be able to decode the state so long as he gets all the pieces.
Bayesian Adversarial Signaling
There are multiple senders and each state is known to more than one sender. Each sender has a voracity parameter \nu, this is the probability that they send a faithful signal. At one extreme senders make small mistakes and at the other they are completely deceptive. At the extreme the agents have types (like knights and knaves) and the receivers must learn to classify the agents by type and then learn the signaling system. Agents need to learn a
Babbling Bayesian Babies
Babies in the babbling stage of language development are learning to signal. They are sending all possible phonemes and the parents and thier parents either respond or talk to each other. The babies are collecting the feedback and reinforecing both poitively and negatively until they only use the phonemes that are in the language of thier parents. They start with over 300 phonemes and end up with 40-50.
In this scenario the sender operates at random. Both the sender and the receiver must observe the rewards and reinfoce state signal action triplets.
The evolution of signaling systems
In this section I want to address some of the questions that drive my research on signaling systems.
When do we expect signaling systems to evolve?
When agents fitness is increasingly predicated on coordination or communication they will get a benefit for evolving signaling systems. I.e. a evolutionary pressure to communicate will lead to the evolution of signaling systems.
What are the main desiderata for signaling systems?
Here are some of the main desiderata for signaling systems:
Efficiency - the signaling system should be as short as possible.
Salience - the signaling system should be most salient for the distribution of states.
Cost - the signaling system should be as cheap as possible to learn and use.
Robustness - the signaling system should be robust to noise and deception.
Adaptability - the signaling system should be able to adapt to changes in the distribution of states.
Compositionality - the signaling system should be able to be combined with other RL activities to form
more complex signaling system.
more complex policies.
This is most clearly illustrated in:
The predation scenario where
Agent’s short term survival is predicated on their ability to respond to signals indicating the presence of predators by take the appropriate precautions. Of course signals need a source.
Agents can send a signals for the state they perceive or to stay mute.
Agents can repeat signals they receive or stay mute.
As predation increases, selection pressure may induce signaling systems to evolve.
The Dowery/Courtship scenario where:
The game can be cooperative or competitive.
In the competitive case only the fittest agents get a mate.
In the cooperative case all agents get to mate but some will mate more often, or with more desirable mates.
Agent must collect resources (e.g. a bill of goods for a dowery) before they can reproducing from a changing landscape.
Only the top n dowries will generate an offspring. (bills of goods slowly perish but the size and diversity of is important).
Alternatively only the agent that is the the best at courtship n times can generate an offspring. (this time there are smaller bills of good that quickly perish)
Resources are plentiful but evanescent.
Agent that can signal would be able to collect a dowery faster and increase thier fitness.
As competition increases benefits signaling systems should evolve.
This is interesting as the exploration/exploitation dilemma caps the rate at which agents can reproduce. Yet signaling will allow agents to over come this cap.
This is also a case where agents may get a benefit from sending false signals if the receiver is a serious contender. So that the receiver will waste time and resources.
The agents must learn to discriminate To handle deception agents may also develop a model of the mind of the sender to predict the likelihood of deception. They may also want to tally if the sender has been deceptive in the past.
Or
The Knights & Knaves scenario where:
Agents need to:
Classify agent by type. (knight or knave, monkey, insane, etc.) to interpret the semantics of their signals.
Assemble the state from messages with different semantics to recover the state of the world.
This scenario does assumes the agents have an underlying motivation to learn to signal.
And now add a selection pressure on the evolution of basic logic and semantics.
Agents that communicate can spend less time exploring and more time exploiting. . In this case the agents will evolve a signaling system that is most salient for the distribution of states. This is the most likely scenario for the evolution of signaling systems. The reason why agents might want to learn a signaling system is to maximize their fitness
What are the main parameters that affect the learning of signaling systems?
state distribution (these are the states of the world and signaling is used to share these states with others to maximize fitness - the expected progeny)
saliency distribution (weights for states ranking thier risk)
voracity of senders.
cost of signaling (risk of predation).
What are the different settings for learning signaling systems?
Some other questions within these contexts might be:
What are the number of signaling systems for a given number of states and actions?
What are the number of pooling equilibria for a given number of states and actions?
Let’s break these down by the degeneracy of the pooling equilibrium. This might suggest the minimal number of signals needed in an experiment to learn the signaling system. It might also suggest the thresholds of success for optimal signaling systems in different settings.
Can we estimate the regret for different RL algorithms ?
What is the expected signaling success for each of the above?
What is the expected and the mean number of steps to acquire a signaling system for a given number of states and actions under different settings?
How does having more senders or receivers affect the above?
What is the complexity of n-agents to come up with a common signaling system?
under full communication
under partial communication
How does locality affect the time to a universal signaling systems?
if there is full observability
if communications are one to one
if communication are different neighborhood, Von Neuman, Moore, hexagonal, other lattices, chains, rings, random graphs. (need to use optimal dynamics)
Another question that like a lemma on time needed for an agent to become experienced enough to setup an optimal signaling system?
Given distribution S of states with k states and some the rarest state s' having probability p(s') = \alpha what is the expected number of observations needed for agents to approximate the distribution of states to within some credible interval \epsilon<\alpha?
Note while there is no lower bound on alpha the upper bound is \alpha = 1/k for a uniform distribution of states. I think this is the Bayesian version of an empirical distribution. This would be a waiting time for becoming experienced.
After this waiting time a steady state distribution should be known to all agents.
Under partial observability the agents need to cooperate to learn the signaling system in a distributed manner. If the agents are on a grid or on a graph what are the bounds on coordination time for learning the signaling system - using a gossip protocol - i.e. each agent can only communicate with its neighbors - using a broadcast protocol - i.e. each agent can communicate with all other agents - using a random walk protocol - i.e. each agent can communicate with a random agent - using a central coordinator - i.e. each agent can communicate with a central coordinator - using an ambassador - i.e. each agent can communicate with an ambassador who can communicate with many other agents per Ramzey’s theory
While reviewing a paper of this subject I had realized that there are a number of hypothetical scenarios for signaling systems to arise.
In RL we have different setting for learning optimal strategies. Some of theres different scenarios can be framed in this form.
I wanted to list them here so I can reference them later
But thinking as I list these I notice that some provide an easy solutions to problems that others don’t.
One point of interest. If the agents are concerned with picking the right action for each state, they should collapse any states which share the same optimal action into a single signal. This will reduce the number of signals they must be learned and reduce the overall message length and cost of signaling. So in reality we should not be overly concerned with the number of actions exceeding the number of states.
When there are not enough signals agent need to learn to aggregate signals.
add
learning by evolution:
replicator dynamics with
agents have random signaling systems assigned and the systems with most payoffs is selected through population dynamics.
children learn thier parent matrix via sampling.
one parent (perfect and imperfect transmission)
two parents
pidgins via shared dictionaries
creoles shared grammars and dictionaries
adding some mutation - adding mutations to the childrerns signaling system.
based on paper by (Nowak and Krakauer)
learning via reinforcement learning
spontaneous symmetry breaking scenarios vs planning
If there are N signals, states and actions is there an advantage to planning a signaling system vs letting it evolve in terms of the number of steps needed to learn the signaling system?
random signaling means that each step is an independent trial.
Sender can send N signals and
Receiver can guess N Actions
So there are N^2 combinations per turn.
So there are Only the ones with A=T get a reward so there are N good combinations. So there is a N/N^2 = 1/N chance of getting a reward. So we can expect that the number of steps needed to learn to signal the state T is N.
planning means that the sender picks one signal and sticks to it. In this case Receiver gets to systematicaly eliminate an action every time.
sender has 1 signal and
receiver can guess N at first and N-1 at second and N-k-1 at kth turn.
So there are n+1/2
actions giving 1*N combinations and only ones with A=T get the payoff. So there is a 1/N chance of getting a reward. So we can expect that the number of steps needed to learn to signal the state T is N.
Thus planning is faster than random signaling.
random signaling means that there are (2n/n*n)^n = 2
is agent use positive reinforcement only then
are there conditions where the signaler/reciever gets to detrmines the signaling system?
if Sender sends random signals from L-{coordinated} R must guess the state From L-{coordinated}.
if S wants to switch X and Y ? and does so R get 0 . If R is epsilon greedy he will find the new semantics.
A meta protocol would require a code switching signal be “Swap X Y”
Source coding scenario errors in encoding & decoding - based on paper by (Nowak and Krakauer)
errors in the transmission channel based on paper by (Nowak and Krakauer)
risks - there are signals with monotonicaly increasing risk.
payoffs for signals are symmetric
cost associated with the risky signals are borne by the sender
if recievers can respond correctly after getting a partial message they get a bonus.
we can also consider sharing cost and rewards symmetrically. — creating a complex system with compositionality using self play