Discord
@hash/prisoners-dilemma-with-q-learning
Simulation
1
0
Public

Prisoners Dilemma with Q Learning

This simulation is an extention of the Iterated Prisoner's Dilemma model, using the Q Learning Library to create an agent which learns a strategy to play against its opponent.

This simulation is an extention of the Iterated Prisoner's Dilemma model, using the Q Learning Library to create an agent which learns a strategy to play against its opponent.

Iterated Prisoner's Dilemma

During each match two agents will compete with each other according to the rules of the Prisoner's Dilemma game.

Every round, agents will choose to either cooperate "c" or defect "d" as their current move, and then play it against the other agent. Based on their moves, they will each be awarded a certain number of points. Each match contains a certain number of rounds, and at the end of each match the agents' histories are cleared.

For more detail on the Iterated Prisoner's Dilemma, see https://axelrod.readthedocs.io/en/stable/reference/description.html.

Extending with Q Learning

The primary goal of this extension is to introduce a new strategy which uses renforcement learning to determine which action it takes each round. The Q Learning library is a good fit for this, and only requires a few edits to start working.

The Q learning agent looks back a certain number of time steps (determined by lookback in globals) to choose its next action. There are four possible outcomes for each round in the history, so the total number of possible states in the q table is 4 ^ lookback.

That also means that for the first few steps, it chooses its action randomly since it doesn't have enough history to reference.

Behaviors

The library behaviors are added into the agent's behavior chain once there is enough history for it to reference.

The two custom behaviors written for the q agent are reward.js and strategy_q.js. The first behavior looks at the previous round's moves and determines the score the agent received according to the rules of the Prisoner's Dilemma, and also sets the next_q_state property on the agent.

The second behavior translates the chosen action for the agent into an actual move that it will play. It is also responsible for adding the q learning library behaviors once the agent has enough history to reference.

Globals

There are also global properties that were added to enable the Q Learning library to function properly. You can see a full list of all the required globals in the README for the library.

Analysis

The Reward per Episode graph in the Analysis tab shows the score our Q learning agent receives each match. As the agent learns how to best respond to its opponent's strategy, it improves its total score.