Pavlov’s Arcade

This is the homepage for our EECS 349 Machine Learning project at Northwestern University. You can read more about our project below, read the extended abstract or skip to our interactive demo!

Authors

Angela Jiang	angelejiang2014 `at` u.northwestern.edu
Motoki Mizoguchi	motokimizoguchi2013 `at` u.northwestern.edu
Max New	maxsnew `at` gmail.com

Introduction

Video games are designed to challenge human players. A new player starts with no knowledge of the game mechanics and learns by reward and punishment which strategies are good or bad. In our project, we directly apply this philosophy of gaming to the construction of an AI player. Just as a human would, the bot plays the arcade game repeatedly, being rewarded for advancing and punished for losing. Starting from a random strategy, the learner uses a Q-Learning reinforcement technique to update its policy.

In reinforcement learning, the AI receives positive and negative rewards based on the outcome of its actions. For each state, the optimal move may not be clear based on immediate rewards. Reinforcement learning algorithms allow the bot to learn based on delayed rewards so that it can choose moves that maximize future rewards. In the case of breakout, at every instant, the bot must choose a move so that in the future, the ball will hit the paddle and avoid losing the game.

Breakout is a simple arcade game originally released in 1976 by Atari, Inc. In the game, the player has control of a single paddle that sits at the bottom of the screen and can be moved left or right. Above the paddle are a number of bricks the player attempts to eliminate. A ball is released into the game and bounces around the screen. The player has to prevent the ball from passing through the bottom of the screen or they lose a life. In addition, the player advances the game by bouncing the ball into the bricks, which disappear when hit. When all of the balls on the screen are eliminated, a new set of bricks appear on screen and the game continues. The game eventually ends when the player loses all of their lives or a time limit runs out. Check out the demo below to see the game in action.

Design

In order to run Q-Learning, we need to consider the state of the Breakout game. In a game of Breakout, there is a court containing bricks, ball, and paddle. The ball, the paddle, and each brick will have their x-y coordinates and size of their object. The ball and the paddle also has their velocity. The x-y coordinates are given by pixel values, which depends on the window size of the browser that one runs the game. This may range from 300 to 800 pixels in each dimension. With just the paddle and the ball, the order of the state space can be 10⁹ to 10¹¹ in size. In order to keep the state space small and consistent with different screen size, we decided to normalize the x-y state space to the width of a smallest brick unit called chunk. The court was normalized to a 15 chunks by 15 chunks field, and the x and y values of the ball and the paddle were represented using the mapping to this field. This mapping reduced the state space for the two states by a factor of 10⁵.

Breakout State Space

Implementation

Our game is adapted from an open source Breakout implementation written in Javascript. Instead of receiving commands from the keyboard, it queries our server for each move. It is therefore synchronous with the server and will only update the game after each response from the server. The Python server provides the Javascript bot with its next move.

The game provides the server with a reward and the current state of the game and receives a command to stay, move left or move right a single chunk. It will also notify the server if the game is won or lost. In each case, the server will update the q-table values for the previous state with the observed reward. It will then determine the bot’s next move.

Before our server can provide the optimal moves to our bot, it must learn the strategy. Our server therefore has two modes, learning and testing. In learning mode, our server chooses moves based on an ε-greedy algorithm to allow for exploration. In testing mode, it provides the bot with the move that is expected to maximize its reward.

Results

While our learning is conducted, we take snapshots of the state of the table at intervals of 50,000 and 1,000,000 moves trained on. We then were able to test on these intermediary tables and compare performace. The performance metrics that we used were

Number of hits of the ball
Levels beaten
Time elapsed before losing

Each bot was allowed 100 lives and results are averages. Below, we look at average number of paddle hits as our bot trains. We see that the number of moves the bot trains on correlates to a larger number of paddle hits. This shows that the bot is indeed learning based on our reward for paddle hits. In a level without bricks, we would expect an optimal learner to prolong the game indefinitely.

Learner hits over course of learning

Next we look that average time elapsed in the next figure. The short time duration per game is due to the game being sped up for testing purposes. There is less of a noticeable correlation here. This is evidence that our original assumption of what the optimal strategy is may not be correct.

Learner time to death over course of learning

Conclusion

In our project, we applied reinforcement learning techniques to the construction of a Breakout bot. In our testing, we can observe that our bot’s performance increases with training. However, because we see little correlation to other metrics of game strategy, we cannot determine if our proposed strategy is optimal.

Demo

Below is a running demo of our project. The game is adapted from a version by Jake Gordon detailed here.

You can select a bot and it will play breakout, recording wins and losses as it goes. You can speed up the game to see how the bots do over time or slow down the game to see the bots’ reactions to the state of the game.

The bots include:

A hand-written bot whose strategy is to stay under the ball at all times.
Our initial learner with a random strategy.
3 Learners at various stages of learning.

Game Speed:
Frame Rate:
Bot:
Hand-written
Initial Learner
Learner after 5,000,000 moves
Learner after 10,000,000 moves
Learner after 15,000,000 moves

level: