fredag 21 juli 2017

Solving OpenAI Gym's CartPole-v0 environment

This is a cross-post of the write-up of my submission for OpenAI's CartPole problem (with some corrections and clarifications).

Introduction and background:

I'm a novice when it comes to Reinforcement Learning but it's a subject I've become obsessed with lately. I've taken a couple of courses and done some machine learning experiments before (more specifically, a supervised learned bot for Codingame's Coders Strike Back), but on reinforcement learning, pretty much all of which I've read is Andrej Karpathy’s Pong from pixels (which is also how I found the AI Gym, incidentally).

I did not use any framework for the neural network, it is based on my own rudimentary architecture for supervised learning.

As for environment, I used deeplearning4j's gym-java-client to be able to code in Java. I should also mention I had some issues with crashes in the simulator. This is why the evaluation is so short and probably why it converges so early (it forced me to optimize so it would succeed before crashing).
You can see the evaluation and write-up here: neph1's evaluation


I trained a logistic neural network with two hidden layers. Input nodes were the observations of the last 5 steps.
I believe I'm using a gradient policy approach.

Neural Network Layout:

The size of the network had to be carefully designed. Deviating from the described network by only one or two nodes can cause severe degradation in performance. See “Alternatives and Red Herrings” for information about other designs I tried.
20 input nodes (observations from 5 latest steps)
26 hidden nodes in first layer
14 hidden nodes in second layer
2 output nodes
All nodes are logistic. Layers are fully interconnected

Initial Weights:

All weights are randomly initialized to a value between -1.0 and 1.0.

Learning Rate:

Learning rate matters, but it is also quite forgiving. Anything between 0.15 and 0.7 works. But it seems keeping it around 0.25 is slightly better. This evaluation used 0.28 I believe.


The input each step is divided  by the (observed) bounds of that range. The exception is value 2 and 4 where I found that blowing these up (dividing by 0.05) led to a significant performance in learning. I expect this is due to enforcing a low velocity. Those values are then also capped to between -5 and 5. I believe that by blowing up the values I do something that RL purists might not be OK with, since I essentially tell the algorithm which values to focus on.


I tried a number of different scoring methods.
What worked best in the end was:
Compare position of each observed input with the one from the previous.
If better (closer to zero), set gradients to positive (1.0), otherwise negative (-1.0). I’ve tried more variable gradients, like using the score directly, but it didn’t work as well.

 float score = 0;   
   for (int i = 0; i < 4; i++) {   
    float delta = (Math.abs(observation[i + 4]) - Math.abs(observation[i]));   
    score += delta;   
   if (score > 0) {   
    score = 1f;   
   } else {   
    score = -1f;   
   float[] scores = lastAction == 0 ? TEMP_SCORE_LEFT : TEMP_SCORE_RIGHT;   
   decisions.add(new float[][]{lastInput, scores});   
   updateScore(3, score);   

Then also update the previous two inputs with a diminishing value in the updateScore method. The factor seems to be forgiving, 0.55 to 0.95 have been tried, where 0.8 seems to be a good level).

 private void updateScore(int turnsAgo, float scoreChange) {  
   int size = decisions.size();  
   int min = Math.min(size, turnsAgo);  
   for (int i = 0; i < min; i++) {  
     float[] scores = decisions.get(size - i - 1)[1];  
     // increase scores by scoreChange  
     scoreChange *= 0.8f;  


When it came to training and using the training data I tried two different models.
1. Keep neural network between episodes. Train using only the data from the latest episode.
2. Reinitialize network each episode (keep initial weights). Keep all training data and train on whole set (all episodes).
In both cases the training data is shuffled before use and the back propagated.
Apart from the obvious benefit of less total training time on model 1, model 2 also led to much better learning performance.

Soft Max:

For a long time I used Soft Max to determine the action. Once the network improved, it became obvious that a deterministic approach led to better outcome (just use whichever output is higher).


To observe performance I printed out the following information after each episode:
Episode number
Score for episode
Average over all episodes so far
Rolling average (this differed from between 10 and 100 episodes)
Highest value so far (less interesting once reaching highest score was frequent. Still interesting to see when it reaches 200 the first time).
Total reward

Alternatives and red herrings:

Possible to train with 6 time steps (but not at same level of performance)
24 input nodes
32 hidden nodes in first layer
16 hidden nodes in second layer
2 output nodes

One scoring method I tried was to compare relative movement with the previous step. The idea was that even if the model made a good move, if the velocity in the opposite direction was too high, it wouldn't get noticed when just watching the position of the rod and cradle. So, by comparing delta v of the previous move, I thought it might be possible to spot those things. But either my theory was wrong, or I didn't manage to find good values for the scoring.

Another thing I experimented with was comparing the current position(s) with that more than one step back. Again, I was not able to find any good way of factoring this information into the scoring. So instead I placed my bets on the network figuring this out for itself.

At one point when I was thinking about how to improve the learning, I thought that more tranining examples would be a good thing (it's one of the fundamental rules in supervised learning, after all). Since the observation space is symmetric (or so I thought), I should be able to mirror all observations and scorings and get twice as many training examples. Awesome! Right? It turned out that wasn't the case. I was not able to use the mirrored data. I still think this would be worth exploring further.

Further improvements:

I still think mirroring the training examples should be possible. Doubling the training data should improve it significantly.

Given the design of the network, it seems a Recurrent Neural Network (RNN) would be a good alternative.