I have a DQN algorithm that learns (the loss converges to 0) but unfortunately it learns a Q value function such that both of the Q values for each of the 2 possible actions are very similar. It is worth noting that the Q values change by very little over each observation.
Details:
The algorithm plays CartPole-v1 from OpenAI Gym but uses the screen pixels as an observation rather than the 4 values provided
The reward function I have provided provides a reward of: 0.1 if not game over and -1 if game over
The decay-rate (gamma ) is 0.95
epsilon is 1 for the first 3200 actions (to populate some of the replay memory) and then annealed over 100,000 steps to the value of 0.01
the replay memory is of size 10,000
The architecture of the conv net is:
- input layer of size screen_pixels
- conv layer 1 with 32 filters with kernel (8,8) and stride (4,4), relu activation function and is padded to be the same size on output as input
- conv layer 2 with 64 filters with kernel (4,4) and stride (2,2), relu activation function and is padded to be the same size on output as input
- conv layer 3 with 64 filters with kernel (3,3) and stride (1,1), relu activation function and is padded to be the same size on output as input
- a flatten layer (this is to change the shape of the data to allow it to then feed into a fully connected layer)
- Fully connected layer with 512 nodes and relu activation function
- An output fully connected layer with 2 nodes (the action space)
- The learning rate of the convolutional neural network is 0.0001
- The code has been developed in keras and uses experience replay and double deep q learning
- The original image is reduced from (400, 600, 3) to (60, 84, 4) by greyscaling, resizing, cropping and then stacking 4 images together before providing this to the conv net
- The target network is updated every 2 online network updates.