Why would a DQN give similar values to all actions in the action space (2) for all observations

Question

I have a DQN algorithm that learns (the loss converges to 0) but unfortunately it learns a Q value function such that both of the Q values for each of the 2 possible actions are very similar. It is worth noting that the Q values change by very little over each observation.

Details:

The algorithm plays CartPole-v1 from OpenAI Gym but uses the screen pixels as an observation rather than the 4 values provided
The reward function I have provided provides a reward of: 0.1 if not game over and -1 if game over
The decay-rate (gamma ) is 0.95
epsilon is 1 for the first 3200 actions (to populate some of the replay memory) and then annealed over 100,000 steps to the value of 0.01
the replay memory is of size 10,000
The architecture of the conv net is:
- input layer of size screen_pixels
- conv layer 1 with 32 filters with kernel (8,8) and stride (4,4), relu activation function and is padded to be the same size on output as input
- conv layer 2 with 64 filters with kernel (4,4) and stride (2,2), relu activation function and is padded to be the same size on output as input
- conv layer 3 with 64 filters with kernel (3,3) and stride (1,1), relu activation function and is padded to be the same size on output as input
- a flatten layer (this is to change the shape of the data to allow it to then feed into a fully connected layer)
- Fully connected layer with 512 nodes and relu activation function
- An output fully connected layer with 2 nodes (the action space)
The learning rate of the convolutional neural network is 0.0001
The code has been developed in keras and uses experience replay and double deep q learning
The original image is reduced from (400, 600, 3) to (60, 84, 4) by greyscaling, resizing, cropping and then stacking 4 images together before providing this to the conv net
The target network is updated every 2 online network updates.

Juan Leni Juan Leni · Accepted Answer · 2017-08-02T19:06:42

Providing a positive reward of 0.1 on every step as long as the game is not over may make the game over -1 punishment almost irrelevant. Particularly considering the discount factor that you are using.

It is difficult to judge without looking at your source code but I would initially suggest you to provide only a negative reward at the end of the game and remove positive rewards.

Why would a DQN give similar values to all actions in the action space (2) for all observations

1 Answers