3
votes

I have a DQN algorithm that learns (the loss converges to 0) but unfortunately it learns a Q value function such that both of the Q values for each of the 2 possible actions are very similar. It is worth noting that the Q values change by very little over each observation.

Details:

  • The algorithm plays CartPole-v1 from OpenAI Gym but uses the screen pixels as an observation rather than the 4 values provided

  • The reward function I have provided provides a reward of: 0.1 if not game over and -1 if game over

  • The decay-rate (gamma ) is 0.95

  • epsilon is 1 for the first 3200 actions (to populate some of the replay memory) and then annealed over 100,000 steps to the value of 0.01

  • the replay memory is of size 10,000

  • The architecture of the conv net is:

    • input layer of size screen_pixels
    • conv layer 1 with 32 filters with kernel (8,8) and stride (4,4), relu activation function and is padded to be the same size on output as input
    • conv layer 2 with 64 filters with kernel (4,4) and stride (2,2), relu activation function and is padded to be the same size on output as input
    • conv layer 3 with 64 filters with kernel (3,3) and stride (1,1), relu activation function and is padded to be the same size on output as input
    • a flatten layer (this is to change the shape of the data to allow it to then feed into a fully connected layer)
    • Fully connected layer with 512 nodes and relu activation function
    • An output fully connected layer with 2 nodes (the action space)
  • The learning rate of the convolutional neural network is 0.0001
  • The code has been developed in keras and uses experience replay and double deep q learning
  • The original image is reduced from (400, 600, 3) to (60, 84, 4) by greyscaling, resizing, cropping and then stacking 4 images together before providing this to the conv net
  • The target network is updated every 2 online network updates.
1

1 Answers

1
votes

Providing a positive reward of 0.1 on every step as long as the game is not over may make the game over -1 punishment almost irrelevant. Particularly considering the discount factor that you are using.

It is difficult to judge without looking at your source code but I would initially suggest you to provide only a negative reward at the end of the game and remove positive rewards.