0
votes

I'm working on my bachelor thesis.

My topic is reinforcement learning. The Setup:

  • Unity3D (C#)
  • Own neural network framework

Confirmed the network working by testing to training a sine-function. It can approximate it. Well. there are some values which won't get to their desired value but it's good enough. When training it with single Values it always converges.

Here is my problem:

I try to teach my network the Q-Value-Function of a simple game, catch balls: In this game it just has to catch a ball dropping from random position and with random angle. +1 if catch -1 if failed

My network-model has 1 hidden layer with neurons ranging from 45-180 (i tested this numbers with no success)

It uses replay with 32 samples from a 100k memory with a learning-rate of 0.0001 It learns for 50000 frames then tests for 10000 frames. This happens 10 times. Inputs are PlatformPosX, BallPosX, BallPosY from the last 4 frames

Pseudocode:

  • Choose action (e-greedy)

  • Do action,

  • Store state action, CurrentReward. Done in memory

  • if in learnphase: Replay

My problem is:

Its actions starts clipping to either 0 or 1 with some variance sometimes. It never has a ideal policy like if the platform would just follow the ball.

EDIT: Sorry for cheap info... My Quality-Function is trained by: Reward + Gamma(nextEstimated_Reward) So its discounting.

1

1 Answers

2
votes

Why would you possibly expect that to work?

Your training can barely approximate a 1-dimensional function. And now you expect it to solve a 12-dimensional function which involves a differential equation? You should have verified first whether your training does even converge for a multi dimensional function at all, with the chosen training parameters.

Your training, given the little detail you provided, also appears to be unsuitable. There is hardly a chance it ever successfully catches the ball, and even when it does, you are rewarding it mostly for random outputs. Only correlation between in- and output is in the last few frames when the pad can only reach the target in time by a limited set of possible actions.

Then there is the choice of inputs. Don't require your model to differentiate by itself. Relevant inputs would had been x, y, dx, dy. Preferably even x, y relative to pad position, not world. Should have a much better chance to converge. Even if it was only learning to keep x minimal.

Working with absolute world coordinates is pretty much bound to fail, as it would require the training to cover the entire range of possible input combinations. And also the network to be big enough to even store all the combinations. Be aware that the network isn't learning the actual function, it's learning an approximation for every single possible set of inputs. Even if the ideal solution is actually just a linear equation, the non linear properties of the activation function make it impossible to learn it in a generalized form for unbound inputs.