I am trying to implement Actor-Critic learning atuomation algorithm that is not same as basic actor-critic algorithm, it's little bit changed.
Anyway, I used Adam optimizer and implemented with pytorch
when i backward TD-error for Critic first, there's no error. However, i backward loss for Actor, the error occured.
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in 46 # update Actor Func 47 optimizer_M.zero_grad() ---> 48 loss.backward() 49 optimizer_M.step() 50
~\Anaconda3\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph) 100 products. Defaults to
False
. 101 """ --> 102 torch.autograd.backward(self, gradient, retain_graph, create_graph) 103 104 def register_hook(self, hook):~\Anaconda3\lib\site-packages\torch\autograd__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 88 Variable._execution_engine.run_backward( 89 tensors, grad_tensors, retain_graph, create_graph, ---> 90 allow_unreachable=True) # allow_unreachable flag 91 92
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
above is the content of error
I tried to find inplace operation, but I haven't found in my written code. I think i don't know how to handle optimizer.
Here is main code:
for cur_step in range(1):
action = M_Agent(state, flag)
next_state, r = env.step(action)
# calculate TD Error
TD_error = M_Agent.cal_td_error(r, next_state)
# calculate Target
target = torch.FloatTensor([M_Agent.cal_target(TD_error)])
logit = M_Agent.cal_logit()
loss = criterion(logit, target)
# update value Func
optimizer_M.zero_grad()
TD_error.backward()
optimizer_M.step()
# update Actor Func
loss.backward()
optimizer_M.step()
Here is the agent network
# Actor-Critic Agent
self.act_pipe = nn.Sequential(nn.Linear(state, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, num_action),
nn.Softmax()
)
self.val_pipe = nn.Sequential(nn.Linear(state, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, 1)
)
def forward(self, state, flag, test=None):
temp_action_prob = self.act_pipe(state)
self.action_prob = self.cal_prob(temp_action_prob, flag)
self.action = self.get_action(self.action_prob)
self.value = self.val_pipe(state)
return self.action
I wanna update each network respectively.
and I wanna know that Basic TD Actor-Critic method uses TD error for loss?? or squared error between r+V(s') and V(s) ?