1
votes

I am trying to implement Actor-Critic learning atuomation algorithm that is not same as basic actor-critic algorithm, it's little bit changed.

Anyway, I used Adam optimizer and implemented with pytorch

when i backward TD-error for Critic first, there's no error. However, i backward loss for Actor, the error occured.

--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in 46 # update Actor Func 47 optimizer_M.zero_grad() ---> 48 loss.backward() 49 optimizer_M.step() 50

~\Anaconda3\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph) 100 products. Defaults to False. 101 """ --> 102 torch.autograd.backward(self, gradient, retain_graph, create_graph) 103 104 def register_hook(self, hook):

~\Anaconda3\lib\site-packages\torch\autograd__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 88 Variable._execution_engine.run_backward( 89 tensors, grad_tensors, retain_graph, create_graph, ---> 90 allow_unreachable=True) # allow_unreachable flag 91 92

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

above is the content of error

I tried to find inplace operation, but I haven't found in my written code. I think i don't know how to handle optimizer.

Here is main code:

        for cur_step in range(1):   
        action = M_Agent(state, flag)  
        next_state, r = env.step(action)   

        # calculate TD Error
        TD_error = M_Agent.cal_td_error(r, next_state)

        # calculate Target
        target = torch.FloatTensor([M_Agent.cal_target(TD_error)])
        logit = M_Agent.cal_logit()
        loss = criterion(logit, target)

        # update value Func
        optimizer_M.zero_grad()
        TD_error.backward()
        optimizer_M.step()

        # update Actor Func
        loss.backward()
        optimizer_M.step()

Here is the agent network

    # Actor-Critic Agent
    self.act_pipe = nn.Sequential(nn.Linear(state, 128),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(128, 256),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(256, num_action),
                            nn.Softmax()
                            )

     self.val_pipe = nn.Sequential(nn.Linear(state, 128),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(128, 256),
                            nn.ReLU(),
                            nn.Dropout(0.5),
                            nn.Linear(256, 1)
                            )


      def forward(self, state, flag, test=None):

          temp_action_prob = self.act_pipe(state)
          self.action_prob = self.cal_prob(temp_action_prob, flag)
          self.action = self.get_action(self.action_prob)
          self.value = self.val_pipe(state)

          return self.action

I wanna update each network respectively.

and I wanna know that Basic TD Actor-Critic method uses TD error for loss?? or squared error between r+V(s') and V(s) ?

1

1 Answers

2
votes

I think the problem is that you zero the gradients right before calling backward, after the forward propagation. Note that for automatic differentiation you need the computation graph and the intermediate results that you produce during your forward pass.

So zero the gradients before your TD error and target calculations! And not after you are finished your forward propagation.

    for cur_step in range(1):   
    action = M_Agent(state, flag)  
    next_state, r = env.step(action)   

    optimizer_M.zero_grad()  # zero your gradient here

    # calculate TD Error
    TD_error = M_Agent.cal_td_error(r, next_state)

    # calculate Target
    target = torch.FloatTensor([M_Agent.cal_target(TD_error)])
    logit = M_Agent.cal_logit()
    loss = criterion(logit, target)

    # update value Func
    TD_error.backward()
    optimizer_M.step()

    # update Actor Func
    loss.backward()
    optimizer_M.step()

To answer your second question, the DDPG algorithm for example uses the squared error (see the paper).

Another recommendation. In many cases large parts of the value and policy networks are shared in deep actor-critic agents: you have the same layers up to the last hidden layer, and use a single linear output for value prediction and a softmax layer for the action distribution. This is especially useful if you have high dimensional visual inputs, as it act as sort of a multi-task learning, but nevertheless you can try. (As I see you have a low-dimensional state vector).