当前位置:网站首页>Strengthen basic learning records

Strengthen basic learning records

2022-07-06 13:52:00 I like the strengthened Xiaobai in Curie

   Reinforcement learning algorithms are roughly divided into three categories ,value-based、policy-based And the combination of the two Actor-Critic, Here is a brief description of the recent right AC Learning experience of .

One 、 Introduction to the environment

   What we use here is gym Environmental ’CartPole-v1’, This environment is similar to that of the previous article ’CartPole-v0’ There's almost no difference , The main difference lies in the definition of the maximum number of steps per round and the reward , As shown in the figure below .

 Insert picture description here
   In this paper , Want to try to combine On-Policy The algorithm of , Therefore, the maximum number of steps in a single round is limited , The size is 100.

  'CartPole-v0’ The detailed introduction of the environment is attached with a link .
   link : OpenAI Gym Introduction to classic control environment ——CartPole( Inverted pendulum )

Two 、 A brief introduction to the algorithm

  1. Actor-Critic
       The algorithm has two frameworks , That is, strategy related Actor Network and value related Critic The Internet . Because the randomness strategy is adopted here , therefore Actor The Internet takes advantage of softmax Function normalizes the probability ;Critic For network utilization v Values are calculated . Besides , So this is taking advantage of A2C The dominance function of (Advantage).
     Insert picture description here
  2. On-Policy
       Here we take On-Policy The algorithm of , Pay attention to each round 100 Step game , Will produce 100 strip transition, Wait for these transition After storage , Begin to learn , Use this directly 100 Samples , And empty the sample , In order to get new samples in the next round .
     Insert picture description here
  3. AC(A2C) Pseudo code :
     Insert picture description here
     Insert picture description here
  4. Realization
       The implementation here refers to the online tutorial , But the source code is just Policy-Gradient Methods , Here is a simple modification . Besides , Here is the randomness strategy , Itself increases the exploratory , Different from the previous deterministic strategy , Yes torch The sampling function of , The details have not been studied . The results are also attached in the figure below , You can see that after training , Rewards basically converge to 100.
import gym
import numpy
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt

# Hyperparameters
learning_rate = 0.0002
gamma = 0.98
n_rollout = 100
RENDER = False

env = gym.make('CartPole-v1')
env = env.unwrapped

#print("env.action_space :", env.action_space)
#print("env.observation_space :", env.observation_space)

n_features = env.observation_space.shape[0]
n_actions = env.action_space.n

class ActorCritic(nn.Module):
    def __init__(self):
        super(ActorCritic, self).__init__()
        self.data = []
        hidden_dims = 256
        self.feature_layer = nn.Sequential(nn.Linear(n_features, hidden_dims),
        self.fc_pi = nn.Linear(hidden_dims, n_actions)
        self.fc_v = nn.Linear(hidden_dims, 1)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def pi(self, x):
        x = self.feature_layer(x)
        x = self.fc_pi(x)
        prob = F.softmax(x, dim=-1)
        return prob

    def v(self, x):
        x = self.feature_layer(x)
        v = self.fc_v(x)
        return v

    def put_data(self, transition):

    def make_batch(self):
        s_lst, a_lst, r_lst, s_next_lst, done_lst = [], [], [], [], []
        for transition in self.data:
            s, a, r, s_, done = transition
            r_lst.append([r / 100.0])
            done_mask = 0.0 if done else 1.0

        s_batch, a_batch, r_batch, s_next_batch, done_batch = torch.tensor(numpy.array(s_lst),
                                                                           dtype=torch.float), torch.tensor(
            a_lst), torch.tensor(numpy.array(r_lst), dtype=torch.float), torch.tensor(
            numpy.array(s_next_lst), dtype=torch.float), torch.tensor(
            numpy.array(done_lst), dtype=torch.float)
        self.data = []
        return s_batch, a_batch, r_batch, s_next_batch, done_batch

    def train_net(self):
        s, a, r, s_, done = self.make_batch()
        td_target = r + gamma * self.v(s_) * done
        delta = td_target - self.v(s)
        def critic_learn():

            loss_func = nn.MSELoss()
            loss1 = loss_func(self.v(s),td_target)


        def actor_learn():

            pi = self.pi(s)
            pi_a = pi.gather(1, a)
            loss = -torch.log(pi_a) * delta.detach() + F.smooth_l1_loss(self.v(s), td_target.detach())



def main():
    model = ActorCritic()
    print_interval = 20
    score = 0.0
    avg_returns = []

    for n_epi in range(MAX_EPISODE):
        s = env.reset()

        for t in range(n_rollout):

            prob = model.pi(torch.from_numpy(s).float())
            m = Categorical(prob)
            a = m.sample().item()
            s_next, r, done, info = env.step(a)
            model.put_data((s, a, r, s_next, done))

            s = s_next
            score += r


        if n_epi % print_interval == 0 and n_epi != 0:
            avg_score = score / print_interval
            print("# of episode :{}, avg score : {:.1f}".format(n_epi, score / print_interval))
            score = 0.0
    plt.ylabel('avg score')
    plt.savefig('./plt_ac.png',format= 'png')

if __name__ == '__main__':

 Insert picture description here


本文为[I like the strengthened Xiaobai in Curie]所创,转载请带上原文链接,感谢