当前位置：网站首页>Intensive reading: generative adversarial imitation learning

Intensive reading: generative adversarial imitation learning

2022-06-22 10:58:00 【Alex_ 12 hours a day 6 days a week】

1. Background introduction

1.1. Background

This paper is 2016 Proposed by Stanford University research team in , Two authors , One is Jonathan Ho, My resume is very rich , His main research interests are unsupervised learning and reinforcement learning , The other is Stefano Ermon, He is an associate professor at Stanford University , The main research direction is probability modeling 、 Generative learning and reasoning .

1.2. Ask before reading

Why did you choose this paper ？

Because my graduation project is automatic driving , The effect of model learning has not reached the expectation , First, the time cost of training is very high , Second, there is no way for the model to learn to obey the traffic rules . Later on 《 Hands on learning, reinforcement learning 》 In this book we see imitation learning , Consider the current imitation learning algorithm , Finally, I chose this paper , That is, imitation learning based on generative confrontation network .

What is the general task ？

Autonomous driving is actually a decision-making problem , So reinforcement learning is to let the model learn the optimal decision .

What's wrong with this direction ？ What kind of problem is it ？

Agent learning is too slow for complex scenes .（DeepMind Proposed Rainbow Algorithm training requires 34200 individual GPU hour, almost 1000 Many days , So the necessary parallel training ）
In the real world , Reward functions are difficult to define , Rules that depend on the real world , If the design is not correct, the agent may find some loopholes .（ The autonomous vehicle has stalled ）

Why are there such problems ？

At the beginning of learning, agents will choose more random actions in order to explore the environment , such as DQN It will choose whether to explore or utilize according to a decay probability , It will lead to spend a lot of time exploring in the early stage , And most of them are meaningless explorations .
The rules of the real world cannot be exhaustive , It will make it difficult to express the reward function formally , Reflected in the code, it is difficult to calculate the reward through a function . For example, for autonomous driving tasks , Traffic rules are a kind of restriction , Rear green light 、 Lane lines, sidewalks, etc , It's impossible to follow 《 Road traffic safety act 》 To design the reward function , And these rules are subject to change , The traffic rules are different in different areas .

How does the author solve this problem ？

First , Imitation learning can be learned quickly through expert experience , This reduces the exploration time , Avoid the agent from exploring some obviously meaningless actions at the beginning , For example, there is a wall in front , The agent has to test what state will be caused by the action of full speed collision .
secondly , Potential rules are included in expert experience , Like autonomous driving , Although we can not combine all the rules to design the reward function , But you can learn the rules by learning the old driver's driving style , It is equivalent to that expert experience implicitly gives a relative size of the reward function , But there is no clear definition .

How to verify whether the solution is effective ？

In the paper ： The data set size required for the model to achieve expert effect .
Examine the speed of learning ： Whether the time required for model training to achieve the same effect is shorter .
Examine occupancy metrics ： In the same state, the action taken by the agent and the action taken by the expert have the same coincidence rate .

1.3. Strengthen the foundation of learning

First of all, let's talk about intensive learning .

Reinforcement learning is an algorithm that a machine can achieve its goal by interacting with the environment , First, the machine perceives the environment in a state of the environment , Then make an action decision through its own calculation , Then apply this action to the environment , Finally, the environment changes accordingly , And send the corresponding reward and the status of the next round back to the machine , Then the machine perceives the new environment state in the next round of interaction , And so on .

Insert picture description here
In reinforcement learning, machines are called agents （Agent）, Similar to supervised learning “ Model ”, However, reinforcement learning emphasizes that agents can not only perceive the surrounding environment , You can also change the environment directly by making decisions , Instead of just giving some predictive signals .

There are some basic concepts in reinforcement learning ：

random process

If we leave agents aside , The environment is generally dynamic , It will evolve as certain factors change , It's a random process .

random process It is a quantitative description of the dynamic relationship of a series of random phenomena , The research object is a random phenomenon that evolves with time .
example ： The car passed a crossroads , It can go straight 、 Turn left 、 Turn right and turn around , This is a random process , Quantitative description is to make sure it goes straight 、 Turn left 、 What is the probability of turning right and turning around .
Random phenomenon At some point t The value of is a vector , use $S_{t}$ Express , All possible states form a set of states $S$ .

For a random process , The most critical element is the state and the conditional probability distribution of state transition , At some point t The state of $S_{t}$ Usually depends on t The state before the moment , So in the same way , The state of the next moment is $S_{t+1}$ The probability can be expressed as $P(S_{t+1}|S_{1},...,S_{t})$ .

Markov properties

At this point, if you add Markov properties ： The state of a certain moment only depends on the state of the previous moment , The formula can be simplified to $P(S_{t+1}|S_{t}) = P(S_{t+1}|S_{1},...,S_{t})$ .

Markov property means that the state of the next moment only depends on the current state , Will not be affected by the past state , But that doesn't mean it has nothing to do with history , because t+1 Although the state of the moment is only related to t The state of the moment is related to , however t The status information of the moment actually contains t-1 Status information of the moment , Through this chain relationship , The historical information has been transmitted to the present .

So Markov properties actually simplify the calculation , As long as the current state is known , All historical states are no longer needed .

Markov process

Stochastic processes with Markov properties are called Markov process , Tuples are usually used (S, P) Express , among S Is a finite set of states in a random process ,P It's a state transition matrix .

Let's say there are n Status , Then the state transition matrix defines the transition probabilities between all state pairs , The order of the matrix i Xing di j The column element represents the slave state $s_{i}$ Move to state $s_{j}$ Probability .
$\mathcal{P}=\left[\begin{array}{ccc} P\left(s_{1} \mid s_{1}\right) & \cdots & P\left(s_{n} \mid s_{1}\right) \\ \vdots & \ddots & \vdots \\ P\left(s_{1} \mid s_{n}\right) & \cdots & P\left(s_{n} \mid s_{n}\right) \end{array}\right]$

Markov reward process

If the reward function is added to the Markov process r And discount factor γ, You can achieve Markov reward process , By tuples (S, P, r, γ) form .

Reward function r It means to transfer to a certain state s A reward for , For example, a task process is given in the automatic driving task ： Wash the car first 、 Then come on 、 Finally go home , Then when the car arrives at the car wash, it will get a reward , You will also get a reward when you arrive at the gas station , Finally, I will get a reward when I go home .

The discount factor γ The range of phi is zero [0, 1), Left closed right away . The reason why discount factor is introduced is that the forward interest has certain uncertainty , Sometimes we want to get a reward as soon as possible , So we need to discount the long-term interest .γ The closer the 1 Show more concern about long-term cumulative rewards , The closer the 0 Show more concern about short-term rewards .

In the Markov reward process , The expected return of a state is called the value of the state , The value of all States constitutes Value function , The input of the value function is a certain state , The output is the score of this state .

Also for car washing 、 Refuel and go home , The value of washing the car is not as great as that of refuelling , It doesn't matter whether you wash the car or not , But you can't go home without refueling , The value of refueling is not as big as that of going home , Because the ultimate goal is to go home .
therefore r( Wash the car )=1、r( come on. )=3、r( get home )=5.

Markov decision process

The above random processes are all spontaneous changes of the environment , If there is an external stimulus , That is, the action of the agent , The next state of the environment is determined by the current state and the action of the agent .

Markov decision process is to add the action of agent on the basis of Markov reward process , By tuples (S, P, A, r, γ) form , among A Represents a collection of actions , The reward function and the state transition function are also related to A of .

The actions taken by the agent are determined by the strategy $\pi$ Decisive , $\pi(a|s)=P(A_{t}=a|S_{t}=s)$ , Indicates that the input status is s Take action in case of a Probability .

The value function is divided into State value function and Action value function ：
State value function ： From the State $s$ Start and follow the strategy $\pi$ The expected return , $V^{\pi}(s)=E_{\pi}[G_{t}|S_{t}=s]$ .
Action value function ： Using strategy $\pi$ when , For the current state $s$ Executive action $a$ Expected return , $Q^{\pi}(s,a)=E_{\pi}[G_{t}|S_{t}=s, A_{t}=a]$ .

Occupancy metrics

The value function of different strategies is different , Because the probability distribution of the state accessed by the agent is different .

So we can define Occupancy metrics ： Indicates that the policy is being executed $\pi$ Time state action is right (s, a) The probability of being visited .

2. The thought of thesis

2.1. Imitation learning

The data of reinforcement learning is obtained through the interaction between agent and environment , There is no need for labels in supervised learning , But it depends on the setting of the reward function . In many real situations , The reward function is not given , If we simply design the reward function, we can not guarantee that the strategies trained by reinforcement learning meet the actual requirements .

For example, the control of autonomous vehicle , What we observe is the current environmental perception information , Action is the planning of the next specific path , If the reward function simply sets the reward of driving forward without collision as +1, The reward for collision is -10, Then the result of agent learning is likely to be to find a place to stagnate , So many reward functions of autonomous vehicle are carefully designed and debugged .

Suppose there is an expert agent , Its strategy can be regarded as the optimal strategy , So imitation learning is the strategy of training agents by directly imitating the state and action data that the expert interacts in the environment , There is no need to use the reward signals provided by the environment .

2.2. Behavioral cloning and reverse reinforcement learning

At present, the academic imitation learning methods can be basically divided into three categories ：

Behavioral cloning （Behavior Cloning, BC）： Direct use of supervised learning methods , Add expert data to (s, a) Of a tuple s As sample input ,a As a label , The goal of learning is ： $\theta^{*} = \arg \min_{\theta} E_{(s,a)\sim B}[L(\pi_{\theta}(s), a)]$

among B Is an expert data set ,L Is the loss function corresponding to the supervised learning framework . If the action is discrete , The loss function can be obtained by maximum likelihood estimation , If the action is continuous , The loss function can be the mean square error .

advantage ：① Implement a simple ;② You can quickly learn a good strategy .
shortcoming ：① Need a lot of data to support ;② The generalization ability is not strong enough , Cannot handle situations not covered by expert experience ;③ The expert trajectory of agent fitting may be biased , You may learn some little habits , The importance of knowledge cannot be judged ;④ In the process of supervised learning, it is only in this case that the effect is not good , But for reinforcement learning , The bad effect of a certain current state may affect the following states , namely Compound error .

Reverse reinforcement learning （Inverse Reinforcement Learning, IRL）： Suppose that the reward function of the environment should make the expert strategy obtain the highest reward value , And then learn the reward function behind it , Finally, positive reinforcement learning is carried out based on the reward function , So as to get the imitation strategy .

In particular , Suppose we have an expert strategy $\pi_{E}$ , It is hoped that the reverse reinforcement learning can be used to deduce it , It's actually in a set of functions C Find an optimal loss function in .

So the optimization function of inverse reinforcement learning ： $\underset{c \in \mathcal{C}}{\operatorname{maximize}}\left(\min _{\pi \in \Pi}-H(\pi)+\mathbb{E}_{\pi}[c(s, a)]\right)-\mathbb{E}_{\pi_{E}}[c(s, a)]$ , among $\pi_{E}$ It's an expert strategy , $\pi$ It is the strategy that we should learn , $H(\pi) \triangleq \mathbb{E}_{\pi}[-\log \pi(a \mid s)]$ It's a strategy $\pi$ Of γ- Discounted causal entropy .

in other words , $\mathbb{E}_{\pi_{E}}[c(s, a)]$ Through expert strategy $\pi_{E}$ The expected loss of the trajectory obtained by interactive sampling with the environment , Empathy $\mathbb{E}_{\pi}[c(s, a)]$ Is the state of the learning strategy through the expert sampling track - The expected loss of behavior to gain , The purpose of introducing the maximum causal entropy is for the state that has not appeared in the observed expert trajectory , Try to act randomly .

therefore ,IRL Is to find a loss function $\in C$ , Low loss for expert strategy , There are higher losses for other strategies , So we can deduce the loss function of expert strategy in reverse . basically , The loss function learned by inverse reinforcement learning is to explain the actions of experts , It does not directly tell the agent how to act .

advantage ： Learn the loss function , First fit expert trajectory , There is no composite error problem .
shortcoming ： High calculation cost , It is necessary to continuously carry out internal circulation reinforcement learning .

2.3. Generative confrontation imitation learning

Generative confrontation imitation learning （Generative Adversarial Imitation Learning, GAIL）： Direct learning strategy from data based on generative countermeasure network , It bypasses the middle step of inverse reinforcement learning .

The essence of generative antagonism imitation learning is to imitate the occupation measure of expert strategy , Try to make the occupancy measurement of all state action pairs of the learning strategy in the environment $\rho_{\pi}(s,a)$ And occupancy metrics for expert strategies $\rho_{\pi_{E}}(s,a)$ Agreement .

To that end , Policies need to interact with the environment ,GAIL The algorithm has a discriminator and a strategy , Strategy $\pi$ It is equivalent to generating generators in the confrontation network , Given a state , The policy will output the actions that should be taken in this state , And the discriminator D Move the state to (s, a) As input , Output one 0~1 The real number , Indicates whether the discriminator thinks that the state action pair is from the agent strategy or the expert strategy .

Judging device D Our goal is to make the output of expert data as close as possible to 0, Bring the output of the agent strategy close to 1, In this way, the two sets of data can be distinguished , So the discriminator D The loss function of is ： $L(\phi)=-E_{\rho_{\pi}}[\log D_{\phi}(s, a)] - E_{\rho_{E}}[log(1-D_{\phi}(s, a))]$ , among $\phi$ It's a discriminator D Parameters of .

The goal of agent strategy is that the trajectory generated by interaction can be mistaken for expert trajectory by discriminator , So you can use a discriminator D The output of is used as a reward function to train the agent strategy , That is, the agent policy samples the state in the environment s, And take action a, This state action is right (s, a) Will be input to the discriminator D in , Output D(s, a) Value , Then set the reward to r(s, a) = -log D(s, a). Last , In the process of constant confrontation , The data distribution generated by agent strategy will be closer and closer to the real expert data distribution .

2.4. experimental result

In the paper gym Of 9 Control tasks on GAIL An assessment was made , The first is to use TRPO Trained an expert strategy , Then, the data sets of different trajectories are extracted from the expert strategy , Then three algorithms are tested , The first is behavioral cloning algorithm , Is to directly use the given state action to carry out supervised training , The second is the feature expectation matching algorithm , This is an inverse reinforcement learning algorithm , The third is GAIL. The three models use the same neural network architecture , The start of each test is also a random initialization , Provide exactly the same environment interaction .
Insert picture description here
The experimental results are shown in the figure ,y The axis represents the performance of the model , The expert strategy is 1, The random action is 0,x The axis represents the number of expert tracks in the dataset , That is, the size of the data set .

among 8 In two experiments ,GAIL The results are very good . You can find , The effect of behavioral cloning algorithm is very dependent on the amount of expert policy data , The effect of the model increases with the increase of the amount of data , This indicates that the expert data utilization rate of behavioral cloning is relatively low . The effect of reverse reinforcement learning is better than behavioral cloning ,GAIL The performance of is basically equal to the expert strategy .

And we can also find that , Although the model effect of behavioral cloning will improve with the increase of the amount of data , But no one can surpass the experts , And generate GAIL In fact, there are several effects that can surpass the expert level . Why is this so ？ First, behavioral cloning is actually supervised training , So the quality of the data set has actually locked in the upper limit of the model , The model effect can only approach this line infinitely and cannot exceed it . secondly , Generative countervailing imitation learning can not only use the data of experts , New data will also be generated by the generator , So its upper limit is likely to exceed that of experts .

for instance , Behavioral cloning is like seeking the limit in mathematics , The limit value can only be approached infinitely, but it is impossible to get the limit value , And generating confrontation is like a teacher , After you have learned all your life , You can learn by yourself , Finally surpass the teacher , The waves behind push the waves ahead , Qianlang died on the beach , But the effect of generating confrontation may be better than that of experts , It can't be much better , Because the teacher is here , After you pass him , He can't continue to teach you , You need to keep looking for better teachers . It's like we go to school , After primary school and junior high school , After junior high school and senior high school , After high school and College , Constantly looking for better teachers , Learn from them .

A little idea ： The effect of generating models against imitation learning may exceed that of experts , Is it possible to add the data generated when the expert effect is exceeded to the expert track , It is equivalent to being a teacher for yourself , Constantly surpass oneself .

Insert picture description here

stay Reacher Tasks , The sample utilization efficiency of behavioral cloning is GAIL Higher , Then the author began to adjust parameters crazily , But in the end there is nothing like behavioral cloning , Nor does the paper explain why , It should be related to the environment of the task .

3. Code practice

Step 1. Generate expert data

adopt PPO Algorithm training an expert strategy , First, a policy network is defined .

class PolicyNet(nn.Module):
	def __init__(self, state_dim, hidden_dim, action_dim):
		super(PolicyNet, self).__init__()
		self.fc1 = nn.Linear(state_dim, hidden_dim)
		self.fc2 = nn.Linear(hidden_dim, action_dim)

	def forward(self, x):
		x = F.relu(self.fc1(x))
		return F.softmax(self.fc2(x), dim=1)

Then a value evaluation network is defined .

class ValueNet(nn.Module):
	def __init__(self, state_dim, hidden_dim):
		super(ValueNet, self).__init__()
		self.fc1 = nn.Linear(state_dim, hidden_dim)
		self.fc2 = nn.Linear(hidden_dim, 1)

	def forward(self, x):
		x = F.relu(self.fc1(x))
		return self.fc2(x)

And then it's through PPO Algorithm training , Here we use PPO- The way of truncation .

class PPO:
	""" PPO Algorithm , Use truncation  """

	def __init__(self, state_dim, hidden_dim, action_dim, actor_lr, critic_lr, lambda_, epochs, eps, gamma):
		self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
		self.critic = ValueNet(state_dim, hidden_dim).to(device)
		self.actor_optimizer = th.optim.Adam(self.actor.parameters(), lr=actor_lr)
		self.critic_optimizer = th.optim.Adam(self.critic.parameters(), lr=critic_lr)
		self.gamma = gamma
		self.lambda_ = lambda_
		self.epochs = epochs
		self.eps = eps

	def take_action(self, state):
		state = th.tensor([state], dtype=th.float).to(device)
		probs = self.actor(state)
		action_dist = th.distributions.Categorical(probs)
		action = action_dist.sample()
		return action.item()

	def update(self, transition_dict):
		states = th.tensor(transition_dict["states"], dtype=th.float).to(device)
		actions = th.tensor(transition_dict["actions"], dtype=th.float).view(-1, 1).to(device)
		rewards = th.tensor(transition_dict["rewards"], dtype=th.float).view(-1, 1).to(device)
		next_states = th.tensor(transition_dict["next_states"], dtype=th.float).to(device)
		dones = th.tensor(transition_dict["dones"], dtype=th.float).view(-1, 1).to(device)
		td_target = rewards + self.gamma * self.critic(next_states) * (1 - dones).to(device)
		td_delta = td_target - self.critic(states)
		advantage = compute_advantage(self.gamma, self.lambda_, td_delta).to(device)
		old_log_probs = th.log(self.actor(states).gather(1, actions.type(th.int64))).detach()

		for _ in range(self.epochs):
			log_probs = th.log(self.actor(states).gather(1, actions.type(th.int64)))
			ratio = th.exp(log_probs - old_log_probs)
			surr1 = ratio * advantage
			surr2 = th.clamp(ratio, 1 - self.eps, 1 + self.eps) * advantage
			actor_loss = th.mean(-th.min(surr1, surr2))
			critic_loss = th.mean(F.mse_loss(self.critic(states), td_target.detach()))
			self.actor_optimizer.zero_grad()
			self.critic_optimizer.zero_grad()
			actor_loss.backward()
			critic_loss.backward()
			self.actor_optimizer.step()
			self.critic_optimizer.step()

After training , The final reward Can reach close to 500.

Insert picture description here
Finally from the PPO The expert trajectory is obtained by sampling from the trained expert strategy .

Step 2. Behavioral cloning

The implementation of behavioral cloning is actually very simple , It is a strategic network , First, take the expert data as the input for training , Then directly test the effect of the agent in the environment .

class BehaviorClone:
    def __init__(self, state_dim, hidden_dim, action_dim, lr):
        self.policy = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
        self.optimizer = th.optim.Adam(self.policy.parameters(), lr=lr)

    def learn(self, states, actions):
        states = th.tensor(states, dtype=th.float).to(device)
        actions = th.tensor(actions).view(-1, 1).to(device)
        log_probs = th.log(self.policy(states).gather(1, actions.type(th.int64)))
        bc_loss = th.mean(-log_probs)  #  Maximum likelihood estimation 

        self.optimizer.zero_grad()
        bc_loss.backward()
        self.optimizer.step()

    def take_action(self, state):
        state = th.tensor([state], dtype=th.float).to(device)
        probs = self.policy(state)
        action_dist = th.distributions.Categorical(probs)
        action = action_dist.sample()
        return action.item()

Step 3. Generative confrontation imitation learning

Finally, the generation of antagonistic imitation learning , You need to define another discriminator , I'm just splicing the state and action vectors together , That is, the state action is right , After two layers of full connection , Finally, there was a Sigmoid Output 0~1 Probability between .

class Discriminator(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(Discriminator, self).__init__()
        self.fc1 = th.nn.Linear(state_dim + action_dim, hidden_dim)
        self.fc2 = th.nn.Linear(hidden_dim, 1)

    def forward(self, x, a):
        cat = th.cat([x, a], dim=1)
        x = F.relu(self.fc1(cat))
        return th.sigmoid(self.fc2(x))