当前位置:网站首页>Reinforcement learning (II): SARS, with code rewriting
Reinforcement learning (II): SARS, with code rewriting
2022-07-29 01:41:00 【wweweiweiweiwei】
Reinforcement learning ( Two ):SARSA, With code rewriting
This article brings the second classical reinforcement learning algorithm :SARSA.
SARSA The name is actually State、Action、Reward、State_、Action_ The combination of , Because it is calculating Q The five values are used when the value , among State_ Is the state of the next moment ,Action It's the action of the next moment . Compared with Q-learning The difference is SARSA Of Q The value calculation uses the action of the next moment , and Q-learning No, ,Q-learning It's only used. SARS, The first four .
SARSA And Q-learning Very similar , There are different , The difference between them is SARSA It's online on-policy Algorithm , and Q-learning It's an offline Algorithm ; And we're calculating SARSA Of Q The action of the next moment is used in the value A_, This is also SARSA A feature of , That is to say agent At the next moment, it must be the action A_, and Q-learning No . Later, if you follow my method to change reinforcement learning ( One ) Of Q-learning Your code will find , We compare the green triangle to a cliff , If agent You will die if you touch the cliff , that Q-learning Of agent Is not afraid of death , It will soon be trained to converge , You can get “ The best path ”; and SARSA Of agent Different , It's more “ Fear of death ”, So in the first few times " Death " after , It will avoid obstacles , Or hesitate , This step can be clearly seen when running the code , therefore SARSA The training time of this article is longer than Q-learning Longer .
But in fact ,SARSA than Q-learning good , Because in real life, we are not like Q-learning There are so many opportunities to try and make mistakes .
Here is a comparison of two pseudocodes :

Here is how to modify reinforcement learning ( One ) Medium Q-learning Code it , It's very simple :
The main function :
if __name__ == "__main__":
env = Env()
agent = QLearningAgent(actions=list(range(env.n_actions)))
for episode in range(1000): # loop 1000 Time
state = env.reset()
action = agent.get_action(str(state))
while True:
env.render()
# agent Generate action #str() Is to convert a number into a string
next_state, reward, done = env.step(action)
next_action = agent.get_action(str(next_state))
# to update Q surface
agent.learn(str(state), action, reward, str(next_state),next_action)
state = next_state # Status update
action = next_action
env.print_value_all(agent.q_table)
# When you reach the end, stop the game and start a new round of training
if done:
break
Here, the current action to obtain the current state is mainly put in while Out of the loop , Then one gets the next action in the loop A_ Code for , And the back agent.learn It's been modified , original Q-learning Of agent.learn There are only four inputs , Now there are five inputs , Look at the comparison below :
agent.learn(str(state), action, reward, str(next_state)) #Q-learning
agent.learn(str(state), action, reward, str(next_state),next_action) #SARSA
Of course , If it's changed here, then learn The function should be changed accordingly :
def learn(self, state, action, reward, next_state, next_action):
current_q = self.q_table[state][action] # find Q The corresponding coordinates in the table , Add them to the selection status respectively Reward
# Belman equation update
new_q = reward + self.discount_factor * self.q_table[next_state][next_action] # to update Q value
self.q_table[state][action] += self.learning_rate * (new_q - current_q)
Here is the previous definition also added next_action, And then calculate new_q It's mainly self.q_table It has been modified here , Let's make a comparison :
new_q = reward + self.discount_factor * max(self.q_table[next_state]) #Q-learning
new_q = reward + self.discount_factor * self.q_table[next_state][next_action] #SARSA
This is based on Q-learning Algorithm and SARSA The algorithm comes from , You can combine the algorithm to analyze .
And then put on SARSA Results of :

SARSA At the beginning, I will constantly try and make mistakes , Fall into the triangle , Of course, it will also enter the circular area , Then mainly after running for a period of time ,SARSA Of agent I will always wander in the four white squares in the upper left corner , Just like what I said before “ Fear of death ”, But because here we give 10% The randomness of action , therefore agent It is still possible to get out of those four grids , Then a large probability will enter the circular area , Nor does it rule out the possibility of carrying out triangular areas , However, the final training result was very successful , And you can find that there are two grids in the lower right corner that have not been explored at all .
Finally, make an explanation : I have just begun to learn reinforcement learning recently , The classic algorithm still needs to be reviewed , First, write this article by yourself , The second is to show it to other Xiaobai like me , If you don't understand, you can communicate with each other . Later articles may talk less about classical algorithms ,B Don't bother to stand. The boss has made it very clear , Unless I have my own code implementation or modification, I may share it with you , thank you !
Update :
In the green pseudo code diagram Q-learning A bit of a problem. , Marked in orange γmax(s’,a’) There should be a problem here , This is supposed to be γmax(s’,a), because Q-learning Of Q Value calculation does not use the action of the next moment , So this should be a No a’.
边栏推荐
猜你喜欢
随机推荐
Read the recent trends of okaleido tiger and tap the value and potential behind it
C语言犄角旮旯的知识之形参、实参、main函数参数、数组或指针做函数参数等
BOM系列之window对象
File “manage.py“, line 14 ) from exc
【观察】三年跃居纯公有云SaaS第一,用友YonSuite的“飞轮效应”
The solution to keep the height of the container unchanged during the scaling process of the uniapp movable view table
It is found that the data of decimal type in the database can be obtained through resultset.getdouble, but this attribute cannot be obtained through GetObject.
Self-attention neural architecture search for semantic image segmentation
过去10年的10起重大网络安全事件
nep 2022 cat
els 方块移动
New 1688 API access instructions
了解各种路径
PLATO上线LAAS协议Elephant Swap,用户可借此获得溢价收益
Formal parameters, arguments, main function parameters, arrays or pointers as function parameters of the knowledge in every corner of C language
Digital currency of quantitative transactions - generate foot print factor data
DVWA之SQL注入
SQL question brushing: find the employee number EMP with more than 15 salary records_ No and its corresponding recording times t
如何选择专业、安全、高性能的远程控制软件
JS事件简介

![[hcip] two mGRE networks are interconnected through OSPF (ENSP)](/img/fe/8bb51ac48f52d61e8d31af490300bb.png)







