当前位置:网站首页>Reinforcement learning (II): SARS, with code rewriting
Reinforcement learning (II): SARS, with code rewriting
2022-07-29 01:41:00 【wweweiweiweiwei】
Reinforcement learning ( Two ):SARSA, With code rewriting
This article brings the second classical reinforcement learning algorithm :SARSA.
SARSA The name is actually State、Action、Reward、State_、Action_ The combination of , Because it is calculating Q The five values are used when the value , among State_ Is the state of the next moment ,Action It's the action of the next moment . Compared with Q-learning The difference is SARSA Of Q The value calculation uses the action of the next moment , and Q-learning No, ,Q-learning It's only used. SARS, The first four .
SARSA And Q-learning Very similar , There are different , The difference between them is SARSA It's online on-policy Algorithm , and Q-learning It's an offline Algorithm ; And we're calculating SARSA Of Q The action of the next moment is used in the value A_, This is also SARSA A feature of , That is to say agent At the next moment, it must be the action A_, and Q-learning No . Later, if you follow my method to change reinforcement learning ( One ) Of Q-learning Your code will find , We compare the green triangle to a cliff , If agent You will die if you touch the cliff , that Q-learning Of agent Is not afraid of death , It will soon be trained to converge , You can get “ The best path ”; and SARSA Of agent Different , It's more “ Fear of death ”, So in the first few times " Death " after , It will avoid obstacles , Or hesitate , This step can be clearly seen when running the code , therefore SARSA The training time of this article is longer than Q-learning Longer .
But in fact ,SARSA than Q-learning good , Because in real life, we are not like Q-learning There are so many opportunities to try and make mistakes .
Here is a comparison of two pseudocodes :

Here is how to modify reinforcement learning ( One ) Medium Q-learning Code it , It's very simple :
The main function :
if __name__ == "__main__":
env = Env()
agent = QLearningAgent(actions=list(range(env.n_actions)))
for episode in range(1000): # loop 1000 Time
state = env.reset()
action = agent.get_action(str(state))
while True:
env.render()
# agent Generate action #str() Is to convert a number into a string
next_state, reward, done = env.step(action)
next_action = agent.get_action(str(next_state))
# to update Q surface
agent.learn(str(state), action, reward, str(next_state),next_action)
state = next_state # Status update
action = next_action
env.print_value_all(agent.q_table)
# When you reach the end, stop the game and start a new round of training
if done:
break
Here, the current action to obtain the current state is mainly put in while Out of the loop , Then one gets the next action in the loop A_ Code for , And the back agent.learn It's been modified , original Q-learning Of agent.learn There are only four inputs , Now there are five inputs , Look at the comparison below :
agent.learn(str(state), action, reward, str(next_state)) #Q-learning
agent.learn(str(state), action, reward, str(next_state),next_action) #SARSA
Of course , If it's changed here, then learn The function should be changed accordingly :
def learn(self, state, action, reward, next_state, next_action):
current_q = self.q_table[state][action] # find Q The corresponding coordinates in the table , Add them to the selection status respectively Reward
# Belman equation update
new_q = reward + self.discount_factor * self.q_table[next_state][next_action] # to update Q value
self.q_table[state][action] += self.learning_rate * (new_q - current_q)
Here is the previous definition also added next_action, And then calculate new_q It's mainly self.q_table It has been modified here , Let's make a comparison :
new_q = reward + self.discount_factor * max(self.q_table[next_state]) #Q-learning
new_q = reward + self.discount_factor * self.q_table[next_state][next_action] #SARSA
This is based on Q-learning Algorithm and SARSA The algorithm comes from , You can combine the algorithm to analyze .
And then put on SARSA Results of :

SARSA At the beginning, I will constantly try and make mistakes , Fall into the triangle , Of course, it will also enter the circular area , Then mainly after running for a period of time ,SARSA Of agent I will always wander in the four white squares in the upper left corner , Just like what I said before “ Fear of death ”, But because here we give 10% The randomness of action , therefore agent It is still possible to get out of those four grids , Then a large probability will enter the circular area , Nor does it rule out the possibility of carrying out triangular areas , However, the final training result was very successful , And you can find that there are two grids in the lower right corner that have not been explored at all .
Finally, make an explanation : I have just begun to learn reinforcement learning recently , The classic algorithm still needs to be reviewed , First, write this article by yourself , The second is to show it to other Xiaobai like me , If you don't understand, you can communicate with each other . Later articles may talk less about classical algorithms ,B Don't bother to stand. The boss has made it very clear , Unless I have my own code implementation or modification, I may share it with you , thank you !
Update :
In the green pseudo code diagram Q-learning A bit of a problem. , Marked in orange γmax(s’,a’) There should be a problem here , This is supposed to be γmax(s’,a), because Q-learning Of Q Value calculation does not use the action of the next moment , So this should be a No a’.
边栏推荐
- Timer of BOM series
- 560 和为 K 的子数组
- How to choose professional, safe and high-performance remote control software
- JS事件简介
- Comprehensive upgrade, complete collection of Taobao / tmall API interfaces
- Digital currency of quantitative transactions - generate foot print factor data
- 全新升级:获得淘宝商品详情“高级版” API
- Flask generates image verification code
- [MySQL] historical cumulative de duplication of multiple indicators
- 【SQL之降龙十八掌】01——亢龙有悔:入门10题
猜你喜欢

Introduction to Elmo, Bert and GPT

Alphafold revealed the universe of protein structure - from nearly 1million structures to more than 200million structures

J9数字论:什么因素决定NFT的价值?

BOM系列之window对象

一篇万字博文带你入坑爬虫这条不归路 【万字图文】

Intel introduces you to visual recognition -- openvino

如何选择专业、安全、高性能的远程控制软件

DocuWare 移动劳动力解决方案可帮助您构建新的生产力模式:随时随地、任何设备

How to choose professional, safe and high-performance remote control software

SQL question brushing: find the current salary details and department number Dept_ no
随机推荐
【Web技术】1395- Esbuild Bundler HMR
Window object of BOM series
嵌入式分享合集23
JS judge whether array / object array 1 contains array / object array 2
New 1688 API access instructions
A ten thousand word blog post takes you into the pit. Reptiles are a dead end [ten thousand word pictures]
DocuWare 移动劳动力解决方案可帮助您构建新的生产力模式:随时随地、任何设备
els 新的方块落下
Recommended Spanish translation of Beijing passport
internship:用于类型判断的工具类编写
Plato launched the LAAS protocol elephant swap, which allows users to earn premium income
TypeError: can only concatenate str (not “int“) to str
J9 number theory: what factors determine the value of NFT?
活动速递| Apache Doris 性能优化实战系列直播课程初公开,诚邀您来参加!
nep 2022 cat
Common functions and usage of numpy
JS事件简介
[leetcode sliding window problem]
HCIA配置实例(eNSP)
[unity project practice] synthetic watermelon