当前位置：网站首页>Reinforcement learning (II): SARS, with code rewriting

Reinforcement learning (II): SARS, with code rewriting

2022-07-29 01:41:00 【wweweiweiweiwei】

Reinforcement learning （ Two ）：SARSA, With code rewriting

This article brings the second classical reinforcement learning algorithm ：SARSA.

SARSA The name is actually State、Action、Reward、State_、Action_ The combination of , Because it is calculating Q The five values are used when the value , among State_ Is the state of the next moment ,Action It's the action of the next moment . Compared with Q-learning The difference is SARSA Of Q The value calculation uses the action of the next moment , and Q-learning No, ,Q-learning It's only used. SARS, The first four .

SARSA And Q-learning Very similar , There are different , The difference between them is SARSA It's online on-policy Algorithm , and Q-learning It's an offline Algorithm ; And we're calculating SARSA Of Q The action of the next moment is used in the value A_, This is also SARSA A feature of , That is to say agent At the next moment, it must be the action A_, and Q-learning No . Later, if you follow my method to change reinforcement learning （ One ） Of Q-learning Your code will find , We compare the green triangle to a cliff , If agent You will die if you touch the cliff , that Q-learning Of agent Is not afraid of death , It will soon be trained to converge , You can get “ The best path ”; and SARSA Of agent Different , It's more “ Fear of death ”, So in the first few times " Death " after , It will avoid obstacles , Or hesitate , This step can be clearly seen when running the code , therefore SARSA The training time of this article is longer than Q-learning Longer .

But in fact ,SARSA than Q-learning good , Because in real life, we are not like Q-learning There are so many opportunities to try and make mistakes .

Here is a comparison of two pseudocodes ：

Insert picture description here

Here is how to modify reinforcement learning （ One ） Medium Q-learning Code it , It's very simple ：

The main function ：

if __name__ == "__main__":
    env = Env()
    agent = QLearningAgent(actions=list(range(env.n_actions)))
    for episode in range(1000):  # loop 1000 Time 
        state = env.reset()
        action = agent.get_action(str(state))
        while True:
            env.render()
            # agent Generate action  #str() Is to convert a number into a string 
            next_state, reward, done = env.step(action)
            next_action = agent.get_action(str(next_state))
            #  to update Q surface 
            agent.learn(str(state), action, reward, str(next_state),next_action)
            state = next_state   # Status update 
            action = next_action
            env.print_value_all(agent.q_table)
            #  When you reach the end, stop the game and start a new round of training 
            if done:
                break

Here, the current action to obtain the current state is mainly put in while Out of the loop , Then one gets the next action in the loop A_ Code for , And the back agent.learn It's been modified , original Q-learning Of agent.learn There are only four inputs , Now there are five inputs , Look at the comparison below ：

agent.learn(str(state), action, reward, str(next_state))  #Q-learning

agent.learn(str(state), action, reward, str(next_state),next_action)   #SARSA

Of course , If it's changed here, then learn The function should be changed accordingly ：

    def learn(self, state, action, reward, next_state, next_action):
        current_q = self.q_table[state][action]   # find Q The corresponding coordinates in the table , Add them to the selection status respectively Reward
        #  Belman equation update 
        new_q = reward + self.discount_factor * self.q_table[next_state][next_action]   # to update Q value 
        self.q_table[state][action] += self.learning_rate * (new_q - current_q)

Here is the previous definition also added next_action, And then calculate new_q It's mainly self.q_table It has been modified here , Let's make a comparison ：

new_q = reward + self.discount_factor * max(self.q_table[next_state])   #Q-learning

new_q = reward + self.discount_factor * self.q_table[next_state][next_action]   #SARSA

This is based on Q-learning Algorithm and SARSA The algorithm comes from , You can combine the algorithm to analyze .

And then put on SARSA Results of ：

Insert picture description here
SARSA At the beginning, I will constantly try and make mistakes , Fall into the triangle , Of course, it will also enter the circular area , Then mainly after running for a period of time ,SARSA Of agent I will always wander in the four white squares in the upper left corner , Just like what I said before “ Fear of death ”, But because here we give 10% The randomness of action , therefore agent It is still possible to get out of those four grids , Then a large probability will enter the circular area , Nor does it rule out the possibility of carrying out triangular areas , However, the final training result was very successful , And you can find that there are two grids in the lower right corner that have not been explored at all .

Finally, make an explanation ： I have just begun to learn reinforcement learning recently , The classic algorithm still needs to be reviewed , First, write this article by yourself , The second is to show it to other Xiaobai like me , If you don't understand, you can communicate with each other . Later articles may talk less about classical algorithms ,B Don't bother to stand. The boss has made it very clear , Unless I have my own code implementation or modification, I may share it with you , thank you ！

Update ：
In the green pseudo code diagram Q-learning A bit of a problem. , Marked in orange γmax(s’,a’) There should be a problem here , This is supposed to be γmax(s’,a), because Q-learning Of Q Value calculation does not use the action of the next moment , So this should be a No a’.

原网站

版权声明
本文为[wweweiweiweiwei]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130554570953.html

当前位置：网站首页>Reinforcement learning (II): SARS, with code rewriting

Reinforcement learning (II): SARS, with code rewriting

Reinforcement learning （ Two ）：SARSA, With code rewriting

边栏推荐

猜你喜欢

随机推荐