当前位置：网站首页>Q-learning notes

Q-learning notes

2022-06-30 12:35:00 【Show brother invincible】

emmmmm, Forced reinforcement learning
The idea of reinforcement learning is actually easy to understand , By constantly interacting with the environment , To fix agent act , obtain agent In different state What should be done next action, To maximize the benefits .
Here is a strong push for this Zhihu blogger
https://www.zhihu.com/column/c_1215667894253830144
It really made me understand in vernacular , Search others to find out the formula and the theory , It's really a face of muddled ......（ After you understand the process, you look at the formulas and find that they are not so difficult to understand ）
Have a look first Q-Learning Algorithm flow of , Then explain one by one , Here is mo fan python Flow chart of ：
Insert picture description here

The first thing to say is that you should have a basic Q Tabular , Otherwise you have no hair ,agent How to give you the next status s’ The guidance of , Is that so? , This step corresponds to the first line Initialize
then episode I searched it and it was step Set , That is, every step from the beginning of the game to the end of the game ,s Is the initial state of the game
The following is to say off-policy and on-policy The problem.
About the definition of the two , I refer to this article ：
Insert picture description here

So-called off-policy and on-policy The difference between generating data and updating to ensure maximum revenue Q Whether the strategies adopted in the table stage are consistent , With Q-Learning For example , Of course you chose it when you played the game action It's trained Q(s,a) The one with the largest value is , This is called goal strategy

 Target strategy （target policy）： Strategies to be learned by agents

But we talked about the initial Q- The table is given at random , He needs many rounds of training , De convergence , So we were asked to take-action When traversing all possible actions in a certain state , So this is called

 Behavioral strategies （behavior policy）： Strategies for agent interaction with environment , That is, the policy used to generate the behavior

When the two are consistent, it is on-policy, Inconsistency is off-policy
Now consider , During training , The agent selects eplison-greedy Strategy , That is, I have a certain probability to choose now in my q table action The action with the greatest value , But not necessarily , I can also choose other movements , Then the subsequent processes, including states and actions, will be different , This makes it possible to explore different movements
By constantly playing ,Q The table will continue to converge , When it comes time to play, it will be based on Q-table Play under the target strategy , In order to obtain greater profits .
therefore Q-Learning It's a off-policy Algorithm , Because of these two stages policy Completely different

原网站

版权声明
本文为[Show brother invincible]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/181/202206301005059201.html

当前位置：网站首页>Q-learning notes

Q-learning notes

边栏推荐

猜你喜欢

随机推荐