当前位置:网站首页>Q-learning notes
Q-learning notes
2022-06-30 12:35:00 【Show brother invincible】
emmmmm, Forced reinforcement learning
The idea of reinforcement learning is actually easy to understand , By constantly interacting with the environment , To fix agent act , obtain agent In different state What should be done next action, To maximize the benefits .
Here is a strong push for this Zhihu blogger
https://www.zhihu.com/column/c_1215667894253830144
It really made me understand in vernacular , Search others to find out the formula and the theory , It's really a face of muddled ......( After you understand the process, you look at the formulas and find that they are not so difficult to understand )
Have a look first Q-Learning Algorithm flow of , Then explain one by one , Here is mo fan python Flow chart of :
The first thing to say is that you should have a basic Q Tabular , Otherwise you have no hair ,agent How to give you the next status s’ The guidance of , Is that so? , This step corresponds to the first line Initialize
then episode I searched it and it was step Set , That is, every step from the beginning of the game to the end of the game ,s Is the initial state of the game
The following is to say off-policy and on-policy The problem.
About the definition of the two , I refer to this article :
So-called off-policy and on-policy The difference between generating data and updating to ensure maximum revenue Q Whether the strategies adopted in the table stage are consistent , With Q-Learning For example , Of course you chose it when you played the game action It's trained Q(s,a) The one with the largest value is , This is called goal strategy
Target strategy (target policy): Strategies to be learned by agents
But we talked about the initial Q- The table is given at random , He needs many rounds of training , De convergence , So we were asked to take-action When traversing all possible actions in a certain state , So this is called
Behavioral strategies (behavior policy): Strategies for agent interaction with environment , That is, the policy used to generate the behavior
When the two are consistent, it is on-policy, Inconsistency is off-policy
Now consider , During training , The agent selects eplison-greedy Strategy , That is, I have a certain probability to choose now in my q table action The action with the greatest value , But not necessarily , I can also choose other movements , Then the subsequent processes, including states and actions, will be different , This makes it possible to explore different movements
By constantly playing ,Q The table will continue to converge , When it comes time to play, it will be based on Q-table Play under the target strategy , In order to obtain greater profits .
therefore Q-Learning It's a off-policy Algorithm , Because of these two stages policy Completely different
边栏推荐
- Basic interview questions for Software Test Engineers (required for fresh students and test dishes) the most basic interview questions
- Some commonly used hardware information of the server (constantly updated)
- Instructions for legend use in SuperMap iclient3d 11i for cesium 3D scene
- [bug solution] fiftyone reports attributeerror: module 'CV2' has no attribute 'GAPI_ wip_ gst_ Gstreamerpipeline 'error resolution
- Redis - problèmes de cache
- 使用Power Designer工具构建数据库模型
- Subtrate 源码追新导读-5月上旬: XCM 正式启用
- Iserver publishing es service query setting maximum return quantity
- 1020. number of enclaves
- Introduction to new features of ES6
猜你喜欢
MySQL索引和优化的理解学习
90.(cesium篇)cesium高度监听事件
60 个神级 VS Code 插件!!
Four Misunderstandings of Internet Marketing
Set集合
Swagger2 automatically generates API documents
【目标跟踪】|pytracking 配置 win 编译prroi_pool.pyd
SuperMap 3D SDKs_Unity插件开发——连接数据服务进行SQL查询
Redis installation on Linux system
QT MSVC installation and commissioning
随机推荐
Hisilicon 3559 developing common sense reserves: a complete explanation of related terms
Substrate 源码追新导读: 5月中旬: Uniques NFT模块和Nomination Pool
MySql实现两个查询结果相除
Basic interview questions for Software Test Engineers (required for fresh students and test dishes) the most basic interview questions
[target tracking] |pytracking configuring win to compile prroi_ pool. pyd
图解使用Navicat for MySQL创建存储过程
Map集合
Sword finger offer 05 Replace spaces: replace each space in the string s with "%20"“
Map collection
Generate entity classes from SQL Server database tables through EF core framework
SuperMap iClient3D for WebGL 加载TMS瓦片
Reading the table data of Tencent documents in the applet
Building of Hisilicon 3559 universal platform: obtaining the modified code of data frame
MySQL索引和优化的理解学习
[leetcode] 15. Sum of three numbers
Lichuang EDA learning notes 10 common connector component identification and passive buzzer driving circuit
各厂家rtsp地址格式如下:
What is the principle of spectral confocal displacement sensor? Which fields can be applied?
QT implementation dynamic navigation bar
"Xiaodeng" user personal data management in operation and maintenance