当前位置:网站首页>1、强化学习基础知识点
1、强化学习基础知识点
2022-07-05 20:18:00 【C--G】
概率论知识补充
Random Variable
抛硬币是随机事件,正面朝上与反面朝上概率都是0.5,通常使用X表示随机变量,x表示观测值
Probability Density Function (PDF)
概率密度函数意味着某个随机变量在某个确定的取值点附件的可能性
高斯分布
离散概率分布
概率密度函数如果为连续型,则函数积分和为1,离散型所有取值和为1
Random Sampling
随机抽样

强化学习基础
强化学习概念名词

state:状态
action:动作
agent:智能体
policy:策略(概率密度函数)
各个动作的概率,使用随机的策略,更切合现实,不易看出规律
reward:奖励
要根据实际情况设置奖励,如:吃到金币奖励+1,游戏通过奖励+10000,玛丽淘汰奖励-10000,什么也没发生奖励是0,强化学习的目的是提高获得的奖励
state transition:状态转移
状态转移是随机的,状态转移概率密度函数只有环境知道,玩家不知道
简介

agent采取action,environment的state改变同时返回reward给agent,agent根据reward进行学习
- 强化学习中随机性的来源
action的随机性
state的随机性
- AI如何玩游戏


观察state s1,Agent利用policy函数执行action a1,environment生成新的state s2并返回的reward r1给agent ,agent再次利用policy函数执行action a2。。。。。。循环该操作 - Rewards and Returns

- 回报
return:回报,也就是未来的累积奖励
Ut由Rt到游戏结束Rn累加所得。当前reword应该比后期reword权重大,比如:今天的80元比明天100元来得实际
y:折扣汇报,介于0-1
- 汇报的随机性



t时刻return取决于t到n时刻的reward,reward取决与state和action,所以return也取决与state和action
- Value Function
action-value function——动作价值函数

对于Ut而言,St和At是可以观察的,St+1——Sn,和At+1——An是随机变量
St+1概率与St,At有关,At+1概率与St+1有关
state-value function——状态价值函数



- Ai control the agent

Π(a|s)策略学习函数,在state情况下最优action,Q(s,a)计算各个动作的得分,选择最优*
评估强化学习
OpenAI Gym



总结

边栏推荐
- leetcode刷题:二叉树10(完全二叉树的节点个数)
- ffplay文档[通俗易懂]
- Wechat applet regular expression extraction link
- Jvmrandom cannot set seeds | problem tracing | source code tracing
- 【数字IC验证快速入门】8、数字IC中的典型电路及其对应的Verilog描述方法
- Leetcode brush questions: binary tree 18 (largest binary tree)
- 【愚公系列】2022年7月 Go教学课程 004-Go代码注释
- Autumn byte interviewer asked you any questions? In fact, you have stepped on thunder
- sort和投影
- leetcode刷题:二叉树12(二叉树的所有路径)
猜你喜欢

Scala基础【HelloWorld代码解析,变量和标识符】

PyTorch 1.12发布,正式支持苹果M1芯片GPU加速,修复众多Bug

About the priority of Bram IP reset
![Scala basics [HelloWorld code parsing, variables and identifiers]](/img/75/1d89581b9b8299ffb55d95514e6df4.png)
Scala basics [HelloWorld code parsing, variables and identifiers]

Let's talk about threadlocalinsecurerandom

Parler de threadlocal insecurerandom

Convolution free backbone network: Pyramid transformer to improve the accuracy of target detection / segmentation and other tasks (with source code)

leetcode刷题:二叉树11(平衡二叉树)

JS implementation prohibits web page zooming (ctrl+ mouse, +, - zooming effective pro test)

- Oui. Net Distributed Transaction and Landing Solution
随机推荐
Go language learning tutorial (16)
[quick start of Digital IC Verification] 6. Quick start of questasim (taking the design and verification of full adder as an example)
Notes on key vocabulary in the English original of the biography of jobs (12) [chapter ten & eleven]
JVMRandom不可设置种子|问题追溯|源码追溯
信息学奥赛一本通 1340:【例3-5】扩展二叉树
Leetcode: binary tree 15 (find the value in the lower left corner of the tree)
Debezium series: modify the source code to support UNIX_ timestamp() as DEFAULT value
[C language] three implementations of quick sorting and optimization details
sun. misc. Base64encoder error reporting solution [easy to understand]
mongodb基操的练习
E. Singhal and Numbers(质因数分解)
nprogress插件 进度条
ICTCLAS word Lucene 4.9 binding
Leetcode skimming: binary tree 17 (construct binary tree from middle order and post order traversal sequence)
实操演示:产研团队如何高效构建需求工作流?
淺淺的談一下ThreadLocalInsecureRandom
Convolution free backbone network: Pyramid transformer to improve the accuracy of target detection / segmentation and other tasks (with source code)
Unity编辑器扩展 UI控件篇
leetcode刷题:二叉树18(最大二叉树)
Go language | 03 array, pointer, slice usage