当前位置:网站首页>1、强化学习基础知识点
1、强化学习基础知识点
2022-07-05 20:18:00 【C--G】
概率论知识补充
Random Variable
抛硬币是随机事件,正面朝上与反面朝上概率都是0.5,通常使用X表示随机变量,x表示观测值
Probability Density Function (PDF)
概率密度函数意味着某个随机变量在某个确定的取值点附件的可能性
高斯分布
离散概率分布
概率密度函数如果为连续型,则函数积分和为1,离散型所有取值和为1
Random Sampling
随机抽样

强化学习基础
强化学习概念名词

state:状态
action:动作
agent:智能体
policy:策略(概率密度函数)
各个动作的概率,使用随机的策略,更切合现实,不易看出规律
reward:奖励
要根据实际情况设置奖励,如:吃到金币奖励+1,游戏通过奖励+10000,玛丽淘汰奖励-10000,什么也没发生奖励是0,强化学习的目的是提高获得的奖励
state transition:状态转移
状态转移是随机的,状态转移概率密度函数只有环境知道,玩家不知道
简介

agent采取action,environment的state改变同时返回reward给agent,agent根据reward进行学习
- 强化学习中随机性的来源
action的随机性
state的随机性
- AI如何玩游戏


观察state s1,Agent利用policy函数执行action a1,environment生成新的state s2并返回的reward r1给agent ,agent再次利用policy函数执行action a2。。。。。。循环该操作 - Rewards and Returns

- 回报
return:回报,也就是未来的累积奖励
Ut由Rt到游戏结束Rn累加所得。当前reword应该比后期reword权重大,比如:今天的80元比明天100元来得实际
y:折扣汇报,介于0-1
- 汇报的随机性



t时刻return取决于t到n时刻的reward,reward取决与state和action,所以return也取决与state和action
- Value Function
action-value function——动作价值函数

对于Ut而言,St和At是可以观察的,St+1——Sn,和At+1——An是随机变量
St+1概率与St,At有关,At+1概率与St+1有关
state-value function——状态价值函数



- Ai control the agent

Π(a|s)策略学习函数,在state情况下最优action,Q(s,a)计算各个动作的得分,选择最优*
评估强化学习
OpenAI Gym



总结

边栏推荐
- How to choose a good external disk platform, safe and formal?
- 什么是pyc文件
- 14. Users, groups, and permissions (14)
- 2020 CCPC 威海 - A. Golden Spirit(思维),D. ABC Conjecture(大数分解 / 思维)
- mongodb文档间关系
- IC科普文:ECO的那些事儿
- [quick start of Digital IC Verification] 7. Basic knowledge of digital circuits necessary for verification positions (including common interview questions)
- 1: Citation;
- 怎么挑选好的外盘平台,安全正规的?
- Leetcode(347)——前 K 个高频元素
猜你喜欢

618 "low key" curtain call, how can baiqiushangmei join hands with the brand to cross the "uncertain era"?

ROS2专题【01】:win10上安装ROS2

Leetcode skimming: binary tree 16 (path sum)

Securerandom things | true and false random numbers

【数字IC验证快速入门】7、验证岗位中必备的数字电路基础知识(含常见面试题)

A solution to PHP's inability to convert strings into JSON

Oracle tablespace management
![[quick start of Digital IC Verification] 6. Quick start of questasim (taking the design and verification of full adder as an example)](/img/6d/110b87747f0a4be52da9fd49a05b82.png)
[quick start of Digital IC Verification] 6. Quick start of questasim (taking the design and verification of full adder as an example)

实操演示:产研团队如何高效构建需求工作流?

Leetcode skimming: binary tree 10 (number of nodes of a complete binary tree)
随机推荐
Go language learning tutorial (16)
信息学奥赛一本通 1337:【例3-2】单词查找树 | 洛谷 P5755 [NOI2000] 单词查找树
Leetcode brush questions: binary tree 18 (largest binary tree)
Schema和Model
Based on vs2017 and cmake GUI configuration, zxing and opencv are used in win10 x64 environment, and simple detection of data matrix code is realized
js方法传Long类型id值时会出现精确损失
图嵌入Graph embedding学习笔记
2020 CCPC 威海 - A. Golden Spirit(思维),D. ABC Conjecture(大数分解 / 思维)
Leetcode brush question: binary tree 13 (the same tree)
1:引文;
Minimum commission for stock trading account opening, where to open an account with low commission? Is it safe to open an account on your mobile phone
ffplay文档[通俗易懂]
Guidelines for application of Shenzhen green and low carbon industry support plan in 2023
Debezium series: idea integrates lexical and grammatical analysis ANTLR, and check the DDL, DML and other statements supported by debezium
《乔布斯传》英文原著重点词汇笔记(十二)【 chapter ten & eleven】
Enter the parallel world
解决php无法将string转换为json的办法
MySql的root密码忘记该怎么找回
Flume series: interceptor filtering data
mongodb/文档操作