当前位置:网站首页>[reinforcement learning notes] common symbols in reinforcement learning
[reinforcement learning notes] common symbols in reinforcement learning
2022-06-25 08:35:00 【Allenpandas】
| Symbol | Symbol interpretation |
|---|---|
| ≐ \doteq ≐ | Defining symbols |
| ≈ \approx ≈ | About equal to |
| ϵ \epsilon ϵ | ϵ \epsilon ϵ The probability of random action in a greedy strategy |
| γ \gamma γ | Discount factor |
| λ \lambda λ | Decay rate in trace |
| ← \leftarrow ← | Assignment symbol |
| s s s, s ′ s' s′ | state s s s |
| a a a | action a a a |
| r r r | earnings r r r |
| t t t | Discrete time steps , Or time |
| π \pi π | Strategy ( Decision making rules ) |
| π ( s ) \pi(s) π(s) | according to Deterministic strategies π \pi π In state s s s Action selected when |
| π ( a ∣ s ) \pi(a|s) π(a∣s) | according to Random strategy π \pi π In state s s s Action selected when a a a Probability |
| A t A_{t} At | t t t The action of the moment |
| S t S_{t} St | t t t The state of the moment , Usually by S t − 1 S_{t-1} St−1 and A t − 1 A_{t-1} At−1 Random decision |
| R t R_{t} Rt | t t t The benefits of the moment , Usually by S t − 1 S_{t-1} St−1 and A t − 1 A_{t-1} At−1 Random decision |
| G t G_t Gt | t t t The reward of the moment ( It's an expectation ) |
| p ( s ′ , r ∣ s , a ) p(s', r |s, a) p(s′,r∣s,a) | From the State s s s Take action a a a Move to state s ′ s' s′ And get the benefits r r r Probability |
| p ( s ′ ∣ s , a ) p(s' |s, a) p(s′∣s,a) | From the State s s s Take action a a a Move to state s ′ s' s′ Probability |
| r ( s , a ) r(s, a) r(s,a) | From the State s s s Take action a a a The expectation of immediate benefits |
| r ( s , a , s ′ ) r(s, a, s') r(s,a,s′) | From the State s s s Take action a a a Move to state s ′ s' s′ The expectation of immediate benefits |
| v π ( s ) v_\pi(s) vπ(s) | state s s s In the strategy π \pi π Under the value of ( Expected return ) |
| v ∗ ( s ) v_*(s) v∗(s) | state s s s The value under the optimal strategy |
| q π ( s , a ) q_\pi(s, a) qπ(s,a) | state s s s In the strategy π \pi π Take action a a a The value of |
| q ∗ ( s , a ) q_*(s, a) q∗(s,a) | state s s s Take action under the optimal strategy a a a The value of |
| V V V, V t V_{t} Vt | State value function |
| Q Q Q, Q t Q_{t} Qt | Action value function |
边栏推荐
- 在二叉树(搜索树)中找到两个节点的最近公共祖先(剑指offer)
- 在哪个平台买股票开户安全?求分享
- QSS 不同风格的按钮
- Can I grant database tables permission to delete column objects? Why?
- Paper:Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection
- C language "recursive series": recursive implementation of 1+2+3++ n
- Almost taken away by this wave of handler interview cannons~
- 钱堂教育商学院给的证券账户安全吗?能开户吗?
- Log in to MySQL 5.7 under ubuntu18 and set the root password
- Summary of NLP data enhancement methods
猜你喜欢

Beam search and five optimization methods

linux中的mysql有10061错误怎么解决

35岁腾讯员工被裁员感叹:北京一套房,存款700多万,失业好焦虑

How to calculate the independence weight index?

With the beauty of technology enabled design, vivo cooperates with well-known art institutes to create the "industry university research" plan

检测点是否在多边形内

测一测现在的温度

Unity addressable batch management

Use pytorch to build mobilenetv2 and learn and train based on migration

How to do factor analysis? Why should data be standardized?
随机推荐
Nips 2014 | two stream revolutionary networks for action recognition in videos reading notes
Beam search and five optimization methods
Data-centric vs. Model-centric. The Answer is Clear!
What is SKU and SPU? What is the difference between SKU and SPU
Basic record of getting started with PHP
打新债安不安全 有风险吗
InfluxDB时序数据库
Bluecmsv1.6- code audit
以科技赋能设计之美,vivo携手知名美院打造“产学研”计划
Super simple case: how to do hierarchical chi square test?
TCP acceleration notes
Scanpy(七)基于scanorama整合scRNA-seq实现空间数据分析
Getting to know the generation confrontation network (12) -- using pytoch to build wgan-gp to generate handwritten digits
How to calculate critical weight indicators?
In 2022, which industry will graduates prefer when looking for jobs?
How do I install the software using the apt get command?
What is the file that tp6 automatically executes? What does the tp6 core class library do?
Is there no risk in the security of new bonds
420 sequence traversal of binary tree 2 (429. sequence traversal of n-ary tree, 515. find the maximum value in each tree row, 116. fill in the next right node pointer of each node, 104. maximum depth
Use Adobe Acrobat pro to resize PDF pages