当前位置:网站首页>Strengthen basic learning records
Strengthen basic learning records
2022-07-06 13:52:00 【I like the strengthened Xiaobai in Curie】
Strengthening learning Q-learning and Saras Comparison of
Multi agent reinforcement learning small white one , Recently, I am learning to strengthen the foundation of learning , Record here , In case you forget .
One 、Q-learning
Q-learing The most basic reinforcement learning algorithm , adopt Q Table storage status - Action value , namely Q(s,a), It can be used for problems with small state space , When the dimension of state space is large , Need to cooperate with neural network , Expanded into DQN Algorithm , Dealing with problems .
- Value-based
- Off-Policy
Read a lot about On-Policy and Off-Policy The blog of , I haven't quite understood the difference between the two , I'm confused , I read a blogger's answer two days ago , Only then have a deeper understanding , A link is attached here .
link : on-policy and off-policy What's the difference? ?
When Q-learning update , Although the data used is current policy Produced , But the updated strategy is not the one that generates this data ( Pay attention to the... In the update formula max), It can be understood here : there max The operation is to select a larger Q A worthy action , to update Q surface , But the actual round may not be changed , So it is Off-Policy Of . - Pseudo code

- Realization
The environment used here is the treasure hunt game in the teacher's tutorial , Maintain through lists ,—#-T, The last position T It's a treasure ,# Represents the current position of the player , Go to the rightmost grid , Find the treasure , Game over .
The code implementation refers to a blogger , Can't find the link .....
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action: 1. Explore randomly and explore locations that have not been explored , Otherwise select reward The biggest move
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
def q_learning():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
while state != N_STATES - 1:
cur_action = choose_action(state, q_table)
new_state, reward = get_env_feedback(state, cur_action)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, :].max()
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state = new_state
update_env(state, epoch, step)
step += 1
return q_table
q_learning()
Two 、Saras
Saras It is also the most basic algorithm in reinforcement learning , Also use Q Table is stored Q(s,a), The reason why it's called Saras, It's because of one transition contain (s,a,r,a,s) Quintuples , namely Saras.
- Value-based
- On-Policy
Here's a comparison Q-learning, Then we can know , The data used here is the current policy Produced , And updated Q When it's worth it , It is based on new actions and new States Q value , New actions will be performed ( Note that there is no max), So it is On-Policy. - Pseudo code

- Realization
Reference here Q-learning Made simple changes , This is based on the new state , Choose another action , And perform the action , In addition, update Q When it's worth it , Directly based on the corresponding Q Value update .
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
# Generate (N_STATES,len(ACTIONS))) Of Q Empty value table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action:
#0.9 Probability greed ,0.1 Probabilistic random selection of actions , Be exploratory
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
# Maintain the environment
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
# to update Q surface
def Saras():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
cur_action = choose_action(state, q_table)
while state != N_STATES - 1:
new_state, reward = get_env_feedback(state, cur_action)
new_action = choose_action(new_state,q_table)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, new_action]
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state,cur_action = new_state,new_action
update_env(state, epoch, step)
step += 1
return q_table
Saras()
Blog for the first time , It may be understood that there is a problem , Please correct your mistakes .
边栏推荐
- QT meta object qmetaobject indexofslot and other functions to obtain class methods attention
- (原创)制作一个采用 LCD1602 显示的电子钟,在 LCD 上显示当前的时间。显示格式为“时时:分分:秒秒”。设有 4 个功能键k1~k4,功能如下:(1)k1——进入时间修改。
- Caching mechanism of leveldb
- 实验七 常用类的使用
- 编写程序,模拟现实生活中的交通信号灯。
- [the Nine Yang Manual] 2022 Fudan University Applied Statistics real problem + analysis
- Brief introduction to XHR - basic use of XHR
- 7-6 矩阵的局部极小值(PTA程序设计)
- 【九阳神功】2020复旦大学应用统计真题+解析
- . How to upload XMIND files to Jinshan document sharing online editing?
猜你喜欢

1. First knowledge of C language (1)

A comprehensive summary of MySQL transactions and implementation principles, and no longer have to worry about interviews

仿牛客技术博客项目常见问题及解答(一)

About the parental delegation mechanism and the process of class loading

The latest tank battle 2022 full development notes-1

C language to achieve mine sweeping game (full version)

Mode 1 two-way serial communication is adopted between machine a and machine B, and the specific requirements are as follows: (1) the K1 key of machine a can control the ledi of machine B to turn on a

Nuxtjs快速上手(Nuxt2)

实验六 继承和多态

7. Relationship between array, pointer and array
随机推荐
2022泰迪杯数据挖掘挑战赛C题思路及赛后总结
Wechat applet
Mortal immortal cultivation pointer-2
[during the interview] - how can I explain the mechanism of TCP to achieve reliable transmission
Get started with typescript
1143_ SiCp learning notes_ Tree recursion
[the Nine Yang Manual] 2019 Fudan University Applied Statistics real problem + analysis
ArrayList的自动扩容机制实现原理
hashCode()与equals()之间的关系
Redis实现分布式锁原理详解
Record a penetration of the cat shed from outside to inside. Library operation extraction flag
2. First knowledge of C language (2)
这次,彻底搞清楚MySQL索引
MySQL中count(*)的实现方式
仿牛客技术博客项目常见问题及解答(三)
简单理解ES6的Promise
Cookie和Session的区别
[the Nine Yang Manual] 2017 Fudan University Applied Statistics real problem + analysis
Read only error handling
The difference between cookies and sessions