当前位置:网站首页>Strengthen basic learning records
Strengthen basic learning records
2022-07-06 13:52:00 【I like the strengthened Xiaobai in Curie】
Strengthening learning Q-learning and Saras Comparison of
Multi agent reinforcement learning small white one , Recently, I am learning to strengthen the foundation of learning , Record here , In case you forget .
One 、Q-learning
Q-learing The most basic reinforcement learning algorithm , adopt Q Table storage status - Action value , namely Q(s,a), It can be used for problems with small state space , When the dimension of state space is large , Need to cooperate with neural network , Expanded into DQN Algorithm , Dealing with problems .
- Value-based
- Off-Policy
Read a lot about On-Policy and Off-Policy The blog of , I haven't quite understood the difference between the two , I'm confused , I read a blogger's answer two days ago , Only then have a deeper understanding , A link is attached here .
link : on-policy and off-policy What's the difference? ?
When Q-learning update , Although the data used is current policy Produced , But the updated strategy is not the one that generates this data ( Pay attention to the... In the update formula max), It can be understood here : there max The operation is to select a larger Q A worthy action , to update Q surface , But the actual round may not be changed , So it is Off-Policy Of . - Pseudo code

- Realization
The environment used here is the treasure hunt game in the teacher's tutorial , Maintain through lists ,—#-T, The last position T It's a treasure ,# Represents the current position of the player , Go to the rightmost grid , Find the treasure , Game over .
The code implementation refers to a blogger , Can't find the link .....
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action: 1. Explore randomly and explore locations that have not been explored , Otherwise select reward The biggest move
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
def q_learning():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
while state != N_STATES - 1:
cur_action = choose_action(state, q_table)
new_state, reward = get_env_feedback(state, cur_action)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, :].max()
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state = new_state
update_env(state, epoch, step)
step += 1
return q_table
q_learning()
Two 、Saras
Saras It is also the most basic algorithm in reinforcement learning , Also use Q Table is stored Q(s,a), The reason why it's called Saras, It's because of one transition contain (s,a,r,a,s) Quintuples , namely Saras.
- Value-based
- On-Policy
Here's a comparison Q-learning, Then we can know , The data used here is the current policy Produced , And updated Q When it's worth it , It is based on new actions and new States Q value , New actions will be performed ( Note that there is no max), So it is On-Policy. - Pseudo code

- Realization
Reference here Q-learning Made simple changes , This is based on the new state , Choose another action , And perform the action , In addition, update Q When it's worth it , Directly based on the corresponding Q Value update .
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
# Generate (N_STATES,len(ACTIONS))) Of Q Empty value table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action:
#0.9 Probability greed ,0.1 Probabilistic random selection of actions , Be exploratory
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
# Maintain the environment
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
# to update Q surface
def Saras():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
cur_action = choose_action(state, q_table)
while state != N_STATES - 1:
new_state, reward = get_env_feedback(state, cur_action)
new_action = choose_action(new_state,q_table)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, new_action]
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state,cur_action = new_state,new_action
update_env(state, epoch, step)
step += 1
return q_table
Saras()
Blog for the first time , It may be understood that there is a problem , Please correct your mistakes .
边栏推荐
- The difference between abstract classes and interfaces
- 【九阳神功】2018复旦大学应用统计真题+解析
- .Xmind文件如何上传金山文档共享在线编辑?
- 【Numpy和Pytorch的数据处理】
- 【九阳神功】2017复旦大学应用统计真题+解析
- Principles, advantages and disadvantages of two persistence mechanisms RDB and AOF of redis
- Miscellaneous talk on May 27
- Nuxtjs quick start (nuxt2)
- 2022泰迪杯数据挖掘挑战赛C题思路及赛后总结
- MySQL锁总结(全面简洁 + 图文详解)
猜你喜欢

4. Branch statements and loop statements

Using spacedesk to realize any device in the LAN as a computer expansion screen

FAQs and answers to the imitation Niuke technology blog project (III)

Meituan dynamic thread pool practice ideas, open source

Service ability of Hongmeng harmonyos learning notes to realize cross end communication

QT meta object qmetaobject indexofslot and other functions to obtain class methods attention

FAQs and answers to the imitation Niuke technology blog project (II)

Difference and understanding between detected and non detected anomalies

6. Function recursion

使用Spacedesk实现局域网内任意设备作为电脑拓展屏
随机推荐
记一次猫舍由外到内的渗透撞库操作提取-flag
Leetcode.3 无重复字符的最长子串——超过100%的解法
Wechat applet
【黑马早报】上海市监局回应钟薛高烧不化;麦趣尔承认两批次纯牛奶不合格;微信内测一个手机可注册俩号;度小满回应存款变理财产品...
关于双亲委派机制和类加载的过程
【九阳神功】2018复旦大学应用统计真题+解析
The difference between overloading and rewriting
SRC挖掘思路及方法
.Xmind文件如何上传金山文档共享在线编辑?
2. C language matrix multiplication
[hand tearing code] single case mode and producer / consumer mode
FAQs and answers to the imitation Niuke technology blog project (I)
(original) make an electronic clock with LCD1602 display to display the current time on the LCD. The display format is "hour: minute: Second: second". There are four function keys K1 ~ K4, and the fun
Leetcode. 3. Longest substring without repeated characters - more than 100% solution
It's never too late to start. The tramp transformation programmer has an annual salary of more than 700000 yuan
Implementation of count (*) in MySQL
杂谈0516
Principles, advantages and disadvantages of two persistence mechanisms RDB and AOF of redis
8. C language - bit operator and displacement operator
力扣152题乘数最大子数组