当前位置:网站首页>Strengthen basic learning records
Strengthen basic learning records
2022-07-06 13:52:00 【I like the strengthened Xiaobai in Curie】
Strengthening learning Q-learning and Saras Comparison of
Multi agent reinforcement learning small white one , Recently, I am learning to strengthen the foundation of learning , Record here , In case you forget .
One 、Q-learning
Q-learing The most basic reinforcement learning algorithm , adopt Q Table storage status - Action value , namely Q(s,a), It can be used for problems with small state space , When the dimension of state space is large , Need to cooperate with neural network , Expanded into DQN Algorithm , Dealing with problems .
- Value-based
- Off-Policy
Read a lot about On-Policy and Off-Policy The blog of , I haven't quite understood the difference between the two , I'm confused , I read a blogger's answer two days ago , Only then have a deeper understanding , A link is attached here .
link : on-policy and off-policy What's the difference? ?
When Q-learning update , Although the data used is current policy Produced , But the updated strategy is not the one that generates this data ( Pay attention to the... In the update formula max), It can be understood here : there max The operation is to select a larger Q A worthy action , to update Q surface , But the actual round may not be changed , So it is Off-Policy Of . - Pseudo code
- Realization
The environment used here is the treasure hunt game in the teacher's tutorial , Maintain through lists ,—#-T, The last position T It's a treasure ,# Represents the current position of the player , Go to the rightmost grid , Find the treasure , Game over .
The code implementation refers to a blogger , Can't find the link .....
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action: 1. Explore randomly and explore locations that have not been explored , Otherwise select reward The biggest move
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
def q_learning():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
while state != N_STATES - 1:
cur_action = choose_action(state, q_table)
new_state, reward = get_env_feedback(state, cur_action)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, :].max()
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state = new_state
update_env(state, epoch, step)
step += 1
return q_table
q_learning()
Two 、Saras
Saras It is also the most basic algorithm in reinforcement learning , Also use Q Table is stored Q(s,a), The reason why it's called Saras, It's because of one transition contain (s,a,r,a,s) Quintuples , namely Saras.
- Value-based
- On-Policy
Here's a comparison Q-learning, Then we can know , The data used here is the current policy Produced , And updated Q When it's worth it , It is based on new actions and new States Q value , New actions will be performed ( Note that there is no max), So it is On-Policy. - Pseudo code
- Realization
Reference here Q-learning Made simple changes , This is based on the new state , Choose another action , And perform the action , In addition, update Q When it's worth it , Directly based on the corresponding Q Value update .
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
# Generate (N_STATES,len(ACTIONS))) Of Q Empty value table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action:
#0.9 Probability greed ,0.1 Probabilistic random selection of actions , Be exploratory
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
# Maintain the environment
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
# to update Q surface
def Saras():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
cur_action = choose_action(state, q_table)
while state != N_STATES - 1:
new_state, reward = get_env_feedback(state, cur_action)
new_action = choose_action(new_state,q_table)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, new_action]
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state,cur_action = new_state,new_action
update_env(state, epoch, step)
step += 1
return q_table
Saras()
Blog for the first time , It may be understood that there is a problem , Please correct your mistakes .
边栏推荐
猜你喜欢
Using spacedesk to realize any device in the LAN as a computer expansion screen
Mortal immortal cultivation pointer-2
[面试时]——我如何讲清楚TCP实现可靠传输的机制
Meituan dynamic thread pool practice ideas, open source
强化学习系列(一):基本原理和概念
Programme de jeu de cartes - confrontation homme - machine
MySQL lock summary (comprehensive and concise + graphic explanation)
实验六 继承和多态
. Net6: develop modern 3D industrial software based on WPF (2)
[hand tearing code] single case mode and producer / consumer mode
随机推荐
The difference between abstract classes and interfaces
7-8 7104 约瑟夫问题(PTA程序设计)
1143_ SiCp learning notes_ Tree recursion
ABA问题遇到过吗,详细说以下,如何避免ABA问题
渗透测试学习与实战阶段分析
实验七 常用类的使用
2022 Teddy cup data mining challenge question C idea and post game summary
7-3 构造散列表(PTA程序设计)
Intensive literature reading series (I): Courier routing and assignment for food delivery service using reinforcement learning
5月27日杂谈
[dark horse morning post] Shanghai Municipal Bureau of supervision responded that Zhong Xue had a high fever and did not melt; Michael admitted that two batches of pure milk were unqualified; Wechat i
Miscellaneous talk on May 27
Canvas foundation 1 - draw a straight line (easy to understand)
杂谈0516
[the Nine Yang Manual] 2020 Fudan University Applied Statistics real problem + analysis
(原创)制作一个采用 LCD1602 显示的电子钟,在 LCD 上显示当前的时间。显示格式为“时时:分分:秒秒”。设有 4 个功能键k1~k4,功能如下:(1)k1——进入时间修改。
MySQL锁总结(全面简洁 + 图文详解)
FAQs and answers to the imitation Niuke technology blog project (II)
Cookie和Session的区别
Have you encountered ABA problems? Let's talk about the following in detail, how to avoid ABA problems