当前位置:网站首页>Strengthen basic learning records
Strengthen basic learning records
2022-07-06 13:52:00 【I like the strengthened Xiaobai in Curie】
Strengthening learning Q-learning and Saras Comparison of
Multi agent reinforcement learning small white one , Recently, I am learning to strengthen the foundation of learning , Record here , In case you forget .
One 、Q-learning
Q-learing The most basic reinforcement learning algorithm , adopt Q Table storage status - Action value , namely Q(s,a), It can be used for problems with small state space , When the dimension of state space is large , Need to cooperate with neural network , Expanded into DQN Algorithm , Dealing with problems .
- Value-based
- Off-Policy
Read a lot about On-Policy and Off-Policy The blog of , I haven't quite understood the difference between the two , I'm confused , I read a blogger's answer two days ago , Only then have a deeper understanding , A link is attached here .
link : on-policy and off-policy What's the difference? ?
When Q-learning update , Although the data used is current policy Produced , But the updated strategy is not the one that generates this data ( Pay attention to the... In the update formula max), It can be understood here : there max The operation is to select a larger Q A worthy action , to update Q surface , But the actual round may not be changed , So it is Off-Policy Of . - Pseudo code
- Realization
The environment used here is the treasure hunt game in the teacher's tutorial , Maintain through lists ,—#-T, The last position T It's a treasure ,# Represents the current position of the player , Go to the rightmost grid , Find the treasure , Game over .
The code implementation refers to a blogger , Can't find the link .....
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action: 1. Explore randomly and explore locations that have not been explored , Otherwise select reward The biggest move
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
def q_learning():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
while state != N_STATES - 1:
cur_action = choose_action(state, q_table)
new_state, reward = get_env_feedback(state, cur_action)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, :].max()
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state = new_state
update_env(state, epoch, step)
step += 1
return q_table
q_learning()
Two 、Saras
Saras It is also the most basic algorithm in reinforcement learning , Also use Q Table is stored Q(s,a), The reason why it's called Saras, It's because of one transition contain (s,a,r,a,s) Quintuples , namely Saras.
- Value-based
- On-Policy
Here's a comparison Q-learning, Then we can know , The data used here is the current policy Produced , And updated Q When it's worth it , It is based on new actions and new States Q value , New actions will be performed ( Note that there is no max), So it is On-Policy. - Pseudo code
- Realization
Reference here Q-learning Made simple changes , This is based on the new state , Choose another action , And perform the action , In addition, update Q When it's worth it , Directly based on the corresponding Q Value update .
import numpy as np
import pandas as pd
import time
N_STATES = 6 # 6 Status , One dimensional array length
ACTIONS = [-1, 1] # Two states ,-1:left, 1:right
epsilon = 0.9 # greedy
alpha = 0.1 # Learning rate
gamma = 0.9 # Diminishing reward value
max_episodes = 10 # Maximum rounds
fresh_time = 0.3 # Move interval
# q_table
# Generate (N_STATES,len(ACTIONS))) Of Q Empty value table
q_table = pd.DataFrame(np.zeros((N_STATES, len(ACTIONS))), columns=ACTIONS)
# choose action:
#0.9 Probability greed ,0.1 Probabilistic random selection of actions , Be exploratory
def choose_action(state, table):
state_actions = table.iloc[state, :]
if np.random.uniform() > epsilon or state_actions.all() == 0:
action = np.random.choice(ACTIONS)
else:
action = state_actions.argmax()
return action
def get_env_feedback(state, action):
# New status = current state + Move status
new_state = state + action
reward = 0
# Shift right plus 0.5
# Move to the right , Closer to the treasure , get +0.5 Reward
if action > 0:
reward += 0.5
# Move to the left , Stay away from the treasure , get -0.5 Reward
if action < 0:
reward -= 0.5
# The next step is to reach the treasure , Give the highest reward +1
if new_state == N_STATES - 1:
reward += 1
# If you go all the way to the left , And move left , Get the lowest negative reward -1
# At the same time pay attention to , It's still here to define the new state , Otherwise, it will report a mistake
if new_state < 0:
new_state = 0
reward -= 1
return new_state, reward
# Maintain the environment
def update_env(state, epoch, step):
env_list = ['-'] * (N_STATES - 1) + ['T']
if state == N_STATES - 1:
# Reach your destination
print("")
print("epoch=" + str(epoch) + ", step=" + str(step), end='')
time.sleep(2)
else:
env_list[state] = '#'
print('\r' + ''.join(env_list), end='')
time.sleep(fresh_time)
# to update Q surface
def Saras():
for epoch in range(max_episodes):
step = 0 # Move steps
state = 0 # The initial state
update_env(state, epoch, step)
cur_action = choose_action(state, q_table)
while state != N_STATES - 1:
new_state, reward = get_env_feedback(state, cur_action)
new_action = choose_action(new_state,q_table)
q_pred = q_table.loc[state, cur_action]
if new_state != N_STATES - 1:
q_target = reward + gamma * q_table.loc[new_state, new_action]
else:
q_target = reward
q_table.loc[state, cur_action] += alpha * (q_target - q_pred)
state,cur_action = new_state,new_action
update_env(state, epoch, step)
step += 1
return q_table
Saras()
Blog for the first time , It may be understood that there is a problem , Please correct your mistakes .
边栏推荐
- 仿牛客技术博客项目常见问题及解答(三)
- Inaki Ading
- Nuxtjs快速上手(Nuxt2)
- Why use redis
- [au cours de l'entrevue] - Comment expliquer le mécanisme de transmission fiable de TCP
- Custom RPC project - frequently asked questions and explanations (Registration Center)
- [hand tearing code] single case mode and producer / consumer mode
- Read only error handling
- Miscellaneous talk on May 27
- The latest tank battle 2022 full development notes-1
猜你喜欢
QT meta object qmetaobject indexofslot and other functions to obtain class methods attention
[during the interview] - how can I explain the mechanism of TCP to achieve reliable transmission
(original) make an electronic clock with LCD1602 display to display the current time on the LCD. The display format is "hour: minute: Second: second". There are four function keys K1 ~ K4, and the fun
Intensive literature reading series (I): Courier routing and assignment for food delivery service using reinforcement learning
Yugu p1012 spelling +p1019 word Solitaire (string)
MySQL lock summary (comprehensive and concise + graphic explanation)
Have you encountered ABA problems? Let's talk about the following in detail, how to avoid ABA problems
深度强化文献阅读系列(一):Courier routing and assignment for food delivery service using reinforcement learning
. How to upload XMIND files to Jinshan document sharing online editing?
1. First knowledge of C language (1)
随机推荐
2022 Teddy cup data mining challenge question C idea and post game summary
实验七 常用类的使用(修正帖)
[modern Chinese history] Chapter 6 test
1143_ SiCp learning notes_ Tree recursion
Using qcommonstyle to draw custom form parts
Mode 1 two-way serial communication is adopted between machine a and machine B, and the specific requirements are as follows: (1) the K1 key of machine a can control the ledi of machine B to turn on a
[the Nine Yang Manual] 2016 Fudan University Applied Statistics real problem + analysis
7-9 制作门牌号3.0(PTA程序设计)
【头歌educoder数据表中数据的插入、修改和删除】
3. Input and output functions (printf, scanf, getchar and putchar)
JS several ways to judge whether an object is an array
6. Function recursion
Analysis of penetration test learning and actual combat stage
FAQs and answers to the imitation Niuke technology blog project (II)
【九阳神功】2016复旦大学应用统计真题+解析
Detailed explanation of redis' distributed lock principle
简单理解ES6的Promise
重载和重写的区别
The latest tank battle 2022 full development notes-1
使用Spacedesk实现局域网内任意设备作为电脑拓展屏