当前位置:网站首页>[deep learning] - maze task learning I (to realize the random movement of agents)
[deep learning] - maze task learning I (to realize the random movement of agents)
2022-06-29 06:10:00 【electrochemjy】
The maze of deep reinforcement learning I
This document is used for learning records of deep reinforcement learning , First, learn the basic idea of reinforcement learning process through maze task
【 Maze task advanced 】
Stage 1 : Implement an agent , The agent searches randomly in the maze and moves towards the target
Stage two : Make the agent move towards the goal directly ( Strategy iteration )
Stage three : Value iteration ( Give value to the state and action of the agent ), Seek the most valuable action and state ( Get the right value )
PS: First, record the learning of stage one
Build a maze
# I'm going to use the function
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
%matplotlib inline
# Draw the initial state of the maze
def plot():
fig=plt.figure(figsize=(5,5))
ax=plt.gca()
# Draw walls
plt.plot([1,1],[0,1],color='red',linewidth=3)
plt.plot([1,2],[2,2],color='red',linewidth=2)
plt.plot([2,2],[2,1],color='red',linewidth=2)
plt.plot([2,3],[1,1],color='red',linewidth=2)
# Painting state
plt.text(0.5,2.5,'S0',size=14,ha='center')
plt.text(1.5,2.5,'S1',size=14,ha='center')
plt.text(2.5,2.5,'S2',size=14,ha='center')
plt.text(0.5,1.5,'S3',size=14,ha='center')
plt.text(1.5,1.5,'S4',size=14,ha='center')
plt.text(2.5,1.5,'S5',size=14,ha='center')
plt.text(0.5,0.5,'S6',size=14,ha='center')
plt.text(1.5,0.5,'S7',size=14,ha='center')
plt.text(2.5,0.5,'S8',size=14,ha='center')
plt.text(0.5,2.5,'S0',size=14,ha='center')
plt.text(0.5,2.3,'START',ha='center')
plt.text(2.5,0.3,'END',ha='center')
# Set the drawing range
ax.set_xlim(0,3)
ax.set_ylim(0,3)
plt.tick_params(axis='both',which='both',bottom='off',top='off',labelbottom='off',right='off',left='off',labelleft='off')
# The current position S0 With green circles
line,=ax.plot([0.5],[2.5],marker="o",color='g',markersize=60)
# Display diagram
plt.show()
# Function detection ( This is a static display )
fig=plot()

The implementation of agent
The rules that define the behavior of agents are called policies , signify “ In state s Take action a The probability of follows by the parameter theta Defined strategy pi”
In the task , state s It refers to the position of the agent in the maze , action a Refers to the operations that an agent can perform in this state ( Such as upward 、 towards the right 、 Down and left ), Parameters theta Means in a state of s The probability of using this action .
Therefore, the initial state of the maze task can be transformed into a matrix
# The initial state of the maze ,1 Indicates that the direction can be advanced ,np.nan It means there are walls and you can't go forward ,[ Up , towards the right , Down , towards the left ]
theta_0=np.array([[np.nan,1,1,np.nan], #S0
[np.nan,1,np.nan,1], #S1
[np.nan,np.nan,1,1], #S2
[1,1,1,np.nan], #S3
[np.nan,np.nan,1,1], #S4
[1,np.nan,np.nan,np.nan], #S5
[1,np.nan,np.nan,np.nan], #S6
[1,1,np.nan,np.nan], #S7
]) # S8 A goal No strategy
# Will correspond to the forward direction theta Values are converted to percentages as probabilities
def simple_convert_into_pi_from_theta(theta):
''' Simply calculate the ratio '''
[m,n]=theta.shape # Read theta matrix
pi=np.zeros((m,n))
for i in range(0,m):
pi[i,:]=theta[i,:]/np.nansum(theta[i,:]) # Calculate the ratio
pi=np.nan_to_num(pi) # take nan Convert to 0, Because the probability of moving towards the wall is 0
return pi
# Initial strategy
pi_0=simple_convert_into_pi_from_theta(theta_0)
print(" The initial strategy is pi_0=",pi_0)
The initial strategy is pi_0= [[0. 0.5 0.5 0. ]
[0. 0.5 0. 0.5 ]
[0. 0. 0.5 0.5 ]
[0.33333333 0.33333333 0.33333333 0. ]
[0. 0. 0.5 0.5 ]
[1. 0. 0. 0. ]
[1. 0. 0. 0. ]
[0.5 0.5 0. 0. ]]
# The random movement of the agent is realized according to the state of the agent
# Set the status index , Find the state after one step of movement s
def get_next_s(pi,s):
direction = ["up", "right", "down", "left"]
next_direction=np.random.choice(direction,p=pi[s,:]) # from direction With probability p, Random direction selection ,s Is the agent state (0-8)
# Determine the next step according to the action
if next_direction=='up':
s_next=s-3 # Move up Number of States -3
if next_direction=="right":
s_next = s + 1
if next_direction=="down":
s_next = s + 3
if next_direction=="left":
s_next = s - 1
return s_next
# The definition of the function that the agent continues to move and reach the goal
def goal_maze(pi):
s=0
state_history=[0]# Create a list to record the moving track of the agent
while (1):
next_s=get_next_s(pi,s)
state_history.append(next_s)# Record the history of the moving track of the agent
if next_s==8:
break
else:
s=next_s
return state_history
# An agent consists of states s0 Target state reached s8 Track history moved
state_history=goal_maze(pi_0)
print("s0-s8 Moving records ",state_history)# Changing
print("s0-s8 Move steps ",len(state_history))# Changing
# Because agents move randomly according to probability , Therefore, the state change trajectory may be different for each execution
s0-s8 Moving records [0, 3, 6, 3, 4, 7, 8]
s0-s8 Move steps 7
The above is the implementation process of maze task phase I
边栏推荐
- Servlet version conflict causes page 404
- 3 frequently tested SQL data analysis questions (including data and code)
- HTTP Caching Protocol practice
- VLAN experiment
- Leetcode theme [array] -217- there are duplicate elements
- Review of MySQL knowledge points
- Design and practice of kubernetes cluster and application monitoring scheme
- [C language series] - branch and loop statements
- Spark saving to external data source
- Difference between static and final
猜你喜欢

Sourcetree remote red exclamation point

JS messagechannel transport

2,5-di (3,4-dicarboxyphenoxy) - 4 '- phenylethynylbiphenyldianhydride (pephqda) / Qiyue custom supply porphyrin modified amphiphilic block copolymer peg113-pcl46-porphyrin

Analysis report on the investment market of the development planning prospect of the recommended NFT industry research industry in 2022 (the attachment is a link to the online disk, and the report is

What is the "danksharding" of V God Kop on Valentine's day?

2022 recommended prefabricated construction industry research report industry development prospect market analysis white paper (the attachment is a link to the network disk, and the report is continuo

It turns out that the joys and sorrows of programmers are not interlinked

Research Report on the recommended lithography industry in 2022 industry development prospect market investment analysis (the attachment is a link to the network disk, and the report is continuously u

Maximum ascending subarray sum of leetcode simple problem
![[C language series] - initial C language (4)](/img/3b/b20d6e0194f2114f8c27a17d58369a.jpg)
[C language series] - initial C language (4)
随机推荐
Est - ce que l'ouverture d'un compte de titres est sécurisée? Y a - t - il un danger?
Love that can't be met -- what is the intimate relationship maintained by video chat
Use some examples of qte5
Establishing the development environment of esp8266
What has urbanization brought to our mental health and behavior?
Convert data frame with date column to timeseries
Boost the digital economy and face the future office | the launch of the new version of spreadjsv15.0 is about to begin
What are the uses of static?
2022 recommended precious metal industry research report industry development prospect market analysis white paper (the attachment is a link to the online disk, and the report is continuously updated)
The fresh student who was born in Ali after 2000: it's really fragrant to mend this
Benign competition will promote each other
Is there any difference between a=a+b and a+=b?
Week 10 - task 3- from point to circle to cylinder
Difference between parametric continuity and geometric continuity
2022 recommended quantum industry research industry development planning prospect investment market analysis report (the attachment is a link to the online disk, and the report is continuously updated
Can use the mouse, will reinstall the computer system tutorial sharing
Two houses with different colors and the farthest distance
Hustoj SPJ example
ASP. Net core 6 framework unveiling example demonstration [03]:dapr initial experience
CCTV revealed that xumengtao won the black Technology: there was a virtual coach???