当前位置:网站首页>Fundamentals of intensive learning openai gym environment construction demo
Fundamentals of intensive learning openai gym environment construction demo
2022-06-12 04:52:00 【sinat_ twenty-eight million three hundred and seventy-one thous】
1. Gym Introduce
Gym It is a simulation platform for researching and developing reinforcement learning algorithms , No prior knowledge of agents is required , It consists of the following two parts
- Gym Open source library : Collection of test questions . When you test reinforcement learning , The test problem is the environment , For example, robots play games , The set of environments is the picture of the game . These environments have a common interface , Allows users to design general algorithms .
- OpenAI Gym service : Provide a site and API( For example, the classical control problem :CartPole-v0), Allow users to compare their test results .
2. Gym install
We need to be in Python 3.5+ Easy to use in the environment of pip install gym
pip install gym
If you need to install from the source code gym, Then you can. :
git clone https://github.com/openai/gym
cd gym
pip install -e .
Can run pip install -e .[all] Perform a complete installation of all environments . You need to install some dependent packages , Include cmake And the latest pip edition .
3. Gym Use demo
Simply speaking OpenAI Gym Provides many problems and environments ( Or games ) The interface of , Users do not need to know much about the internal implementation of the game , It can be used to test and simulate by simply calling . Next, the classical control problem CartPole-v0 For example , Have a brief understanding of Gym Characteristics
# Import gym Environmental Science
import gym
# Declare the environment used
env = gym.make('CartPole-v0')
# Environment initialization
env.reset()
# Iterate over the environment 1000 Time
for _ in range(1000):
env.render()
observation, reward, done, info = env.step(env.action_space.sample()) # Take random action
if done:
env.reset()
env.close()
The operation effect is as follows
As can be seen from the above code ,gym The core interface of is Env. As a unified environment interface ,Env Contains the following core methods :
- reset(self): Reset the state of the environment , Return to observation .
- step(self, action): Advance one time step , return observation, reward, done, info.
- render(self, mode=‘human’, close=False): Redraw a frame of the environment . The default mode is generally friendly , If a window pops up .
- close(self): Shut down the environment , And clear the memory
The above code is imported first gym library , Then create CartPole-v0 Environmental Science , And reset the environment state . stay for In a cycle 1000 Time step control ,env.render() Refresh the environment screen for each time step , Take a random action on the current environment state (0 or 1), Return in the environment done by True when , Reset environment , Close the simulation environment after the end of the last cycle .
4、 observation (Observations)
In the above code env.step() Function to simulate each step , stay Gym in ,env.step() Returns the 4 Parameters :
- observation Observation (Object): At present step After execution , Environmental observations ( Type as object ). for example , Pixels obtained from the camera , The angle of each joint of the robot or the current state of the board game ;
- Reward Reward (Float): Perform the previous action (action) after , agent ( agent) The reward you get ( Floating point type ), The range of incentive value varies in different environments , But the goal of reinforcement learning is to maximize the total reward ;
- complete Done (Boolen): Indicates whether the environment needs to be reset env.reset. Most of the time , When Done by True when , It means that the current round (episode) Or experiment (tial) end . For example, when a robot falls or falls off the table , You should terminate the current round and reset (reset);
- Information Info (Dict): Diagnostic information for the debugging process . This will not be used in the standard intelligent simulation evaluation info, We'll talk about it when we need it .
stay Gym In simulation , Each turn begins , It has to be executed first reset() function , Return the initial observation information , Then according to the flag bit done The state of , To decide whether to go to the next round . So the more appropriate way is to follow done The logo of .
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
env.close()
The snippet of the code run result is shown below :
[ 0.04025062 -0.04312649 0.00186348 0.02288173]
[ 0.03938809 -0.23827512 0.00232111 0.31615203]
[ 0.03462259 -0.43343005 0.00864416 0.60956605]
[ 0.02595398 -0.23843 0.02083548 0.31961824]
[ 0.02118538 -0.43384239 0.02722784 0.6187984 ]
[ 0.01250854 -0.23911113 0.03960381 0.33481376]
[ 0.00772631 -0.43477369 0.04630008 0.63971794]
[-0.00096916 -0.63050954 0.05909444 0.94661444]
[-0.01357935 -0.43623107 0.07802673 0.67306909]
[-0.02230397 -0.24227538 0.09148811 0.40593731]
[-0.02714948 -0.43856752 0.09960686 0.72600415]
[-0.03592083 -0.24495361 0.11412694 0.46625881]
[-0.0408199 -0.05161354 0.12345212 0.21161588]
[-0.04185217 0.14154693 0.12768444 -0.03971694]
[-0.03902123 -0.05515279 0.1268901 0.29036807]
[-0.04012429 -0.25183418 0.13269746 0.6202239 ]
[-0.04516097 -0.05879065 0.14510194 0.37210296]
[-0.04633679 0.13400401 0.152544 0.12846047]
[-0.04365671 -0.06293669 0.15511321 0.46511532]
[-0.04491544 -0.25987115 0.16441551 0.80239106]
[-0.05011286 -0.45681992 0.18046333 1.14195086]
[-0.05924926 -0.65378152 0.20330235 1.48536419]
Episode finished after 22 timesteps
The above results show that in this iteration , The output observations are a list . This is a CartPole Environment specific state , The rule is .
among :
- Indicates the position of the trolley on the track (position of the cart on the track)
- Indicates the included angle between the pole and the vertical direction (angle of the pole with the vertical)
- Indicates trolley speed (cart velocity)
- Represents the rate of angular change (rate of change of the angle)
5、 Space (Spaces)
Each action performed (action) They are randomly selected from the environmental action space , But these movements (action) What is it? ? stay Gym In the simulation environment , There is room for movement action_space And observation space observation_space Two indicators , The program is defined as Space type , Format and range used to describe effective motion and observations . Here's a code example :
import gym
env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)
Discrete(2)
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
It can be seen from the running results of the program :
- action_space Is a discrete Discrete type , from discrete.py Source code is known , The scope is a {0,1,…,n-1} The length is n Set of nonnegative integers , stay CartPole-v0 In the example , The action space is expressed as {0,1}.
- observation_space It's a Box type , from box.py Source code is known , It means a n V's box , So I printed it out in the previous section observation Is a length of 4 Array of . Each element in the array has upper and lower bounds .
6. Reward (reward)
stay gym Of Cart Pole Environmental Science (env) Inside , Moving the car left or right action after ,env Will return a +1 Of reward. among CartPole-v0 Arrive in 200 individual reward after , The game will end , and CartPole-v1 Medium is 500. The biggest reward (reward) Thresholds can be modified through the registry described earlier .
7. The registry
Gym It is a large collection of reinforcement learning simulation environments , And it is encapsulated into a common interface and exposed to users , See the code for all environments as follows
from gym import envs
print(envs.registry.all())
8. Register the emulator
Gym It supports writing user created environments to the registry , You need to perform gym.make() And register at startup register. If you want to register your own environment , Suppose you define your environment in the following structure :
myenv/
__init__.py
myenv.py
i. myenv.py Contains classes suitable for our own environment . stay init.py in , Enter the following code :
from gym.envs.registration import register
register(
id='MyEnv-v0',
entry_point='myenv.myenv:MyEnv', # first myenv It's the folder name , the second myenv It's the file name ,MyEnv Is the name of the class in the file
)
ii. Use our own environment :
import gym
import myenv # Be sure to import your own environment , This is a point that is easy to ignore
env = gym.make('MyEnv-v0')
iii. stay PYTHONPATH Install in myenv Directory or start from the parent directory python.
Directory structure :
myenv/
__init__.py
my_hotter_colder.py
-------------------
__init__.py file :
-------------------
from gym.envs.registration import register
register(
id='MyHotterColder-v0',
entry_point='myenv.my_hotter_colder:MyHotterColder',
)
-------------------
my_hotter_colder.py file :
-------------------
import gym
from gym import spaces
from gym.utils import seeding
import numpy as np
class MyHotterColder(gym.Env):
"""Hotter Colder
The goal of hotter colder is to guess closer to a randomly selected number
After each step the agent receives an observation of:
0 - No guess yet submitted (only after reset)
1 - Guess is lower than the target
2 - Guess is equal to the target
3 - Guess is higher than the target
The rewards is calculated as:
(min(action, self.number) + self.range) / (max(action, self.number) + self.range)
Ideally an agent will be able to recognise the 'scent' of a higher reward and
increase the rate in which is guesses in that direction until the reward reaches
its maximum
"""
def __init__(self):
self.range = 1000 # +/- value the randomly select number can be between
self.bounds = 2000 # Action space bounds
self.action_space = spaces.Box(low=np.array([-self.bounds]), high=np.array([self.bounds]))
self.observation_space = spaces.Discrete(4)
self.number = 0
self.guess_count = 0
self.guess_max = 200
self.observation = 0
self.seed()
self.reset()
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def step(self, action):
assert self.action_space.contains(action)
if action < self.number:
self.observation = 1
elif action == self.number:
self.observation = 2
elif action > self.number:
self.observation = 3
reward = ((min(action, self.number) + self.bounds) / (max(action, self.number) + self.bounds)) ** 2
self.guess_count += 1
done = self.guess_count >= self.guess_max
return self.observation, reward[0], done, {"number": self.number, "guesses": self.guess_count}
def reset(self):
self.number = self.np_random.uniform(-self.range, self.range)
self.guess_count = 0
self.observation = 0
return self.observation
9. OpenAI Gym Evaluation platform
Users can record and upload the performance of algorithms in the environment or upload their own models Gist, Generate evaluation report , It can also record small videos of models playing games . There is a leaderboard in every environment , To compare the performance of our models .
Upload to the recording method as follows
import gym
from gym import wrappers
env = gym.make('CartPole-v0')
env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
Use Monitor Wrapper Package your environment , The performance of your model will be recorded under the path you define . It supports writing different model performance in the same path in the same environment .
stay Official website After you sign up , You can see your own... On your personal page API_Key, Then you can upload the results to OpenAI Gym:
import gym
gym.upload('/tmp/cartpole-experiment-1', api_key='YOUR_API_KEY')
Then we get the following results :
When you open the link, there will be an evaluation report of the current model in the environment , And also recorded a small video :
Upload results every time ,OpenAI Gym Will be evaluated .
Create a Github Gist Upload the results , Or directly at upload The parameter is passed in :
import gym
gym.upload('/tmp/cartpole-experiment-1', writeup='https://gist.github.com/gdb/b6365e79be6052e7531e7ba6ea8caf23', api_key='YOUR_API_KEY')
The evaluation will automatically calculate the score , And generate a beautiful page .
In most environments , Our goal is to minimize the number of steps required to achieve threshold level performance . Different environments have different thresholds , In some circumstances , It is not clear what the threshold is , The goal here is to maximize the final performance . stay cartpole In this environment , The threshold is the number of frames that the pole can stand upright .
If the above blog has any mistakes or questions , Please add VX:1755337994, Timely inform ! grateful !
边栏推荐
- Link: fatal error lnk1168: cannot open debug/test Solution of exe for writing
- SQL injection upload one sentence Trojan horse (turn)
- Oracle's instr()
- How to construct a search string?
- L1-065 "nonsense code" (5 points)
- Operation of simulated examination platform for theoretical question bank of G2 utility boiler stoker in 2022
- PostgreSQL age XID maintenance prevents the database from being read-only
- LabVIEW about TDMS and binary storage speeds
- JS controls the display and hiding of tags through class
- L1-068 harmonic average (10 points)
猜你喜欢

Spatial distribution data of national multi-year average precipitation 1951-2021, temperature distribution data, evapotranspiration data, evaporation data, solar radiation data, sunshine data and wind

Operation of simulated examination platform for theoretical question bank of G2 utility boiler stoker in 2022

Why should a redis cluster use a reverse proxy? Just read this one

asp. Net core theme Middleware

Some problems of silly girl solved

Map coordinate conversion of Baidu map API

C asynchronous programming (async and await) and asynchronous method synchronous invocation

2022 "college entrance examination memory" has been packaged, please check!

Qinglong wool - Kaka

@What happens if bean and @component are used on the same class?
随机推荐
L1-066 cat is liquid (5 points)
LabVIEW關於TDMS和Binary存儲速度
Microsoft announces that it will discontinue support for older versions of visual studio
Some problems of Qinglong panel
Based on Visual Studio code Net Maui cross platform mobile application development
Shandong University network security range experimental platform -- team and project introduction
Kwai opens a milestone activity for fans to record every achievement moment for creators
Pytorch was reported by a large number of netizens that torchrec, a new library, was "born" and has a large scale
Operation of simulated examination platform for 2022 safety officer-b certificate examination questions
Interview must ask: summary of ten classic sorting algorithms
LabVIEW关于TDMS和Binary存储速度
Summary of common interview questions in redis
Day17 array features array boundary array application traversal array multidimensional array creation and traversal arrays operation array bubble sort
one billion one hundred and eleven million one hundred and eleven thousand one hundred and eleven
Musk promotes the development of fascinating new products partners remind important questions
Install/Remove of the Service Denied!
Parallélisation de l'entraînement accéléré TF. Données. Générateur de données
PostgreSQL age XID maintenance prevents the database from being read-only
BI 如何让SaaS产品具有 “安全感”和“敏锐感”(上)
How Bi makes SaaS products have a "sense of security" and "sensitivity" (Part I)