当前位置:网站首页>Fundamentals of intensive learning openai gym environment construction demo

Fundamentals of intensive learning openai gym environment construction demo

2022-06-12 04:52:00 sinat_ twenty-eight million three hundred and seventy-one thous

1. Gym Introduce

Gym It is a simulation platform for researching and developing reinforcement learning algorithms , No prior knowledge of agents is required , It consists of the following two parts

  • Gym Open source library : Collection of test questions . When you test reinforcement learning , The test problem is the environment , For example, robots play games , The set of environments is the picture of the game . These environments have a common interface , Allows users to design general algorithms .
  • OpenAI Gym service : Provide a site and API( For example, the classical control problem :CartPole-v0), Allow users to compare their test results .

2. Gym install

We need to be in Python 3.5+ Easy to use in the environment of pip install gym

pip install gym

If you need to install from the source code gym, Then you can. :

git clone https://github.com/openai/gym
cd gym
pip install -e .

Can run pip install -e .[all] Perform a complete installation of all environments . You need to install some dependent packages , Include cmake And the latest pip edition .

3. Gym Use demo

Simply speaking OpenAI Gym Provides many problems and environments ( Or games ) The interface of , Users do not need to know much about the internal implementation of the game , It can be used to test and simulate by simply calling . Next, the classical control problem CartPole-v0 For example , Have a brief understanding of Gym Characteristics

#  Import gym Environmental Science 
import gym
#  Declare the environment used 
env = gym.make('CartPole-v0')
#  Environment initialization 
env.reset()

#  Iterate over the environment 1000 Time 
for _ in range(1000):
    env.render()
    observation, reward, done, info = env.step(env.action_space.sample()) #  Take random action 
    if done:
       env.reset()
env.close()

The operation effect is as follows


 

As can be seen from the above code ,gym The core interface of is Env. As a unified environment interface ,Env Contains the following core methods :

  • reset(self): Reset the state of the environment , Return to observation .
  • step(self, action): Advance one time step , return observation, reward, done, info.
  • render(self, mode=‘human’, close=False): Redraw a frame of the environment . The default mode is generally friendly , If a window pops up .
  • close(self): Shut down the environment , And clear the memory

The above code is imported first gym library , Then create CartPole-v0 Environmental Science , And reset the environment state . stay for In a cycle 1000 Time step control ,env.render() Refresh the environment screen for each time step , Take a random action on the current environment state (0 or 1), Return in the environment done by True when , Reset environment , Close the simulation environment after the end of the last cycle .

4、 observation (Observations)

In the above code env.step() Function to simulate each step , stay Gym in ,env.step() Returns the 4 Parameters :

  • observation Observation (Object): At present step After execution , Environmental observations ( Type as object ). for example , Pixels obtained from the camera , The angle of each joint of the robot or the current state of the board game ;
  • Reward Reward (Float): Perform the previous action (action) after , agent ( agent) The reward you get ( Floating point type ), The range of incentive value varies in different environments , But the goal of reinforcement learning is to maximize the total reward ;
  • complete Done (Boolen): Indicates whether the environment needs to be reset env.reset. Most of the time , When Done by True when , It means that the current round (episode) Or experiment (tial) end . For example, when a robot falls or falls off the table , You should terminate the current round and reset (reset);
  • Information Info (Dict): Diagnostic information for the debugging process . This will not be used in the standard intelligent simulation evaluation info, We'll talk about it when we need it .

stay Gym In simulation , Each turn begins , It has to be executed first reset() function , Return the initial observation information , Then according to the flag bit done The state of , To decide whether to go to the next round . So the more appropriate way is to follow done The logo of .

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

The snippet of the code run result is shown below :

[ 0.04025062 -0.04312649  0.00186348  0.02288173]
[ 0.03938809 -0.23827512  0.00232111  0.31615203]
[ 0.03462259 -0.43343005  0.00864416  0.60956605]
[ 0.02595398 -0.23843     0.02083548  0.31961824]
[ 0.02118538 -0.43384239  0.02722784  0.6187984 ]
[ 0.01250854 -0.23911113  0.03960381  0.33481376]
[ 0.00772631 -0.43477369  0.04630008  0.63971794]
[-0.00096916 -0.63050954  0.05909444  0.94661444]
[-0.01357935 -0.43623107  0.07802673  0.67306909]
[-0.02230397 -0.24227538  0.09148811  0.40593731]
[-0.02714948 -0.43856752  0.09960686  0.72600415]
[-0.03592083 -0.24495361  0.11412694  0.46625881]
[-0.0408199  -0.05161354  0.12345212  0.21161588]
[-0.04185217  0.14154693  0.12768444 -0.03971694]
[-0.03902123 -0.05515279  0.1268901   0.29036807]
[-0.04012429 -0.25183418  0.13269746  0.6202239 ]
[-0.04516097 -0.05879065  0.14510194  0.37210296]
[-0.04633679  0.13400401  0.152544    0.12846047]
[-0.04365671 -0.06293669  0.15511321  0.46511532]
[-0.04491544 -0.25987115  0.16441551  0.80239106]
[-0.05011286 -0.45681992  0.18046333  1.14195086]
[-0.05924926 -0.65378152  0.20330235  1.48536419]
Episode finished after 22 timesteps

The above results show that in this iteration , The output observations are a list . This is a CartPole Environment specific state , The rule is .

among :

  • Indicates the position of the trolley on the track (position of the cart on the track
  • Indicates the included angle between the pole and the vertical direction (angle of the pole with the vertical
  • Indicates trolley speed (cart velocity
  • Represents the rate of angular change (rate of change of the angle

5、 Space (Spaces

Each action performed (action) They are randomly selected from the environmental action space , But these movements (action) What is it? ? stay Gym In the simulation environment , There is room for movement action_space And observation space observation_space Two indicators , The program is defined as Space type , Format and range used to describe effective motion and observations . Here's a code example :

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)

Discrete(2)
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

It can be seen from the running results of the program :

  • action_space Is a discrete Discrete type , from discrete.py Source code is known , The scope is a {0,1,…,n-1} The length is n Set of nonnegative integers , stay CartPole-v0 In the example , The action space is expressed as {0,1}.
  • observation_space It's a Box type , from box.py Source code is known , It means a n V's box , So I printed it out in the previous section observation Is a length of 4 Array of . Each element in the array has upper and lower bounds .

6. Reward (reward)

stay gym Of Cart Pole Environmental Science (env) Inside , Moving the car left or right action after ,env Will return a +1 Of reward. among CartPole-v0 Arrive in 200 individual reward after , The game will end , and CartPole-v1 Medium is 500. The biggest reward (reward) Thresholds can be modified through the registry described earlier .

7. The registry

Gym It is a large collection of reinforcement learning simulation environments , And it is encapsulated into a common interface and exposed to users , See the code for all environments as follows

from gym import envs
print(envs.registry.all())

8. Register the emulator

Gym It supports writing user created environments to the registry , You need to perform gym.make() And register at startup register. If you want to register your own environment , Suppose you define your environment in the following structure :

myenv/
    __init__.py
    myenv.py

i. myenv.py Contains classes suitable for our own environment . stay init.py in , Enter the following code :

from gym.envs.registration import register
register(
    id='MyEnv-v0',
    entry_point='myenv.myenv:MyEnv', #  first myenv It's the folder name , the second myenv It's the file name ,MyEnv Is the name of the class in the file 
)

ii. Use our own environment :

import gym
import myenv #  Be sure to import your own environment , This is a point that is easy to ignore 
env = gym.make('MyEnv-v0')

iii. stay PYTHONPATH Install in myenv Directory or start from the parent directory python.

 Directory structure :
myenv/
    __init__.py
    my_hotter_colder.py
-------------------
__init__.py  file :
-------------------
from gym.envs.registration import register
register(
    id='MyHotterColder-v0',
    entry_point='myenv.my_hotter_colder:MyHotterColder',
)
-------------------
my_hotter_colder.py file :
-------------------
import gym
from gym import spaces
from gym.utils import seeding
import numpy as np

class MyHotterColder(gym.Env):
    """Hotter Colder
    The goal of hotter colder is to guess closer to a randomly selected number

    After each step the agent receives an observation of:
    0 - No guess yet submitted (only after reset)
    1 - Guess is lower than the target
    2 - Guess is equal to the target
    3 - Guess is higher than the target

    The rewards is calculated as:
    (min(action, self.number) + self.range) / (max(action, self.number) + self.range)

    Ideally an agent will be able to recognise the 'scent' of a higher reward and
    increase the rate in which is guesses in that direction until the reward reaches
    its maximum
    """
    def __init__(self):
        self.range = 1000  # +/- value the randomly select number can be between
        self.bounds = 2000  # Action space bounds

        self.action_space = spaces.Box(low=np.array([-self.bounds]), high=np.array([self.bounds]))
        self.observation_space = spaces.Discrete(4)

        self.number = 0
        self.guess_count = 0
        self.guess_max = 200
        self.observation = 0

        self.seed()
        self.reset()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, action):
        assert self.action_space.contains(action)

        if action < self.number:
            self.observation = 1

        elif action == self.number:
            self.observation = 2

        elif action > self.number:
            self.observation = 3

        reward = ((min(action, self.number) + self.bounds) / (max(action, self.number) + self.bounds)) ** 2

        self.guess_count += 1
        done = self.guess_count >= self.guess_max

        return self.observation, reward[0], done, {"number": self.number, "guesses": self.guess_count}

    def reset(self):
        self.number = self.np_random.uniform(-self.range, self.range)
        self.guess_count = 0
        self.observation = 0
        return self.observation

9. OpenAI Gym Evaluation platform

Users can record and upload the performance of algorithms in the environment or upload their own models Gist, Generate evaluation report , It can also record small videos of models playing games . There is a leaderboard in every environment , To compare the performance of our models .

Upload to the recording method as follows

import gym
from gym import wrappers
env = gym.make('CartPole-v0')
env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

Use Monitor Wrapper Package your environment , The performance of your model will be recorded under the path you define . It supports writing different model performance in the same path in the same environment .

stay Official website After you sign up , You can see your own... On your personal page API_Key, Then you can upload the results to OpenAI Gym:

import gym
gym.upload('/tmp/cartpole-experiment-1', api_key='YOUR_API_KEY')

Then we get the following results :

When you open the link, there will be an evaluation report of the current model in the environment , And also recorded a small video :

Upload results every time ,OpenAI Gym Will be evaluated .

Create a Github Gist Upload the results , Or directly at upload The parameter is passed in :

import gym
gym.upload('/tmp/cartpole-experiment-1', writeup='https://gist.github.com/gdb/b6365e79be6052e7531e7ba6ea8caf23', api_key='YOUR_API_KEY')

The evaluation will automatically calculate the score , And generate a beautiful page .

In most environments , Our goal is to minimize the number of steps required to achieve threshold level performance . Different environments have different thresholds , In some circumstances , It is not clear what the threshold is , The goal here is to maximize the final performance . stay cartpole In this environment , The threshold is the number of frames that the pole can stand upright .

If the above blog has any mistakes or questions , Please add VX:1755337994, Timely inform ! grateful ! 

原网站

版权声明
本文为[sinat_ twenty-eight million three hundred and seventy-one thous]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203010623574438.html

随机推荐