当前位置：网站首页>Fundamentals of intensive learning openai gym environment construction demo

Fundamentals of intensive learning openai gym environment construction demo

2022-06-12 04:52:00 【sinat_ twenty-eight million three hundred and seventy-one thous】

1. Gym Introduce

Gym It is a simulation platform for researching and developing reinforcement learning algorithms , No prior knowledge of agents is required , It consists of the following two parts

Gym Open source library ： Collection of test questions . When you test reinforcement learning , The test problem is the environment , For example, robots play games , The set of environments is the picture of the game . These environments have a common interface , Allows users to design general algorithms .
OpenAI Gym service ： Provide a site and API（ For example, the classical control problem ：CartPole-v0）, Allow users to compare their test results .

2. Gym install

We need to be in Python 3.5+ Easy to use in the environment of pip install gym

pip install gym

If you need to install from the source code gym, Then you can. ：

git clone https://github.com/openai/gym
cd gym
pip install -e .

Can run pip install -e .[all] Perform a complete installation of all environments . You need to install some dependent packages , Include cmake And the latest pip edition .

3. Gym Use demo

Simply speaking OpenAI Gym Provides many problems and environments （ Or games ） The interface of , Users do not need to know much about the internal implementation of the game , It can be used to test and simulate by simply calling . Next, the classical control problem CartPole-v0 For example , Have a brief understanding of Gym Characteristics

#  Import gym Environmental Science 
import gym
#  Declare the environment used 
env = gym.make('CartPole-v0')
#  Environment initialization 
env.reset()

#  Iterate over the environment 1000 Time 
for _ in range(1000):
    env.render()
    observation, reward, done, info = env.step(env.action_space.sample()) #  Take random action 
    if done:
       env.reset()
env.close()

The operation effect is as follows

As can be seen from the above code ,gym The core interface of is Env. As a unified environment interface ,Env Contains the following core methods ：

reset(self)： Reset the state of the environment , Return to observation .
step(self, action)： Advance one time step , return observation, reward, done, info.
render(self, mode=‘human’, close=False)： Redraw a frame of the environment . The default mode is generally friendly , If a window pops up .
close(self)： Shut down the environment , And clear the memory

The above code is imported first gym library , Then create CartPole-v0 Environmental Science , And reset the environment state . stay for In a cycle 1000 Time step control ,env.render() Refresh the environment screen for each time step , Take a random action on the current environment state （0 or 1）, Return in the environment done by True when , Reset environment , Close the simulation environment after the end of the last cycle .

4、 observation （Observations）

In the above code env.step() Function to simulate each step , stay Gym in ,env.step() Returns the 4 Parameters ：

observation Observation (Object)： At present step After execution , Environmental observations ( Type as object ). for example , Pixels obtained from the camera , The angle of each joint of the robot or the current state of the board game ;
Reward Reward (Float): Perform the previous action (action) after , agent ( agent) The reward you get ( Floating point type ), The range of incentive value varies in different environments , But the goal of reinforcement learning is to maximize the total reward ;
complete Done (Boolen): Indicates whether the environment needs to be reset env.reset. Most of the time , When Done by True when , It means that the current round (episode) Or experiment (tial) end . For example, when a robot falls or falls off the table , You should terminate the current round and reset (reset);
Information Info (Dict): Diagnostic information for the debugging process . This will not be used in the standard intelligent simulation evaluation info, We'll talk about it when we need it .

stay Gym In simulation , Each turn begins , It has to be executed first reset() function , Return the initial observation information , Then according to the flag bit done The state of , To decide whether to go to the next round . So the more appropriate way is to follow done The logo of .

import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

The snippet of the code run result is shown below ：

[ 0.04025062 -0.04312649  0.00186348  0.02288173]
[ 0.03938809 -0.23827512  0.00232111  0.31615203]
[ 0.03462259 -0.43343005  0.00864416  0.60956605]
[ 0.02595398 -0.23843     0.02083548  0.31961824]
[ 0.02118538 -0.43384239  0.02722784  0.6187984 ]
[ 0.01250854 -0.23911113  0.03960381  0.33481376]
[ 0.00772631 -0.43477369  0.04630008  0.63971794]
[-0.00096916 -0.63050954  0.05909444  0.94661444]
[-0.01357935 -0.43623107  0.07802673  0.67306909]
[-0.02230397 -0.24227538  0.09148811  0.40593731]
[-0.02714948 -0.43856752  0.09960686  0.72600415]
[-0.03592083 -0.24495361  0.11412694  0.46625881]
[-0.0408199  -0.05161354  0.12345212  0.21161588]
[-0.04185217  0.14154693  0.12768444 -0.03971694]
[-0.03902123 -0.05515279  0.1268901   0.29036807]
[-0.04012429 -0.25183418  0.13269746  0.6202239 ]
[-0.04516097 -0.05879065  0.14510194  0.37210296]
[-0.04633679  0.13400401  0.152544    0.12846047]
[-0.04365671 -0.06293669  0.15511321  0.46511532]
[-0.04491544 -0.25987115  0.16441551  0.80239106]
[-0.05011286 -0.45681992  0.18046333  1.14195086]
[-0.05924926 -0.65378152  0.20330235  1.48536419]
Episode finished after 22 timesteps

The above results show that in this iteration , The output observations are a list . This is a CartPole Environment specific state , The rule is .

among ：

Indicates the position of the trolley on the track （position of the cart on the track）
Indicates the included angle between the pole and the vertical direction （angle of the pole with the vertical）
Indicates trolley speed （cart velocity）
Represents the rate of angular change （rate of change of the angle）

5、 Space （Spaces）

Each action performed (action) They are randomly selected from the environmental action space , But these movements (action) What is it? ? stay Gym In the simulation environment , There is room for movement action_space And observation space observation_space Two indicators , The program is defined as Space type , Format and range used to describe effective motion and observations . Here's a code example ：

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
print(env.observation_space)

Discrete(2)
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

It can be seen from the running results of the program ：

action_space Is a discrete Discrete type , from discrete.py Source code is known , The scope is a {0,1,…,n-1} The length is n Set of nonnegative integers , stay CartPole-v0 In the example , The action space is expressed as {0,1}.
observation_space It's a Box type , from box.py Source code is known , It means a n V's box , So I printed it out in the previous section observation Is a length of 4 Array of . Each element in the array has upper and lower bounds .

6. Reward (reward)

stay gym Of Cart Pole Environmental Science （env） Inside , Moving the car left or right action after ,env Will return a +1 Of reward. among CartPole-v0 Arrive in 200 individual reward after , The game will end , and CartPole-v1 Medium is 500. The biggest reward （reward） Thresholds can be modified through the registry described earlier .

7. The registry

Gym It is a large collection of reinforcement learning simulation environments , And it is encapsulated into a common interface and exposed to users , See the code for all environments as follows

from gym import envs
print(envs.registry.all())

8. Register the emulator

Gym It supports writing user created environments to the registry , You need to perform gym.make() And register at startup register. If you want to register your own environment , Suppose you define your environment in the following structure ：

myenv/
    __init__.py
    myenv.py

i. myenv.py Contains classes suitable for our own environment . stay init.py in , Enter the following code ：

from gym.envs.registration import register
register(
    id='MyEnv-v0',
    entry_point='myenv.myenv:MyEnv', #  first myenv It's the folder name , the second myenv It's the file name ,MyEnv Is the name of the class in the file 
)

ii. Use our own environment ：

import gym
import myenv #  Be sure to import your own environment , This is a point that is easy to ignore 
env = gym.make('MyEnv-v0')

iii. stay PYTHONPATH Install in myenv Directory or start from the parent directory python.

 Directory structure ：
myenv/
    __init__.py
    my_hotter_colder.py
-------------------
__init__.py  file ：
-------------------
from gym.envs.registration import register
register(
    id='MyHotterColder-v0',
    entry_point='myenv.my_hotter_colder:MyHotterColder',
)
-------------------
my_hotter_colder.py file ：
-------------------
import gym
from gym import spaces
from gym.utils import seeding
import numpy as np

class MyHotterColder(gym.Env):
    """Hotter Colder
    The goal of hotter colder is to guess closer to a randomly selected number

    After each step the agent receives an observation of:
    0 - No guess yet submitted (only after reset)
    1 - Guess is lower than the target
    2 - Guess is equal to the target
    3 - Guess is higher than the target

    The rewards is calculated as:
    (min(action, self.number) + self.range) / (max(action, self.number) + self.range)

    Ideally an agent will be able to recognise the 'scent' of a higher reward and
    increase the rate in which is guesses in that direction until the reward reaches
    its maximum
    """
    def __init__(self):
        self.range = 1000  # +/- value the randomly select number can be between
        self.bounds = 2000  # Action space bounds

        self.action_space = spaces.Box(low=np.array([-self.bounds]), high=np.array([self.bounds]))
        self.observation_space = spaces.Discrete(4)

        self.number = 0
        self.guess_count = 0
        self.guess_max = 200
        self.observation = 0

        self.seed()
        self.reset()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, action):
        assert self.action_space.contains(action)

        if action < self.number:
            self.observation = 1

        elif action == self.number:
            self.observation = 2

        elif action > self.number:
            self.observation = 3

        reward = ((min(action, self.number) + self.bounds) / (max(action, self.number) + self.bounds)) ** 2

        self.guess_count += 1
        done = self.guess_count >= self.guess_max

        return self.observation, reward[0], done, {"number": self.number, "guesses": self.guess_count}

    def reset(self):
        self.number = self.np_random.uniform(-self.range, self.range)
        self.guess_count = 0
        self.observation = 0
        return self.observation

9. OpenAI Gym Evaluation platform

Users can record and upload the performance of algorithms in the environment or upload their own models Gist, Generate evaluation report , It can also record small videos of models playing games . There is a leaderboard in every environment , To compare the performance of our models .

Upload to the recording method as follows

import gym
from gym import wrappers
env = gym.make('CartPole-v0')
env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

Use Monitor Wrapper Package your environment , The performance of your model will be recorded under the path you define . It supports writing different model performance in the same path in the same environment .

stay Official website After you sign up , You can see your own... On your personal page API_Key, Then you can upload the results to OpenAI Gym：

import gym
gym.upload('/tmp/cartpole-experiment-1', api_key='YOUR_API_KEY')

Then we get the following results ：

When you open the link, there will be an evaluation report of the current model in the environment , And also recorded a small video ：

Upload results every time ,OpenAI Gym Will be evaluated .

Create a Github Gist Upload the results , Or directly at upload The parameter is passed in ：

import gym
gym.upload('/tmp/cartpole-experiment-1', writeup='https://gist.github.com/gdb/b6365e79be6052e7531e7ba6ea8caf23', api_key='YOUR_API_KEY')

The evaluation will automatically calculate the score , And generate a beautiful page .

In most environments , Our goal is to minimize the number of steps required to achieve threshold level performance . Different environments have different thresholds , In some circumstances , It is not clear what the threshold is , The goal here is to maximize the final performance . stay cartpole In this environment , The threshold is the number of frames that the pole can stand upright .

If the above blog has any mistakes or questions , Please add VX：1755337994, Timely inform ！ grateful ！

原网站

版权声明
本文为[sinat_ twenty-eight million three hundred and seventy-one thous]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203010623574438.html