ChainerRL is a deep reinforcement learning library built on top of Chainer.

Overview

ChainerRL

Build Status Coverage Status Documentation Status PyPI

ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Chainer, a flexible deep learning framework.

Breakout Humanoid Grasping Atlas

Installation

ChainerRL is tested with 3.6. For other requirements, see requirements.txt.

ChainerRL can be installed via PyPI:

pip install chainerrl

It can also be installed from the source code:

python setup.py install

Refer to Installation for more information on installation.

Getting started

You can try ChainerRL Quickstart Guide first, or check the examples ready for Atari 2600 and Open AI Gym.

For more information, you can refer to ChainerRL's documentation.

Algorithms

Algorithm Discrete Action Continous Action Recurrent Model Batch Training CPU Async Training
DQN (including DoubleDQN etc.) ✓ (NAF) x
Categorical DQN x x
Rainbow x x
IQN x x
DDPG x x
A3C ✓ (A2C)
ACER x
NSQ (N-step Q-learning) ✓ (NAF) x
PCL (Path Consistency Learning) x
PPO x
TRPO x
TD3 x x x
SAC x x x

Following algorithms have been implemented in ChainerRL:

Following useful techniques have been also implemented in ChainerRL:

Visualization

ChainerRL has a set of accompanying visualization tools in order to aid developers' ability to understand and debug their RL agents. With this visualization tool, the behavior of ChainerRL agents can be easily inspected from a browser UI.

Environments

Environments that support the subset of OpenAI Gym's interface (reset and step methods) can be used.

Contributing

Any kind of contribution to ChainerRL would be highly appreciated! If you are interested in contributing to ChainerRL, please read CONTRIBUTING.md.

License

MIT License.

Citations

To cite ChainerRL in publications:

@InProceedings{fujita2019chainerrl,
  author = {Fujita, Yasuhiro and Kataoka, Toshiki and Nagarajan, Prabhat and Ishikawa, Takahiro},
  title = {ChainerRL: A Deep Reinforcement Learning Library},
  booktitle = {Workshop on Deep Reinforcement Learning at the 33rd Conference on Neural Information Processing Systems},
  location = {Vancouver, Canada},
  month = {December},
  year = {2019}
}
Comments
  • Fix CI errors due to pyglet, zipp, mock, and gym

    Fix CI errors due to pyglet, zipp, mock, and gym

    • Use newer gym because old gym does not work with pyglet>=1.4.
    • Modify some example scripts that are not compatible with newer gym.
    • Use zipp==1.0.0 in flexCI because new zipp does not work with Python 3.5.
    • Remove mock from dependency as we do not support py2 now.
    test 
    opened by muupan 14
  • Rainbow Scores

    Rainbow Scores

    The previous Rainbow scores were generated with an incorrect hyperparameter. This PR aims to rectify that. We reran Rainbow using the correct hyperparameters, and have the new results here.

    This resolves issue #466

    example 
    opened by prabhatnagarajan 14
  • Recurrent and batched TRPO

    Recurrent and batched TRPO

    ~Merge #431 before this PR.~

    • Precompute action distributions for importance weighting and KL divergence as #430 did
    • Support recurrent models as #431 did
    • Support batch training
    enhancement 
    opened by muupan 13
  • Fix weight normalization inside prioritized experience replay

    Fix weight normalization inside prioritized experience replay

    This PR fixes the wrong computation of min_probability in PrioritizedReplayBuffer when normalize_by_max == 'batch', which is the default behavior. The previous behavior was unexpectedly the same as normalize_by_max == 'memory'.

    bug 
    opened by muupan 11
  • Drop python2 support

    Drop python2 support

    ~Resolves #463~ Resolves #392 Resolves #467

    • Travis CI with python 2 is stopped
    • future is removed from dependencies
    • import from __future__ is removed
    • hacks for python 2 are removed (#392)
    • switched from ppa:jonathanf to ppa:cran for installing ffmpeg since the former is no longer available
    • remove use of six
    • remove fastcache, funcsigs, statistics as dependencies
    enhancement 
    opened by muupan 11
  • Implicit quantile networks (IQN)

    Implicit quantile networks (IQN)

    Please merge ~#350~ ~#356~ first.

    This PR resolves #282

    For its scores, see recent comments below. For some games it achieves same-level scores as the paper's. For some other games, it still underperforms the paper, but I don't know why.

    • [x] Add tests for QantileDiscreteActionValue
    • [x] Add tests for quantile huber loss
    • [x] Try i=1,...,n (from Appendix) instead of i=0,...,n-1 (from eq. 4) (the former is correct, confirmed by Georg Ostrovski)
    enhancement 
    opened by muupan 11
  • guide on how to use LSTM version of DDPG on gym environments

    guide on how to use LSTM version of DDPG on gym environments

    I am trying to run DDPG with the gym Pendulum-v0 environment. However I am getting this error:

    TypeError: The batch size of x must be equal to or less thanthe size of the previous state h.

    This is my code:

    env = gym.make('Pendulum-v0')
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.shape[0]
    
    q_func = q_func_.FCLSTMSAQFunction(obs_size, n_actions, n_hidden_channels=50, n_hidden_layers=2)
    pi = policy.FCLSTMDeterministicPolicy(n_input_channels=obs_size, n_hidden_channels=50, n_hidden_layers=2, 
                                          action_size=n_actions, 
                                          min_action=env.action_space.low, 
                                          max_action=env.action_space.high, 
                                          bound_action=True
                                         )
    model = DDPGModel(policy=pi, q_func=q_func)
    opt_a = optimizers.Adam(alpha=1e-4)
    opt_c = optimizers.Adam(alpha=1e-3)
    opt_a.setup(model['policy'])
    opt_c.setup(model['q_function'])
    opt_a.add_hook(chainer.optimizer.GradientClipping(1.0), 'hook_a')
    opt_c.add_hook(chainer.optimizer.GradientClipping(1.0), 'hook_c')
    
    ou_sigma = (env.action_space.high - env.action_space.low) * 0.2
    explorer = explorers.AdditiveOU(sigma=ou_sigma)
    
    replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=5 * 10 ** 5)
    
    phi = lambda x: x.astype(np.float32, copy=False)
    
    agent = DDPG(model, opt_a, opt_c, replay_buffer, gamma=0.995, explorer=explorer, 
                 replay_start_size=5000, target_update_method='soft', 
                 target_update_interval=1, update_interval=1,
                 soft_update_tau=1e-2, n_times_update=1, 
                 gpu=0, minibatch_size=200, phi=phi)
    
    n_episodes = 200
    max_episode_len = 200
    for i in range(1, n_episodes + 1):
        obs = env.reset()
        reward = 0
        done = False
        R = 0  # return (sum of rewards)
        t = 0  # time step
        while not done and t < max_episode_len:
            # Uncomment to watch the behaviour
    #         env.render()
            action = agent.act_and_train(obs, reward)
            obs, reward, done, _ = env.step(action)
            R += reward
            t += 1
        if i % 10 == 0:
            print('episode:', i,
                  '\nR:', R,
                  '\nstatistics:', agent.get_statistics())
        agent.stop_episode_and_train(obs, reward, done)
    print('Finished.')
    

    Here is the full initial running and error:

    episode: 10 R: -1069.3354146961874 statistics: [('average_q', -0.1465160510604003), ('average_actor_loss', 0.0), ('average_critic_loss', 0.0)] episode: 20 R: -1583.6140918088897 statistics: [('average_q', -0.16802258113631832), ('average_actor_loss', 0.0), ('average_critic_loss', 0.0)]

    TypeError Traceback (most recent call last) in 10 # Uncomment to watch the behaviour 11 # env.render() ---> 12 action = agent.act_and_train(obs, reward) 13 obs, reward, done, _ = env.step(action) 14 R += reward

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainerrl\agents\ddpg.py in act_and_train(self, obs, reward) 335 self.last_action = action 336 --> 337 self.replay_updater.update_if_necessary(self.t) 338 339 return self.last_action

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainerrl\replay_buffer.py in update_if_necessary(self, iteration) 543 else: 544 transitions = self.replay_buffer.sample(self.batchsize) --> 545 self.update_func(transitions)

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainerrl\agents\ddpg.py in update(self, experiences, errors_out) 263 264 batch = batch_experiences(experiences, self.xp, self.phi, self.gamma) --> 265 self.critic_optimizer.update(lambda: self.compute_critic_loss(batch)) 266 self.actor_optimizer.update(lambda: self.compute_actor_loss(batch)) 267

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainer\optimizer.py in update(self, lossfun, *args, **kwds) 862 if lossfun is not None: 863 use_cleargrads = getattr(self, '_use_cleargrads', True) --> 864 loss = lossfun(*args, **kwds) 865 if use_cleargrads: 866 self.target.cleargrads()

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainerrl\agents\ddpg.py in () 263 264 batch = batch_experiences(experiences, self.xp, self.phi, self.gamma) --> 265 self.critic_optimizer.update(lambda: self.compute_critic_loss(batch)) 266 self.actor_optimizer.update(lambda: self.compute_actor_loss(batch)) 267

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainerrl\agents\ddpg.py in compute_critic_loss(self, batch) 208 # Estimated Q-function observes s_t and a_t 209 predict_q = F.reshape( --> 210 self.q_function(batch_state, batch_actions), 211 (batchsize,)) 212

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainerrl\q_functions\state_action_q_functions.py in call(self, x, a) 105 h = F.concat((x, a), axis=1) 106 h = self.nonlinearity(self.fc(h)) --> 107 h = self.lstm(h) 108 return self.out(h) 109

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainer\link.py in call(self, *args, **kwargs) 292 # forward is implemented in the child classes 293 forward = self.forward # type: ignore --> 294 out = forward(*args, **kwargs) 295 296 # Call forward_postprocess hook

    ~\AppData\Local\Continuum\anaconda3\envs\chainer\lib\site-packages\chainer\links\connection\lstm.py in forward(self, x) 296 msg = ('The batch size of x must be equal to or less than' 297 'the size of the previous state h.') --> 298 raise TypeError(msg) 299 elif h_size > batch: 300 h_update, h_rest = split_axis.split_axis(

    TypeError: The batch size of x must be equal to or less thanthe size of the previous state h.

    opened by junhuang-ifast 10
  • Prioritized Double IQN

    Prioritized Double IQN

    We should first merge Double IQN.

    Here is a comparison against Double IQN

    | Game | DoubleIQN | Prioritized Double IQN | | ------------- |:-------------:|:-------------:| | Asterix | 507353.8| 738166.66| | Bowling | 80.33| 72.72| | Hero | 28564.58| 35293.26| | MontezumaRevenge| 5.55| 3.79| | Qbert | 29531.1| 25763.95| | Seaquest | 30870.0| 31905.0| | Venture | 719.51|1369.84 | | VideoPinball| 731942.25|717376.0|

    Prioritized IQN wins on 5/7 domains and loses on 2/7!

    enhancement 
    opened by prabhatnagarajan 10
  • Replicate Prioritized Experience Replay's reported performance improvements

    Replicate Prioritized Experience Replay's reported performance improvements

    Missing details

    • "all weights w_i were scaled so that max_i w_i = 1". Is max_i w_i computed over a minibatch or the whole buffer?
    • What is the value of epsilon that is added to absolute TD errors?
    opened by muupan 9
  • show error message

    show error message

    When we construct QFunction, and model parameter is not initialized (it happens when some Link is instantiated with in_size=None), copy_param will fail because the target_link's param is None.

    This PR is to show user-friendly error message to let the user know what is the cause of this error.

    I'd like to get comment that what kind of message is better.

    I'm not sure the performance degrade by checking this type error.

    enhancement 
    opened by corochann 8
  • A2C

    A2C

    Add A2C proposed by following thesis. https://arxiv.org/abs/1708.05144 https://blog.openai.com/baselines-acktr-a2c/ A2C is a synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C).

    enhancement 
    opened by iory 8
  • Create a copy of agent in evaluation for  chainerrl/experiements/train_agent_async.py

    Create a copy of agent in evaluation for chainerrl/experiements/train_agent_async.py

    I am a user of channerl, and my code break if I use the same agent to evaluate (note that in train&evaluation, env could be different), so I request to change to copy an agent

    opened by CyanFi 0
  • Regarding the output of DQN

    Regarding the output of DQN

    I would like to use DQN to control the field degree of freedom robot arm. However, the DQN in this library has only one output if the action space is discrete. In the case of multiple degrees of freedom, do we have to use a neural network for each joint?

    opened by KitanoFumiya 1
  • Can I get the distortion data in MujoCo?

    Can I get the distortion data in MujoCo?

    I am doing research on reinforcement learning using MujoCo. I am doing reinforcement learning on my own robot arm, and I want the data on the distortion of the link of the robot arm. Does anyone know how to take strain data?

    opened by KitanoFumiya 0
  • Question about quantile huber loss function in IQN

    Question about quantile huber loss function in IQN

    Hello, I have one question. In the paper of IQN, quantile huber loss function is delta_{ij} < 0.

    Screen Shot 2020-01-28 at 18 34 45

    But chainerrl iqn code is delta_{ij} > 0. I think this inequlity sign is not correct.

    I’m sorry for poor English.

    bug 
    opened by ShogoAkiyama 1
  • Segmentation fault in Docker after importing chainerrl

    Segmentation fault in Docker after importing chainerrl

    I've encountered segmentation fault error when using chainerrl with GPU in Docker. The error occurs if I import chainerrl first, then perform a cuda.get_device(args).use(). The quick fix my colleague and I found is to do cuda.get_device(args).use()first, then import chainerrl. Both scenarios are shown below.

    [email protected]:/home# python3
    Python 3.7.6 (default, Dec 19 2019, 23:50:13) 
    [GCC 7.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from chainer import cuda
    >>> import chainerrl
    >>> cuda.get_device(1).use()
    Segmentation fault (core dumped)
    
    [email protected]:/home# python3
    Python 3.7.6 (default, Dec 19 2019, 23:50:13) 
    [GCC 7.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from chainer import cuda
    >>> cuda.get_device(1).use()
    >>> import chainerrl
    >>> 
    

    Dockerfile

    FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
    RUN apt-get update
    
    RUN apt-get install -y wget git build-essential cmake libxerces-c-dev libfox-1.6-dev libgdal-dev libproj-dev libgl2ps-dev swig && rm -rf /var/lib/apt/lists/*
    
    RUN apt update \
    	&& apt install software-properties-common -y \
    	&& add-apt-repository ppa:deadsnakes/ppa -y \
    	&& apt-get update \
    	&& apt install python3.7 -y \
    	&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1 \
    	&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 2
    
    RUN apt-get install python3.7-dev python3-pip python3-wheel python3-setuptools -y
    
    RUN git clone https://github.com/chainer/chainerrl.git \
    	&& cd chainerrl \
    	&& python3 setup.py install
    
    RUN pip3 install cupy-cuda101
    

    My current computer configurations:

    [email protected]:/home# nvidia-smi
    Thu Jan 16 21:07:09 2020       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  TITAN RTX           On   | 00000000:3B:00.0 Off |                  N/A |
    | 63%   81C    P2   246W / 280W |   5777MiB / 24220MiB |     99%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  TITAN RTX           On   | 00000000:AF:00.0  On |                  N/A |
    | 41%   59C    P8    30W / 280W |   1422MiB / 24217MiB |      9%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    [email protected]:/home# nvcc -V
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Sun_Jul_28_19:07:16_PDT_2019
    Cuda compilation tools, release 10.1, V10.1.243
    [email protected]:/home/DRL_Traffic_Corridor#
    
    [email protected]:/home# lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              32
    On-line CPU(s) list: 0-31
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    NUMA node(s):        2
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               85
    Model name:          Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
    Stepping:            4
    CPU MHz:             800.012
    CPU max MHz:         3000.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4200.00
    Virtualization:      VT-x
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            1024K
    L3 cache:            11264K
    NUMA node0 CPU(s):   0-7,16-23
    NUMA node1 CPU(s):   8-15,24-31
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
    
    [email protected]:/home# uname -a
    Linux 9d606b33d95a 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    

    This is the furthest we could debug this issue. This was initially quite an issue until we found out about the import fix. We are just curious why does importing chainerrl throws a segmentation fault.

    opened by russelltankl 2
Releases(v0.8.0)
  • v0.8.0(Feb 14, 2020)

    Announcement

    This release will probably be the final major update under the name of ChainerRL. The development team is planning to switch its backend from Chainer to PyTorch and continue its development as OSS.

    Important enhancements

    • Soft Actor-Critic (https://arxiv.org/abs/1812.05905) with benchmark results is added.
      • Agent class: chainerrl.agents.SoftActorCritic
      • Example and benchmark results (MuJoCo): https://github.com/chainer/chainerrl/tree/v0.8.0/examples/mujoco/reproduction/soft_actor_critic
      • Example (Roboschool Atlas): https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atlas
    • Trained models of benchmark results are now downloadable. See READMEs of examples.
      • For Atari envs: DQN, IQN, Rainbow, A3C
      • For MuJoCo envs: DDPG, PPO, TRPO, TD3, Soft Actor-Critic
    • DQN-based agents now support recurrent models in a new, more efficient interface.
      • Example: https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atari/train_drqn_ale.py
    • TRPO now supports recurrent models and batch training.
    • A variant of IQN with double Q-learning is added.
      • Agent class: chainerrl.agents.DoubleIQN.
      • Example: https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atari/train_double_iqn.py
    • IQN now supports prioritized experience replay.

    Important bugfixes

    • The bug that the update of CategoricalDoubleDQN is same as that of CategoricalDQN is fixed.
    • The bug that batch training with N-step or episodic replay buffers does not work is fixed.
    • The bug that weight normalization is PrioritizedReplayBuffer with normalize_by_max == 'batch' is wrong is fixed.

    Important destructive changes

    • Support of Python 2 is dropped. ChainerRL is now only tested with Python 3.5.1+.
    • The interface of DQN-based agents to use recurrent models has changed. See the DRQN example: https://github.com/chainer/chainerrl/tree/v0.8.0/examples/atari/train_drqn_ale.py

    All updates

    Enhancements

    • Recurrent DQN families with a new interface (#436)
    • Recurrent and batched TRPO (#446)
    • Add Soft Actor-Critic agent (#457)
    • Code to collect demonstrations from an agent. (#468)
    • Monitor with ContinuingTimeLimit support (#491)
    • Fix B007: Loop control variable not used within the loop body (#502)
    • Double IQN (#503)
    • Fix B006: Do not use mutable data structures for argument defaults. (#504)
    • Splits Replay Buffers into separate files in a replay_buffers module (#506)
    • Use chainer.grad in ACER (#511)
    • Prioritized Double IQN (#518)
    • Add policy loss to TD3's logged statistics (#524)
    • Adds checkpoint frequencies for serial and batch Agents. (#525)
    • Add a deterministic mode to IQN for stable tests (#529)
    • Use Link.cleargrads instead of Link.zerograds in REINFORCE (#536)
    • Use cupyx.scatter_add instead of cupy.scatter_add (#537)
    • Avoid cupy.zeros_like with numpy.ndrray (#538)
    • Use get_device_from_id since get_device is deprecated (#539)
    • Releases trained models for all reproduced agents (#565)

    Documentation

    • Typo fix in Replay Buffer Docs (#507)
    • Fixes typo in docstring for AsyncEvaluator (#508)
    • Improve the algorithm list on README (#509)
    • Add Explorers to Documentation (#514)
    • Fixes syntax errors in ReplayBuffer docs. (#515)
    • Adds policies to the documentation (#516)
    • Adds demonstration collection to experiments docs (#517)
    • Adds List of Batch Agents to the README (#543)
    • Add documentation for Q-functions and some missing details in docstrings (#556)
    • Add comment on environment version difference (#582)
    • Adds ChainerRL Bibtex to the README (#584)
    • Minor Typo Fix (#585)

    Examples

    • Rename examples directories (#487)
    • Adds training times for reproduced Mujoco results (#497)
    • Adds additional information to Grasping Example README (#501)
    • Fixes a comment in PPO example (#521)
    • Rainbow Scores (#546)
    • Update train_a3c.py (#547, thanks @xinyuewang1!)
    • Update train_a3c.py (#548, thanks @xinyuewang1!)
    • Improves formatting of IQN training times (#549)
    • Corrects Scores in Examples (#552)
    • Removes GPU option from README (#564)
    • Releases trained models for all reproduced agents (#565)
    • Add an example script for RoboschoolAtlasForwardWalk-v1 (#577)
    • Corrects Rainbow Results (#580)
    • Adds proper A3C scores (#581)

    Testing

    • Add CI configs (#478)
    • Specify ubuntu 16.04 for Travis CI and modify a dependency accordingly (#520)
    • Remove a tailing space of DoubleIQN (#526)
    • Add a deterministic mode to IQN for stable tests (#529)
    • Fix import error when chainer==7.0.0b3 (#531)
    • Make test_monitor.py work on flexCI (#533)
    • Improve parameter distributions used in TestGaussianDistribution (#540)
    • Increase flexCI's time limit to 20min (#550)
    • decrease amount of decimal digits required to 4 (#554)
    • Use attrs<19.2.0 with pytest (#569)
    • Run slow tests with flexCI (#575)
    • Typo fix in CI comment. (#576)
    • Adds time to DDPG Tests (#587)
    • Fix CI errors due to pyglet, zipp, mock, and gym (#592)

    Bugfixes

    • Fix a bug in batch_recurrent_experiences regarding next_action (#528)
    • Fix ValueError in SARSA with GPU (#534)
    • fix function call (#541)
    • Pass env_id to replay_buffer methods to fix batch training (#558)
    • Fixes Categorical Double DQN Error. (#567)
    • Fix weight normalization inside prioritized experience replay (#570)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jun 28, 2019)

    Important enhancements

    • Rainbow (https://arxiv.org/abs/1710.02298) with benchmark results is added. (thanks @seann999!)
      • Agent class: chainerrl.agents.CategoricalDoubleDQN
      • Example and benchmark results: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/atari/rainbow
    • TD3 (https://arxiv.org/abs/1802.09477) with benchmark results is added.
      • Agent class: chainerrl.agents.TD3
      • Example and benchmark results: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/mujoco/td3
    • PPO now supports recurrent models.
      • Example: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/ale/train_ppo_ale.py (with --recurrent option)
      • Results: https://github.com/chainer/chainerrl/pull/431
    • DDPG now supports batch training
      • Example: https://github.com/chainer/chainerrl/tree/v0.7.0/examples/gym/train_ddpg_batch_gym.py

    Important bugfixes

    • The bug that some examples use the same random seed across envs for env.seed is fixed.
    • The bug that batch training with n-step return and/or recurrent models is not successful is fixed.
    • The bug that examples/ale/train_dqn_ale.py uses LinearDecayEpsilonGreedy even when NoisyNet is used is fixed.
    • The bug that examples/ale/train_dqn_ale.py does not use the value specified by --noisy-net-sigma is fixed.
    • The bug that chainerrl.links.to_factorized_noisy does not work correctly with chainerrl.links.Sequence is fixed.

    Important destructive changes

    • chainerrl.experiments.train_agent_async now requires eval_n_steps (number of timesteps for each evaluation phase) and eval_n_episodes (number of episodes for each evaluation phase) to be explicitly specified, with one of them being None.
    • examples/ale/dqn_phi.py is removed.
    • chainerrl.initializers.LeCunNormal is removed. Use chainer.initializers.LeCunNormal instead.

    All updates

    Enhancement

    • Rainbow (#374)
    • Make copy_param support scalar parameters (#410)
    • Enables batch DDPG agents to be trained. (#416)
    • Enables asynchronous time-based evaluations of agents. (#420)
    • Removes obsolete dqn_phi file (#424)
    • Add Branched and use it to simplify train_ppo_batch_gym.py (#427)
    • Remove LeCunNormal since Chainer has it from v3 (#428)
    • Precompute log probability in PPO (#430)
    • Recurrent PPO with a stateless recurrent model interface (#431)
    • Replace Variable.data with Variable.array (again) (#434)
    • Make IQN work with tuple observations (#435)
    • Add VectorStackFrame to reduce memory usage in train_dqn_batch_ale.py (#443)
    • DDPG example that reproduces the TD3 paper (#452)
    • TD3 agent (#453)
    • update requirements.txt and setup.py for gym (#461)
    • Support gym>=0.12.2 by stopping to use underscore methods in gym wrappers (#462)
    • Add warning about numpy 1.16.0 (#476)

    Documentation

    • Link to abstract pages on ArXiv (#409)
    • fixes typo (#412)
    • Fixes file path in grasping example README (#422)
    • Add links to references (#425)
    • Fixes minor grammar mistake in A3C ALE example (#432)
    • Add explanation of examples/atari (#437)
    • Link to chainer/chainer, not pfnet/chainer (#439)
    • Link to chainer/chainer(rl), not pfnet/chainer(rl) (#440)
    • fix & add docstring for FCStateQFunctionWithDiscreteAction (#441)
    • Fixes a typo in train_agent_batch Documentation. (#444)
    • Adds Rainbow to main README (#447)
    • Fixes Docstring in IQN (#451)
    • Improves Rainbow README (#458)
    • very small fix: add missing doc for eval_performance. (#459)
    • Adds IQN Results to readme (#469)
    • Adds IQN to the documentation. (#470)
    • Adds reference to mujoco folder in the examples README (#474)
    • Fixes incorrect comment. (#490)

    Examples

    • Rainbow (#374)
    • Create an IQN example aimed at reproducing the original paper and its evaluation protocol. (#408)
    • Benchmarks DQN example (#414)
    • Enables batch DDPG agents to be trained. (#416)
    • Fixes scores for Demon Attack (#418)
    • Set observation_space of kuka env correctly (#421)
    • Fixes error in setting explorer in DQN ALE example. (#423)
    • Add Branched and use it to simplify train_ppo_batch_gym.py (#427)
    • A3C Example for reproducing paper results. (#433)
    • PPO example that reproduces the "Deep Reinforcement Learning that Matters" paper (#448)
    • DDPG example that reproduces the TD3 paper (#452)
    • TD3 agent (#453)
    • Apply noisy_net_sigma parameter (#465)

    Testing

    • Use Python 3.6 in Travis CI (#411)
    • Increase tolerance of TestGaussianDistribution.test_entropy since sometimes it failed (#438)
    • make FrameStack follow original spaces (#445)
    • Split test_examples.sh (#472)
    • Fix Travis error (#492)
    • Use Python 3.6 for ipynb (#493)

    Bugfixes

    • bugfix (#360, thanks @corochann!)
    • Fixes error in setting explorer in DQN ALE example. (#423)
    • Make sure the agent sees when episodes end (#429)
    • Pass env_id to replay buffer methods to correctly support batch training (#442)
    • Add VectorStackFrame to reduce memory usage in train_dqn_batch_ale.py (#443)
    • Fix a bug of unintentionally using same process indices (#455)
    • Make cv2 dependency optional (#456)
    • fix ScaledFloatFrame.observation_space (#460)
    • Apply noisy_net_sigma parameter (#465)
    • Match EpisodicReplayBuffer.sample with ReplayBuffer.sample (#485)
    • Make to_factorized_noisy work with sequential links (#489)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Feb 28, 2019)

    Important enhancements

    • Implicit Quantile Network (IQN) https://arxiv.org/abs/1806.06923 agent is added: chainerrl.agents.IQN.
    • Training DQN and its variants with N-step returns is supported.
    • Resetting env with done=False via info dict is supported. When env.step returns a info dict with info['needs_reset']=True, env is reset. This feature is useful for implementing a continuing env.
    • Evaluation with a fixed number of timesteps is supported (except async training). This evaluation protocol is popular in Atari benchmarks.
      • examples/atari/dqn now implements the same evaluation protocol as the Nature DQN paper.
    • An example script of training a DoubleDQN agent for a PyBullet-based robotic grasping env is added: examples/grasping.

    Important bugfixes

    • The bug that PPO's obs_normalizer was not saved is fixed.
    • The bug that NonbiasWeightDecay didn't work with newer versions of Chainer is fixed.
    • The bug that argv argument was ignored by chainerrl.experiments.prepare_output_dir is fixed.

    Important destructive changes

    • train_agent_with_evaluation and train_agent_batch_with_evaluation now require eval_n_steps (number of timesteps for each evaluation phase) and eval_n_episodes (number of episodes for each evaluation phase) to be explicitly specified, with one of them being None.
    • train_agent_with_evaluation's max_episode_len argument is renamed to train_max_episode_len.
    • ReplayBuffer.sample now returns a list of lists of N experiences to support N-step returns.

    All updates

    Enhancement

    • Implicit quantile networks (IQN) (#288)
    • Adds N-step learning for DQN-based agents. (#317)
    • Replaywarning (#321)
    • Close envs in async training (#343)
    • Allow envs to send a 'needs_reset' signal (#356)
    • Changes variable names in train_agent_with_evaluation (#358)
    • Use chainer.dataset.concat_examples in batch_states (#366)
    • Implements Time-based evaluations (#367)

    Documentation

    • Add long description for pypi (#357, thanks @ljvmiranda921!)
    • A small change to the installation documentation (#369)
    • Adds a link to the ChainerRL visualizer from the main repository (#370)
    • adds implicit quantile networks to readme (#393)
    • Fix DQN.update's docstring (#394)

    Examples

    • Grasping example (#371)
    • Adds Deepmind Scores to README in DQN Example (#383)

    Testing

    • Fix TestTrainAgentAsync (#363)
    • Use AbnormalExitCodeWarning for nonzero exitcode warnings (#378)
    • Avoid random test failures due to asynchronousness (#380)
    • Drop hacking (#381)
    • Avoid gym 0.11.0 in Travis (#396)
    • Stabilize and speed up A3C tests (#401)
    • Reduce ACER's test cases and maximum timesteps (#404)
    • Add tests of IQN examples (#405)

    Bugfixes

    • Avoid UnicodeDecodeError in setup.py (#365)
    • Save and load obs_normalizer of PPO (#377)
    • Make NonbiasWeightDecay work again (#390)
    • bug fix (#391, thanks @tappy27!)
    • Fix episodic training of DDPG (#399)
    • Fix PGT's training (#400)
    • Fix ResidualDQN's training (#402)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Nov 15, 2018)

    Important enhancements

    • Batch synchronized training using multiple environment instances and a single GPU is supported for some agents:
      • A2C (added as chainerrl.agents.A2C)
      • PPO
      • DQN and other agents that inherits DQN except SARSA
    • examples/ale/train_dqn_ale.py now follows "Tuned DoubleDQN" setting by default, and supports prioritized experience replay as an option
    • examples/atari/train_dqn.py is added as a basic example of applying DQN to Atari.

    Important bugfixes

    • A bug in chainerrl.agents.CategoricalDQN that deteriorates performance is fixed
    • A bug in atari_wrappers.LazyFrame that unnecessarily increases memory usage is fixed

    Important destructive changes

    • chainerrl.replay_buffer.PrioritizedReplayBuffer and chainerrl.replay_buffer.PrioritizedEpisodicReplayBuffer are updated:
      • become FIFO (First In, First Out), reducing memory usage in Atari games
      • compute priorities more closely following the paper
    • eval_explorer argument of chainerrl.experiments.train_agent_* is dropped (use chainerrl.wrappers.RandomizeAction for evaluation-time epsilon-greedy)
    • Interface of chainerrl.agents.PPO has changed a lot
    • Support of Chainer v2 is dropped
    • Support of gym<0.9.7 is dropped
    • Support of loading chainerrl<=0.2.0's replay buffer is dropped

    All updates

    Enhancement

    • A2C (#149, thanks @iory!)
    • Add wrappers to cast observations (#160)
    • Fix on flake8 3.5.0 (#214)
    • Use ()-shaped array for scalar loss (#219)
    • FIFO prioritized replay buffer (#277)
    • Update Policy class to inherit ABCMeta (#280, thanks @uidilr!)
    • Batch PPO Implementation (#295, thanks @ljvmiranda921!)
    • Mimic the details of prioritized experience replay (#301)
    • Add ScaleReward wrapper (#304)
    • Remove GaussianPolicy and obsolete policies (#305)
    • Make random access queue sampling code cleaner (#309)
    • Support gym==0.10.8 (#324)
    • Batch A2C/PPO/DQN (#326)
    • Use RandomizeAction wrapper instead of Explorer in evaluation (#328)
    • remove duplicate lines (typo) (#329, thanks @monado3!)
    • Merge consecutive with statements (#333)
    • Use Variable.array instead of Variable.data (#336)
    • Remove code for Chainer v2 (#337)
    • Implement getitem for ActionValue (#339)
    • Count updates of DQN (#341)
    • Move Atari Wrappers (#349)
    • Render wrapper (#350)

    Documentation

    • fixes minor typos (#306)
    • fixes typo (#307)
    • Typos (#308)
    • fixes readme typo (#310)
    • Adds partial list of paper implementations with links to the main README (#311)
    • Adds another paper to list (#312)
    • adds some instructions regarding testing for potential contributors (#315)
    • Remove duplication of DQN in docs (#334)
    • nit on grammar of a comment: (#354)

    Examples

    • Tuned DoubleDQN with prioritized experience replay (#302)
    • adds some descriptions to parseargs arguments (#319)
    • Make clip_eps positive (#340)
    • updates env in ddpg example (#345)
    • Examples (#348)

    Testing

    • Fix Travis CI errors (#318)
    • Parse Chainer version with packaging.version (#322)
    • removes tests for old replay buffer (#347)

    Bugfixes

    • Fix the error caused by inexact delta_z (#314)
    • Stop caching the result of numpy.concatenate in LazyFrames (#332)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Jul 23, 2018)

    Important enhancements

    • TRPO (trust region policy optimization) is added: chainerrl.agents.TRPO.
    • C51 (categorical DQN) is added: chainerrl.agents.CategoricalDQN.
    • NoisyNet is added: chainerrl.links.FactorizedNoisyLinear and chainerrl.links.to_factorized_noisy.
    • Python 3.7 is supported
    • Examples were improved in terms of logging and random seed setting

    Important destructive changes

    • The async module is renamed async_ for Python 3.7 support.

    All updates

    Enhancements

    • TRPO agent (#204)
    • Use numpy random (#206)
    • Add gpus argument for chainerrl.misc.set_random_seed (#207)
    • More check on nesting AttributeSavingMixin (#208)
    • show error message (#210, thanks @corochann!)
    • Add an option to set whether the agent is saved every time the score is improved (#213)
    • Make tests check exit status of subprocesses (#215)
    • make ReplayBuffer.load() compatible with v0.2.0. (#216, thanks @mr4msm!)
    • Add requirements-dev.txt (#222)
    • Align act and act_and_train's signature to the Agent interface (#230, thanks @lyx-x!)
    • Support dtype arg of spaces.Box (#231)
    • Set outdir to results and add help strings (#248)
    • Categorical DQN (C51) (#249)
    • Remove DiscreteActionValue.sample_epsilon_greedy_actions (#259)
    • Remove DQN.compute_q_values (#260)
    • Enable to change batch_states in PPO (#261, thanks @kuni-kuni!)
    • Remove unnecessary declaration and substitution of 'done' in the train_agent function (#271, thanks @uidilr!)

    Documentation

    • Update the contribution guide to use pytest (#220)
    • Add docstring to ALE and fix seed range (#234)
    • Fix docstrings of DDPG (#241)
    • Update the algorithm section of README (#246)
    • Add CategoricalDQN to README (#252)
    • Remove unnecessary comments from examples/gym/train_categorical_dqn_gym.py (#255)
    • Update README.md of examples/ale (#275)

    Examples

    • Fix OMP_NUM_THREADS setting (#235)
    • Improve random seed setting in ALE examples (#239)
    • Improve random seed setting for all examples (#243)
    • Use gym and atari wrappers instead of chainerrl.envs.ale (#253)
    • Remove unused args from examples/ale/train_categorical_dqn_ale.py and examples/ale/train_dqn_ale.py (#256)
    • Remove unused --profile argument (#258)
    • Hyperlink DOI against preferred resolver (#266, thanks @katrinleinweber!)

    Testing

    • Fix import chainer.testing.condition (#200)
    • Use pytest (#209)
    • Fix PCL tests (#211)
    • Test loading v0.2.0 replay buffers (#217)
    • Use assertRaises instead of expectedFailure (#218)
    • Improve travis script (#242)
    • Run autopep8 in travis ci (#247)
    • Switch autopep8 and hacking (#257)
    • Use hacking 1.0 (#262)
    • Fix a too long line of PPO (#264)
    • Update to hacking 1.1.0 (#274)
    • Add tests of DQN's loss functions (#279)

    Bugfixes

    • gym 0.9.6 is not working with python2 (#226)
    • Tiny fix: argument passing in SoftmaxDistribution (#228, thanks @lyx-x!)
    • Add docstring to ALE and fix seed range (#234)
    • except both Exception and KeyboardInterrupt (#250, thanks @uenoku!)
    • Switch autopep8 and hacking (#257)
    • Modify async to async_ to support Python 3.7 (#286, thanks @mmilk1231!)
    • Noisy network fixes (#287, thanks @seann999!)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Dec 8, 2017)

    Important enhancements

    • Both Chainer v2 and v3 are now supported
    • PPO (Proximal Policy Optimization) has been added: chainerrl.agents.PPO
    • Replay buffers has been made faster

    Important destructive changes

    • Episodic replay buffers' __len__ now counts the number of transitions, not episodes
    • ALE's grayscale conversion formula has been corrected
    • FCGaussianPolicyWithFixedCovariance now has a nonlinearity before the last layer

    All updates

    Enhancements

    • Add RMSpropAsync and NonbiasWeightDecay to optimizers/__init__.py (#113)
    • Use init_scope (#116)
    • Remove ALE dependency (#121)
    • Support environments without git command (#124)
    • Add PPO agent (#126)
    • add .gitignore (#127, thanks @knorth55!)
    • Use faster queue for replay buffers (#131)
    • Use F.matmul instead of F.batch_matmul (#141)
    • Add a utility function to draw a computational graph (#166)
    • Improve MLPBN (#171)
    • Improve StateActionQFunctions (#172)
    • Improve deterministic policies (#173)
    • Fix InvertGradients (#185)
    • Remove unused functions in DQN (#188)
    • Warn about negative exit code of child processes (#194)

    Documentation

    • Add animation gifs (#107)
    • Synchronize docs version with package version (#111)
    • Add logo (#136)
    • [policies/gaussian_policy] Improve docstring (#140, thanks @iory!)
    • Improve docstrings (#142)
    • Fix a typo (#146)
    • Fix a broken link to travis ci (#153)
    • Add PPO to README as an implemented algorithm (#168)
    • Improve the docstring of AdditiveGaussian (#170)
    • Add docsting on eval_max_episode_len (#177)
    • Add docstring to DuelingDQN (#187)
    • Suppress Sphinx' warning in the docstring of PCL (#198)

    Example

    • fix typo (#122)
    • Use Chain.init_scope in the quick start (#148)
    • Draw computational graphs in train_dqn_ale.py (#192)
    • Draw computational graphs in train_dqn_gym.py (#195)
    • Draw computational graphs in train_a3c_ale.py (#197)

    Testing

    • Add CHAINER_VERSION config to CI (#143)
    • Specify --outdir on 2nd test (#154)
    • Return dict for info of env.step (#162)
    • Fix import error in tests (#180)
    • Mark TestBiasCorrection as slow (#181)
    • Add tests for SingleActionValue (#191)

    Bugfixes

    • Fix save/load in EpisodicReplayBuffer (#130)
    • Fix REINFORCE's missing initialization of t (#133)
    • Fix episodic buffer __len__ (#155)
    • Remove duplicated import of explorers (#163)
    • Fix missing nonlinearity before the last layer (#165)
    • Use bytestrings to write git outputs (#178)
    • Patches to envs.ALE (#182)
    • Fix QuadraticActionValue and add tests (#190)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jun 8, 2017)

    Enhancements:

    • Agents
      • REINFORCE #81
    • Training helper functions
      • Hook functions #85
      • Add more columns to scores.txt: episodes, max and min #78
      • Improve naming of the output directories #72 #77
      • Use logger instead of print #60
      • Make train_agent_async's eval_interval optional #93
    • Misc
      • Use Gumbel-Max trick for categorical sampling in GPU #88 #104
      • Remove test arguments from links (use chainer.config instead) #100

    Fixes:

    • Fix argument names #86
    • Fix option names #71
    • Fix the issue that average_loss is not updated #95

    Dependency changes:

    • Switch to Chainer v2 #100

    Changes that can affect performance:

    • train_agent_async won't decay learning rate by default any more. Use hook functions instead.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Mar 27, 2017)

    Enhancements:

    • SARSA #39
    • Boltzmann explorer #40
    • ACER for continuous actions #29
    • PCL #45 #57
    • Prioritized Replay #44 #57

    Fixes:

    • Fix spelling: s/updator/updater/ #48
    Source code(tar.gz)
    Source code(zip)
Owner
Chainer
Chainer
Open world survival environment for reinforcement learning

Crafter Open world survival environment for reinforcement learning. Highlights Crafter is a procedurally generated 2D world, where the agent finds foo

Danijar Hafner 213 Jan 05, 2023
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

Facebook Research 3.3k Jan 05, 2023
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Intel Labs 2.2k Jan 05, 2023
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

SLM Lab Modular Deep Reinforcement Learning framework in PyTorch. Documentation: https://slm-lab.gitbook.io/slm-lab/ BeamRider Breakout KungFuMaster M

Wah Loon Keng 1.1k Dec 24, 2022
A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Reinforcement Learning Working Group 1.6k Jan 09, 2023
Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

Keras-RL 5.4k Jan 04, 2023
A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Stable Baselines Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. You can read a

Ashley Hill 3.7k Jan 01, 2023
TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. TF-Agents makes implementing, de

2.4k Dec 29, 2022
Monitor your el-cheapo UPS via SNMP

UPSC-SNMP-Agent UPSC-SNMP-Agent exposes your el-cheapo locally connected UPS via the SNMP network management protocol. This enables various equipment

Tom Szilagyi 32 Jul 28, 2022
An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Reinforcement Learning Working Group 823 Jan 06, 2023
Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Serpent.AI - Game Agent Framework (Python) Update: Revival (May 2020) Development work has resumed on the framework with the aim of bringing it into 2

Serpent.AI 6.4k Jan 05, 2023
Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

Marek Wydmuch 1.5k Dec 30, 2022
This is the official implementation of Multi-Agent PPO.

MAPPO Chao Yu*, Akash Velu*, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. Website: https://sites.google.com/view/mappo This repository implem

653 Jan 06, 2023
Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.

Paddle-RLBooks Welcome to Paddle-RLBooks which is a reinforcement learning code study guide based on pure PaddlePaddle. 欢迎来到Paddle-RLBooks,该仓库主要是针对强化学

AgentMaker 117 Dec 12, 2022
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Status: Maintenance (expect bug fixes and minor updates) Baselines OpenAI Baselines is a set of high-quality implementations of reinforcement learning

OpenAI 13.5k Jan 07, 2023
A general-purpose multi-agent training framework.

MALib A general-purpose multi-agent training framework. Installation step1: build environment conda create -n malib python==3.7 -y conda activate mali

MARL @ SJTU 346 Jan 03, 2023
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

OpenAI 29.6k Jan 01, 2023
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

Chainer 1.1k Dec 26, 2022
TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

DeepMind 3.1k Dec 29, 2022