Ranger deep learning optimizer rewrite to use newest components

Related tags

Deep LearningRanger21
Overview

Ranger21 - integrating the latest deep learning components into a single optimizer

Ranger deep learning optimizer rewrite to use newest components

Ranger, with Radam + Lookahead core, is now approaching two years old.
*Original publication, Aug 2019: New deep learning optimizer Ranger
In the interim, a number of new developments have happened including the rise of Transformers for Vision.

Thus, Ranger21 (as in 2021) is a rewrite with multiple new additions reflective of some of the most impressive papers this past year. The focus for Ranger21 is that these internals will be parameterized, and where possible, automated, so that you can easily test and leverage some of the newest concepts in AI training, to optimize the optimizer on your respective dataset.

Latest Simple Benchmark comparison (Image classification, dog breed subset of ImageNet, ResNet-18):

Ranger 21:
Accuracy: 74.02% Validation Loss: 15.00

Adam:
Accuracy: 64.84% Validation Loss: 17.19

Net results: 14.15% greater accuracy with Ranger21 vs Adam, same training epochs.

Ranger21 Status:

April 27 PM - Ranger21 now training on ImageNet! Starting work on benchmarking Ranger21 on ImageNet. Due to cost, will train to 40 epochs on ImageNet and compare with same setup with 40 epochs using Adam to have a basic "gold standard" comparison. Training is underway now, hope to have results end of this week.

April 26 PM - added smarter auto warmup based on Dickson Neoh report (tested with only 5 epochs), and first pip install setup thanks to @BrianPugh!
The warmup structure for Ranger21 is based on the paper by Ma/Yarats which uses the beta2 param to compute the default warmup. However, that also assumes we have a longer training run. @DNH on the fastai forums tested with 5 epochs which meant it never got past warmup phase.
Thus have added a check for the % warmup relative to the total training time and will auto fall back to 30% (settable via warmup_pct_default) in order to account for shorter training runs.

  • First pip install for Ranger21, thanks to @BrianPugh! In the next week or two will be focusing on making Ranger21 easier to install and use vs adding new optimizer features and thanks to @BrianPugh we've already underway with a basic pip install.
git clone https://github.com/lessw2020/Ranger21.git
cd Ranger21
python -m pip install -e .
```

or directly installed from github:

```
python -m pip install git+https://github.com/lessw2020/Ranger21.git

April 25 PM - added guard for potential key error issue Update checked in to add additional guard to prevent a key error reported earlier today during lookahead step. This should correct, but since unable to repro locally, please update to latest code and raise an issue if you encounter this. Thanks!

April 25 - Fixed warmdown calculation error, moved to Linear warmdown, new high in benchmark: Found that there was an error in the warmdown calculations. Fixed and also moved to linear warmdown. This resulted in another new high for the simple benchmark, with results now moved to above so they don't get lost in the updates section.
Note that the warmdown now calculates based on the decay between the full lr, to the minimal lr (defaults to 3e-5), rather than previously declining to 0.

Note that you can display the lr curves directly by simply using:

lr_curve = optimizer.tracking_lr
plt.plot(lr_curve)

Ranger21 internally tracks the lr per epoch for this type of review. Additional updates include adding a 'clear_cache' to reset the cached lookahead params, and also moved the lookahead procesing to it's own function and cleaned up some naming conventions. Will use item_active=True/False rather than the prior using_item=True/False to keep the code simpler as now item properties are alpha grouped vs being cluttered into the using_item layout.
April 24 - New record on benchmark with NormLoss, Lookahead, PosNeg momo, Stable decay etc. all combined NormLoss and Lookahead integrated into Ranger21 set a new high on our simple benchmark (ResNet 18, subset of ImageWoof).
Best Accuracy = 73.41 Best Val Loss = 15.06

For comparison, using plain Adam on this benchmark:
Adam Only Accuracy = 64.84 Best Adam Val Loss = 17.19

In otherwords, 12.5%+ higher accuracy atm for same training epochs by using Ranger21 vs Adam.

Basically it shows that the integration of all these various new techniques is paying off, as currently combining them delivers better than any of them + Adam.

New code checked in - adds Lookahead and of course Norm Loss. Also the settings is now callable via .show_settings() as an easy way to check settings.
Ranger21_424_settings

Given that the extensive settings may become overwhelming, planning to create config file support to make it easy to save out settings for various architectures and ideally have a 'best settings' recipe for CNN, Transformer for Image/Video, GAN, etc.

April 23 - Norm Loss will be added, initial benchmarking in progress for several features A new soft regularizer, norm loss, was recently published in this paper on Arxiv: https://arxiv.org/abs/2103.06583v1

It's in the spirit of weight decay, but approaches it in a unique manner by nudging the weights towards the oblique manifold..this means unlike weight decay, it can actually push smaller weights up towards the norm 1 property vs weight decay only pushes down. Their paper also shows norm less is less sensitive to hyperparams such as batch size, etc. unlike regular weight decay.

One of the lead authors was kind enough to share their TF implemention, and have reworked it into PyTorch form and integrated into Ranger21. Initial testing set a new high for validation loss on my very basic benchmark. Thus, norm loss will be available with the next code update.

Also did some initial benchmarking to set vanilla Adam as a baseline, and ablation style testing with pos negative momentum. Pos neg momo alone is a big improvement over vanilla Adam, and looking forward to mapping out the contributions and synergies between all of the new features being rolled into Ranger21 including norm loss, adapt gradient clipping, gc, etc.

April 18 PM - Adaptive gradient clipping added, thanks for suggestion and code from @kayuksel. AGC is used in NFNets to replace BN. For our use case here, it's to have a smarter gradient clipping algo vs the usual hard clipping, and ideally better stabilize training.

Here's how the Ranger21 settings output looks atm: ranger21_settings

April 18 AM - chebyshev fractals added, cosine warmdown (cosine decay) added
Chebyshev performed reasonably well, but still needs more work before recommending so it's defaulting to off atm. There are two papers providing support for using Chebyshev, one of which is: https://arxiv.org/abs/2010.13335v1
Cosine warmdown has been added so that the default lr schedule for Ranger21 is linear warmup, flat run at provided lr, and then cosine decay of lr starting at the X% passed in. (Default is .65).

April 17 - building benchmark dataset(s) As a cost effective way of testing Ranger21 and it's various options, currently taking a subset of ImageNet categories and building out at the high level an "ImageSubNet50" and also a few sub category datasets. These are similar in spirit to ImageNette and ImageWoof, but hope to make a few relative improvements including pre-sizing to 224x224 for speed of training/testing. First sub-dataset in progress in ImageBirds, which includes:
n01614925 bald eagle
n01616318 vulture
n01622779 grey owl

n01806143 peacock
n01833805 hummingbird

This is a medium-fine classification problem and will use as first tests for this type of benchmarking. Ideally, will make a seperate repo for the ImageBirds shortly to make it available for people to use though hosting the dataset poses a cost problem...

April 12 - positive negative momentum added, madgrad core checked in Testing over the weekend showed that positive negative momentum works really well, and even better with GC.
Code is a bit messy atm b/c also tested Adaiw, but did not do that well so removed and added pos negative momentum. Pos Neg momentum is a new technique to add parameter based, anisotropic noise to the gradient which helps it settle into flatter minima and also escape saddle points. In other words, better results.
Link to their excellent paper: https://arxiv.org/abs/2103.17182

You can toggle between madgrad or not with the use_madgrad = True/False flag: ranger21_use_madgrad_toggle

April 10 - madgrad core engine integrated Madgrad has been added in a way that you will be able to select to use MadGrad or Adam as the core 'engine' for the optimizer.
Thus, you'll be able to simply toggle which opt engine to use, as well as the various enhancements (warmup, stable weight decay, gradient_centralization) and thus quickly find the best optimization setup for your specific dataset.

Still testing things and then will update code here... Gradient centralization good for both - first findings are gradient centralization definitely improves MadGrad (just like it does with Adam core) so will have GC on as default for both engines.

madgrad_added_ranger21

LR selection is very different between MadGrad and Adam core engine:

One item - the starting lr for madgrad is very different (typically higher) than with Adam....have done some testing with automated LR scheduling (HyperExplorer and ABEL), but that will be added later if it's successful. But if you simply plug your usual Adam LR's into Madgrad you won't be impressed :)

Note that AdamP projection was also tested as an option, but impact was minimal, so will not be adding it atm.

April 6 - Ranger21 alpha ready - automatic warmup added. Seeing impressive results with only 3 features implemented.
Stable weight decay + GC + automated linear warmup seem to sync very nicely. Thus if you are feeling adventorous, Ranger21 is basically alpha usable. Recommend you use the default warmup (automatic by default), but test lr and weight decay.
Ranger21 will output the settings at init to make it clear what you are running with: Ranger21_initialization

April 5 - stable weight decay added. Quick testing shows nice results with 1e-4 weight decay on subset of ImageNet.

Current feature set planned:

1 - feature complete - automated, Linear and Exponential warmup in place of RAdam. This is based on the findings of https://arxiv.org/abs/1910.04209v3

2 - Feature in progress - MadGrad core engine . This is based on my own testing with Vision Transformers as well as the compelling MadGrad paper: https://arxiv.org/abs/2101.11075v1

3 - feature complete - Stable Weight Decay instead of AdamW style or Adam style: needs more testing but the paper is very compelling: https://arxiv.org/abs/2011.11152v3

4 - feature complete - Gradient Centralization will be continued - as always, you can turn it on or off. https://arxiv.org/abs/2004.01461v2

5 - Lookahead may be brought forward - unclear how much it may help with the new MadGrad core, which already leverages dual averaging, but will probably include as a testable param.

6 - Feature implementation in progress - dual optimization engines - Will have Adam and Madgrad core present as well so that one could quickly test with both Madgrad and Adam (or AdamP) with the flip of a param.

If you have ideas/feedback, feel free to open an issue.

Installation

Until this is up on pypi, this can either be installed via cloning the package:

git clone https://github.com/lessw2020/Ranger21.git
cd Ranger21
python -m pip install -e .

or directly installed from github:

python -m pip install git+https://github.com/lessw2020/Ranger21.git
Owner
Less Wright
Principal Software Engineer at Audere PM/Test/Dev at Microsoft Software Architect at X10 Wireless
Less Wright
PyTorch implementation of U-TAE and PaPs for satellite image time series panoptic segmentation.

Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks (ICCV 2021) This repository is the official implem

71 Jan 04, 2023
An End-to-End Machine Learning Library to Optimize AUC (AUROC, AUPRC).

Logo by Zhuoning Yuan LibAUC: A Machine Learning Library for AUC Optimization Website | Updates | Installation | Tutorial | Research | Github LibAUC a

Optimization for AI 176 Jan 07, 2023
Information-Theoretic Multi-Objective Bayesian Optimization with Continuous Approximations

Information-Theoretic Multi-Objective Bayesian Optimization with Continuous Approximations Requirements The code is implemented in Python and requires

1 Nov 03, 2021
Retina blood vessel segmentation with a convolutional neural network

Retina blood vessel segmentation with a convolution neural network (U-net) This repository contains the implementation of a convolutional neural netwo

Orobix 1.2k Jan 06, 2023
Source code for our paper "Empathetic Response Generation with State Management"

Source code for our paper "Empathetic Response Generation with State Management" this repository is maintained by both Jun Gao and Yuhan Liu Model Ove

Yuhan Liu 3 Oct 08, 2022
Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

MTTS-CAN: Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement Paper Xin Liu, Josh Fromm, Shwetak Patel, Daniel M

Xin Liu 106 Dec 30, 2022
Official pytorch code for SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal

SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal This is the official pytorch code for SSAT: A Symmetric Semantic-

ForeverPupil 57 Dec 13, 2022
A python code to convert Keras pre-trained weights to Pytorch version

Weights_Keras_2_Pytorch 最近想在Pytorch项目里使用一下谷歌的NIMA,但是发现没有预训练好的pytorch权重,于是整理了一下将Keras预训练权重转为Pytorch的代码,目前是支持Keras的Conv2D, Dense, DepthwiseConv2D, Batch

Liu Hengyu 2 Dec 16, 2021
A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

Graph2SMILES A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction. 1. Environmental setup System requirements Ubuntu:

29 Nov 18, 2022
RL and distillation in CARLA using a factorized world model

World on Rails Learning to drive from a world on rails Dian Chen, Vladlen Koltun, Philipp Krähenbühl, arXiv techical report (arXiv 2105.00636) This re

Dian Chen 131 Dec 16, 2022
Full Stack Deep Learning Labs

Full Stack Deep Learning Labs Welcome! Project developed during lab sessions of the Full Stack Deep Learning Bootcamp. We will build a handwriting rec

Full Stack Deep Learning 1.2k Dec 31, 2022
Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

TimeLens: Event-based Video Frame Interpolation This repository is about the High Speed Event and RGB (HS-ERGB) dataset, used in the 2021 CVPR paper T

Robotics and Perception Group 544 Dec 19, 2022
GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery This is the code to the paper: Gradient-Based Learn

3 Feb 15, 2022
✔️ Visual, reactive testing library for Julia. Time machine included.

PlutoTest.jl (alpha release) Visual, reactive testing library for Julia A macro @test that you can use to verify your code's correctness. But instead

Pluto 68 Dec 20, 2022
A PyTorch implementation of "Semi-Supervised Graph Classification: A Hierarchical Graph Perspective" (WWW 2019)

SEAL ⠀⠀⠀ A PyTorch implementation of Semi-Supervised Graph Classification: A Hierarchical Graph Perspective (WWW 2019) Abstract Node classification an

Benedek Rozemberczki 202 Dec 27, 2022
Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection, AAAI 2021.

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection This repository is an official implementation of the AAAI 2021 paper Co-mi

MEGVII Research 20 Dec 07, 2022
The Official TensorFlow Implementation for SPatchGAN (ICCV2021)

SPatchGAN: Official TensorFlow Implementation Paper "SPatchGAN: A Statistical Feature Based Discriminator for Unsupervised Image-to-Image Translation"

39 Dec 30, 2022
Learning multiple gaits of quadruped robot using hierarchical reinforcement learning

Learning multiple gaits of quadruped robot using hierarchical reinforcement learning We propose a method to learn multiple gaits of quadruped robot us

Yunho Kim 17 Dec 11, 2022
Implementation of Axial attention - attending to multi-dimensional data efficiently

Axial Attention Implementation of Axial attention in Pytorch. A simple but powerful technique to attend to multi-dimensional data efficiently. It has

Phil Wang 250 Dec 25, 2022
Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

1.4k Jan 05, 2023