SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Overview

SmallInitEmb

LayerNorm(SmallInit(Embedding)) in a Transformer

I find that when training a transformer, the embedding matrix moves slowly, hence it's difficult for the model to jump out of the initial noisy embedding.

(initial embedding)
[[-0.0073  0.0062 -0.0261 ...  0.0086  0.0107 -0.008 ] ... ]
 (after 1 step, the directions of the embedding vectors are not moved much because the numbers change by ~LR = ~4e-4)
[[-0.0069  0.0066 -0.0265 ...  0.009   0.0111 -0.0084] ... ]

So I propose initializing the embedding matrix to tiny values, and put another LayerNorm after it (before all the SA & FFN layers):

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
if self.config.USE_SMALL_EMB and self.layer_id == 0:
    x = self.lnPre(x) # LN(SmallInit(Emb))
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

And then you get improved convergence (especially for BPE models) because the model can quickly jump out of the tiny initial embedding (small changes after 1 step -> significant changes of directions -> significant changes after LayerNorm).

Loss curve comparison: https://wandb.ai/blinkdl/SmallEmbTest

(the gap between LayerNorm(SmallEmb)) and baseline persists after more training)

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
x = self.ln1(x) # this plays the same role as the lnPre in the above PreLN code
x = x + self.att(x)
x = self.ln2(x)
x = x + self.ffn(x)
(note you shall have another LN after the final ffn)
Owner
PENG Bo
http://zhihu.com/people/bopengbopeng
PENG Bo
DrNAS: Dirichlet Neural Architecture Search

This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random va

Xiangning Chen 37 Jan 03, 2023
The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"

TimeSformer This is an official pytorch implementation of Is Space-Time Attention All You Need for Video Understanding?. In this repository, we provid

Facebook Research 1k Dec 31, 2022
TensorFlow implementation of Deep Reinforcement Learning papers

Deep Reinforcement Learning in TensorFlow TensorFlow implementation of Deep Reinforcement Learning papers. This implementation contains: [1] Playing A

Taehoon Kim 1.6k Jan 03, 2023
A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model.

Semantic Meshes A framework for annotating 3D meshes using the predictions of a 2D semantic segmentation model. Paper If you find this framework usefu

Florian 40 Dec 09, 2022
【ACMMM 2021】DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning (ACMMM 2021) Overview We release the code of the DSANet (Dynamic S

Wenhao Wu 46 Dec 27, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 04, 2023
Determined: Deep Learning Training Platform

Determined: Deep Learning Training Platform Determined is an open-source deep learning training platform that makes building models fast and easy. Det

Determined AI 2k Dec 31, 2022
Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”

Official implementation for TransDA Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”. Overview: Result: Prerequisites:

stanley 54 Dec 22, 2022
Face and Body Tracking for VRM 3D models on the web.

Kalidoface 3D - Face and Full-Body tracking for Vtubing on the web! A sequal to Kalidoface which supports Live2D avatars, Kalidoface 3D is a web app t

Rich 257 Jan 02, 2023
Human Pose Detection on EdgeTPU

Coral PoseNet Pose estimation refers to computer vision techniques that detect human figures in images and video, so that one could determine, for exa

google-coral 476 Dec 31, 2022
CONditionals for Ordinal Regression and classification in tensorflow

Condor Ordinal regression in Tensorflow Keras Tensorflow Keras implementation of CONDOR Ordinal Regression (aka ordinal classification) by Garrett Jen

9 Jul 31, 2022
PyTorch common framework to accelerate network implementation, training and validation

pytorch-framework PyTorch common framework to accelerate network implementation, training and validation. This framework is inspired by works from MML

Dongliang Cao 3 Dec 19, 2022
Official implementation for “Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior”

Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior. The code will release soon. Implementation Python3 PyTorch=1.0 NVIDIA GPU+

FengZhang 34 Dec 04, 2022
(AAAI2020)Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing

Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing This repository contains pytorch source code for AAAI2020 oral paper: Grapy-ML

54 Aug 04, 2022
Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

Scribble-Supervised LiDAR Semantic Segmentation Dataset and code release for the paper Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORA

102 Dec 25, 2022
MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding Website • Colab • Paper This repository contains code and links to pre-trained mod

Aishwarya Kamath 770 Dec 28, 2022
Sketch-Based 3D Exploration with Stacked Generative Adversarial Networks

pix2vox [Demonstration video] Sketch-Based 3D Exploration with Stacked Generative Adversarial Networks. Generated samples Single-category generation M

Takumi Moriya 232 Nov 14, 2022
Final project for Intro to CS class.

Financial Analysis Web App https://share.streamlit.io/mayurk1/fin-web-app-final-project/webApp.py 1. Project Description This project is a technical a

Mayur Khanna 1 Dec 10, 2021
PyTorch implementation of our CVPR2021 (oral) paper "Prototype Augmentation and Self-Supervision for Incremental Learning"

PASS - Official PyTorch Implementation [CVPR2021 Oral] Prototype Augmentation and Self-Supervision for Incremental Learning Fei Zhu, Xu-Yao Zhang, Chu

67 Dec 27, 2022
UFT - Universal File Transfer With Python

UFT 2.0.0 UFT (Universal File Transfer) is a CLI tool , which can be used to upl

Merwin 1 Feb 18, 2022