Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Last update: Sep 06, 2021

Overview

Period-alternatives-of-Softmax

Experimental Demo for our paper

'Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism'

We suggest that replacing the exponential function by periodic functions. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants.

** Create your own 'dataset' fold, and maybe need to modify the demo.py file for your own dataset except for cifar-10, cifar-100 and Tiny-imageNet.

Function available:

softmax , norm_softmax
sinmax, norm_sinmax
cosmax, norm_cosmax
sin_2_max, norm_sin_2_max
sin_2_max_move, norm_sin_2_max_move
sirenmax, norm_sirenmax
sin_softmax, norm_sin_softmax

mode available:

search:
        Random search for a suitable set of learning rate and weight decay, and record the results in 
        Attention_test/*functions/lr_wd_search.txt
run:
        Train the demo, and there will be four .npy files created in root.
        (1) 'record_val_acc.npy' for val acc record every 100 iter;
        (2) 'record_train_acc.npy' for train acc record every batch;
        (3) 'record_loss.npy' for train loss record every batch;
        (4) 'kq_value.npy' for Q.K record *before sclaled*.
att_run:
        Same as the run mode but:
        (1) No kq_value record;
        (2) Every 5 epoch, input a test image and record the attention score map of each head of each layer.
            Saved in 'Attention_test/attention_maps.npy'

Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Related tags

Overview

Period-alternatives-of-Softmax

'Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism'

Function available:

mode available:

Owner

slwang9353

ACL'2021: LM-BFF: Better Few-shot Fine-tuning of Language Models

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Code for Multinomial Diffusion

DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

SegNet model implemented using keras framework

RNN Predict Street Commercial Vitality

Dataset VSD4K includes 6 popular categories: game, sport, dance, vlog, interview and city.

GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

Unofficial Tensorflow Implementation of ConvNeXt from A ConvNet for the 2020s

JumpDiff: Non-parametric estimator for Jump-diffusion processes for Python

Deep learning operations reinvented (for pytorch, tensorflow, jax and others)

Using modified BiSeNet for face parsing in PyTorch

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

dualPC.R contains the R code for the main functions.

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Cowsay - A rewrite of cowsay in python

Jarvis Project is a basic virtual assistant that uses TensorFlow for learning.

Official codes: Self-Supervised Learning by Estimating Twin Class Distribution

ML-Decoder: Scalable and Versatile Classification Head

OpenLT: An open-source project for long-tail classification