Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Overview

Status: Archive (code is provided as-is, no updates expected)

Update August 2020: For an example repository that achieves state-of-the-art modeling performance on CIFAR-10 using Sparse Transformers, please see https://github.com/openai/distribution_augmentation

Sparse Attention

This repository contains the sparse attention primitives used in Sparse Transformers (see blog and paper). Specifically, it includes the following:

  1. A faster implementation of normal attention (the upper triangle is not computed, and many operations are fused).
  2. An implementation of "strided" and "fixed" attention, as in the Sparse Transformers paper.
  3. A simple recompute decorator, which can be adapted for usage with attention.

We hope this code can further accelerate research into sparse attention.

An example Transformer implementation which is close to the version we use internally can be found at https://github.com/openai/blocksparse/blob/master/examples/transformer/enwik8.py.

Overview of kernels

The repository contains fused implementations of the attention operation, which takes in Q, K, V matrices (all of dimensionality batch, time, dim) representing the queries, keys, and values for a sequence. For every query element, a weighted sum of the values is returned, where the weightings are determined by the scaled matrix product of Q and K^T.

The kernels allow specification of block sparsity in the QK^T matrix. This means you define a pattern of 0/1s on a [time/blocksize, time/blocksize] matrix of blocks, and the values where it is 0 will not be computed, and not be included in the softmax calculation. Additionally, one can define "callbacks" on the computed blocks, which will further mask out values in any given block from the softmax (though the matrix product will still be computed for those elements).

Block sizes of {8, 16, 32, 64} are supported, and slight advantages in speed may be seen from using larger blocks.

Prerequisites

For fp32 and blocksize 32, any NVIDIA GPU past Kepler can be used (i.e. compute capability beyond 3.5).

For fp16 and blocksize 8, 16, 32, 64, a GPU with Tensor Cores (e.g. the V100 GPU, compute capability >= 7.0) is required.

The primary dependency is the OpenAI blocksparse package.

With CUDA 10 and tensorflow-gpu, you can install blocksparse with pip install blocksparse.

For other setups, you must install blocksparse from source, and directions can be found in the root of the repository.

Examples

Run the following on a non-V100 GPU:

python attention.py

On a V100 GPU:

python attention.py fp16

General usage

An example can be found at the bottom of attention.py.

full_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)
full_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)

# first step of strided attention
local_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)
local_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)

# second step of strided attention
strided_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)
strided_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)

# # the 'fixed' attention pattern
fixed = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="fixed", local_attn_ctx=128, num_verts=4, vertsize=1, recompute=True)

Referencing this work

If you find this helpful in your work, you can consider citing the following:

@article{child2019sparsetransformer,
  title={Generating Long Sequences with Sparse Transformers},
  author={Child, Rewon and Gray, Scott and Radford, Alec and Sutskever, Ilya},
  journal={URL https://openai.com/blog/sparse-transformers},
  year={2019}
}
Owner
OpenAI
OpenAI
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

Alexander Keeney 1 Jan 05, 2022
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

1k Dec 26, 2022
Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

9 Jan 08, 2023
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
Gold standard corpus annotated with verb-preverb connections for Hungarian.

Hungarian Preverb Corpus A gold standard corpus manually annotated with verb-preverb connections for Hungarian. corpus The corpus consist of the follo

RIL Lexical Knowledge Representation Research Group 3 Jan 27, 2022
Longformer: The Long-Document Transformer

Longformer Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents. ***** New December 1st, 2020: Longforme

AI2 1.6k Dec 29, 2022
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

Welcome to Spokestack Python! This library is intended for developing voice interfaces in Python. This can include anything from Raspberry Pi applicat

Spokestack 133 Sep 20, 2022
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

Charly Wargnier 9 Jan 01, 2023
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Cambridge Language Technology Lab 61 Dec 10, 2022
VoiceFixer VoiceFixer is a framework for general speech restoration.

VoiceFixer VoiceFixer is a framework for general speech restoration. We aim at the restoration of severly degraded speech and historical speech. Paper

Leo 174 Jan 06, 2023