当前位置：网站首页>Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record

Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record

2022-06-09 13:49:00 【QbitAl】

White cross From the Aofei temple
qubits | official account QbitAI

Flash is all you need！

lately , An ultra fast and memory saving attention algorithm FlashAttention became angry .

Read by sensing video memory / write in ,FlashAttention Running speed ratio of PyTorch standard Attention fast 2-4 times , The memory required is only its 5%-20%.

And its performance is more than that .

Training BERT Speed is compared with MLPerf Training records improved 15%;
Training GPT-2 Speed up 3.5 times ;
Training Transformer Is faster than the existing baseline .

Netizens expressed surprise one after another ：Great Job！ This job is very useful to me .

Let's see what kind of research this is ~

FlashAttention

This paper puts forward a kind of IO Perceptual precision attention algorithm .

With Transformer Getting bigger and bigger 、 Deeper and deeper , But it's still slow on long sequences 、 And consumes memory .（ The self attention time and memory complexity are quadratic with the sequence length ）

Existing approximate attention methods , Trying to sacrifice model quality , To reduce the computational complexity to solve this problem .

But there are some limitations , That is, you can't improve the training speed during running .

Researchers believe , should Let the attention algorithm have IO perception , That is, consider the reading and writing between video memory levels , For example, large but slow HBM（High Bandwidth Memory） Technology and small but fast SRAM.

Based on this background , The researchers proposed FlashAttention, There are two specific acceleration technologies ： It is calculated by increasing the number of blocks tile 、 And recalculate attention in backward transmission , Integrate all attention operations into CUDA The kernel .

FlashAttention Use tiles to prevent large 𝑁×𝑁 Attention matrix （ Dotted box ） stay GPU HBM Upper materialization (materialization). In the external loop （ The red arrow ）,FlashAttention Cycle through K and V A block of a matrix , And load it into SRAM.

In each block ,FlashAttention loop Q Block of matrix （ Blue arrow ） Load it into SRAM, And write the output of the attention calculation back to HBM.

This produces an attention algorithm , In actual time （wall-clock time） Inside , Its memory efficiency and speed are very high , Compared with the standard attention algorithm, it can access HBM.

The result is faster than the existing attention algorithm

The researchers evaluated FlashAttention To train Transformer Influence , Including training time 、 Model accuracy , And attention to running time and memory efficiency .

First, in the Training speed On .FlashAttention Than MLPerf 1.1 Of BERT The speed record is higher than 15%.

In the realization of GPT-2 On , Than HuggingFace Faster than 3 times , Than Megatron Standards for Transformer Faster than 1.8 times ,FlashAttention take LRA（long-range arena） The benchmark speed of has been improved 2.4 times .

stay Model quality ,FlashAttention take Transformer Extended to longer sequences , And the quality is better .

Long context language modeling .

As shown in the figure , Use FlashAttention It can make GPT-2 Context length increases 4 In the case of times , The training time is longer than Megatron-LM Optimization is fast 30%, At the same time, we also got 0.7 The degree of perplexity （ The less confused , Explain that the better the language model ）.

Long document classification

For longer sequences Transformer Training can improve MIMIC-III and ECtHR Data set performance , For example, the sequence length is 16K stay MIMIC Upper specific length 512 More 4.3 branch .

MIMIC-III： Include discharge summary of ICU patients , Each has multiple label notes ;ECtHR： Legal cases containing the European bill of human rights ; Both datasets contain very long text files .

Besides , Also completed the first one that can be in Path-X and Path-256 To achieve non random performance in tasks Transformer Model .

after , The researchers also completed The benchmark , measurement FlashAttention And blocky sparse （Block-Sparse）FlashAttention Runtime and memory performance , And with 40GB HBM Of A100 GPU Various attention baselines were compared .

Results show ,FlashAttention Running time of , Than PyTorch Attention is achieved quickly 3 times ; In the case of short sequences ,FlashAttention In short sequences, it still runs faster than approximate and sparse attention ; As for the massive and sparse FlashAttention, It is faster than the existing attention implementation in all sequence lengths .

As for the efficiency of video memory ,FlashAttention Than PyTorch Attention baseline is high 20 times .

stay 64k Sequence length 、 All other algorithms have exhausted the video memory ,FlashAttention The efficiency of Still than Linformer high 2 times .

A work by Dr. Stanford

This research is from the computer department of Stanford University and the State University of New York at Buffalo . The joint work is two Stanford computer doctoral students Tri Dao and Dan Fu.

Interested friends , You can find out more about the paper link below ~

Thesis link ：
https://arxiv.org/abs/2205.14135
GitHub link ：
https://github.com/HazyResearch/flash-attention
Reference link ：
https://twitter.com/tri_dao/status/1531437619791290369

— End —

Live registration | The road to mass production of autonomous driving ：

Why? “ Progressive type ” The path first saw the dawn of mass production of driverless vehicles ？

The field of automatic driving has always had “ Progressive type ” and “ Leapfrog ” The battle between the two paths , The former is represented by Tesla , The latter with Waymo Be the leader .

Tesla announces 2024 In, the new type of “Robotaxi” Mass production of , And on the other side Waymo CEO quit , The commercialization has been stagnant . Behind this , Why? “ Progressive type ” The path is favored by more and more institutions ？“ Progressive type ” What is the path of technological development ？ How far is the mass production of autonomous driving from our life ？

Focus on me here , Remember to mark the star ～

原网站

版权声明
本文为[QbitAl]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/160/202206091235336232.html

当前位置：网站首页>Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record

Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record

White cross From the Aofei temple
qubits | official account QbitAI

FlashAttention

The result is faster than the existing attention algorithm

A work by Dr. Stanford

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record

Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record

White cross From the Aofei temple qubits | official account QbitAI

FlashAttention

The result is faster than the existing attention algorithm

A work by Dr. Stanford

边栏推荐

猜你喜欢

随机推荐

White cross From the Aofei temple
qubits | official account QbitAI