当前位置:网站首页>Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record
Dr. Stanford put forward the idea of ultra fast and saving memory attention. The gpt-2 training speed was increased by 3.5 times, and the Bert speed reached a record
2022-06-09 13:49:00 【QbitAl】
White cross From the Aofei temple
qubits | official account QbitAI
Flash is all you need!

lately , An ultra fast and memory saving attention algorithm FlashAttention became angry .
Read by sensing video memory / write in ,FlashAttention Running speed ratio of PyTorch standard Attention fast 2-4 times , The memory required is only its 5%-20%.

And its performance is more than that .
Training BERT Speed is compared with MLPerf Training records improved 15%;
Training GPT-2 Speed up 3.5 times ;
Training Transformer Is faster than the existing baseline .
Netizens expressed surprise one after another :Great Job! This job is very useful to me .

Let's see what kind of research this is ~
FlashAttention
This paper puts forward a kind of IO Perceptual precision attention algorithm .
With Transformer Getting bigger and bigger 、 Deeper and deeper , But it's still slow on long sequences 、 And consumes memory .( The self attention time and memory complexity are quadratic with the sequence length )
Existing approximate attention methods , Trying to sacrifice model quality , To reduce the computational complexity to solve this problem .
But there are some limitations , That is, you can't improve the training speed during running .
Researchers believe , should Let the attention algorithm have IO perception , That is, consider the reading and writing between video memory levels , For example, large but slow HBM(High Bandwidth Memory) Technology and small but fast SRAM.
Based on this background , The researchers proposed FlashAttention, There are two specific acceleration technologies : It is calculated by increasing the number of blocks tile 、 And recalculate attention in backward transmission , Integrate all attention operations into CUDA The kernel .

FlashAttention Use tiles to prevent large 𝑁×𝑁 Attention matrix ( Dotted box ) stay GPU HBM Upper materialization (materialization). In the external loop ( The red arrow ),FlashAttention Cycle through K and V A block of a matrix , And load it into SRAM.
In each block ,FlashAttention loop Q Block of matrix ( Blue arrow ) Load it into SRAM, And write the output of the attention calculation back to HBM.
This produces an attention algorithm , In actual time (wall-clock time) Inside , Its memory efficiency and speed are very high , Compared with the standard attention algorithm, it can access HBM.

The result is faster than the existing attention algorithm
The researchers evaluated FlashAttention To train Transformer Influence , Including training time 、 Model accuracy , And attention to running time and memory efficiency .
First, in the Training speed On .FlashAttention Than MLPerf 1.1 Of BERT The speed record is higher than 15%.

In the realization of GPT-2 On , Than HuggingFace Faster than 3 times , Than Megatron Standards for Transformer Faster than 1.8 times ,FlashAttention take LRA(long-range arena) The benchmark speed of has been improved 2.4 times .

stay Model quality ,FlashAttention take Transformer Extended to longer sequences , And the quality is better .
Long context language modeling .
As shown in the figure , Use FlashAttention It can make GPT-2 Context length increases 4 In the case of times , The training time is longer than Megatron-LM Optimization is fast 30%, At the same time, we also got 0.7 The degree of perplexity ( The less confused , Explain that the better the language model ).

Long document classification
For longer sequences Transformer Training can improve MIMIC-III and ECtHR Data set performance , For example, the sequence length is 16K stay MIMIC Upper specific length 512 More 4.3 branch .

MIMIC-III: Include discharge summary of ICU patients , Each has multiple label notes ;ECtHR: Legal cases containing the European bill of human rights ; Both datasets contain very long text files .
Besides , Also completed the first one that can be in Path-X and Path-256 To achieve non random performance in tasks Transformer Model .

after , The researchers also completed The benchmark , measurement FlashAttention And blocky sparse (Block-Sparse)FlashAttention Runtime and memory performance , And with 40GB HBM Of A100 GPU Various attention baselines were compared .

Results show ,FlashAttention Running time of , Than PyTorch Attention is achieved quickly 3 times ; In the case of short sequences ,FlashAttention In short sequences, it still runs faster than approximate and sparse attention ; As for the massive and sparse FlashAttention, It is faster than the existing attention implementation in all sequence lengths .
As for the efficiency of video memory ,FlashAttention Than PyTorch Attention baseline is high 20 times .

stay 64k Sequence length 、 All other algorithms have exhausted the video memory ,FlashAttention The efficiency of Still than Linformer high 2 times .
A work by Dr. Stanford

This research is from the computer department of Stanford University and the State University of New York at Buffalo . The joint work is two Stanford computer doctoral students Tri Dao and Dan Fu.

Interested friends , You can find out more about the paper link below ~
Thesis link :
https://arxiv.org/abs/2205.14135
GitHub link :
https://github.com/HazyResearch/flash-attention
Reference link :
https://twitter.com/tri_dao/status/1531437619791290369
— End —
Live registration | The road to mass production of autonomous driving :
Why? “ Progressive type ” The path first saw the dawn of mass production of driverless vehicles ?
The field of automatic driving has always had “ Progressive type ” and “ Leapfrog ” The battle between the two paths , The former is represented by Tesla , The latter with Waymo Be the leader .
Tesla announces 2024 In, the new type of “Robotaxi” Mass production of , And on the other side Waymo CEO quit , The commercialization has been stagnant . Behind this , Why? “ Progressive type ” The path is favored by more and more institutions ?“ Progressive type ” What is the path of technological development ? How far is the mass production of autonomous driving from our life ?

Focus on me here , Remember to mark the star ~
边栏推荐
- 云呐|固定资产如何管理比较好?公司固定资产怎么管理?
- com.alibaba.fastjson.JSONException: syntax error, pos 1, line 1, column 2测试
- Seven misconceptions of digital transformation
- 在腾讯连拿六个五星(下)
- What is the seven layer network structure for? Just read this article
- 斯坦福博士提出超快省显存Attention,GPT-2训练速度提升3.5倍,BERT速度创纪录
- Yunna | which department manages the fixed assets and who manages them
- 手把手教你用js实现一个虚拟机
- U8g2 graphics library and STM32 migration (I2C, software and hardware)
- jstat 详解
猜你喜欢

未磁科技完成超亿元A轮融资,核心团队毕业于北航

k8s中的postgresql怎么导出查询的结果,并导入到本地windows机器上的数据库

功能强大的开发板

常见的图像分割方法

Explanation of the top command

云呐|固定资产归哪个部门管理,归谁管理

打蛇打七寸

从最优化的角度看待Softmax损失函数

Yunna | which department manages the fixed assets and who manages them

Yunna | how to do a good inventory of fixed assets? How to count fixed assets
随机推荐
What are the characteristics of Bi reporting system
Database day-1
Zhoubolei annual progress overview of model interpretability 20200805
详解mysql数据去重的三种方式
DDD建模方法论之【事件风暴法】
如何做数据可视化分析
云呐|公司实物资产如何管理
Explanation of the top command
“地球外存在生命之源”上热搜,外星发现氨基酸到底有什么用?
Database installation --mysql
详解异步任务:函数计算的任务触发去重
Using kubekey to build kubernetes/kubesphere environment
精益产品开发体系最佳实践及原则
How can PostgreSQL in k8s export query results and import them to the database on the local windows machine
2022.6.5-----leetcode.478
浅谈RedisTemplate和StringRedisTemplate的区别
2022.6.1-----leetcode.473
[C language practice - printing odd and even digits of integer binary]
C语言 结构体 | 链表
What are the types and aspects of Yunna asset management system