当前位置:网站首页>Teach the big model to skip the "useless" layer and improve the reasoning speed × 3. The performance remains unchanged, and the new method of Google MIT is popular
Teach the big model to skip the "useless" layer and improve the reasoning speed × 3. The performance remains unchanged, and the new method of Google MIT is popular
2022-07-26 15:50:00 【QbitAl】
Xiao Xiao From the Aofei temple
qubits | official account QbitAI
The big language model is cool , But the reasoning speed is too slow ?
and , Increase the model volume , The reasoning effect is not necessarily better than before .
To solve this problem , Google MIT The researchers of have proposed a new framework CALM, Let it decide the amount of calculation .
If CALM Be aware of certain layers “ not essential ”, Then it will skip these layers when calculating .
The paper was po After going online , There was an immediate fire :

One netizen said , We just need such a more intelligent and adaptive model , obviously CALM The decoder of has done :

Directly use the middle layer to output the results
CALM Full name Confident Adaptive Language Modeling, That is, confidence adaptive large language model .
The first mock exam is based on Transformer framework , In order to speed up its calculation , The researchers proposed a name “ Exit ahead of time ”(early exiting) Methods , Let the model according to different inputs , dynamic Decide how many layers of network to calculate .
in other words , In the process of calculation , The model does not need to go through each layer of calculation before outputting the results , Instead, you can directly use the features of the middle layer to output token, So as to reduce the calculation of the model .

therefore , How the model determines “ sign out ” What's the timing of this ?
This requires the training model to learn to judge by itself .
among ,Yfull It is the result of standard model output ,Yearly It's a model “ Exit ahead of time ” The output result of . In order to make Yearly Is much better , We need to make it as close as possible to Yfull bring into correspondence with .

Of course , Different tasks have different requirements for text output consistency , For example, the requirements for the generated results are not so strict ( You can generate more diverse statements ) The task of , about Yfull and Yearly The consistency requirements are not so high .
Therefore, the authors also give two different formulas in the paper , It can be selected according to the actual situation :

In practice , The paper sets a local token Degree of confidence , To check its impact on the whole generation sequence .
The model is in the process of decoding , Calculate the confidence of each level c, And combine it with reaching “ Exit ahead of time ” The threshold of λ comparison , If c Greater than λ, Then model “ Exit ahead of time ”.

therefore , What is the actual test effect of such a model ?
Inductive translation QA The task performance is good
The paper is in CNN/DM、WMT and SQuAD Three data sets were tested .

among ,CNN/DM Is a news article data set , You need to output a few sentences to summarize the results of the article ;WMT15 EN-FR Is a machine translation data set , It is mainly the result of French English sentences ;Open-book SQUAD 1.1 Is a question based on Wikipedia QA Data sets .

According to yizuo Tal Schuster Introduce , stay Maintain the same performance Under the circumstances ,CALM The average number of decoder layers used is lower than before 3 times .

For this paper , Some netizens agree : Models really don't need to always “ Think deeply for a long time ”, Sometimes the correct answer can be deduced from several layers .

According to the author , This idea of accelerating decoding , Apply to any Seq2seq Model .

The authors introduce
The authors of this paper are 8 personal , From Google and MIT CSAIL, There are two main principals ,Tal Schuster and Adam Fisch.

Tal Schuster Doctor graduated from MIT, Currently, he is a senior researcher of Google , The research direction is the robustness of machine learning models 、 Improved reliability and efficiency .

Adam Fisch,MIT Ph.D. student , Bachelor's degree from Princeton University , The research direction is machine learning to quantify uncertainty 、 Less sample learning, etc .

Small partners interested in reasoning acceleration of large language models , You can stamp the address of the paper to learn more .
Address of thesis :
https://arxiv.org/abs/2207.07061
Reference link :
https://twitter.com/TalSchuster/status/1547966142412513282
边栏推荐
- Reflection, enumeration, and lambda expressions
- What is the transport layer protocol tcp/udp???
- Enterprise digital transformation needs in-depth research, and it cannot be transformed for the sake of transformation
- FOC电机控制基础
- 【EXPDP导出数据】expdp导出23行记录,且不包含lob字段的表,居然用时48分钟,请大家帮忙看看
- 数智转型,管理先行|JNPF全力打造“全生命周期管理”平台
- Zhaoqi science and technology innovation high-end talent project was introduced and implemented, mass entrepreneurship and innovation competition was organized, and online live roadshow was broadcast
- Tool skill learning (I): pre skills -makfile, make,.Mk
- Practical task scheduling platform (scheduled task)
- QT is the most basic layout, creating a window interface
猜你喜欢
![[leetcode daily question] - 268. Missing numbers](/img/96/dcf18522257dea7cb7e52f5bb7ced3.png)
[leetcode daily question] - 268. Missing numbers
![[5 minutes paper] Pointer network](/img/9a/66edc27f08f245447cc6b8867d2383.png)
[5 minutes paper] Pointer network

PS + PL heterogeneous multicore case development manual for Ti C6000 tms320c6678 DSP + zynq-7045 (3)

反射、枚举以及lambda表达式

This article explains in detail the discovery and processing of bigkey and hotkey in redis

Vs2019debug mode too laggy can't enter the breakpoint
![[leetcode daily question] - 121. The best time to buy and sell stocks](/img/51/ae7c4d903a51d97b70d5e69c6fffaa.png)
[leetcode daily question] - 121. The best time to buy and sell stocks

桌面应用布局图

API 版本控制【 Eolink 翻译】

深度学习中图像增强技术的综合综述
随机推荐
我们被一个 kong 的性能 bug 折腾了一个通宵
Digital warehouse: iqiyi digital warehouse platform construction practice
A comprehensive review of image enhancement technology in deep learning
组件化开发基本规范、localStorage 和 sessionStorage、对象数据转基本值、原型链使用
OSPF comprehensive experiment
中金财富证券安全吗 开户要多久
八叉树建立地图并实现路径规划导航
Chapter 7 supporting CORS in rest services
一文详解 Redis 中 BigKey、HotKey 的发现与处理
Enterprise digital transformation needs in-depth research, and it cannot be transformed for the sake of transformation
Practical task scheduling platform (scheduled task)
[static code quality analysis tool] Shanghai daoning brings you sonarource/sonarqube download, trial and tutorial
FOC learning notes - coordinate transformation and simulation verification
QCF for deep packet inspection paper summary
LeetCode_ Prefix and_ Hash table_ Medium_ 525. Continuous array
FTP protocol
SAP ABAP 守护进程的实现方式
中金财富炒股安全吗 手续费最便宜的证券公司
认识JS基础与浏览器引擎
单例模式