当前位置:网站首页>Teach the big model to skip the "useless" layer and improve the reasoning speed × 3. The performance remains unchanged, and the new method of Google MIT is popular
Teach the big model to skip the "useless" layer and improve the reasoning speed × 3. The performance remains unchanged, and the new method of Google MIT is popular
2022-07-26 15:50:00 【QbitAl】
Xiao Xiao From the Aofei temple
qubits | official account QbitAI
The big language model is cool , But the reasoning speed is too slow ?
and , Increase the model volume , The reasoning effect is not necessarily better than before .
To solve this problem , Google MIT The researchers of have proposed a new framework CALM, Let it decide the amount of calculation .
If CALM Be aware of certain layers “ not essential ”, Then it will skip these layers when calculating .
The paper was po After going online , There was an immediate fire :

One netizen said , We just need such a more intelligent and adaptive model , obviously CALM The decoder of has done :

Directly use the middle layer to output the results
CALM Full name Confident Adaptive Language Modeling, That is, confidence adaptive large language model .
The first mock exam is based on Transformer framework , In order to speed up its calculation , The researchers proposed a name “ Exit ahead of time ”(early exiting) Methods , Let the model according to different inputs , dynamic Decide how many layers of network to calculate .
in other words , In the process of calculation , The model does not need to go through each layer of calculation before outputting the results , Instead, you can directly use the features of the middle layer to output token, So as to reduce the calculation of the model .

therefore , How the model determines “ sign out ” What's the timing of this ?
This requires the training model to learn to judge by itself .
among ,Yfull It is the result of standard model output ,Yearly It's a model “ Exit ahead of time ” The output result of . In order to make Yearly Is much better , We need to make it as close as possible to Yfull bring into correspondence with .

Of course , Different tasks have different requirements for text output consistency , For example, the requirements for the generated results are not so strict ( You can generate more diverse statements ) The task of , about Yfull and Yearly The consistency requirements are not so high .
Therefore, the authors also give two different formulas in the paper , It can be selected according to the actual situation :

In practice , The paper sets a local token Degree of confidence , To check its impact on the whole generation sequence .
The model is in the process of decoding , Calculate the confidence of each level c, And combine it with reaching “ Exit ahead of time ” The threshold of λ comparison , If c Greater than λ, Then model “ Exit ahead of time ”.

therefore , What is the actual test effect of such a model ?
Inductive translation QA The task performance is good
The paper is in CNN/DM、WMT and SQuAD Three data sets were tested .

among ,CNN/DM Is a news article data set , You need to output a few sentences to summarize the results of the article ;WMT15 EN-FR Is a machine translation data set , It is mainly the result of French English sentences ;Open-book SQUAD 1.1 Is a question based on Wikipedia QA Data sets .

According to yizuo Tal Schuster Introduce , stay Maintain the same performance Under the circumstances ,CALM The average number of decoder layers used is lower than before 3 times .

For this paper , Some netizens agree : Models really don't need to always “ Think deeply for a long time ”, Sometimes the correct answer can be deduced from several layers .

According to the author , This idea of accelerating decoding , Apply to any Seq2seq Model .

The authors introduce
The authors of this paper are 8 personal , From Google and MIT CSAIL, There are two main principals ,Tal Schuster and Adam Fisch.

Tal Schuster Doctor graduated from MIT, Currently, he is a senior researcher of Google , The research direction is the robustness of machine learning models 、 Improved reliability and efficiency .

Adam Fisch,MIT Ph.D. student , Bachelor's degree from Princeton University , The research direction is machine learning to quantify uncertainty 、 Less sample learning, etc .

Small partners interested in reasoning acceleration of large language models , You can stamp the address of the paper to learn more .
Address of thesis :
https://arxiv.org/abs/2207.07061
Reference link :
https://twitter.com/TalSchuster/status/1547966142412513282
边栏推荐
- 777. Exchange adjacent characters in LR string
- pytorch---进阶篇(函数使用技巧/注意事项)
- Refuse noise, the entry journey of earphone Xiaobai
- R language ggplot2 visualization: use the ggdotplot function of ggpubr package to visualize dot plot, set the add parameter to add the mean and standard deviation vertical lines, and set the error.plo
- 【LeetCode】33、 搜索旋转排序数组
- FTP protocol
- No module named ‘win32gui‘
- 04 callable and common auxiliary classes
- Vs add settings for author information and time information
- SAP ABAP 守护进程的实现方式
猜你喜欢

Strengthen the defense line of ecological security, and carry out emergency drills for environmental emergencies in Guangzhou

Digital warehouse: iqiyi digital warehouse platform construction practice

北京的大学排名

Creation and traversal of binary tree
![[five minute paper] reinforcement learning based on parameterized action space](/img/86/9deb43958b6bf7401f41f31f737cc9.png)
[five minute paper] reinforcement learning based on parameterized action space

API 版本控制【 Eolink 翻译】

潘多拉 IOT 开发板学习(RT-Thread)—— 实验17 ESP8266 实验(学习笔记)

PS + PL heterogeneous multicore case development manual for Ti C6000 tms320c6678 DSP + zynq-7045 (4)

小哥自创AI防拖延系统,一玩手机就被“闪瞎” | Reddit高热

桌面应用布局图
随机推荐
【EXPDP导出数据】expdp导出23行记录,且不包含lob字段的表,居然用时48分钟,请大家帮忙看看
深度学习中图像增强技术的综合综述
数仓:数仓建设中的数据建模和日志体系
PS + PL heterogeneous multicore case development manual for Ti C6000 tms320c6678 DSP + zynq-7045 (3)
81.(cesium之家)cesium修改灰色背景(默认蓝色)
理解卷积神经网络中的权值共享
SAP ABAP Netweaver 容器化的一些前沿性研究工作分享
If you want to be good at work, you must first use its tools -c language expansion -- embedded C language (11)
蓝牙BLE4.0-HM-10设备配对指南
R language ggplot2 visualization: use the ggballoonplot function of ggpubr package to visualize the balloon graph (visualize the contingency table composed of two classification variables), and config
Practical task scheduling platform (scheduled task)
gcc/g++与动静库以及gdb
TI C6000 TMS320C6678 DSP+ Zynq-7045的PS + PL异构多核案例开发手册(4)
Zhaoqi science and technology innovation high-end talent project was introduced and implemented, mass entrepreneurship and innovation competition was organized, and online live roadshow was broadcast
A comprehensive review of image enhancement technology in deep learning
VS2019Debug模式太卡进不去断点
什么是传输层协议TCP/UDP???
MYSQL 命令大全
How much help does solid state disk have for game operation
请问参数化视图可以根据传入参数的特点得到不同行数的SQL吗?比如这里我想根据传输参数@field中列