当前位置:网站首页>Why is it necessary to scale the attention before softmax (why divide by the square root of d_k)
Why is it necessary to scale the attention before softmax (why divide by the square root of d_k)
2022-07-29 04:17:00 【ytusdc】



from math import exp
from matplotlib import pyplot as plt
import numpy as np
f = lambda x: exp(x * 2) / (exp(x) + exp(x) + exp(x * 2))
x = np.linspace(0, 100, 100)
y_3 = [f(x_i) for x_i in x]
plt.plot(x, y_3)
plt.show()
The resulting graph is shown below :




be :

1、self-attention Do you have to express it like this ?
Unwanted , Can depict Correlation , Similarity and other modeling methods are ok . Better be fast , Model is easy to learn , Expressive enough .
2、 There are other ways not to divide by the root dk Do you ?
Yes , ditto , As long as the gradient of each layer of parameters can be kept within the training sensitive range , Don't be too big , Don't be too small . Then this network is easier to train . There are ways , A better initialization method , Be similar to google Of T5 Model , Just do it during initialization .
Reference article :
transformer Medium attention Why? scaled?
self-attention Why divide by the root d_k_tyler The blog of -CSDN Blog
边栏推荐
猜你喜欢

Why are there so many unknowns when opengauss starts?

rman不标记过期备份

MySQL gets the maximum value record by field grouping

9. Delay queue

为什么opengauss启动的时候这么多的unknown?

Implementation of jump connection of RESNET (pytorch)

从淘宝,天猫,1688,微店,京东,苏宁,淘特等其他平台一键复制商品到拼多多平台(批量上传宝贝详情接口教程)

Record of problems encountered in ROS learning

HC06 HC05 BT

2021 sist summer camp experience + record post of School of information, Shanghai University of science and technology
随机推荐
Value transmission and address transmission of C language, pointer of pointer
"Weilai Cup" 2022 Niuke summer multi school training camp 2H
Not for 63 days. The biggest XOR
Common components of solder pad (2021.4.6)
The return value of the function is the attention of the pointer, the local variables inside the static limit sub function, and how the pointer to the array represents the array elements
9. Delay queue
UnicodeDecodeError: ‘ascii‘ codec can‘t decode byte 0x90 in position 614: ordinal not in range(128)
Change the value of the argument by address through malloc and pointer
Won't you just stick to 69 days? Merge range
Code or script to speed up the video playback of video websites
Introduction and examples of parameters in Jenkins parametric construction
The function "postgis_version" cannot be found when installing PostGIS
不会就坚持67天吧 平方根
不会就坚持60天吧 神奇的字典
SQL server how to judge when the parameter received by the stored procedure is of type int?
Model tuning, training model trick
不会就坚持65天吧 只出现一次的数字
不会就坚持63天吧 最大的异或
The structure pointer must be initialized, and the pointer must also be initialized
Shielding ODBC load balancing mode in gbase 8A special scenarios?