当前位置:网站首页>Why is it necessary to scale the attention before softmax (why divide by the square root of d_k)
Why is it necessary to scale the attention before softmax (why divide by the square root of d_k)
2022-07-29 04:17:00 【ytusdc】
from math import exp
from matplotlib import pyplot as plt
import numpy as np
f = lambda x: exp(x * 2) / (exp(x) + exp(x) + exp(x * 2))
x = np.linspace(0, 100, 100)
y_3 = [f(x_i) for x_i in x]
plt.plot(x, y_3)
plt.show()
The resulting graph is shown below :
be :
1、self-attention Do you have to express it like this ?
Unwanted , Can depict Correlation , Similarity and other modeling methods are ok . Better be fast , Model is easy to learn , Expressive enough .
2、 There are other ways not to divide by the root dk Do you ?
Yes , ditto , As long as the gradient of each layer of parameters can be kept within the training sensitive range , Don't be too big , Don't be too small . Then this network is easier to train . There are ways , A better initialization method , Be similar to google Of T5 Model , Just do it during initialization .
Reference article :
transformer Medium attention Why? scaled?
self-attention Why divide by the root d_k_tyler The blog of -CSDN Blog
边栏推荐
- Change the value of the argument by address through malloc and pointer
- pat A1041 Be Unique
- C语言:typedef知识点总结
- 不会就坚持67天吧 平方根
- MySQL gets the maximum value record by field grouping
- GBase 8a特殊场景下屏蔽 ODBC 负载均衡方式?
- openFeign异步调用问题
- [kvm] create virtual machine from kickstart file
- 10. Fallback message
- 从淘宝,天猫,1688,微店,京东,苏宁,淘特等其他平台一键复制商品到拼多多平台(批量上传宝贝详情接口教程)
猜你喜欢
Value transmission and address transmission of C language, pointer of pointer
Jenkins 参数化构建中 各参数介绍与示例
Introduction and examples of parameters in Jenkins parametric construction
伏英娜:元宇宙就是新一代互联网!
编译与链接
Whole house WiFi solution: mesh router networking and ac+ap
10.回退消息
不会就坚持63天吧 最大的异或
When array is used as a function parameter, it is better to use the array size as a function parameter
rman不标记过期备份
随机推荐
[common commands]
Codeforces round 810 (Div. 2) d. rain (segment tree difference)
Target detection learning process
Asp. Net MVC, how can the controller in the folder jump to the controller in the root directory?
LCA board
MPU6050
Code or script to speed up the video playback of video websites
Is there any way for Youxuan database to check the log volume that the primary cluster transmits to the standby cluster every day?
Kotlin's list, map, set and other collection classes do not specify types
Rhel8 patch package production
开课!看smardaten如何分解复杂业务场景
不会就坚持62天吧 单词之和
[hands on deep learning] environment configuration (detailed records, starting from the installation of VMware virtual machine)
Incubator course design (April 12, 2021)
MPU6050
Svg -- loading animation
Machine vision Series 2: vs DLL debugging
Implementation of jump connection of RESNET (pytorch)
%s. %c, character constant, string constant, const char*, pointer array, string array summary
不会就坚持61天吧 最短的单词编码