当前位置:网站首页>Why is it necessary to scale the attention before softmax (why divide by the square root of d_k)
Why is it necessary to scale the attention before softmax (why divide by the square root of d_k)
2022-07-29 04:17:00 【ytusdc】
from math import exp
from matplotlib import pyplot as plt
import numpy as np
f = lambda x: exp(x * 2) / (exp(x) + exp(x) + exp(x * 2))
x = np.linspace(0, 100, 100)
y_3 = [f(x_i) for x_i in x]
plt.plot(x, y_3)
plt.show()
The resulting graph is shown below :
be :
1、self-attention Do you have to express it like this ?
Unwanted , Can depict Correlation , Similarity and other modeling methods are ok . Better be fast , Model is easy to learn , Expressive enough .
2、 There are other ways not to divide by the root dk Do you ?
Yes , ditto , As long as the gradient of each layer of parameters can be kept within the training sensitive range , Don't be too big , Don't be too small . Then this network is easier to train . There are ways , A better initialization method , Be similar to google Of T5 Model , Just do it during initialization .
Reference article :
transformer Medium attention Why? scaled?
self-attention Why divide by the root d_k_tyler The blog of -CSDN Blog
边栏推荐
- 不会就坚持63天吧 最大的异或
- Whole house WiFi solution: mesh router networking and ac+ap
- SQL time fuzzy query datediff() function
- Blood cases caused by < meta charset=UTF-8> -- Analysis of common character codes
- opengauss预检查安装
- BIO、NIO、AIO的区别和原理
- Leftmost prefix principle of index
- 11. Backup switch
- Differences and principles of bio, NiO and AIO
- The pit I walked through: the first ad Sketchpad
猜你喜欢
Machine vision series 3:vs2019 opencv environment configuration
Machine vision Series 2: vs DLL debugging
伏英娜:元宇宙就是新一代互联网!
Machine vision Series 1: Visual Studio 2019 dynamic link library DLL establishment
Why are there so many unknowns when opengauss starts?
不会就坚持67天吧 平方根
Lua language (stm32+2g/4g module) and C language (stm32+esp8266) methods of extracting relevant data from strings - collation
Value transmission and address transmission of C language, pointer of pointer
15.federation
9. Delay queue
随机推荐
rman不标记过期备份
14. Haproxy+kept load balancing and high availability
12. Priority queue and inert queue
Interview notes of a company
How to solve the problem of store ranking?
一个公司的面试笔记
Multi card training in pytorch
不会就坚持64天吧 查找插入位置
Change the value of the argument by address through malloc and pointer
AssertionError(“Torch not compiled with CUDA enabled“)
(.*?) regular expression
MPU6050
pat A1041 Be Unique
小程序:区域滚动、下拉刷新、上拉加载更多
Not 67 days, square root
从淘宝,天猫,1688,微店,京东,苏宁,淘特等其他平台一键复制商品到拼多多平台(批量上传宝贝详情接口教程)
Whole house WiFi solution: mesh router networking and ac+ap
Won't you just stick to 62 days? Sum of words
MPU6050
Mmdetection preliminary use