当前位置:网站首页>Transformer's understanding
Transformer's understanding
2022-07-28 06:11:00 【Alan and fish】
1. Dichotomous attention mechanism
- Lead in two ends
Here is the two head attention mechanism , In fact, it's just putting the front q,k,v Subdivide , Divided into two groups. , In this way, the information concerned is refined .
Weight w Split into wq,1,wq,2 Two weight parameters , And then a Dot multiply with two weight parameters respectively , Got it qq,1 and qq,2. - Calculation α \alpha α
This is the time , take q The first head with each k The first head of ,q The second head of k Calculate at the second end of , You will get two α \alpha α1, α \alpha α2 - Calculation b

The following steps are the same as the single head attention mechanism , The difference is that the multi head attention mechanism introduces multiple heads , The information is more subdivided , Multiple calculations are required , The results are more accurate .
2. Introduce location information
There is a flaw in the attention mechanism , There is no location information , So we introduce a one-hot Position matrix of structure .
Set the weight matrix W Split into WI and WP, Then with the input value x And location information p Do point multiplication , obtain ei and α \alpha αi
3.transformer Visual understanding of framework

Take machine translation for example , Enter a machine learning , First, it will be encoded , Then it is decoded , Get the information you want ,tansformer Mechanism is a process of encoding and decoding .
Information entered x There will be one with one-hot Combined with encoded location information , Then enter a self-attention The long attention mechanism . Then take the encoded result as the input of decoding , Put the input into a masked The long attention mechanism , And then pass by self-attention Attention mechanism , Finally, the final output is obtained through a series of operations .
When coding , Added a Norm layer ,Norm and Layer The difference is that ,Norm It's horizontal ,Layer It's vertical .
4. See the effect of attention mechanism through visualization

As shown in the figure :
In this paper, the it It's a pronoun , In this text ,it It means animal, So it's with animal Rely more on , The relationship between them is darker .
5. The effect comparison between single head attention mechanism and multi head attention mechanism

The green one above is the long attention mechanism , The red one below is the single head attention mechanism , As you can see from the diagram , The long attention mechanism pays more attention to information .
边栏推荐
- 利用辅助未标记数据增强无约束人脸识别《Boosting Unconstrained Face Recognition with Auxiliary Unlabeled Data》
- 强化学习——基础概念
- How to use Bert
- Scenario solution of distributed cluster architecture: cluster clock synchronization
- Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding
- vscode uniapp
- 深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations
- 小程序开发系统有哪些优点?为什么要选择它?
- Centos7 installing MySQL
- Kotlin语言现在怎么不火了?你怎么看?
猜你喜欢

Word2vec+ regression model to achieve classification tasks

小程序搭建制作流程是怎样的?

小程序开发如何提高效率?

小程序开发

Pytorch deep learning single card training and multi card training

微信上的小程序店铺怎么做?

Distributed lock database implementation

神经网络优化

Wechat applet development and production should pay attention to these key aspects

Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding
随机推荐
Wechat applet development and production should pay attention to these key aspects
NLP project actual custom template framework
Why is the kotlin language not popular now? What's your opinion?
Invalid packaging for parent POM x, must be “pom“ but is “jar“ @
alpine,debian替换源
KubeSphere安装版本问题
【3】 Redis features and functions
How much does it cost to make a small program mall? What are the general expenses?
小程序开发解决零售业的焦虑
强化学习——价值学习中的DQN
Reinforcement learning -- SARS in value learning
How to choose an applet development enterprise
Kubesphere installation version problem
Deploy the project to GPU and run
小程序商城制作一个需要多少钱?一般包括哪些费用?
神经网络优化
小程序开发要多少钱?两种开发方法分析!
transformer的理解
强化学习——Proximal Policy Optimization Algorithms
【4】 Redis persistence (RDB and AOF)