当前位置:网站首页>Q&A:Transformer, Bert, ELMO, GPT, VIT
Q&A:Transformer, Bert, ELMO, GPT, VIT
2022-07-03 20:14:00 【Zhou Zhou, Zhou Dashuai】
The rainy weather in the South has become an extravagant hope to go out , Even if winter is long and boring , But the real spring will also come quietly .
Such a beginning is rare , Then why should we play with words today ? Because of a cold, it finally recovered ! So make a summary of the recent scientific research work , But many places dare not think about it , The water is too deep , I can't hold it , Just write the common question and answer
One 、Q&A:Transformer
1. Transformer Why use the long attention mechanism ?
You can think , This thing is , We're doing it self-attention When , Yes, it is q Find the relevant k. however “ relevant ” This matter , There are many different forms , There are many different definitions . So , Maybe we can't have only one q, There should be different q, Different q Responsible for different kinds of correlation . The more written expression is : Bulls guarantee transformer You can notice the information of different subspaces , Capture richer feature information .
2. Transformer Calculation attention Why do you choose dot multiplication instead of addition ?
K and Q Calculation dot-production To get a attention score matrix , Used to correct V Weighted .K and Q Used a different
,
To calculate , It can be understood as a projection on different spaces . Because of this projection of different spaces , Increased expression ability , It's calculated in this way attention score matrix The generalization ability of is higher .
3. Why are you doing softmax You need to be right about attention Conduct scaled Well ?
transformer Medium attention Why? scaled?
https://www.zhihu.com/question/339723385/answer/782509914
4. Transformer in encoder and decoder How they interact ?
Before that Why transformer?( 3、 ... and ) Yes Specifically mentioned encoder and decoder How to transmit messages between , Here is a brief statement :q From decoder,k Follow v come from encoder, This step is called Cross attention. Specific reference :Why transformer?( 3、 ... and )
https://blog.csdn.net/m0_57541899/article/details/122761220?spm=1001.2014.3001.5501
5. Why? transformer Block use LayerNorm instead of BatchNorm Well ?
Two 、Q&A:Bert and its family
1. Bert What kind of thing are you doing ?
Bert The thing to do is : Lose one word sequence to Bert, And then every one word Will throw up one embedding It's over when I come out for you . As for Bert What does the architecture look like inside , As mentioned before Bert It's the same architecture as transformer encoder The architecture is the same . that transformer encoder What's in it ? Certainly self-attention layer. that self-attention layer What are you doing ? Namely input a sequence, then output a sequence. Now the whole Bert Architecture is input word sequence, And then it output Corresponding word Of embedding.
2. Bert This nerve How was it trained ?
The first training method is Mask language model (Mask Language Model,MLM); The second training method is The next sentence predicts (Next Sentence Prediction,NSP); More specifically, in the previous Bert and its family——Bert I have written :Bert and its family——Bert
https://blog.csdn.net/m0_57541899/article/details/122789735?spm=1001.2014.3001.55013. Bert Of mask What are the advantages and disadvantages of this method ?
Bert Of mask The way : In the choice mask Of 15% Among the words ,80% Under the circumstances mask Drop the word ,10% In this case, replace... With an arbitrary word , The remaining 10% In this case, keep the original vocabulary unchanged . advantage :(1) Be randomly selected mask Of 15% Among the words , With 10% The probability of using any word substitution to predict the correct word , Equivalent to text error correction task , by Bert The model endows certain text error correction ability ;(2) Be randomly selected mask Of 15% Among the words , With 10% The probability of keeping the original vocabulary unchanged , Relieved fine-tune The problem of input mismatch between time and pre training ( During the pre training, there are mask, and fine-tune When the input is complete sentences , That is, the input does not match ). shortcoming : For words composed of two or more consecutive words , Random mask Split the correlation between consecutive words , Make it difficult for the model to learn the semantic information of words .
4. The model has , How to use it Bert Well ?
An intuitive idea is to Bert Of model Train with your next task , How about that Bert With the next downstream Our tasks are combined ? stay Bert and its family——Bert Four examples are given for reference :Bert and its family——Bert
https://blog.csdn.net/m0_57541899/article/details/122789735?spm=1001.2014.3001.55015. Bert What is the loss function corresponding to the two pre training tasks of ?
Bert The loss function consists of two parts , The first part is from Mask language model (Mask Language Model,MLM), The other part comes from The next sentence predicts (Next Sentence Prediction,NSP). Through the joint learning of these two tasks , You can make Bert Learned embedding both token Level information , It also contains sentence level semantic information . The specific loss function is :
. What does the parameter mean , Check it out
6. ELMO What kind of thing are you doing ?
ELMO The thing to do is :
Bert and its family——ELMO
https://blog.csdn.net/m0_57541899/article/details/122775529?spm=1001.2014.3001.55017. GPT What kind of thing are you doing ?
You may have guessed that I'm going to give another link, right , Yes , Guess what
3、 ... and 、Q&A:VIT
About vision transformer There are too many excellent blog, So here are just a few common questions
1. layer normalization & batch normalization The difference between ?
Layer normalization Do more than BN It's simpler : Enter a vector , Output another vector , Don't consider batch, It's for the same feature The difference in it dimension Calculation mean Follow std, and BN Is different from feature The same inside dimension Calculation mean Follow std
2. Why CLS token Well ?
By analogy Bert:Bert Insert a before the text CLS Symbol , The output vector corresponding to the symbol is used as the semantic representation of the whole article for text classification , It can be understood as : And other words already in the text / Word comparison , This symbol without obvious semantic information will be more “ fair ” Integrate the words in the text / Semantic information of words
3. Multi-head Attention The essence of
Sometimes we can't have only one q, There should be different q, Responsible for different kinds of correlation , Its essence is : Under the condition that the total number of parameters remains unchanged , Will be the same q,k,v Map to the original high latitude subspace attention Calculation , let me put it another way , Is to find the relationship between sequences from different angles , And at the end concat Integrate the association relationships captured in different subspaces
4. VIT Principle analysis
The theoretical analysis :

Code implementation :
For standard Transformer modular , The required input is token( vector ) Sequence , Two dimensional matrix [num_token, token_dim] =[196,768]
In the code implementation , Directly through a convolution layer . With ViT-B/16 For example , The convolution kernel size used is 16x16,stride by 16, The number of convolution kernels is 768,[224, 224, 3] -> [14, 14, 768] -> [196, 768]
In the input TransformerEncoder You need to add [class]token as well as Position Embedding. Splicing [class]token: Cat([1, 768], [196, 768]) -> [197, 768], superposition PositionEmbedding: [197, 768] -> [197, 768]
边栏推荐
- Global and Chinese markets of lithium chloride 2022-2028: Research Report on technology, participants, trends, market size and share
- The 15 year old interviewer will teach you four unique skills that you must pass the interview
- Micro service knowledge sorting - asynchronous communication technology
- Class loading process
- 6006. Take out the minimum number of magic beans
- Utilisation de base du cadre unitest
- kubernetes集群搭建efk日志收集平台
- Ruby replaces gem Alibaba image
- Don't be afraid of no foundation. Zero foundation doesn't need any technology to reinstall the computer system
- Microsoft: the 12th generation core processor needs to be upgraded to win11 to give full play to its maximum performance
猜你喜欢

Derivation of decision tree theory

Teach you how to quickly recover data by deleting recycle bin files by mistake

Phpstudy set LAN access

2.7 format output of values

FPGA learning notes: vivado 2019.1 project creation

It is discussed that the success of Vit lies not in attention. Shiftvit uses the precision of swing transformer to outperform the speed of RESNET

Today's work summary and plan: February 14, 2022

Wechat applet quick start (including NPM package use and mobx status management)

2022-07-02 advanced network engineering (XV) routing policy - route policy feature, policy based routing, MQC (modular QoS command line)

Xctf attack and defense world crypto master advanced area olddriver
随机推荐
4. Data splitting of Flink real-time project
AcWing 1460. Where am i?
Realize user registration and login
Don't be afraid of no foundation. Zero foundation doesn't need any technology to reinstall the computer system
Native table - scroll - merge function
BOC protected alanine zinc porphyrin Zn · TAPP ala BOC / alanine zinc porphyrin Zn · TAPP ala BOC / alanine zinc porphyrin Zn · TAPP ala BOC / alanine zinc porphyrin Zn · TAPP ala BOC supplied by Qiyu
Utilisation de base du cadre unitest
2022 Xinjiang latest road transportation safety officer simulation examination questions and answers
BOC protected amino acid porphyrins TAPP ala BOC, TAPP Phe BOC, TAPP Trp BOC, Zn · TAPP ala BOC, Zn · TAPP Phe BOC, Zn · TAPP Trp BOC Qiyue
Geek Daily: the system of monitoring employees' turnover intention has been deeply convinced off the shelves; The meta universe app of wechat and QQ was actively removed from the shelves; IntelliJ pla
Get log4net log file in C - get log4net log file in C
Print linked list from end to end
Typora charges, WTF? Still need support
AST (Abstract Syntax Tree)
BOC protected phenylalanine zinc porphyrin (Zn · TAPP Phe BOC) / iron porphyrin (Fe · TAPP Phe BOC) / nickel porphyrin (Ni · TAPP Phe BOC) / manganese porphyrin (Mn · TAPP Phe BOC) Qiyue Keke
Test access criteria
Detailed and not wordy. Share the win10 tutorial of computer reinstallation system
2022-06-30 advanced network engineering (XIV) routing strategy - matching tools [ACL, IP prefix list], policy tools [filter policy]
6. Data agent object Defineproperty method
Network security Kali penetration learning how to get started with web penetration how to scan based on nmap