当前位置:网站首页>Q&A:Transformer, Bert, ELMO, GPT, VIT
Q&A:Transformer, Bert, ELMO, GPT, VIT
2022-07-03 20:14:00 【Zhou Zhou, Zhou Dashuai】
The rainy weather in the South has become an extravagant hope to go out , Even if winter is long and boring , But the real spring will also come quietly .
Such a beginning is rare , Then why should we play with words today ? Because of a cold, it finally recovered ! So make a summary of the recent scientific research work , But many places dare not think about it , The water is too deep , I can't hold it , Just write the common question and answer
One 、Q&A:Transformer
1. Transformer Why use the long attention mechanism ?
You can think , This thing is , We're doing it self-attention When , Yes, it is q Find the relevant k. however “ relevant ” This matter , There are many different forms , There are many different definitions . So , Maybe we can't have only one q, There should be different q, Different q Responsible for different kinds of correlation . The more written expression is : Bulls guarantee transformer You can notice the information of different subspaces , Capture richer feature information .
2. Transformer Calculation attention Why do you choose dot multiplication instead of addition ?
K and Q Calculation dot-production To get a attention score matrix , Used to correct V Weighted .K and Q Used a different ,
To calculate , It can be understood as a projection on different spaces . Because of this projection of different spaces , Increased expression ability , It's calculated in this way attention score matrix The generalization ability of is higher .
3. Why are you doing softmax You need to be right about attention Conduct scaled Well ?
transformer Medium attention Why? scaled?https://www.zhihu.com/question/339723385/answer/782509914
4. Transformer in encoder and decoder How they interact ?
Before that Why transformer?( 3、 ... and ) Yes Specifically mentioned encoder and decoder How to transmit messages between , Here is a brief statement :q From decoder,k Follow v come from encoder, This step is called Cross attention. Specific reference :Why transformer?( 3、 ... and )https://blog.csdn.net/m0_57541899/article/details/122761220?spm=1001.2014.3001.5501
5. Why? transformer Block use LayerNorm instead of BatchNorm Well ?
Two 、Q&A:Bert and its family
1. Bert What kind of thing are you doing ?
Bert The thing to do is : Lose one word sequence to Bert, And then every one word Will throw up one embedding It's over when I come out for you . As for Bert What does the architecture look like inside , As mentioned before Bert It's the same architecture as transformer encoder The architecture is the same . that transformer encoder What's in it ? Certainly self-attention layer. that self-attention layer What are you doing ? Namely input a sequence, then output a sequence. Now the whole Bert Architecture is input word sequence, And then it output Corresponding word Of embedding.
2. Bert This nerve How was it trained ?
The first training method is Mask language model (Mask Language Model,MLM); The second training method is The next sentence predicts (Next Sentence Prediction,NSP); More specifically, in the previous Bert and its family——Bert I have written :Bert and its family——Berthttps://blog.csdn.net/m0_57541899/article/details/122789735?spm=1001.2014.3001.55013. Bert Of mask What are the advantages and disadvantages of this method ?
Bert Of mask The way : In the choice mask Of 15% Among the words ,80% Under the circumstances mask Drop the word ,10% In this case, replace... With an arbitrary word , The remaining 10% In this case, keep the original vocabulary unchanged . advantage :(1) Be randomly selected mask Of 15% Among the words , With 10% The probability of using any word substitution to predict the correct word , Equivalent to text error correction task , by Bert The model endows certain text error correction ability ;(2) Be randomly selected mask Of 15% Among the words , With 10% The probability of keeping the original vocabulary unchanged , Relieved fine-tune The problem of input mismatch between time and pre training ( During the pre training, there are mask, and fine-tune When the input is complete sentences , That is, the input does not match ). shortcoming : For words composed of two or more consecutive words , Random mask Split the correlation between consecutive words , Make it difficult for the model to learn the semantic information of words .
4. The model has , How to use it Bert Well ?
An intuitive idea is to Bert Of model Train with your next task , How about that Bert With the next downstream Our tasks are combined ? stay Bert and its family——Bert Four examples are given for reference :Bert and its family——Berthttps://blog.csdn.net/m0_57541899/article/details/122789735?spm=1001.2014.3001.55015. Bert What is the loss function corresponding to the two pre training tasks of ?
Bert The loss function consists of two parts , The first part is from Mask language model (Mask Language Model,MLM), The other part comes from The next sentence predicts (Next Sentence Prediction,NSP). Through the joint learning of these two tasks , You can make Bert Learned embedding both token Level information , It also contains sentence level semantic information . The specific loss function is :. What does the parameter mean , Check it out
6. ELMO What kind of thing are you doing ?
ELMO The thing to do is :
Bert and its family——ELMOhttps://blog.csdn.net/m0_57541899/article/details/122775529?spm=1001.2014.3001.55017. GPT What kind of thing are you doing ?
You may have guessed that I'm going to give another link, right , Yes , Guess what
3、 ... and 、Q&A:VIT
About vision transformer There are too many excellent blog, So here are just a few common questions
1. layer normalization & batch normalization The difference between ?
Layer normalization Do more than BN It's simpler : Enter a vector , Output another vector , Don't consider batch, It's for the same feature The difference in it dimension Calculation mean Follow std, and BN Is different from feature The same inside dimension Calculation mean Follow std
2. Why CLS token Well ?
By analogy Bert:Bert Insert a before the text CLS Symbol , The output vector corresponding to the symbol is used as the semantic representation of the whole article for text classification , It can be understood as : And other words already in the text / Word comparison , This symbol without obvious semantic information will be more “ fair ” Integrate the words in the text / Semantic information of words
3. Multi-head Attention The essence of
Sometimes we can't have only one q, There should be different q, Responsible for different kinds of correlation , Its essence is : Under the condition that the total number of parameters remains unchanged , Will be the same q,k,v Map to the original high latitude subspace attention Calculation , let me put it another way , Is to find the relationship between sequences from different angles , And at the end concat Integrate the association relationships captured in different subspaces
4. VIT Principle analysis
The theoretical analysis :
Code implementation :
For standard Transformer modular , The required input is token( vector ) Sequence , Two dimensional matrix [num_token, token_dim] =[196,768]
In the code implementation , Directly through a convolution layer . With ViT-B/16 For example , The convolution kernel size used is 16x16,stride by 16, The number of convolution kernels is 768,[224, 224, 3] -> [14, 14, 768] -> [196, 768]
In the input TransformerEncoder You need to add [class]token as well as Position Embedding. Splicing [class]token: Cat([1, 768], [196, 768]) -> [197, 768], superposition PositionEmbedding: [197, 768] -> [197, 768]
边栏推荐
- CesiumJS 2022^ 源码解读[7] - 3DTiles 的请求、加载处理流程解析
- Test panghu was teaching you how to use the technical code to flirt with girls online on Valentine's Day 520
- 4. Data splitting of Flink real-time project
- Part 28 supplement (XXVIII) busyindicator (waiting for elements)
- Nacos usage of micro services
- Bool blind note - score query
- BOC protected tryptophan zinc porphyrin (Zn · TAPP Trp BOC) / copper porphyrin (Cu · TAPP Trp BOC) / cobalt porphyrin (cobalt · TAPP Trp BOC) / iron porphyrin (Fe · TAPP Trp BOC) / Qiyue supply
- Teach you how to quickly recover data by deleting recycle bin files by mistake
- Global and Chinese markets of cast iron diaphragm valves 2022-2028: Research Report on technology, participants, trends, market size and share
- How to do Taobao full screen rotation code? Taobao rotation tmall full screen rotation code
猜你喜欢
JMeter connection database
Sparse matrix (triple) creation, transpose, traversal, addition, subtraction, multiplication. C implementation
BOC protected tryptophan zinc porphyrin (Zn · TAPP Trp BOC) / copper porphyrin (Cu · TAPP Trp BOC) / cobalt porphyrin (cobalt · TAPP Trp BOC) / iron porphyrin (Fe · TAPP Trp BOC) / Qiyue supply
[Yu Yue education] basic reference materials of manufacturing technology of Shanghai Jiaotong University
Test changes in Devops mode -- learning and thinking
2.7 format output of values
Node MySQL serialize cannot rollback transactions
Make a simple text logo with DW
Part 28 supplement (XXVIII) busyindicator (waiting for elements)
Typora charges, WTF? Still need support
随机推荐
5- (4-nitrophenyl) - 10,15,20-triphenylporphyrin ntpph2/ntppzn/ntppmn/ntppfe/ntppni/ntppcu/ntppcd/ntppco and other metal complexes
Based on laravel 5.5\5.6\5 X solution to the failure of installing laravel ide helper
AI enhanced safety monitoring project [with detailed code]
Blue Bridge Cup: the fourth preliminary - "simulated intelligent irrigation system"
Plan for the first half of 2022 -- pass the PMP Exam
[raid] [simple DP] mine excavation
Point cloud data denoising
Test access criteria
unittest框架基本使用
Micro service knowledge sorting - three pieces of micro Service Technology
Microservice framework - frequently asked questions
Realize user registration and login
MySQL dump - exclude some table data - MySQL dump - exclude some table data
原生表格-滚动-合并功能
Microservice knowledge sorting - search technology and automatic deployment technology
Gym welcomes the first complete environmental document, which makes it easier to get started with intensive learning!
6. Data agent object Defineproperty method
Teach you how to quickly recover data by deleting recycle bin files by mistake
Nerfplusplus parameter format sorting
First knowledge of database