当前位置:网站首页>Teacher lihongyi's notes on machine learning -4.1 self attention
Teacher lihongyi's notes on machine learning -4.1 self attention
2022-06-10 06:10:00 【Ning Meng, Julie】
notes : This article is my study of Teacher Li Hongyi 《 machine learning 》 Course 2021/2022 The notes ( Course website ), The pictures in the text are from the course PPT. Welcome to communicate and give more advice , thank you !
Lecture 4-Sequence as input
The previous lesson introduced Deep Learning In the application of image processing , This lesson will introduce Deep Learning In natural language processing (NLP) Application .
Let's review , As analyzed above Model (Deep Neural Network) Enter each sample yes Vector or Matrix (e.g. image) In the form of . Actually Matrix It can be seen as taking Vector Made a dimension transformation , From one dimension to three or more dimensions . So it looks like , Input is Vector or Matrix when , Every sample The dimension of is fixed .

When a sample It's a group. vectors when , Different sample Contains vectors The number may vary . such ,sample Dimension of is no longer fixed , It is changing. . This input is also known as Sequence or vector set. Just like every word we say has its length and short ,sequence The length of is also variable . This raises new questions for us : How to deal with variable length input ? Don't worry. , This lesson will introduce you to .
Input is Sequence( Sequence ) The situation of
Let's take a look first , Which applications are based on sequence (Sequence) As input ?
1. Text . There are many words in a sentence . You can use one word for each word vector Express , such , A sentence is a vector set.
How to use words vector It means ?
Method 1 :One-hot Encoding. Take a large vocabulary , It includes all the words , Each word uniquely means .
Existing problems : The vocabulary is very large , The resulting vector is a sparse high-dimensional vector , It takes up a lot of space . and , This vector representation does not contain semantic information . As shown in the figure below ,cat, dog It's all animals ,one-hot encoding Do not show their similarity .
Method 2 :Word Embedding. Added semantic information , As shown in the figure below , Belonging to animals dog, cat, rabbit Get closer , Belonging to plants tree, flower Get closer . and , The resulting vector has fewer dimensions .

2. speech recognition : Frame the voice signal (frame) Divide , Identify the phoneme. Every paragraph frame It's a vector, A voice is vector set.

3. Figure network : The social network is one Graph( Figure network ), Each of these nodes ( user ) Both can be used. vector To represent properties , This Graph Namely vector set.

4. Molecular structure : In drug research and development 、 Material research and development , It can be used to analyze molecular properties . The atoms on each node are treated as one vector, This vector use One-hot Encoding Express , The overall molecular structure is vector set.
There are three forms of output
1. The length of the output sequence is the same as that of the input sequence .
Each vector has a label. Every vector There is a corresponding label Output , In this way, the length of the output sequence is the same as that of the input sequence . This is a The focus of this lesson is .
Application, for example, :(1) Part of speech tagging (POS tagging).(2) speech recognition ( voice –>phoneme( That is, vowels and consonants )).(3) Prediction of user behavior in social networks .

2. The output sequence length is 1.
The whole sequence has a label. The entire sequence outputs a label.
Application, for example, :(1)Sentiment analysis, Analyze the emotion through comments (positive, neural, negative).(2) Speaker recognition .(3) Judging molecular properties ( Such as hydrophilicity ).

3. The model determines the length of the output sequence .
Model decides the number of labels itself. The output is determined by the model label The number of , That is, the output length . This situation is also called seq2seq Mission , Will be in Lecture 5 Introduce .
Application, for example, :(1) translate , The length of the translation may be different from that of the original .(2) speech recognition , Enter a voice , Output a paragraph of text . Be careful , Different from the speech recognition examples introduced in the first type of output form , The output here is not the smallest unit of speech phoneme( That is, vowels and consonants ), But words , Therefore, it is not necessarily the same as the input length .

Self-attention principle
Next , Let's look at the input and output sequences with the same length ( This situation is also called Sequence Labeling), How to design the model ?
First , An intuitive idea is to break it one by one , Or use the input is vector Treatment method .vector set It's nothing more than vector From one to many , Put this group vectors Send in Fully Connected Network Just fine .
however , Soon you will find out , There are problems . such as , Mark the part of speech of this sentence : “ I saw a saw. ” If the input is vector To deal with , The machine will think that these two saw It's the same . In fact, they are different , The first is the verb ( See ), The second is a noun ( saw ). Voice is the same , A phonetic symbol , There can be different alphabetic forms . in other words ,vector set Medium vectors It's not independent , There's a connection between them . We have to think about context( Context ). I also mentioned this in the article sharing English learning methods , The meaning of the word should be put in the sentence ( Context ) In memory . I also gave a few small examples of polysemy , Interested friends are welcome to click to read : How to memorize words quickly and well ?.
Solution one : Set up window.
image homework2 The same method is used in speech recognition , Set one to include adjacent vectors Of window, This does not take into account context Did you? ?
Existing problems : This is for Sequence Unrealistic . Just imagine ,Window How big is it ? For a paragraph of text , Only look at the length of the longest sentence first , Then drive such a big one window. thus , Large amount of calculation , There are many parameters , Easy to overfitting.
There's another way , At every vector Blend in context Information . such ,vector set The length remains the same , however , They are not what they were vectors 了 . This is it. Self-attention Method .
Solution two :Self-attention.
As shown in the figure below , hold sequence (vector set) Input Self-attention modular , Output sequence Each of them vector All carry context Information . Then put these vectors Send in Fully Connected Network, And input is vector Deal with it the same way . Compared with method 1 ,self-attention No parameters will be added , And the context information of the whole sentence is considered .

and CNN Of Convolutional Layer equally ,Self-attention+Fully Connected Layer You can also stack several layers , As shown in the figure below . therefore ,self-attention The input layer of can be input or hidden layer.

Self-attention What to do ?
seek vector The relationship between . As shown in the figure below , For one vector, Find its and sequence The others vectors The relevance of . We can use α \alpha α To represent correlation .

How to calculate α \alpha α: use Dot-product( Point multiplication ) or Additive Fine , As shown in the figure below .Dot-product More often , In this example, we use Dot-product.

By input a i a^i ai Find out q i q^i qi (query) and k i k^i ki (key), q i q^i qi And sequence All of the k i k^i ki do dot-product, obtain α \alpha α, As shown in the figure below . α \alpha α Also known as attention score. Be careful : q 1 q^1 q1 Also with their own k 1 k^1 k1 Multiply . α \alpha α after softmax , The normalized results are obtained α ′ \alpha' α′ .
doubt : Why? q 1 q^1 q1 With your own k 1 k^1 k1 Multiply ?

Normalize the attention score α ′ \alpha' α′ multiply v v v (value), You get b b b, As shown in the figure below . b b b It's input vector Join in context information Output after . Although the length of the input-output sequence is the same , But each of them vector Are added to the other in the sequence vectors Mutual information of .
doubt : Why v v v, Multiply the input directly a a a can't I? ?

From the above calculation steps, it is found that , Calculation sequence An output in vector b i b^i bi Does not depend on previous results b 1 b^1 b1, …, b i − 1 b^{i-1} bi−1 , therefore , b 1 b^1 b1, …, b n b^{n} bn Parallel computing .
Just now we saw how to handle input vector set One of them vector. Now from the individual to the whole , Let's see what the input is vector set (sequence) How to operate . In the example shown in the following figure , Input sequence The length is 4, Yes a 1 , a 2 , a 3 , a 4 a^1, a^2, a^3, a^4 a1,a2,a3,a4 4 individual vector. from a i a^i ai Get the corresponding q i , k i , v i q^i,k^i,v^i qi,ki,vi.

because q ⋅ k q\cdot k q⋅k yes dot-product, You can put k i k^i ki Transposition , And q i q^i qi Do matrix multiplication , Get the corresponding α \alpha α , form attention matrix A, As shown in the figure below :

α ′ \alpha' α′ And v v v Multiply , Get the output b b b , This operation can also be written in the form of matrix operation , As shown in the figure below :

Combined with the above analysis , obtain Self attention Matrix operation expression of the whole process , As shown in the figure below :

It should be noted that :
(1) W q W^q Wq, W k W^k Wk, W v W^v Wv It's a parameter to learn !
(2) The picture shows Attention Matrix The calculation of is the most computationally intensive part , hypothesis sequence The length is L, Among them vector Dimension for d, Then you need to calculate L x d x L Time .
Self-attention Improvement
1. Multi-head Self-attention
occasionally , We need to consider multiple correlations , Need more than one self attention, Hence the Multi-head Self-attention. As shown in the figure below ( The picture shows 2 heads The situation of ), here , ( q , k , v ) (q,k,v) (q,k,v) From one group to multiple groups . Be careful : In each group ( q , k , v ) (q,k,v) (q,k,v) Is the corresponding operation , No cross group .

With a single self attention comparison ,Multi-head Self-attention One more step : Multiple outputs are combined to get one output .

2. Positional Encoding
review self attention Calculation process , We found that self-attention Without considering location information , Only cross correlations are calculated . For example, a certain word , Whether it is at the beginning of a sentence 、 Sentence 、 At the end of the sentence , self-attention The results are the same . however , Sometimes Sequence The location information in is still very important .
resolvent :Positional Encoding, Put the location information e i e^i ei Add to input a i a^i ai in . You can design your own , The black vertical box in the figure below shows a e i e^i ei , You can also learn from the data .

Self-attention Application
1. be applied to NLP
Self-attention stay NLP Widely used in , Such as the famous Transformer, BERT In the model architecture of Self-attention.
2. Application in speech recognition , improvement :Truncated Self-attention
Used in speech processing Self-attention, You can put frames treat as vector set.
Existing problems : Suppose every time you move 10ms Take down one frame,1s There is 100 individual vectors. A sentence is very long , Calculation Attention Matrix It's a lot of computation .
improvement : use Truncated Self-attention, As shown in the figure , Not all sequence Count up attention score, Limit the calculation to a certain adjacent range .
reflection :Truncated Self-attention It feels a bit like CNN Of receptive field.

3. Applied to image processing , contrast :CNN Model
Used in image processing self attention, As shown in the figure , Put a pixel (W,H,D) Think of it as a vector, An image is vector set.

Actually ,Self-attention Can be seen as more flexible CNN. Why do you say that ?
Think of a pixel as a vector,CNN Just look at receptive field Range correlation , Can be understood as the center of this vector Just look at the adjacent vectors, As shown in the figure below . from Self-attention From the perspective of , That's what it is receptive field Not the whole thing sequence Of Self-attention. therefore , CNN The model is a simplified version Self-attention.
On the other hand ,CNN Of receptive field The size of is set manually , such as : kernel size by 3x3. and Self-attention solve attention score The process of , In fact, it can be seen as learning and determining receptive field The range size . And CNN comparison ,self-attention choice receptive field It jumps out of the limit of adjacent pixels , You can select from the whole image . therefore ,Self-attention It's a complex version CNN Model .

Through the above analysis, we can find that ,self-attention More complex (more flexible),|H| Bigger , Therefore, the amount of data in the training set is required N Bigger . As shown in the figure below , In image recognition (Image Recognition) On mission , The amount of data is relatively small (less data) when ,CNN The model performs better . The amount of data is relatively large (more data) when ,Self-attention The model performs better .
Be careful : Following less data There are 10M ( Ten million ) Data , It is not the small amount of data we imagined , ha-ha !

4. Self-attention v.s. RNN
stay self-attention Before , Input is sequence The network structure that is often used is RNN (Recurrent Neural Network). Let's take a look at the difference between the two .
first impression :RNN I can only look at the previous vectors,self-attention Look at the whole sentence .
It's not , Also have bidirectional RNN, From the beginning to the end of the sentence , From the end of the sentence to the beginning of the sentence , You can also look at the whole sentence .
The real difference is :
(1) As shown in the figure below , If RNN the last one vector To contact the first vector, It's hard , The first one needs to be vector The output of is always saved in memory in . And that's right self-attention Come on , It's simple . Whole Sequence Anywhere on the vector You can contact ,“ The end of the earth is like a neighbor ”, Distance is not a problem .
(2)RNN The previous output is used as the subsequent input , Therefore, it is necessary to calculate in sequence , Can't process in parallel . self-attention Parallel computing is possible .

5. Self-attention for Graph
Graph in , According to edge To simplify the attention Calculation . Yes edge Connected nodes On the calculation attention, No, edge The connected is set to 0, This is a Graph Neural Network(GNN) A kind of .

If you think this article is good , Please praise and support , thank you !
Pay attention to me Ning Meng Julie, learn from each other , Communicate more !
Read more notes , Please click on Mr. Li Hongyi 《 machine learning 》 note – Catalogue .
Reference resources
Mr. Li Hongyi 《 machine learning 2022》:
Course website :https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php
video :https://www.bilibili.com/video/BV1Wv411h7kN
边栏推荐
- 李宏毅老师《机器学习》课程笔记-4.2 Batch Normalization
- Learning notes of introduction to deep learning
- Transformer XL model details
- 反向域名解析是什么?
- Position coding in self attention mechanism
- pm2 的安装与常用命令
- Difference between thread:: join() and thread:: detach() in multithreading
- NPM command Encyclopedia
- Word vector: detailed explanation of glove model
- 李宏毅老师《机器学习》课程笔记-5 Transformer
猜你喜欢
随机推荐
C#为应付期末涉及到大部分考点所设计的学生管理系统
深证通mr-消息中间件简单使用
Landing of global organizational structure control
Picture literacy tutorial
Struct in golang
risc-v “access fault“和“page fault“区别
Thank you for the Big White Rabbit candy of Yuyi children's shoes
4. [prime phrase, leftmost prime phrase]
Canape XCP on can project creation
将json文件里面的数据写入数据库
Thread priority and thread safety of multithread family
Mill embedded CPU module appears in industrial control technology seminar
MySQL数据库的基本使用
图片识字的教程
IP地址0.0.0.0是什么意思?
Wireshark packet capture analysis
Working with MySQL databases in a project
Online chat -- Introduction and implementation of websocket
Talk about CTF web WP
Solve the problem of npcap installation failure in Wireshark







