当前位置：网站首页>Teacher lihongyi's notes on machine learning -4.1 self attention

Teacher lihongyi's notes on machine learning -4.1 self attention

2022-06-10 06:10:00 【Ning Meng, Julie】

notes ： This article is my study of Teacher Li Hongyi 《 machine learning 》 Course 2021/2022 The notes （ Course website ）, The pictures in the text are from the course PPT. Welcome to communicate and give more advice , thank you ！

Lecture 4-Sequence as input

The previous lesson introduced Deep Learning In the application of image processing , This lesson will introduce Deep Learning In natural language processing (NLP) Application .

Let's review , As analyzed above Model (Deep Neural Network) Enter each sample yes Vector or Matrix (e.g. image) In the form of . Actually Matrix It can be seen as taking Vector Made a dimension transformation , From one dimension to three or more dimensions . So it looks like , Input is Vector or Matrix when , Every sample The dimension of is fixed .

Insert picture description here

When a sample It's a group. vectors when , Different sample Contains vectors The number may vary . such ,sample Dimension of is no longer fixed , It is changing. . This input is also known as Sequence or vector set. Just like every word we say has its length and short ,sequence The length of is also variable . This raises new questions for us ： How to deal with variable length input ？ Don't worry. , This lesson will introduce you to .

Input is Sequence（ Sequence ） The situation of

Let's take a look first , Which applications are based on sequence (Sequence) As input ？

1. Text . There are many words in a sentence . You can use one word for each word vector Express , such , A sentence is a vector set.

How to use words vector It means ？

Method 1 ：One-hot Encoding. Take a large vocabulary , It includes all the words , Each word uniquely means .

Existing problems ： The vocabulary is very large , The resulting vector is a sparse high-dimensional vector , It takes up a lot of space . and , This vector representation does not contain semantic information . As shown in the figure below ,cat, dog It's all animals ,one-hot encoding Do not show their similarity .

Method 2 ：Word Embedding. Added semantic information , As shown in the figure below , Belonging to animals dog, cat, rabbit Get closer , Belonging to plants tree, flower Get closer . and , The resulting vector has fewer dimensions .

Insert picture description here

2. speech recognition ： Frame the voice signal (frame) Divide , Identify the phoneme. Every paragraph frame It's a vector, A voice is vector set.

Insert picture description here

3. Figure network ： The social network is one Graph（ Figure network ）, Each of these nodes （ user ） Both can be used. vector To represent properties , This Graph Namely vector set.

Insert picture description here

4. Molecular structure ： In drug research and development 、 Material research and development , It can be used to analyze molecular properties . The atoms on each node are treated as one vector, This vector use One-hot Encoding Express , The overall molecular structure is vector set.

There are three forms of output

1. The length of the output sequence is the same as that of the input sequence .

Each vector has a label. Every vector There is a corresponding label Output , In this way, the length of the output sequence is the same as that of the input sequence . This is a The focus of this lesson is .

Application, for example, ：（1） Part of speech tagging (POS tagging).（2） speech recognition （ voice –>phoneme( That is, vowels and consonants )）.（3） Prediction of user behavior in social networks .

Insert picture description here

2. The output sequence length is 1.

The whole sequence has a label. The entire sequence outputs a label.

Application, for example, ：（1）Sentiment analysis, Analyze the emotion through comments (positive, neural, negative).（2） Speaker recognition .（3） Judging molecular properties （ Such as hydrophilicity ）.

Insert picture description here

3. The model determines the length of the output sequence .

Model decides the number of labels itself. The output is determined by the model label The number of , That is, the output length . This situation is also called seq2seq Mission , Will be in Lecture 5 Introduce .

Application, for example, ：（1） translate , The length of the translation may be different from that of the original .（2） speech recognition , Enter a voice , Output a paragraph of text . Be careful , Different from the speech recognition examples introduced in the first type of output form , The output here is not the smallest unit of speech phoneme（ That is, vowels and consonants ）, But words , Therefore, it is not necessarily the same as the input length .

Insert picture description here

Self-attention principle

Next , Let's look at the input and output sequences with the same length （ This situation is also called Sequence Labeling）, How to design the model ？

First , An intuitive idea is to break it one by one , Or use the input is vector Treatment method .vector set It's nothing more than vector From one to many , Put this group vectors Send in Fully Connected Network Just fine .

however , Soon you will find out , There are problems . such as , Mark the part of speech of this sentence : “ I saw a saw. ” If the input is vector To deal with , The machine will think that these two saw It's the same . In fact, they are different , The first is the verb （ See ）, The second is a noun （ saw ）. Voice is the same , A phonetic symbol , There can be different alphabetic forms . in other words ,vector set Medium vectors It's not independent , There's a connection between them . We have to think about context（ Context ）. I also mentioned this in the article sharing English learning methods , The meaning of the word should be put in the sentence （ Context ） In memory . I also gave a few small examples of polysemy , Interested friends are welcome to click to read ： How to memorize words quickly and well ？.

Solution one ： Set up window.

image homework2 The same method is used in speech recognition , Set one to include adjacent vectors Of window, This does not take into account context Did you? ？

Existing problems ： This is for Sequence Unrealistic . Just imagine ,Window How big is it ？ For a paragraph of text , Only look at the length of the longest sentence first , Then drive such a big one window. thus , Large amount of calculation , There are many parameters , Easy to overfitting.

There's another way , At every vector Blend in context Information . such ,vector set The length remains the same , however , They are not what they were vectors 了 . This is it. Self-attention Method .

Solution two ：Self-attention.

As shown in the figure below , hold sequence (vector set) Input Self-attention modular , Output sequence Each of them vector All carry context Information . Then put these vectors Send in Fully Connected Network, And input is vector Deal with it the same way . Compared with method 1 ,self-attention No parameters will be added , And the context information of the whole sentence is considered .

Insert picture description here

and CNN Of Convolutional Layer equally ,Self-attention+Fully Connected Layer You can also stack several layers , As shown in the figure below . therefore ,self-attention The input layer of can be input or hidden layer.

Insert picture description here

Self-attention What to do ？

seek vector The relationship between . As shown in the figure below , For one vector, Find its and sequence The others vectors The relevance of . We can use $\alpha$ To represent correlation .

Insert picture description here

How to calculate $\alpha$ ： use Dot-product（ Point multiplication ） or Additive Fine , As shown in the figure below .Dot-product More often , In this example, we use Dot-product.

Insert picture description here

By input $a^i$ Find out $q^i$ (query) and $k^i$ (key), $q^i$ And sequence All of the $k^i$ do dot-product, obtain $\alpha$ , As shown in the figure below . $\alpha$ Also known as attention score. Be careful ： $q^1$ Also with their own $k^1$ Multiply . $\alpha$ after softmax , The normalized results are obtained $\alpha'$ .

doubt ： Why? $q^1$ With your own $k^1$ Multiply ？

Insert picture description here

Normalize the attention score $\alpha'$ multiply $v$ (value), You get $b$ , As shown in the figure below . $b$ It's input vector Join in context information Output after . Although the length of the input-output sequence is the same , But each of them vector Are added to the other in the sequence vectors Mutual information of .

doubt ： Why $v$ , Multiply the input directly $a$ can't I? ？

Insert picture description here

From the above calculation steps, it is found that , Calculation sequence An output in vector $b^i$ Does not depend on previous results $b^1$ , …, $b^{i-1}$ , therefore , $b^1$ , …, $b^{n}$ Parallel computing .

Just now we saw how to handle input vector set One of them vector. Now from the individual to the whole , Let's see what the input is vector set (sequence) How to operate . In the example shown in the following figure , Input sequence The length is 4, Yes $a^1, a^2, a^3, a^4$ 4 individual vector. from $a^i$ Get the corresponding $q^i,k^i,v^i$ .

Insert picture description here

because $q\cdot k$ yes dot-product, You can put $k^i$ Transposition , And $q^i$ Do matrix multiplication , Get the corresponding $\alpha$ , form attention matrix A, As shown in the figure below ：

Insert picture description here

$\alpha'$ And $v$ Multiply , Get the output $b$ , This operation can also be written in the form of matrix operation , As shown in the figure below ：

Insert picture description here

Combined with the above analysis , obtain Self attention Matrix operation expression of the whole process , As shown in the figure below ：

Insert picture description here

It should be noted that ：

（1） $W^q$ , $W^k$ , $W^v$ It's a parameter to learn ！

（2） The picture shows Attention Matrix The calculation of is the most computationally intensive part , hypothesis sequence The length is L, Among them vector Dimension for d, Then you need to calculate L x d x L Time .

Self-attention Improvement

1. Multi-head Self-attention

occasionally , We need to consider multiple correlations , Need more than one self attention, Hence the Multi-head Self-attention. As shown in the figure below （ The picture shows 2 heads The situation of ）, here , $(q, k, v)$ From one group to multiple groups . Be careful ： In each group $(q, k, v)$ Is the corresponding operation , No cross group .

Insert picture description here

With a single self attention comparison ,Multi-head Self-attention One more step ： Multiple outputs are combined to get one output .

Insert picture description here

2. Positional Encoding

review self attention Calculation process , We found that self-attention Without considering location information , Only cross correlations are calculated . For example, a certain word , Whether it is at the beginning of a sentence 、 Sentence 、 At the end of the sentence , self-attention The results are the same . however , Sometimes Sequence The location information in is still very important .

resolvent ：Positional Encoding, Put the location information $e^i$ Add to input $a^i$ in . You can design your own , The black vertical box in the figure below shows a $e^i$ , You can also learn from the data .

Insert picture description here

Self-attention Application

1. be applied to NLP

Self-attention stay NLP Widely used in , Such as the famous Transformer, BERT In the model architecture of Self-attention.

2. Application in speech recognition , improvement ：Truncated Self-attention

Used in speech processing Self-attention, You can put frames treat as vector set.

Existing problems ： Suppose every time you move 10ms Take down one frame,1s There is 100 individual vectors. A sentence is very long , Calculation Attention Matrix It's a lot of computation .

improvement ： use Truncated Self-attention, As shown in the figure , Not all sequence Count up attention score, Limit the calculation to a certain adjacent range .

reflection ：Truncated Self-attention It feels a bit like CNN Of receptive field.

Insert picture description here

3. Applied to image processing , contrast ：CNN Model

Used in image processing self attention, As shown in the figure , Put a pixel （W,H,D） Think of it as a vector, An image is vector set.

Insert picture description here

Actually ,Self-attention Can be seen as more flexible CNN. Why do you say that ？

Think of a pixel as a vector,CNN Just look at receptive field Range correlation , Can be understood as the center of this vector Just look at the adjacent vectors, As shown in the figure below . from Self-attention From the perspective of , That's what it is receptive field Not the whole thing sequence Of Self-attention. therefore , CNN The model is a simplified version Self-attention.

On the other hand ,CNN Of receptive field The size of is set manually , such as : kernel size by 3x3. and Self-attention solve attention score The process of , In fact, it can be seen as learning and determining receptive field The range size . And CNN comparison ,self-attention choice receptive field It jumps out of the limit of adjacent pixels , You can select from the whole image . therefore ,Self-attention It's a complex version CNN Model .

Insert picture description here

Through the above analysis, we can find that ,self-attention More complex (more flexible),|H| Bigger , Therefore, the amount of data in the training set is required N Bigger . As shown in the figure below , In image recognition (Image Recognition) On mission , The amount of data is relatively small (less data) when ,CNN The model performs better . The amount of data is relatively large (more data) when ,Self-attention The model performs better .

Be careful ： Following less data There are 10M （ Ten million ） Data , It is not the small amount of data we imagined , ha-ha ！

Insert picture description here

4. Self-attention v.s. RNN

stay self-attention Before , Input is sequence The network structure that is often used is RNN (Recurrent Neural Network). Let's take a look at the difference between the two .

first impression ：RNN I can only look at the previous vectors,self-attention Look at the whole sentence .

It's not , Also have bidirectional RNN, From the beginning to the end of the sentence , From the end of the sentence to the beginning of the sentence , You can also look at the whole sentence .

The real difference is ：

（1） As shown in the figure below , If RNN the last one vector To contact the first vector, It's hard , The first one needs to be vector The output of is always saved in memory in . And that's right self-attention Come on , It's simple . Whole Sequence Anywhere on the vector You can contact ,“ The end of the earth is like a neighbor ”, Distance is not a problem .

（2）RNN The previous output is used as the subsequent input , Therefore, it is necessary to calculate in sequence , Can't process in parallel . self-attention Parallel computing is possible .

Insert picture description here

5. Self-attention for Graph

Graph in , According to edge To simplify the attention Calculation . Yes edge Connected nodes On the calculation attention, No, edge The connected is set to 0, This is a Graph Neural Network(GNN) A kind of .

Insert picture description here