当前位置:网站首页>Teacher lihongyi's notes on machine learning -4.1 self attention

Teacher lihongyi's notes on machine learning -4.1 self attention

2022-06-10 06:10:00 Ning Meng, Julie

notes : This article is my study of Teacher Li Hongyi 《 machine learning 》 Course 2021/2022 The notes ( Course website ), The pictures in the text are from the course PPT. Welcome to communicate and give more advice , thank you !

Lecture 4-Sequence as input

The previous lesson introduced Deep Learning In the application of image processing , This lesson will introduce Deep Learning In natural language processing (NLP) Application .

Let's review , As analyzed above Model (Deep Neural Network) Enter each sample yes Vector or Matrix (e.g. image) In the form of . Actually Matrix It can be seen as taking Vector Made a dimension transformation , From one dimension to three or more dimensions . So it looks like , Input is Vector or Matrix when , Every sample The dimension of is fixed .

 Insert picture description here

When a sample It's a group. vectors when , Different sample Contains vectors The number may vary . such ,sample Dimension of is no longer fixed , It is changing. . This input is also known as Sequence or vector set. Just like every word we say has its length and short ,sequence The length of is also variable . This raises new questions for us : How to deal with variable length input ? Don't worry. , This lesson will introduce you to .

Input is Sequence( Sequence ) The situation of

Let's take a look first , Which applications are based on sequence (Sequence) As input ?

1. Text . There are many words in a sentence . You can use one word for each word vector Express , such , A sentence is a vector set.

How to use words vector It means ?

Method 1 :One-hot Encoding. Take a large vocabulary , It includes all the words , Each word uniquely means .

Existing problems : The vocabulary is very large , The resulting vector is a sparse high-dimensional vector , It takes up a lot of space . and , This vector representation does not contain semantic information . As shown in the figure below ,cat, dog It's all animals ,one-hot encoding Do not show their similarity .

Method 2 :Word Embedding. Added semantic information , As shown in the figure below , Belonging to animals dog, cat, rabbit Get closer , Belonging to plants tree, flower Get closer . and , The resulting vector has fewer dimensions .

 Insert picture description here

2. speech recognition : Frame the voice signal (frame) Divide , Identify the phoneme. Every paragraph frame It's a vector, A voice is vector set.

 Insert picture description here

3. Figure network : The social network is one Graph( Figure network ), Each of these nodes ( user ) Both can be used. vector To represent properties , This Graph Namely vector set.

 Insert picture description here

4. Molecular structure : In drug research and development 、 Material research and development , It can be used to analyze molecular properties . The atoms on each node are treated as one vector, This vector use One-hot Encoding Express , The overall molecular structure is vector set.

There are three forms of output

1. The length of the output sequence is the same as that of the input sequence .

Each vector has a label. Every vector There is a corresponding label Output , In this way, the length of the output sequence is the same as that of the input sequence . This is a The focus of this lesson is .

Application, for example, :(1) Part of speech tagging (POS tagging).(2) speech recognition ( voice –>phoneme( That is, vowels and consonants )).(3) Prediction of user behavior in social networks .

 Insert picture description here

2. The output sequence length is 1.

The whole sequence has a label. The entire sequence outputs a label.

Application, for example, :(1)Sentiment analysis, Analyze the emotion through comments (positive, neural, negative).(2) Speaker recognition .(3) Judging molecular properties ( Such as hydrophilicity ).

 Insert picture description here

3. The model determines the length of the output sequence .

Model decides the number of labels itself. The output is determined by the model label The number of , That is, the output length . This situation is also called seq2seq Mission , Will be in Lecture 5 Introduce .

Application, for example, :(1) translate , The length of the translation may be different from that of the original .(2) speech recognition , Enter a voice , Output a paragraph of text . Be careful , Different from the speech recognition examples introduced in the first type of output form , The output here is not the smallest unit of speech phoneme( That is, vowels and consonants ), But words , Therefore, it is not necessarily the same as the input length .

 Insert picture description here

Self-attention principle

Next , Let's look at the input and output sequences with the same length ( This situation is also called Sequence Labeling), How to design the model ?

First , An intuitive idea is to break it one by one , Or use the input is vector Treatment method .vector set It's nothing more than vector From one to many , Put this group vectors Send in Fully Connected Network Just fine .

however , Soon you will find out , There are problems . such as , Mark the part of speech of this sentence : “ I saw a saw. ” If the input is vector To deal with , The machine will think that these two saw It's the same . In fact, they are different , The first is the verb ( See ), The second is a noun ( saw ). Voice is the same , A phonetic symbol , There can be different alphabetic forms . in other words ,vector set Medium vectors It's not independent , There's a connection between them . We have to think about context( Context ). I also mentioned this in the article sharing English learning methods , The meaning of the word should be put in the sentence ( Context ) In memory . I also gave a few small examples of polysemy , Interested friends are welcome to click to read : How to memorize words quickly and well ?.

Solution one : Set up window.

image homework2 The same method is used in speech recognition , Set one to include adjacent vectors Of window, This does not take into account context Did you? ?

Existing problems : This is for Sequence Unrealistic . Just imagine ,Window How big is it ? For a paragraph of text , Only look at the length of the longest sentence first , Then drive such a big one window. thus , Large amount of calculation , There are many parameters , Easy to overfitting.

There's another way , At every vector Blend in context Information . such ,vector set The length remains the same , however , They are not what they were vectors 了 . This is it. Self-attention Method .

Solution two :Self-attention.

As shown in the figure below , hold sequence (vector set) Input Self-attention modular , Output sequence Each of them vector All carry context Information . Then put these vectors Send in Fully Connected Network, And input is vector Deal with it the same way . Compared with method 1 ,self-attention No parameters will be added , And the context information of the whole sentence is considered .

 Insert picture description here

and CNN Of Convolutional Layer equally ,Self-attention+Fully Connected Layer You can also stack several layers , As shown in the figure below . therefore ,self-attention The input layer of can be input or hidden layer.

 Insert picture description here

Self-attention What to do ?

seek vector The relationship between . As shown in the figure below , For one vector, Find its and sequence The others vectors The relevance of . We can use α \alpha α To represent correlation .

 Insert picture description here

How to calculate α \alpha α: use Dot-product( Point multiplication ) or Additive Fine , As shown in the figure below .Dot-product More often , In this example, we use Dot-product.

 Insert picture description here

By input a i a^i ai Find out q i q^i qi (query) and k i k^i ki (key), q i q^i qi And sequence All of the k i k^i ki do dot-product, obtain α \alpha α, As shown in the figure below . α \alpha α Also known as attention score. Be careful : q 1 q^1 q1 Also with their own k 1 k^1 k1 Multiply . α \alpha α after softmax , The normalized results are obtained α ′ \alpha' α .

doubt : Why? q 1 q^1 q1 With your own k 1 k^1 k1 Multiply ?

 Insert picture description here

Normalize the attention score α ′ \alpha' α multiply v v v (value), You get b b b, As shown in the figure below . b b b It's input vector Join in context information Output after . Although the length of the input-output sequence is the same , But each of them vector Are added to the other in the sequence vectors Mutual information of .

doubt : Why v v v, Multiply the input directly a a a can't I? ?

 Insert picture description here

From the above calculation steps, it is found that , Calculation sequence An output in vector b i b^i bi Does not depend on previous results b 1 b^1 b1, …, b i − 1 b^{i-1} bi1 , therefore , b 1 b^1 b1, …, b n b^{n} bn Parallel computing .

Just now we saw how to handle input vector set One of them vector. Now from the individual to the whole , Let's see what the input is vector set (sequence) How to operate . In the example shown in the following figure , Input sequence The length is 4, Yes a 1 , a 2 , a 3 , a 4 a^1, a^2, a^3, a^4 a1,a2,a3,a4 4 individual vector. from a i a^i ai Get the corresponding q i , k i , v i q^i,k^i,v^i qi,ki,vi.

 Insert picture description here

because q ⋅ k q\cdot k qk yes dot-product, You can put k i k^i ki Transposition , And q i q^i qi Do matrix multiplication , Get the corresponding α \alpha α , form attention matrix A, As shown in the figure below :

 Insert picture description here

α ′ \alpha' α And v v v Multiply , Get the output b b b , This operation can also be written in the form of matrix operation , As shown in the figure below :

 Insert picture description here

Combined with the above analysis , obtain Self attention Matrix operation expression of the whole process , As shown in the figure below :

 Insert picture description here

It should be noted that :

(1) W q W^q Wq, W k W^k Wk, W v W^v Wv It's a parameter to learn !

(2) The picture shows Attention Matrix The calculation of is the most computationally intensive part , hypothesis sequence The length is L, Among them vector Dimension for d, Then you need to calculate L x d x L Time .

Self-attention Improvement

1. Multi-head Self-attention

occasionally , We need to consider multiple correlations , Need more than one self attention, Hence the Multi-head Self-attention. As shown in the figure below ( The picture shows 2 heads The situation of ), here , ( q , k , v ) (q,k,v) (q,k,v) From one group to multiple groups . Be careful : In each group ( q , k , v ) (q,k,v) (q,k,v) Is the corresponding operation , No cross group .

 Insert picture description here

With a single self attention comparison ,Multi-head Self-attention One more step : Multiple outputs are combined to get one output .

 Insert picture description here

2. Positional Encoding

review self attention Calculation process , We found that self-attention Without considering location information , Only cross correlations are calculated . For example, a certain word , Whether it is at the beginning of a sentence 、 Sentence 、 At the end of the sentence , self-attention The results are the same . however , Sometimes Sequence The location information in is still very important .

resolvent :Positional Encoding, Put the location information e i e^i ei Add to input a i a^i ai in . You can design your own , The black vertical box in the figure below shows a e i e^i ei , You can also learn from the data .

 Insert picture description here

Self-attention Application

1. be applied to NLP

Self-attention stay NLP Widely used in , Such as the famous Transformer, BERT In the model architecture of Self-attention.

2. Application in speech recognition , improvement :Truncated Self-attention

Used in speech processing Self-attention, You can put frames treat as vector set.

Existing problems : Suppose every time you move 10ms Take down one frame,1s There is 100 individual vectors. A sentence is very long , Calculation Attention Matrix It's a lot of computation .

improvement : use Truncated Self-attention, As shown in the figure , Not all sequence Count up attention score, Limit the calculation to a certain adjacent range .

reflection :Truncated Self-attention It feels a bit like CNN Of receptive field.

 Insert picture description here

3. Applied to image processing , contrast :CNN Model

Used in image processing self attention, As shown in the figure , Put a pixel (W,H,D) Think of it as a vector, An image is vector set.

 Insert picture description here

Actually ,Self-attention Can be seen as more flexible CNN. Why do you say that ?

Think of a pixel as a vector,CNN Just look at receptive field Range correlation , Can be understood as the center of this vector Just look at the adjacent vectors, As shown in the figure below . from Self-attention From the perspective of , That's what it is receptive field Not the whole thing sequence Of Self-attention. therefore , CNN The model is a simplified version Self-attention.

On the other hand ,CNN Of receptive field The size of is set manually , such as : kernel size by 3x3. and Self-attention solve attention score The process of , In fact, it can be seen as learning and determining receptive field The range size . And CNN comparison ,self-attention choice receptive field It jumps out of the limit of adjacent pixels , You can select from the whole image . therefore ,Self-attention It's a complex version CNN Model .

 Insert picture description here

Through the above analysis, we can find that ,self-attention More complex (more flexible),|H| Bigger , Therefore, the amount of data in the training set is required N Bigger . As shown in the figure below , In image recognition (Image Recognition) On mission , The amount of data is relatively small (less data) when ,CNN The model performs better . The amount of data is relatively large (more data) when ,Self-attention The model performs better .

Be careful : Following less data There are 10M ( Ten million ) Data , It is not the small amount of data we imagined , ha-ha !

 Insert picture description here

4. Self-attention v.s. RNN

stay self-attention Before , Input is sequence The network structure that is often used is RNN (Recurrent Neural Network). Let's take a look at the difference between the two .

first impression :RNN I can only look at the previous vectors,self-attention Look at the whole sentence .

It's not , Also have bidirectional RNN, From the beginning to the end of the sentence , From the end of the sentence to the beginning of the sentence , You can also look at the whole sentence .

The real difference is :

(1) As shown in the figure below , If RNN the last one vector To contact the first vector, It's hard , The first one needs to be vector The output of is always saved in memory in . And that's right self-attention Come on , It's simple . Whole Sequence Anywhere on the vector You can contact ,“ The end of the earth is like a neighbor ”, Distance is not a problem .

(2)RNN The previous output is used as the subsequent input , Therefore, it is necessary to calculate in sequence , Can't process in parallel . self-attention Parallel computing is possible .

 Insert picture description here

5. Self-attention for Graph

Graph in , According to edge To simplify the attention Calculation . Yes edge Connected nodes On the calculation attention, No, edge The connected is set to 0, This is a Graph Neural Network(GNN) A kind of .

 Insert picture description here

If you think this article is good , Please praise and support , thank you !

Pay attention to me Ning Meng Julie, learn from each other , Communicate more !

Read more notes , Please click on Mr. Li Hongyi 《 machine learning 》 note – Catalogue .

Reference resources

Mr. Li Hongyi 《 machine learning 2022》:

Course website :https://speech.ee.ntu.edu.tw/~hylee/ml/2022-spring.php

video :https://www.bilibili.com/video/BV1Wv411h7kN

原网站

版权声明
本文为[Ning Meng, Julie]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206100556310468.html