当前位置:网站首页>One image can hold 16x16 words! ——Transformers for large scale image scaling recognition (a brief review of ICLR 2021 papers)

One image can hold 16x16 words! ——Transformers for large scale image scaling recognition (a brief review of ICLR 2021 papers)

2020-11-09 23:48:00 Napoleon0

author |Stan Kriventsov
compile |Flin
source |medium

In this post , I want to be able to , Explain the author's submission to 2021 ICLR The new paper from the conference “ A graph is equal to 16x16 A word : Transformers for large-scale image recognition ” The meaning of ( So far anonymous ).

In another article , I offer an example , This example takes this new model ( be called Vision Transformer, Vision transformer ) And PyTorch Used together with standards MNIST Data sets for prediction .

since 1960 In depth learning since ( Machine learning uses neural networks to have more than one hidden layer ) It has come out , But what makes deep learning really come to the forefront , yes 2012 In the year AlexNet, A convolution network ( Simply speaking , A network , First look for small patterns in each part of the image , Then try to combine them into a whole picture ), from Alex Krizhevsky Design , Won the year ImageNet The winner of the image classification contest .

Over the next few years , Deep computer vision technology has experienced a real revolution , New convolution architectures emerge every year (GoogleNet、ResNet、DenseNet、EfficientNet etc. ), In the ImageNet And other benchmark datasets ( Such as CIFAR-10、CIFAR-100) Set a new precision record on the .

The figure below shows from 2011 Since then ImageNet The highest accuracy of machine learning models on datasets ( On the first attempt, correctly predict the accuracy of the content of the image ) The progress of .

However , In the past few years , The most interesting development of deep learning is not in the field of images , It's in natural language processing (NLP) in , This is from Ashish Vaswani Et al. 2017 Year paper “ Attention is everything you need ” For the first time .

The thought of attention , It refers to the weight that can be trained , Simulate the importance of each connection between different parts of the input sentence , Yes NLP The effect of this is similar to the convolution network in computer vision , It greatly improves the machine learning model for various language tasks ( Such as natural language understanding ) And the effect of machine translation .

What makes attention so effective for linguistic data , Because understanding human language often requires tracking long-term dependencies . We might start by saying “ We arrived in New York ”, And said, “ The weather in the city is fine ”. For any human reader , It should be clear , In the last sentence “ City ” refer to “ New York ”, But for one based only on nearby data ( Like convolutional networks ) Find the model of the pattern in , This connection may not be detectable .

The problem of long-term dependence can be solved by using recursive networks , for example LSTMs, Before the transformer comes ,LSTMs It's actually NLP The top model in , But even those models , It's also hard to match specific words .

The global attention model in the transformer measures the importance of each connection between any two words in the text , This explains the advantages of their performance . For sequential data types that are less important to attention ( for example , Time domain data such as daily sales or stock prices ), Recursive networks are still very competitive , It may still be the best choice .

Although in NLP In the equal sequence model , Dependencies between distant objects may have special significance , But in the image task , They can't be ignored . To form a complete picture , You usually need to understand the various parts of the image .

up to now , The reason why attention models don't perform well in computer vision is the difficulty of scaling them ( They are scaled to N², therefore 1000x1000 The full set of attention weights between the pixels of an image will have a million items ).

Maybe more importantly , in fact , Contrary to the words in the text , The pixels in the picture themselves are not very meaningful , So connecting them through attention doesn't make much sense .

This new paper proposes a method , It doesn't care about pixels , It's about small areas of the image ( Maybe it's in the title 16x16, Although the optimal block size actually depends on the image size and content of the model ).

The picture above ( From the paper ) Shows how the visual transformer works .

Each color block in the input image is flattened by using a linear projection matrix , And add location embedding to it ( Learning numbers , It contains information about the initial position of the color block in the image ). This is necessary , Because the transformer will process all the inputs , Regardless of the order , Therefore, it is helpful for the model to evaluate the attention weight correctly . Additional class tags are connected to the input ( Position in the image 0), As a placeholder for the class to predict in the classification task .

Be similar to 2017 edition , The transformer encoder consists of multiple attention , Normalized and fully connected layer composition , These layers have residuals ( skip ) Connect , As shown in the right half of the figure .

In each area of interest , Multiple headers can capture different connection patterns . If you are interested in learning more about transformers , I suggest reading Jay Alammar I wrote this wonderful article .

The output is fully connected MLP The header can provide the desired category prediction . Of course , As it is today , The master model can be pre trained on large image data sets , Then you can use standard transfer learning methods to get the final MLP Head fine tuning for specific tasks .

One of the features of the new model is , Although according to the research in this paper , It is more effective than convolution method to obtain the same prediction accuracy with less computation , But as it receives more and more data training , Its performance seems to be improving , This is more than any other model .

The author of this article contains 3 Billion private googlejft-300M Visual converter images are trained on the dataset , This leads to state-of-the-art accuracy in many benchmarks . One can expect this pre trained model to be released soon , So that we can all try .

See the new application of neural attention in the field of computer vision , It's so exciting ! Hopefully in the next few years , On the basis of this development , Can make greater progress !

Link to the original text :https://medium.com/swlh/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-brief-review-of-the-8770a636c6a8

Welcome to join us AI Blog station :
http://panchuang.net/

sklearn Machine learning Chinese official documents :
http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource summary station :
http://docs.panchuang.net/

版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢