当前位置：网站首页>A model can do two things: image annotation and image reading Q & A. VQA accuracy is close to human level | demo can be played

A model can do two things: image annotation and image reading Q & A. VQA accuracy is close to human level | demo can be played

2022-07-02 09:56:00 【QbitAl】

bright and quick From the Aofei temple
qubits | official account QbitAI

Now? , Throw it to AI A picture , It can not only look at pictures and talk , It can also deal with the tricky problems raised by people .

such as , Show it a classic picture .

It can answer ：

One in a suit 、 The man who is gesturing .

So what color are the men's eyes in the picture ？

Blue .

I have a good look , It's true ！

This is vision - New achievements in the field of language ：BLIP （Bootstrapping Language-Image Pre-training）.

It is a breakthrough in the past that can only be implemented alone Vision - The text generated 、 Vision - Text understanding The two tasks are integrated , Give Way AI You can switch back and forth in the two modes of speaking with pictures and visual question and answer .

And his performance on various tasks is better than that in the past SOTA Method ,VQA Accuracy over 78%, Approaching the human baseline （80.83%）.

Don't talk much , Let's try it , See how powerful this model is .

Demo demo

BLIP It can provide two functions .

The first is to describe the content of the picture , The second is to answer questions about pictures .

After uploading the picture , You can try one of the modes below the picture .

First, let's take a look at it Look at the picture What is the level of .

Uploaded a picture with children 、 cat 、 After the picture of various elements of dog , The output of the model is ：

A little boy and a cat 、 A dog lay on the ground together .

Try asking questions again ：

The picture shows fish Do you ？
BLIP：NO.

You can see ,BLIP The understanding of the picture is very good , Then how many more pictures ？

When we upload the portrait of Mona Lisa , The model easily identified that this is A portrait of a woman , It's not a photo .

Even upload a spoof image of Altman , It's not difficult BLIP, And also gave a serious answer ：

A man was carrying a cake with candles .

Even ask it ： Is the cake on the man's left hand or right hand ？BLIP Can give the correct answer ：

one 's right hand .

This wave of operation is true 6 It's me .

So what is the principle behind it ？ Let's see .

Learn noisy images - The text is right

BLIP There are two aspects of work to be done this time .

First of all , It uses a Multitask model （MED）, Integrate multiple task pre training .

See from the frame diagram ,MED It mainly includes 3 Parts of ：

Single peak encoder , Can use image - Text comparison loss （ITC） Training , Align visual and textual representations .

Image based text encoder , The traditional cross attention layer can be used to simulate vision - The transformation of language information , And through images - Text matching loss （ITM） To train , So as to distinguish positive 、 Negative image - The text is right .

Image based text decoder , The bidirectional self attention layer can be transformed into the causal self attention layer , And share the same cross attention layer and feedforward network with the encoder . The decoder is trained through language modeling （LM） To output text annotations .

thus , The model can execute images - Text contrast 、 Images - Text matching and image language generation tasks .

In the second , The researchers proposed a new type Data bootstrap method （CapFilt）. It can make the model from the image with noise - Text centered learning .

CapFilt It mainly includes Taggers （captioner） and filter （filter） Two parts .

among , The annotator is used to generate text that describes the image , The filter will eliminate the results with noise .

For example, the following examples , It is the filter that rejects the wrong answer .

Studies have shown that , The more diverse the text the annotator lists , The better the final effect .

Compared with previous achievements SOTA Compared with ,BLIP In the image - Text retrieval task [email protected] On average, it's up 2.7%; In the picture generation text ,CIDEr promote 2.8%, The score of visual question and answer has improved 1.6%.

The corresponding author is Tsinghua alumni

The corresponding author of this study is Xu Zhuhong （Steven C.H. Hoi）.

He is also currently employed in Salesforce Asian Institute . Previously, he was a professor at the school of information systems, National University of Singapore .

2002 year , Xu Zhuhong received his bachelor's degree from the Department of computer science of Tsinghua University . On 2004 year 、2006 He obtained a master's degree in computer science and engineering from the University of Hong Kong in 、 doctorate .

2019 Elected in IEEE Fellow. The main research fields are computer vision 、NLP、 Deep learning, etc .

The first author is Junnan Li.

He is now Salesforce Senior research scientist of Asian Academy .

Graduated from the University of Hong Kong , Ph.D. from National University of Singapore .

The research field is very extensive , Including self supervised learning 、 Semi-supervised learning 、 Weak supervised learning 、 The migration study 、 Vision - Language .

The other two authors are also Chinese , Namely Dongxu Li and Caiming Xiong.

Address of thesis ：
https://arxiv.org/abs/2201.12086

Trial address ：
https://huggingface.co/spaces/akhaliq/BLIP

GitHub Address ：
https://github.com/salesforce/BLIP

原网站

版权声明
本文为[QbitAl]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202151703191464.html