当前位置:网站首页>A model can do two things: image annotation and image reading Q & A. VQA accuracy is close to human level | demo can be played
A model can do two things: image annotation and image reading Q & A. VQA accuracy is close to human level | demo can be played
2022-07-02 09:56:00 【QbitAl】
bright and quick From the Aofei temple
qubits | official account QbitAI
Now? , Throw it to AI A picture , It can not only look at pictures and talk , It can also deal with the tricky problems raised by people .
such as , Show it a classic picture .

It can answer :
One in a suit 、 The man who is gesturing .
So what color are the men's eyes in the picture ?
Blue .

I have a good look , It's true !
This is vision - New achievements in the field of language :BLIP (Bootstrapping Language-Image Pre-training).
It is a breakthrough in the past that can only be implemented alone Vision - The text generated 、 Vision - Text understanding The two tasks are integrated , Give Way AI You can switch back and forth in the two modes of speaking with pictures and visual question and answer .
And his performance on various tasks is better than that in the past SOTA Method ,VQA Accuracy over 78%, Approaching the human baseline (80.83%).
Don't talk much , Let's try it , See how powerful this model is .
Demo demo
BLIP It can provide two functions .
The first is to describe the content of the picture , The second is to answer questions about pictures .
After uploading the picture , You can try one of the modes below the picture .

First, let's take a look at it Look at the picture What is the level of .
Uploaded a picture with children 、 cat 、 After the picture of various elements of dog , The output of the model is :
A little boy and a cat 、 A dog lay on the ground together .

Try asking questions again :
The picture shows fish Do you ?
BLIP:NO.

You can see ,BLIP The understanding of the picture is very good , Then how many more pictures ?
When we upload the portrait of Mona Lisa , The model easily identified that this is A portrait of a woman , It's not a photo .

Even upload a spoof image of Altman , It's not difficult BLIP, And also gave a serious answer :
A man was carrying a cake with candles .

Even ask it : Is the cake on the man's left hand or right hand ?BLIP Can give the correct answer :
one 's right hand .

This wave of operation is true 6 It's me .
So what is the principle behind it ? Let's see .
Learn noisy images - The text is right
BLIP There are two aspects of work to be done this time .
First of all , It uses a Multitask model (MED), Integrate multiple task pre training .

See from the frame diagram ,MED It mainly includes 3 Parts of :
Single peak encoder , Can use image - Text comparison loss (ITC) Training , Align visual and textual representations .
Image based text encoder , The traditional cross attention layer can be used to simulate vision - The transformation of language information , And through images - Text matching loss (ITM) To train , So as to distinguish positive 、 Negative image - The text is right .
Image based text decoder , The bidirectional self attention layer can be transformed into the causal self attention layer , And share the same cross attention layer and feedforward network with the encoder . The decoder is trained through language modeling (LM) To output text annotations .
thus , The model can execute images - Text contrast 、 Images - Text matching and image language generation tasks .
In the second , The researchers proposed a new type Data bootstrap method (CapFilt). It can make the model from the image with noise - Text centered learning .
CapFilt It mainly includes Taggers (captioner) and filter (filter) Two parts .
among , The annotator is used to generate text that describes the image , The filter will eliminate the results with noise .

For example, the following examples , It is the filter that rejects the wrong answer .

Studies have shown that , The more diverse the text the annotator lists , The better the final effect .
Compared with previous achievements SOTA Compared with ,BLIP In the image - Text retrieval task [email protected] On average, it's up 2.7%; In the picture generation text ,CIDEr promote 2.8%, The score of visual question and answer has improved 1.6%.
The corresponding author is Tsinghua alumni
The corresponding author of this study is Xu Zhuhong (Steven C.H. Hoi).

He is also currently employed in Salesforce Asian Institute . Previously, he was a professor at the school of information systems, National University of Singapore .
2002 year , Xu Zhuhong received his bachelor's degree from the Department of computer science of Tsinghua University . On 2004 year 、2006 He obtained a master's degree in computer science and engineering from the University of Hong Kong in 、 doctorate .
2019 Elected in IEEE Fellow. The main research fields are computer vision 、NLP、 Deep learning, etc .
The first author is Junnan Li.

He is now Salesforce Senior research scientist of Asian Academy .
Graduated from the University of Hong Kong , Ph.D. from National University of Singapore .
The research field is very extensive , Including self supervised learning 、 Semi-supervised learning 、 Weak supervised learning 、 The migration study 、 Vision - Language .
The other two authors are also Chinese , Namely Dongxu Li and Caiming Xiong.
Address of thesis :
https://arxiv.org/abs/2201.12086
Trial address :
https://huggingface.co/spaces/akhaliq/BLIP
GitHub Address :
https://github.com/salesforce/BLIP
边栏推荐
猜你喜欢

High level application of SQL statements in MySQL database (II)

Bugkuctf-web24 (problem solving ideas and steps)

【UE5】蓝图制作简单地雷教程

QT qlabel style settings

Kinect DK obtains color RGB images in cv:: mat format (used in openpose)

Inverter Simulink model -- processor in the loop test (PIL)

图像识别-数据采集

Tools used for Yolo object recognition and data generation

vs+qt 设置应用程序图标

2837xd 代码生成——StateFlow(1)
随机推荐
MySQL default transaction isolation level and row lock
c语言编程题
The latest progress and development trend of 2022 intelligent voice technology
Cmake command - Official Document
How to install PHP in CentOS
图像识别-数据标注
2837xd code generation - Supplement (2)
Kinect DK obtains color RGB images in cv:: mat format (used in openpose)
Failed to configure a DataSource: ‘url‘ attribute is not specified and no embedd
MySQL transaction
2837xd code generation - stateflow (1)
职业规划和发展
C语言之判断直角三角形
三相逆变器离网控制——PR控制
College Students' CET-4 and CET-6 composition template (self created version, successfully crossed CET-6)
go语言入门
2837xd Code Generation - stateflow (4)
Image recognition - Data Cleaning
阿里云Prometheus监控服务
C语言之做木桶