当前位置:网站首页>A model can do two things: image annotation and image reading Q & A. VQA accuracy is close to human level | demo can be played
A model can do two things: image annotation and image reading Q & A. VQA accuracy is close to human level | demo can be played
2022-07-02 09:56:00 【QbitAl】
bright and quick From the Aofei temple
qubits | official account QbitAI
Now? , Throw it to AI A picture , It can not only look at pictures and talk , It can also deal with the tricky problems raised by people .
such as , Show it a classic picture .
It can answer :
One in a suit 、 The man who is gesturing .
So what color are the men's eyes in the picture ?
Blue .
I have a good look , It's true !
This is vision - New achievements in the field of language :BLIP (Bootstrapping Language-Image Pre-training).
It is a breakthrough in the past that can only be implemented alone Vision - The text generated 、 Vision - Text understanding The two tasks are integrated , Give Way AI You can switch back and forth in the two modes of speaking with pictures and visual question and answer .
And his performance on various tasks is better than that in the past SOTA Method ,VQA Accuracy over 78%, Approaching the human baseline (80.83%).
Don't talk much , Let's try it , See how powerful this model is .
Demo demo
BLIP It can provide two functions .
The first is to describe the content of the picture , The second is to answer questions about pictures .
After uploading the picture , You can try one of the modes below the picture .
First, let's take a look at it Look at the picture What is the level of .
Uploaded a picture with children 、 cat 、 After the picture of various elements of dog , The output of the model is :
A little boy and a cat 、 A dog lay on the ground together .
Try asking questions again :
The picture shows fish Do you ?
BLIP:NO.
You can see ,BLIP The understanding of the picture is very good , Then how many more pictures ?
When we upload the portrait of Mona Lisa , The model easily identified that this is A portrait of a woman , It's not a photo .
Even upload a spoof image of Altman , It's not difficult BLIP, And also gave a serious answer :
A man was carrying a cake with candles .
Even ask it : Is the cake on the man's left hand or right hand ?BLIP Can give the correct answer :
one 's right hand .
This wave of operation is true 6 It's me .
So what is the principle behind it ? Let's see .
Learn noisy images - The text is right
BLIP There are two aspects of work to be done this time .
First of all , It uses a Multitask model (MED), Integrate multiple task pre training .
See from the frame diagram ,MED It mainly includes 3 Parts of :
Single peak encoder , Can use image - Text comparison loss (ITC) Training , Align visual and textual representations .
Image based text encoder , The traditional cross attention layer can be used to simulate vision - The transformation of language information , And through images - Text matching loss (ITM) To train , So as to distinguish positive 、 Negative image - The text is right .
Image based text decoder , The bidirectional self attention layer can be transformed into the causal self attention layer , And share the same cross attention layer and feedforward network with the encoder . The decoder is trained through language modeling (LM) To output text annotations .
thus , The model can execute images - Text contrast 、 Images - Text matching and image language generation tasks .
In the second , The researchers proposed a new type Data bootstrap method (CapFilt). It can make the model from the image with noise - Text centered learning .
CapFilt It mainly includes Taggers (captioner) and filter (filter) Two parts .
among , The annotator is used to generate text that describes the image , The filter will eliminate the results with noise .
For example, the following examples , It is the filter that rejects the wrong answer .
Studies have shown that , The more diverse the text the annotator lists , The better the final effect .
Compared with previous achievements SOTA Compared with ,BLIP In the image - Text retrieval task [email protected] On average, it's up 2.7%; In the picture generation text ,CIDEr promote 2.8%, The score of visual question and answer has improved 1.6%.
The corresponding author is Tsinghua alumni
The corresponding author of this study is Xu Zhuhong (Steven C.H. Hoi).
He is also currently employed in Salesforce Asian Institute . Previously, he was a professor at the school of information systems, National University of Singapore .
2002 year , Xu Zhuhong received his bachelor's degree from the Department of computer science of Tsinghua University . On 2004 year 、2006 He obtained a master's degree in computer science and engineering from the University of Hong Kong in 、 doctorate .
2019 Elected in IEEE Fellow. The main research fields are computer vision 、NLP、 Deep learning, etc .
The first author is Junnan Li.
He is now Salesforce Senior research scientist of Asian Academy .
Graduated from the University of Hong Kong , Ph.D. from National University of Singapore .
The research field is very extensive , Including self supervised learning 、 Semi-supervised learning 、 Weak supervised learning 、 The migration study 、 Vision - Language .
The other two authors are also Chinese , Namely Dongxu Li and Caiming Xiong.
Address of thesis :
https://arxiv.org/abs/2201.12086
Trial address :
https://huggingface.co/spaces/akhaliq/BLIP
GitHub Address :
https://github.com/salesforce/BLIP
边栏推荐
- JDBC review
- 虚幻——动画蓝图、状态机制作人物走跑跳动作
- 三相并网逆变器PI控制——离网模式
- Navicat 远程连接Mysql报错1045 - Access denied for user ‘root‘@‘222.173.220.236‘ (using password: YES)
- ESLint 报错
- Alibaba cloud ack introduction
- Alibaba / popular JSON parsing open source project fastjson2
- C语言之到底是不是太胖了
- 2837xd Code Generation - stateflow (4)
- Activity的创建和跳转
猜你喜欢
Matlab代码生成之SIL/PIL测试
Alibaba cloud SMS service
每天睡前30分钟阅读Day6_Day6_Date_Calendar_LocalDate_TimeStamp_LocalTime
2837xd Code Generation - Supplement (1)
渗透测试的介绍和防范
上班第一天的报错(AWVS卸载不彻底)
Image recognition - data annotation
Introduction and prevention of penetration test
[ue5] animation redirection: how to import magic tower characters into the game
vs+qt 设置应用程序图标
随机推荐
2837xd 代码生成——StateFlow(2)
Required request body is missing: (cross domain problem)
Share a blog (water blog)
PI control of grid connected inverter (grid connected mode)
2837xd 代码生成——补充(1)
个人经历&&博客现状
Tinyxml2 reading and modifying files
上班第一天的报错(Nessus安装winpcap报错)
Introduction to go language
Skywalking理论与实践
In SQL injection, why must the ID of union joint query be equal to 0
阿里云短信服务
BugkuCTF-web16(备份是个好习惯)
渗透测试的介绍和防范
How to use PHP spoole to implement millisecond scheduled tasks
FragmentTabHost实现房贷计算器界面
记录下对游戏主机配置的个人理解与心得
Junit4运行mvn test 测试套件升级方案
About the college entrance examination
每天睡前30分钟阅读Day5_Map中全部Key值,全部Value值获取方式