当前位置:网站首页>Google proposed the super pre training model coca, and the accuracy of fine-tuning top-1 on Imagenet reached 91%! SOTA on multiple downstream tasks!

Google proposed the super pre training model coca, and the accuracy of fine-tuning top-1 on Imagenet reached 91%! SOTA on multiple downstream tasks!

2022-06-10 13:02:00 Zhiyuan community

This article shares papers 『CoCa: Contrastive Captioners are Image-Text Foundation Models』,Google Research Propose a super pre training model CoCa, stay ImageNet Fine tune up Top-1 Accuracy up to 91%! On multiple downstream tasks SOTA!

The details are as follows :

Exploring the basic model of large-scale pre training is of great significance in computer vision , Because these models can be quickly transferred to many downstream tasks . In this paper, we put forward a comparative subtitle (Contrastive Captioner,CoCa) Model , It encodes image text - The basic model of decoder is combined with contrast loss and caption loss for pre training , Thus from CLIP Equal comparison method and SimVLM The advantages of the two models are absorbed in the equal generation method . With all decoder layers attend Standard encoder to encoder output - decoder Transformer Different ,CoCa Cross attention in the first half of the decoder layer is omitted to encode unimodal The text means , And the other decoder levels of the cross attention image encoder are connected to perform multimodal Image text representation .

原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206101239169690.html