当前位置:网站首页>UC San Diego | evit: using token recombination to accelerate visual transformer (iclr2022)

UC San Diego | evit: using token recombination to accelerate visual transformer (iclr2022)

2022-06-22 04:37:00 Zhiyuan community

Paper title :Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Thesis link :https://openreview.net/forum?id=BjyvwnXXVn_

Code link :https://github.com/youweiliang/evit

Author's unit :UC San Diego & The university of Hong Kong & tencent AI lab

Vision Transformers (ViTs) Treat all image blocks as token, And build multi head self attention (MHSA). Make full use of these images token It will lead to redundant computation , Because not all token All in MHSA The middle is attentive . Examples include tag pairs that contain semantically meaningless or distracting image backgrounds ViT Predictions do not contribute positively . In this work , We suggest that ViT The feedforward process of the model reorganizes the image token, Integrate it into... During training ViT in . For each forward reasoning , We recognize MHSA and FFN( Feedforward network ) Attention images between modules token, This is by the corresponding class token Attention directed . then , We do this by keeping an image of interest token And the image is reconstructed by fusing the image markers that are not concerned token, To speed up the follow-up MHSA and FFN Calculation . So , Our approach EViT Improved from two perspectives ViT. First , Under the same number of input image markers , Our approach has been reduced MHSA and FFN Calculate to achieve efficient reasoning . for example ,DeiT-S The speed of reasoning has increased 50%, and ImageNet The recognition accuracy of classification has only decreased 0.3%. secondly , By keeping the same calculated cost , Our method makes ViT More image markers can be used as input to improve recognition accuracy , Where the image marker comes from a higher resolution image . One example is , In contrast to ordinary DeiT-S Under the same calculated cost , We will DeiT-S Of ImageNet The accuracy of classification and recognition is improved 1%. meanwhile , Our approach is not directed to ViT Introduce more parameters . Experiments on a standard benchmark demonstrate the effectiveness of our method .

原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202221107370351.html