Vision transformers (ViTs) have found only limited practical use in processing images

Last update: Sep 10, 2022

Related tags

Overview

CXV

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. CXV outperforms other architectures, token mixers (eg ConvMixer, FNet and MLP Mixer), transformer models (eg ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources.

Models:

CNV - Convolutional Nyströmformer for Vision
CPV - Convolutional Performer for Vision
CLTV - Convolutional Linear Transformer for Vision

Vision transformers (ViTs) have found only limited practical use in processing images

Related tags

Overview

CXV

Convolutional Xformers for Vision

Owner

Cloudwalker

toroidal - a lightweight transformer library for PyTorch

AI4Good project for detecting waste in the environment

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation

optimization routines for hyperparameter tuning

Repository for the Bias Benchmark for QA dataset.

Representing Long-Range Context for Graph Neural Networks with Global Attention

[NeurIPS 2021] "Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks" by Yonggan Fu, Qixuan Yu, Yang Zhang, Shang Wu, Xu Ouyang, David Cox, Yingyan Lin

Code for ICML 2021 paper: How could Neural Networks understand Programs?

Instance-based label smoothing for improving deep neural networks generalization and calibration

MAGMA - a GPT-style multimodal model that can understand any combination of images and language

CoMoGAN: continuous model-guided image-to-image translation. CVPR 2021 oral.

Efficient Training of Visual Transformers with Small Datasets

Running AlphaFold2 (from ColabFold) in Azure Machine Learning

Caffe: a fast open framework for deep learning.

Gesture Volume Control Using OpenCV and MediaPipe

Let Python optimize the best stop loss and take profits for your TradingView strategy.

This is a re-implementation of TransGAN: Two Pure Transformers Can Make One Strong GAN (CVPR 2021) in PyTorch.

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

Multi-label classification of retinal disorders

Tensorflow 2.x based implementation of EDSR, WDSR and SRGAN for single image super-resolution