Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch


Perceiver - Pytorch

Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch


$ pip install perceiver-pytorch


import torch
from perceiver_pytorch import Perceiver

model = Perceiver(
    num_fourier_features = 6,    # number of fourier features, with original value (2 * K + 1)
    depth = 48,                  # depth of net, in paper, they went deep, making up for lack of attention
    num_latents = 6,             # number of latents, or induced set points, or centroids. different papers giving it different names
    cross_dim = 512,             # cross attention dimension
    latent_dim = 512,            # latent dimension
    cross_heads = 1,             # number of heads for cross attention. paper said 1
    latent_heads = 8,            # number of heads for latent self attention, 8
    cross_dim_head = 64,
    latent_dim_head = 64,
    num_classes = 1000,          # output number of classes
    attn_dropout = 0.,
    ff_dropout = 0.,
    weight_tie_layers = False    # whether to weight tie layers (optional, as indicated in the diagram)

img = torch.randn(1, 224 * 224) # 1 imagenet image, pixelized

model(img) # (1, 1000)


    title   = {Perceiver: General Perception with Iterative Attention},
    author  = {Andrew Jaegle and Felix Gimeno and Andrew Brock and Andrew Zisserman and Oriol Vinyals and Joao Carreira},
    year    = {2021},
    eprint  = {2103.03206},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
  • Latent averaging to the logits?

    Latent averaging to the logits?

    I read through the paper last night and came away confused about a few things. I looked through your code hoping for some clarity.

    One issue that doesn't seem to be explained in the paper (or I am missing it) is how the authors go from a set of latents to the logits used at the classification head. You implemented this by taking the mean of the latent set:

    Is this actually how the authors convert to logits?

    opened by neonbjb 7
  • PerceiverAR?


    Hey @lucidrains - love this repo, and still trying to wrap my head around the various difference between Perceiver architectures; how hard would it be to extend PerceiverIO to PerceiverAR; what fundamentally needs to change?

    opened by siddk 5
  • Not using the classification head in Perceiver

    Not using the classification head in Perceiver

    Hi @lucidrains, thank you for your great job!

    I'd like to use the Perceiver (not PerceiverIO) without the classification head (average and projection). Do you think we could add an option to avoid using it? I can do a PR if you want.


    opened by gegallego 4
  • Decoder Attention Module needs a FF network as well in script

    Decoder Attention Module needs a FF network as well in script


    According to perceiver io paper's ( architectural details, they mention that the decoder attention block contains a cross attention block (4), which is already implemented in the script (Line 151), followed by a Feedforward network, given by equation (6) in the paper, which is not present in that script. I am not aware of the repercussions of not having FF in the decoder module but it might be a good idea to have it in the implementation. Something like self.decoder_ff = PreNorm(FeedForward(queries_dim)) would do the job. Experimentally, the authors had found that omitting equation (5) is helpful.

    opened by Hritikbansal 4
  • Positional encoding are already part of the input

    Positional encoding are already part of the input

    Hello! First of all, thank you for this implementation.

    My inputs already have the proper positional encoding as part of the channel axis. Would it be possible to add a feature to deactivate the default implementation of the positional encoding?

    Thank you!

    opened by Atlis 4
  • x = self.latents + self.pos_emb

    x = self.latents + self.pos_emb

    self.latents = nn.Parameter(torch.randn(num_latents, latent_dim))
    self.pos_emb = nn.Parameter(torch.randn(num_latents, latent_dim))
    x = self.latents + self.pos_emb

    I'm not very familiar with pytorch, but does this make sense? I mean, what's intended when 2 trainable weight matrices are simply summed and that's that's the only place where both latents and pos_emb appear. It looks like it can be replaced with only one matrix.

    opened by galchinsky 4
  • Fourier encoding is not similar to the paper

    Fourier encoding is not similar to the paper

    First of all, thanks for sharing the code !

    I have a follow up question to #4.

    In the paper, the authors mentioned about [sin(f_kπx_d), cos(f_kπx_d)], where f_k is a bank of frequencies spaced log-linearly between 1 and µ/2. Can you maybe point out how you came to the 1/2**i scaling in the code ?


    opened by cheneeheng 4
  • Fourier encoding should be for position coordinates instead of byte array

    Fourier encoding should be for position coordinates instead of byte array

    The fourier_encode function as implemented takes as input a byte array x and directly encodes it with sin/cos before concating with the input.

    As I understand the NeRF position encodings, they encode the x/y/etc. position coordinates, and not a transformation of the data itself. From the Perceiver paper:

    We parametrize the frequency encoding to take the values [sin(fkπxd), cos(fkπxd)], where the frequencies fk is the kth band of a bank of frequencies spaced log-linearly between 1 and µ/2... For example, by allowing the network to resolve the maximum frequency present in an input array, we can encourage it to learn to compare the values of bytes at any positions in the input array. xd is the value of the input position along the dth dimension (e.g. for images d = 2 and for video d = 3). xd takes values in [−1, 1] for each dimension. We concatenate the raw positional value xd to produce the final representation of position. This results in a positional encoding of size d(2K + 1).

    NeRF position encoding examples:

    opened by eridgd 4
  • Positional encoding frequency bands should be linearly spaced

    Positional encoding frequency bands should be linearly spaced

    A small bug, but as alluded to in this comment by @marcdumon, it seems as though the frequency bands are indeed spaced linearly in the official JAX implementation.

    opened by djl11 2
  • Bug in fourier_encode (?)

    Bug in fourier_encode (?)

    Thank you for this great implementation. I'm learning a lot from it!

    I think I found a problem in the fourier_encode method. In this line:

    the scales are always the same whatever value of parameter base. Example:

    max_freq = 10, num_bands=6, base = 2
    => scales = [1.0000, 1.3797, 1.9037, 2.6265, 3.6239, 5.0000]
    max_freq = 10, num_bands=6, base = 10
    => scales = [1.0000, 1.3797, 1.9037, 2.6265, 3.6239, 5.0000]
    opened by marcdumon 2
  • Attention softmax is applied to incorrect dimension?

    Attention softmax is applied to incorrect dimension?

    I am studying multi-head attention. When I was reading through [1], I found that the attenion softmax is applied over the last dimension of the similarity tensor sim:

            q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h = h), (q, k, v))
            sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
            if exists(mask):
            # attention, what we cannot get enough of
            attn = sim.softmax(dim = -1)

    If I understand correctly sim has the shape (b*h) n1 n2. The softmax is computed over the last dimension n2. Shouldn't the softmax be applied to matrices with all the similarity values of a single head (i.e. with shape n1, n2)?


    opened by breuderink 2
  • Issue defining base in fourier_encode for,,

    Issue defining base in fourier_encode for,,

    Hey Lucid, love the work, it appears you deprecated base in fourier_encode at

    But,, are still trying to define the base within the forward pass.

    Thanks again, keep up the great work.

    opened by TannerLaBorde 0
  • Audio + Text data?

    Audio + Text data?

    Can someone please guide me on how you can process both audio and .txt data through perceiver simultaneously for multimodality learning?

    An example code would be nice.


    opened by Sidz1812 1
  • just a suggestion

    just a suggestion

    Hi I like to start with thanking you for such a great work with a lot of great implementations. I have a small suggestion. I suggest for all your codes/modules try to add if __name__ == "__main__": so that if someone just wants to use one file/module can easily try that without having going through whole implementations. for example I am trying to use the this, in case of having a if __name__ == "__main__": I can easily try to run a random input and see how it will work. This will increase the usability with a huge amount.

    Keep up the great work :)

    opened by seyeeet 4
  • What should I change if I want to use data with input size 720*184

    What should I change if I want to use data with input size 720*184

    thanks for sharing this code, I was wondering what should I change if I want to be able to use data that can be converted into images with an input size of 720*184? thanks in advance

    opened by Oussamab21 0
  • Question regarding queries dimensionality in Perceiver IO

    Question regarding queries dimensionality in Perceiver IO

    Hi @lucidrains,

    I think I may be missing something - why do we define the perceiver IO queries vector to have a batch dimension (i.e. queries = torch.randn(1, 128, 32))? Was this just to make the code work nicely? Shouldnt we be using queries = torch.randn(128, 32) ? I expect to use the same embedding for all of my batch elements, which is IIUC what your code is doing.

    opened by pcicales 3
Phil Wang
Working with Attention. It's all we need.
Phil Wang
TrackTech: Real-time tracking of subjects and objects on multiple cameras

TrackTech: Real-time tracking of subjects and objects on multiple cameras This project is part of the 2021 spring bachelor final project of the Bachel

5 Jun 17, 2022
blind SQLIpy sebuah alat injeksi sql yang menggunakan waktu sql untuk mendapatkan sebuah server database.

blind SQLIpy Alat blind SQLIpy ini merupakan alat injeksi sql yang menggunakan metode time based blind sql injection metode tersebut membutuhkan waktu

Galih Anggoro Prasetya 4 Feb 24, 2022
Optimizes image files by converting them to webp while also updating all references.

About Optimizes images by (re-)saving them as webp. For every file it replaced it automatically updates all references. Works on single files as well

Watermelon Wolverine 18 Dec 23, 2022

ddz-ai 介绍 斗地主是一种扑克游戏。游戏最少由3个玩家进行,用一副54张牌(连鬼牌),其中一方为地主,其余两家为另一方,双方对战,先出完牌的一方获胜。 ddz-ai以孤立语假设和宽度优先搜索为基础,构建了一种多通道堆叠注意力Transformer结构的系统,使其经过大量训练后,能在实际游戏中获

freefuiiismyname 88 May 15, 2022
Caffe-like explicit model constructor. C(onfig)Model

cmodel Caffe-like explicit model constructor. C(onfig)Model Installation pip install git+ Usage In order to allow usi

1 Feb 18, 2022
[BMVC2021] "TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation"

TransFusion-Pose TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation Haoyu Ma, Liangjian Chen, Deying Kong, Zhe Wang, Xingwei

Haoyu Ma 29 Dec 23, 2022
Tooling for the Common Objects In 3D dataset.

CO3D: Common Objects In 3D This repository contains a set of tools for working with the Common Objects in 3D (CO3D) dataset. Download the dataset The

Facebook Research 724 Jan 06, 2023
Serverless proxy for Spark cluster

Hydrosphere Mist Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model f 317 Dec 01, 2022
Deep Two-View Structure-from-Motion Revisited

Deep Two-View Structure-from-Motion Revisited This repository provides the code for our CVPR 2021 paper Deep Two-View Structure-from-Motion Revisited.

Jianyuan Wang 145 Jan 06, 2023
Multistream CNN for Robust Acoustic Modeling

Multistream Convolutional Neural Network (CNN) A multistream CNN is a novel neural network architecture for robust acoustic modeling in speech recogni

ASAPP Research 37 Sep 21, 2022
Supervised Contrastive Learning for Downstream Optimized Sequence Representations

SupCL-Seq 📖 Supervised Contrastive Learning for Downstream Optimized Sequence representations (SupCS-Seq) accepted to be published in EMNLP 2021, ext

Hooman Sedghamiz 18 Oct 21, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion: A Machine Learning Library for Time Series Table of Contents Introduction Installation Documentation Getting Started Anomaly Detection Foreca

Salesforce 2.8k Dec 30, 2022
Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral

Improving Contrastive Learning by Visualizing Feature Transformation This project hosts the codes, models and visualization tools for the paper: Impro

Bingchen Zhao 83 Dec 15, 2022
Automatic Number Plate Recognition using Contours and Convolution Neural Networks (CNN)

Cite our paper if you find this project useful Abstract Image processing technology is used in

Adithya M 2 Jun 28, 2022
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

zhangtao 146 Dec 29, 2022
Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN)

Flickr-Faces-HQ Dataset (FFHQ) Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative

NVIDIA Research Projects 2.9k Dec 28, 2022
The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

46 Dec 07, 2022
AAAI-22 paper: SimSR: Simple Distance-based State Representationfor Deep Reinforcement Learning

SimSR Code and dataset for the paper SimSR: Simple Distance-based State Representationfor Deep Reinforcement Learning (AAAI-22). Requirements We assum

7 Dec 19, 2022
The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp.

PISE The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp. Requirement conda create -n pise pyt

jinszhang 110 Nov 21, 2022