Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Overview

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Download links and PyTorch implementation of "Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision", ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, Noah Snavely ICCV 2021

Project Page | Paper

drawing

The WikiScenes Dataset

  1. Image and Textual Descriptions: WikiScenes contains 63K images with captions of 99 cathedrals. We provide two versions for download:

    • Low-res version used in our experiments (maximum width set to 200[px], aspect ratio fixed): (1.9GB .zip file)
    • Higher-res version (maximum longer dimension set to 1200[px], aspect ratio fixed): (19.4GB .zip file)

    Licenses for the images are provided here: (LicenseInfo.json file)

    Data Structure

    WikiScenes is organized recursively, following the tree structure in Wikimedia. Each semantic category (e.g. cathedral) contains the following recursive structure:

    ----0 (e.g., "milano cathedral duomo milan milano italy italia")
    --------0 (e.g., "Exterior of the Duomo (Milan)")
    ----------------0 (e.g., "Duomo (Milan) in art - exterior")
    ----------------1
    ----------------...
    ----------------K0-0
    ----------------category.json
    ----------------pictures (contains all pictures in current hierarchy level)
    --------1
    --------...
    --------K0
    --------category.json
    --------pictures (contains all pictures in current hierarchy level)
    ----1
    ----2
    ----...
    ----N
    ----category.json
    

    category.json is a dictionary of the following format:

    {
        "max_index": SUB-DIR-NUMBER
        "pairs" :    {
                        CATEGORY-NAME: SUB-DIR-NAME
                    }
        "pictures" : {
                        PICTURE-NAME: {
                                            "caption": CAPTION-DATA,
                                            "url": URL-DATA,
                                            "properties": PROPERTIES
                                    }
                    }
    }
    

    where:

    1. SUB-DIR-NUMBER is the total number of subcategories
    2. CATEGORY-NAME is the name of the category (e.g., "milano cathedral duomo milan milano italy italia")
    3. SUB-DIR-NAME is the name of the sub-folder (e.g., "0")
    4. PICTURE-NAME is the name of the jpg file located within the pictures folder
    5. CAPTION-DATA contains the caption and URL contains the url from which the image was scraped.
    6. PROPERTIES is a list of properties pre-computed for the image-caption pair (e.g. estimated language of caption).
  2. Keypoint correspondences: We also provide keypoint correspondences between pixels of images from the same landmark: (982MB .zip file)

    Data Structure

     {
         "image_id" : {
                         "kp_id": (x, y),
                     }
     }
    

    where:

    1. image_id is the id of each image.
    2. kp_id is the id of keypoints, which is unique across the whole dataset.
    3. (x, y) the location of the keypoint in this image.
  3. COLMAP reconstructions: We provide the full 3D models used for computing keypoint correspondences: (1GB .zip file)

    To view these models, download and install COLMAP. The reconstructions are organized by landmarks. Each landmark folder contains all the reconstructions associated with that landmark. Each reconstruction contains 3 files:

    1. points3d.txt that contains one line of data for each 3D point associated with the reconstruction. The format for each point is: POINT3D_ID, X, Y, Z, R, G, B, ERROR, TRACK[] as (IMAGE_ID, POINT2D_IDX).
    2. images.txt that contains two lines of data for each image associated with the reconstruction. The format of the first line is: IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME. The format of the second line is: POINTS2D[] as (X, Y, POINT3D_ID)
    3. cameras.txt that contains one line of data for each camera associated with the reconstruction according to the following format: CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]

    Please refer to COLMAP's tutorial for further instructions on how to view these reconstructions.

  4. Companion datasets for additional landmark categories: We provide download links for additional category types:

    Synagogues

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

    Mosques

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

Reproducing Results

  1. Minimum requirements. This project was originally developed with Python 3.6, PyTorch 1.0 and CUDA 9.0. The training requires at least one Titan X GPU (12Gb memory) .

  2. Setup your Python environment. Clone the repository and install the dependencies:

    conda create -n <environment_name> --file requirements.txt -c conda-forge/label/cf202003
    conda activate <environment_name>
    conda install scikit-learn=0.21
    pip install opencv-python
    
  3. Download the dataset. Download the data as detailed above, unzip and place as follows: Image and textual descriptions in <project>/data/ and the correspondence file in <project>.

  4. Download pre-trained models. Download the initial weights (pre-trained on ImageNet) for the backbone model and place in <project>/models/weights/.

    Backbone Initial Weights Comments
    ResNet50 resnet50-19c8e357.pth PyTorch official model
  5. Train on the WikiScenes dataset. See instructions below. Note that the first run always takes longer for pre-processing. Some computations are cached afterwards.

Training, Inference and Evaluation

The directory launch contains template bash scripts for training, inference and evaluation.

Training. For each run, you need to specify the names of two variables, bash EXP and bash RUN_ID. Running bash EXP=wiki RUN_ID=v01 ./launch/run_wikiscenes_resnet50.sh will create a directory ./logs/wikiscenes_corr/wiki/ with tensorboard events and saved snapshots in ./snapshots/wikiscenes_corr/wiki/v01.

Inference.

If you want to do inference with our pre-trained model, please make a directory and put the model there.

    mkdir -p ./snapshots/wikiscenes_corr/final/ours

Download our validation set, and unzip it.

    unzip val_seg.zip

run sh ./launch/infer_val_wikiscenes.sh to predict masks. You can find the predicted masks in ./logs/masks.

If you want to evaluate you own models, you will also need to specify:

  • EXP and RUN_ID you used for training;
  • OUTPUT_DIR the path where to save the masks;
  • SNAPSHOT specifies the model suffix in the format e000Xs0.000;

Evaluation. To compute IoU of the masks, run sh ./launch/eval_seg.sh.

Pre-trained model

For testing, we provide our pre-trained ResNet50 model:

Backbone Link
ResNet50 model_enc_e024Xs-0.800.pth (157M)

Datasheet

We provide a datasheet for our dataset here.

License

The images in our dataset are provided by Wikimedia Commons under various free licenses. These licenses permit the use, study, derivation, and redistribution of these images—sometimes with restrictions, e.g. requiring attribution and with copyleft. We provide full license text and attribution for all images, make no modifications to any, and release these images under their original licenses. The associated captions are provided as a part of unstructured text in Wikimedia Commons, with rights to the original writers under the CC BY-SA 3.0 license. We modify these (as specified in our paper) and release such derivatives under the same license. We provide the rest of our dataset under a CC BY-NC-SA 4.0 license.

Citation

@inproceedings{Wu2021Towers,
 title={Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision},
 author={Wu, Xiaoshi and Averbuch-Elor, Hadar and Sun, Jin and Snavely, Noah},
 booktitle={ICCV},
 year={2021}
}

Acknowledgement

Our code is based on the implementation of Single-Stage Semantic Segmentation from Image Labels

Owner
Blakey Wu
Blakey Wu
Tensorflow implementation of MIRNet for Low-light image enhancement

MIRNet Tensorflow implementation of the MIRNet architecture as proposed by Learning Enriched Features for Real Image Restoration and Enhancement. Lanu

Soumik Rakshit 91 Jan 06, 2023
Bringing Computer Vision and Flutter together , to build an awesome app !!

Bringing Computer Vision and Flutter together , to build an awesome app !! Explore the Directories Flutter · Machine Learning Table of Contents About

Padmanabha Banerjee 14 Apr 07, 2022
Convnet transfer - Code for paper How transferable are features in deep neural networks?

How transferable are features in deep neural networks? This repository contains source code necessary to reproduce the results presented in the follow

Jason Yosinski 143 Sep 13, 2022
CVPRW 2021: How to calibrate your event camera

E2Calib: How to Calibrate Your Event Camera This repository contains code that implements video reconstruction from event data for calibration as desc

Robotics and Perception Group 104 Nov 16, 2022
Soomvaar is the repo which 🏩 contains different collection of 👨‍💻🚀code in Python and 💫✨Machine 👬🏼 learning algorithms📗📕 that is made during 📃 my practice and learning of ML and Python✨💥

Soomvaar 📌 Introduction Soomvaar is the collection of various codes implement in machine learning and machine learning algorithms with python on coll

Felix-Ayush 42 Dec 30, 2022
Implementation of Pix2Seq in PyTorch

pix2seq-pytorch Implementation of Pix2Seq paper Different from the paper image input size 1280 bin size 1280 LambdaLR scheduler used instead of Linear

Tony Shin 9 Dec 15, 2022
Implementation of CaiT models in TensorFlow and ImageNet-1k checkpoints. Includes code for inference and fine-tuning.

CaiT-TF (Going deeper with Image Transformers) This repository provides TensorFlow / Keras implementations of different CaiT [1] variants from Touvron

Sayak Paul 9 Jun 26, 2022
CLADE - Efficient Semantic Image Synthesis via Class-Adaptive Normalization (TPAMI 2021)

Efficient Semantic Image Synthesis via Class-Adaptive Normalization (Accepted by TPAMI)

tzt 49 Nov 17, 2022
SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

SmallInitEmb LayerNorm(SmallInit(Embedding)) in a Transformer I find that when t

PENG Bo 11 Dec 25, 2022
DP-CL(Continual Learning with Differential Privacy)

DP-CL(Continual Learning with Differential Privacy) This is the official implementation of the Continual Learning with Differential Privacy. If you us

Phung Lai 3 Nov 04, 2022
A new test set for ImageNet

ImageNetV2 The ImageNetV2 dataset contains new test data for the ImageNet benchmark. This repository provides associated code for assembling and worki

186 Dec 18, 2022
Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers

Pose Transformers: Human Motion Prediction with Non-Autoregressive Transformers This is the repo used for human motion prediction with non-autoregress

Idiap Research Institute 26 Dec 14, 2022
Spherical CNNs

Spherical CNNs Equivariant CNNs for the sphere and SO(3) implemented in PyTorch Overview This library contains a PyTorch implementation of the rotatio

Jonas Köhler 893 Dec 28, 2022
Melanoma Skin Cancer Detection using Convolutional Neural Networks and Transfer Learning🕵🏻‍♂️

This is a Kaggle competition in which we have to identify if the given lesion image is malignant or not for Melanoma which is a type of skin cancer.

Vipul Shinde 1 Jan 27, 2022
Code for Active Learning at The ImageNet Scale.

Code for Active Learning at The ImageNet Scale. This repository implements many popular active learning algorithms and allows training with torch's DDP.

Zeyad Emam 47 Dec 12, 2022
Code for BMVC2021 paper "Boundary Guided Context Aggregation for Semantic Segmentation"

Boundary-Guided-Context-Aggregation Boundary Guided Context Aggregation for Semantic Segmentation Haoxiang Ma, Hongyu Yang, Di Huang In BMVC'2021 Pape

Haoxiang Ma 31 Jan 08, 2023
The code of paper 'Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection'

Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection Pytorch implemetation of paper 'Learning to Aggregate and Personalize

Tencent YouTu Research 136 Dec 29, 2022
The easiest tool for extracting radiomics features and training ML models on them.

Simple pipeline for experimenting with radiomics features Installation git clone https://github.com/piotrekwoznicki/ClassyRadiomics.git cd classrad pi

Piotr Woźnicki 17 Aug 04, 2022
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 172 Dec 22, 2022
GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models

GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Model This repository is the official PyTorch implementation of GraphRNN, a graph gene

Jiaxuan 568 Dec 29, 2022