Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

Overview

Foodi-ML dataset

This is the GitHub repository for the Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset. This dataset contains over 1.5M unique images and over 9.5M store names, product names, descriptions and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from over 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870k samples of languages of countries from Eastern Europe and West Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visio-linguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.

License

The FooDI-ML dataset is offered under the BY-NC-SA license.

1. Download the dataset

The FooDI-ML dataset is hosted in a S3 bucket in AWS. Therefore AWS CLI is needed to download it. Our dataset is composed of:

  • One DataFrame (glovo-foodi-ml-dataset) stored as a csv file containing all text information + image paths in S3. The size of this CSV file is 540 MB.
  • Set of images listed in the DataFrame. The disk space required to store all images is 316.1 GB.

1.1. Download AWS CLI

If you do not have AWS CLI already installed, please download the latest version of AWS CLI for your operating system.

1.2. Download FooDI-ML

  1. Run the following command to download the DataFrame in ENTER_DESTINATION_PATH directory. We provide an example as if we were going to download the dataset in the directory /mnt/data/foodi-ml/.

    aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv ENTER_DESTINATION_PATH --no-sign-request

    Example: aws s3 cp s3://glovo-products-dataset-d1c9720d/glovo-foodi-ml-dataset.csv /mnt/data/foodi-ml/ --no-sign-request

  2. Run the following command to download the images in ENTER_DESTINATION_PATH/dataset directory (please note the appending of /dataset). This command will download the images in ENTER_DESTINATION_PATHdirectory.

    aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset ENTER_DESTINATION_PATH/dataset --no-sign-request --quiet

    Example: aws s3 cp --recursive s3://glovo-products-dataset-d1c9720d/dataset /mnt/data/foodi-ml/dataset --no-sign-request --quiet

  3. Run the script rename_images.py. This script modifies the DataFrame column to include the paths of the images in the location you specified with ENTER_DESTINATION_PATH/dataset.

    pip install pandas
    python scripts/rename_images.py --output-dir ENTER_DESTINATION_PATH
    

Getting started

Our dataset is managed by the DataFrame glovo-foodi-ml-dataset.csv. This dataset contains the following columns:

  • country_code: This column comprehends 37 unique country codes as explained in our paper. These codes are:

    'ES', 'PL', 'CI', 'PT', 'MA', 'IT', 'AR', 'BG', 'KZ', 'BR', 'ME', 'TR', 'PE', 'SI', 'GE', 'EG', 'RS', 'RO', 'HR', 'UA', 'DO', 'KG', 'CR', 'UY', 'EC', 'HN', 'GH', 'KE', 'GT', 'CL', 'FR', 'BA', 'PA', 'UG', 'MD', 'NG', 'PR'

  • city_code: Name of the city where the store is located.

  • store_name: Name of the store selling that product. If store_name is equal to AS_XYZ, it represents an auxiliary store. This means that while the samples contained are for the most part valid, the store name can't be used in learning tasks

  • product_name: Name of the product. All products have product_name, so this column does not contain any NaN value.

  • collection_section: Name of the section of the product, used for organizing the store menu. Common values are "drinks", "our pizzas", "desserts". All products have collection_section associated to it, so this column does not have any NaN value in it.

  • product_description: A detailed description of the product, describing ingredients and components of it. Not all products of our data have description, so this column contains NaN values that must be removed by the researchers as a preprocessing step.

  • subset: Categorical variable indicating if the sample belongs to the Training, Validation or Test set. The respective values in the DataFrame are ["train", "val", "test"].

  • HIER: Boolean variable indicating if the store name can be used to retrieve product information (indicating if the store_name is not an auxiliary store (with code AS_XYZ)).

  • s3_path: Path of the image of the product in the disk location you chose.

Dataset Statistics

A notebook analyzing several dataset statistics is provided in notebooks/FooDI-ML Dataset Stats Analytics.ipynb.

Benchmark

To run the benchmark included in the original paper one must follow the procedure listed in the following link.

The hyperparameters of the model are included here link

Citation

This paper is under review. In the meanwhile you can cite it in arxiv: https://arxiv.org/abs/2110.02035

Pi-NAS: Improving Neural Architecture Search by Reducing Supernet Training Consistency Shift (ICCV 2021)

Π-NAS This repository provides the evaluation code of our submitted paper: Pi-NAS: Improving Neural Architecture Search by Reducing Supernet Training

Jiqi Zhang 18 Aug 18, 2022
Rax is a Learning-to-Rank library written in JAX

🦖 Rax: Composable Learning to Rank using JAX Rax is a Learning-to-Rank library written in JAX. Rax provides off-the-shelf implementations of ranking

Google 247 Dec 27, 2022
CNN visualization tool in TensorFlow

tf_cnnvis A blog post describing the library: https://medium.com/@falaktheoptimist/want-to-look-inside-your-cnn-we-have-just-the-right-tool-for-you-ad

InFoCusp 778 Jan 02, 2023
Differentiable scientific computing library

xitorch: differentiable scientific computing library xitorch is a PyTorch-based library of differentiable functions and functionals that can be widely

98 Dec 26, 2022
Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

pihole-antitelemetry Research shows Google collects 20x more data from Android than Apple collects from iOS. Block both using these pihole lists. Proj

Adrian Edwards 290 Jan 09, 2023
Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Video Autoencoder: self-supervised disentanglement of 3D structure and motion This repository contains the code (in PyTorch) for the model introduced

157 Dec 22, 2022
Modeling CNN layers activity with Gaussian mixture model

GMM-CNN This code package implements the modeling of CNN layers activity with Gaussian mixture model and Inference Graphs visualization technique from

3 Aug 05, 2022
A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

Deep Reinforcement Learning Agents This repository contains a collection of reinforcement learning algorithms written in Tensorflow. The ipython noteb

Arthur Juliani 2.2k Jan 01, 2023
Generating Anime Images by Implementing Deep Convolutional Generative Adversarial Networks paper

AnimeGAN - Deep Convolutional Generative Adverserial Network PyTorch implementation of DCGAN introduced in the paper: Unsupervised Representation Lear

Rohit Kukreja 23 Jul 21, 2022
Global-Local Context Network for Person Search

Global-Local Context Network for Person Search Abstract: Person search aims to jointly localize and identify a query person from natural, uncropped im

Peng Zheng 15 Oct 17, 2022
Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis, including human motion imitation, appearance transfer, and novel view synthesis. Currently the paper is under review

2.3k Jan 05, 2023
PyTorch implementation of neural style randomization for data augmentation

README Augment training images for deep neural networks by randomizing their visual style, as described in our paper: https://arxiv.org/abs/1809.05375

84 Nov 23, 2022
Multi-Task Deep Neural Networks for Natural Language Understanding

New Release We released Adversarial training for both LM pre-training/finetuning and f-divergence. Large-scale Adversarial training for LMs: ALUM code

Xiaodong 2.1k Dec 30, 2022
Tensorflow Tutorials using Jupyter Notebook

Tensorflow Tutorials using Jupyter Notebook TensorFlow tutorials written in Python (of course) with Jupyter Notebook. Tried to explain as kindly as po

Sungjoon 2.6k Dec 22, 2022
ReferFormer - Official Implementation of ReferFormer

The official implementation of the paper: Language as Queries for Referring Video Object Segmentation Language as Queries for Referring Video Object S

Jonas Wu 232 Dec 29, 2022
Convert Mission Planner (ArduCopter) Waypoint Missions to Litchi CSV Format to execute on DJI Drones

Mission Planner to Litchi Convert Mission Planner (ArduCopter) Waypoint Surveys to Litchi CSV Format to execute on DJI Drones Litchi doesn't support S

Yaros 24 Dec 09, 2022
GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning

GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning, as well as corresponding mitigation strategies.

129 Dec 30, 2022
This is the implementation of our work Deep Extreme Cut (DEXTR), for object segmentation from extreme points.

This is the implementation of our work Deep Extreme Cut (DEXTR), for object segmentation from extreme points.

Sergi Caelles 828 Jan 05, 2023
Code for the paper "Asymptotics of ℓ2 Regularized Network Embeddings"

README Code for the paper Asymptotics of L2 Regularized Network Embeddings. Requirements Requires Stellargraph 1.2.1, Tensorflow 2.6.0, scikit-learm 0

Andrew Davison 0 Jan 06, 2022
SOTA model in CIFAR10

A PyTorch Implementation of CIFAR Tricks 调研了CIFAR10数据集上各种trick,数据增强,正则化方法,并进行了实现。目前项目告一段落,如果有更好的想法,或者希望一起维护这个项目可以提issue或者在我的主页找到我的联系方式。 0. Requirement

PJDong 58 Dec 21, 2022