Data from "Datamodels: Predicting Predictions with Training Data"

Overview

Data from "Datamodels: Predicting Predictions with Training Data"

Here we provide the data used in the paper "Datamodels: Predicting Predictions with Training Data" (arXiv, Blog).

Note that all of the data below is stored on Amazon S3 using the “requester pays” option to avoid a blowup in our data transfer costs (we put estimated AWS costs below)---if you are on a budget and do not mind waiting a bit longer, please contact us at [email protected] and we can try to arrange a free (but slower) transfer.

Citation

To cite this data, please use the following BibTeX entry:

@inproceedings{ilyas2022datamodels,
  title = {Datamodels: Predicting Predictions from Training Data},
  author = {Andrew Ilyas and Sung Min Park and Logan Engstrom and Guillaume Leclerc and Aleksander Madry},
  booktitle = {ArXiv preprint arXiv:2202.00622},
  year = {2022}
}

Overview

We provide the data used in our paper to analyze two image classification datasets: CIFAR-10 and (a modified version of) FMoW.

For each dataset, the data consists of two parts:

  1. Training data for datamodeling, which consists of:
    • Training subsets or "training masks", which are the independent variables of the regression tasks; and
    • Model outputs (correct-class margins and logits), which are the dependent variables of the regression tasks.
  2. Datamodels estimated from this data using LASSO.

For each dataset, there are multiple versions of the data depending on the choice of the hyperparameter α, the subsampling fraction (this is the random fraction of training examples on which each model is trained; see Section 2 of our paper for more information).

Following table shows the number of models we trained and used for estimating datamodels (also see Table 1 in paper):

Subsampling α (%) CIFAR-10 FMoW
10 1,500,000 N/A
20 750,000 375,000
50 300,000 150,000
75 600,000 300,000

Training data

For each dataset and $\alpha$, we provide the following data:

# M is the number of models trained
/{DATASET}/data/train_masks_{PCT}pct.npy  # [M x N_train] boolean
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_test] np.float16
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_train] np.float16

(The files live in the Amazon S3 bucket madrylab-datamodels; we provide instructions for acces in the next section.)

Each row of the above matrices corresponds to one instance of model trained; each column corresponds to a training or test example. CIFAR-10 examples are organized in the default order; for FMoW, see here. For example, a train mask for CIFAR-10 has the shape [M x 50,000].

For CIFAR-10, we also provide the full logits for all ten classes:

/cifar/data/train_logits_{PCT}pct.npy  # [M x N_test x 10] np.float16
/cifar/data/test_logits_{PCT}pct.npy   # [M x N_test x 10] np.float16

Note that you can also compute the margins from these logits.

We include an addtional 10,000 models for each setting that we used for evaluation; the total number of models in each matrix is M as indicated in the above table plus 10,000.

Datamodels

All estimated datamodels for each split (train or test) are provided as a dictionary in a .pt file (load with torch.load):

/{DATASET}/datamodels/train_{PCT}pct.pt
/{DATASET}/datamodels/test_{PCT}pct.pt

Each dictionary contains:

  • weight: matrix of shape N_train x N, where N is either N_train or N_test depending on the group of target examples
  • bias: vector of length N, corresponding to biases for each datamodel
  • lam: vector of length N, regularization λ chosen by CV for each datamodel

Downloading

We make all of our data available via Amazon S3. Total sizes of the training data files are as follows:

Dataset, α (%) masks, margins (GB) logits (GB)
CIFAR-10, 10 245 1688
CIFAR-10, 20 123 849
CIFAR-10, 50 49 346
CIFAR-10, 75 98 682
FMoW, 20 25.4 -
FMoW, 50 10.6 -
FMoW, 75 21.2 -

Total sizes of datamodels data (the model weights) are 16.9 GB for CIFAR-10 and 0.75 GB for FMoW.

API

You can download them using the Amazon S3 CLI interface with the requester pays option as follows (replacing the fields {...} as appropriate):

aws s3api get-object --bucket madrylab-datamodels \
                     --key {DATASET}/data/{SPLIT}_{DATA_TYPE}_{PCT}.npy \
                     --request-payer requester \
                     [OUT_FILE]

For example, to retrieve the test set margins for CIFAR-10 models trained on 50% subsets, use:

aws s3api get-object --bucket madrylab-datamodels \
                     --key cifar/data/test_margins_50pct.npy \
                     --request-payer requester \
                     test_margins_50pct.npy

Pricing

The total data transfer fee (from AWS to internet) for all of the data is around $374 (= 4155 GB x 0.09 USD per GB).

If you only download everything except for the logits (which is sufficient to reproduce all of our analysis), the fee is around $53.

Loading data

The data matrices are in numpy array format (.npy). As some of these are quite large, you can read small segments without reading the entire file into memory by additionally specifying the mmap_mode argument in np.load:

X = np.load('train_masks_10pct.npy', mmap_mode='r')
Y = np.load('test_margins_10pct.npy', mmap_mode='r')
...
# Use segments, e.g, X[:100], as appropriate
# Run regress(X, Y[:]) using choice of estimation algorithm.

FMoW data

We use a customized version of the FMoW dataset from WILDS (derived from this original dataset) that restricts the year of the training set to 2012. Our code is adapted from here.

To use the dataset, first download WILDS using:

pip install wilds

(see here for more detailed instructions).

In our paper, we only use the in-distribution training and test splits in our analysis (the original version from WILDS also has out-of-distribution as well as validation splits). Our dataset splits can be constructed as follows and used like a PyTorch dataset:

from fmow import FMoWDataset

ds = FMoWDataset(root_dir='/mnt/nfs/datasets/wilds/',
                     split_scheme='time_after_2016')

transform_steps = [
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]
transform = transforms.Compose(transform_steps)

ds_train = ds.get_subset('train', transform=transform)
ds_test = ds.get_subset('id_test', transform=transform)

The columns of matrix data described above is ordered according to the default ordering of examples given by the above constructors.

Owner
Madry Lab
Towards a Principled Science of Deep Learning
Madry Lab
Course files for "Ocean/Atmosphere Time Series Analysis"

time-series This package contains all necessary files for the course Ocean/Atmosphere Time Series Analysis, an introduction to data and time series an

Jonathan Lilly 107 Nov 29, 2022
A complete guide to start and improve in machine learning (ML)

A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2021 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art

Louis-François Bouchard 3.3k Jan 04, 2023
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
Code base of KU AIRS: SPARK Autonomous Vehicle Team

KU AIRS: SPARK Autonomous Vehicle Project Check this link for the blog post describing this project and the video of SPARK in simulation and on parkou

Mehmet Enes Erciyes 1 Nov 23, 2021
Repositório para o #alurachallengedatascience1

1° Challenge de Dados - Alura A Alura Voz é uma empresa de telecomunicação que nos contratou para atuar como cientistas de dados na equipe de vendas.

Sthe Monica 16 Nov 10, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processi

Salesforce 2.8k Jan 05, 2023
MLOps pipeline project using Amazon SageMaker Pipelines

This project shows steps to build an end to end MLOps architecture that covers data prep, model training, realtime and batch inference, build model registry, track lineage of artifacts and model drif

AWS Samples 3 Sep 16, 2022
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 09, 2022
ML Kaggle Titanic Problem using LogisticRegrission

-ML-Kaggle-Titanic-Problem-using-LogisticRegrission here you will find the solution for the titanic problem on kaggle with comments and step by step c

Mahmoud Nasser Abdulhamed 3 Oct 23, 2022
🎛 Distributed machine learning made simple.

🎛 lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 44 Nov 27, 2022
Create large-scale ML-driven multiscale simulation ensembles to study the interactions

MuMMI RAS v0.1 Released: Nov 16, 2021 MuMMI RAS is the application component of the MuMMI framework developed to create large-scale ML-driven multisca

4 Feb 16, 2022
Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating and analyzing optimization models. Pyomo can be used to define symbolic p

Pyomo 1.4k Dec 28, 2022
scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

neurodata 3 Dec 16, 2022
using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

Mohammed Hussien 2 May 02, 2022
Open MLOps - A Production-focused Open-Source Machine Learning Framework

Open MLOps - A Production-focused Open-Source Machine Learning Framework Open MLOps is a set of open-source tools carefully chosen to ease user experi

Data Revenue 590 Dec 28, 2022
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么? AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。 简单易用: AutoX的接口和sklearn类似,方便上手使用。

4Paradigm 431 Dec 28, 2022
Simple structured learning framework for python

PyStruct PyStruct aims at being an easy-to-use structured learning and prediction library. Currently it implements only max-margin methods and a perce

pystruct 666 Jan 03, 2023
Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Criteo 419 Jan 01, 2023
A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

Peter Corke 1.2k Jan 07, 2023