Learning to Weight Data in Semi-supervised Learning
Not All Unlabeled Data are Equal:Overview
This code is for paper: Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning. Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing. NeurIPS'20. (*equal contribtion)
Setup
Important: ML_DATA
is a shell environment variable that should point to the location where the datasets are installed. See the Install datasets section for more details.
Environement*: this code is tested using python-3.7, anaconda3-5.0.1, cuda-10.0, cudnn-v7.6, tensorflow-1.15
Install dependencies
conda create -n semi-sup python=3.7
conda activate semi-sup
pip install -r requirements.txt
make sure tf.test.is_gpu_available() == True
after installation so that GPUs will be used.
Install datasets
export ML_DATA="path to where you want the datasets saved"
export PYTHONPATH=$PYTHONPATH:"path to this repo"
# Download datasets
CUDA_VISIBLE_DEVICES= ./scripts/create_datasets.py
cp $ML_DATA/svhn-test.tfrecord $ML_DATA/svhn_noextra-test.tfrecord
# Create unlabeled datasets
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/cifar10 $ML_DATA/cifar10-train.tfrecord
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/svhn $ML_DATA/svhn-train.tfrecord $ML_DATA/svhn-extra.tfrecord
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/svhn_noextra $ML_DATA/svhn-train.tfrecord
# Create semi-supervised subsets
for seed in 0 1 2 3 4 5; do
for size in 250 1000 4000; do
CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=$size $ML_DATA/SSL2/cifar10 $ML_DATA/cifar10-train.tfrecord
CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=$size $ML_DATA/SSL2/svhn $ML_DATA/svhn-train.tfrecord $ML_DATA/svhn-extra.tfrecord
CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=$size $ML_DATA/SSL2/svhn_noextra $ML_DATA/svhn-train.tfrecord
done
done
Running
Setup
All commands must be ran from the project root. The following environment variables must be defined:
export ML_DATA="path to where you want the datasets saved"
export PYTHONPATH=$PYTHONPATH:"path to this repo"
Example
For example, train a model with 32 filters on cifar10 shuffled with seed=1
, 250 labeled samples and 1000 validation sample:
# single-gpu
CUDA_VISIBLE_DEVICES=0 python main.py --filters=32 [email protected] --train_dir ./experiments
# multi-gpu: just pass more GPUs and the model automatically scales to them, here we assign GPUs 0-1 to the program:
CUDA_VISIBLE_DEVICES=0,1 python main.py --filters=32 [email protected] --train_dir ./experiments
Naming rule: ${dataset}.${seed}@${size}-${valid}
Available labelled sizes are 250, 1000, 4000.
For validation, available sizes are 1000, 5000.
Possible shuffling seeds are 1, 2, 3, 4, 5 and 0 for no shuffling (0 is not used in practiced since data requires to be shuffled for gradient descent to work properly).
Image classification
The hyper-parameters used in the paper:
# 2GPU setting is recommended
for seed in 1 2 3 4 5; do
for size in 250 1000 4000; do
CUDA_VISIBLE_DEVICES=0,1 python main.py --filters=32 \
--dataset=cifar10.${seed}@${size}-1000 \
--train_dir ./experiments --alpha 0.01 --inner_steps 512
done
done
Flags
python main.py --help
# The following option might be too slow to be really practical.
# python main.py --helpfull
# So instead I use this hack to find the flags:
fgrep -R flags.DEFINE libml main.py
Monitoring training progress
You can point tensorboard to the training folder (by default it is --train_dir=./experiments
) to monitor the training process:
tensorboard.sh --port 6007 --logdir ./experiments
Checkpoint accuracy
We compute the median accuracy of the last 20 checkpoints in the paper, this is done through this code:
# Following the previous example in which we trained [email protected], extracting accuracy:
./scripts/extract_accuracy.py ./experiments/[email protected]/CTAugment_depth2_th0.80_decay0.990/FixMatch_alpha0.01_archresnet_batch64_confidence0.95_filters32_inf_warm0_inner_steps100_lr0.03_nclass10_repeat4_scales3_size_unlabeled49000_uratio7_wd0.0005_wu1.0
# The command above will create a stats/accuracy.json file in the model folder.
# The format is JSON so you can either see its content as a text file or process it to your liking.
Use you own data
- You first need to creat
*.tfrecord
for the labeled and unlabled data; please checkscripts/create_datasets.py
andscripts/create_unlabeled.py
for examples. - Then you need to creat the splits for semi-supervied learning; see
scripts/create_split.py
. - modify
libml/data.py
to support the new dataset. Specifically, check this function and this class. - tune hyper-parameters (e.g., learning rate, num_epochs, etc.) to achieve the best results.
Note: our algorithm involves approximation of inverse-Hessian and computation of per-example gradients. Therefore, running on a dataset with large number of classes will be computationally heavy in terms of both speed and memory.
License
Please check LICENSE
Citing this work
If you use this code for your research, please cite our paper.
@inproceedings{ren-ssl2020,
title = {Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning},
author = {Zhongzheng Ren$^\ast$ and Raymond A. Yeh$^\ast$ and Alexander G. Schwing},
booktitle = {Neural Information Processing Systems (NeurIPS)},
year = {2020},
note = {$^\ast$ equal contribution},
}
Acknowledgement
The code is built based on: FixMatch (commit: 08d9b83)
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel.
Contact
Github issues and PR are preferred. Feel free to contact Jason Ren (zr5 AT illinois.edu) for any questions!