Highlights
Models
Multi-weight support API
TorchVision v0.13 offers a new Multi-weight support API for loading different weights to the existing model builder methods:
from torchvision.models import *
# Old weights with accuracy 76.130%
resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
# New weights with accuracy 80.858%
resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
# Best available weights (currently alias for IMAGENET1K_V2)
# Note that these weights may change across versions
resnet50(weights=ResNet50_Weights.DEFAULT)
# Strings are also supported
resnet50(weights="IMAGENET1K_V2")
# No weights - random initialization
resnet50(weights=None)
The new API bundles along with the weights important details such as the preprocessing transforms and meta-data such as labels. Here is how to make the most out of it:
from torchvision.io import read_image
from torchvision.models import resnet50, ResNet50_Weights
img = read_image("test/assets/encode_jpeg/grace_hopper_517x606.jpg")
# Step 1: Initialize model with the best available weights
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.eval()
# Step 2: Initialize the inference transforms
preprocess = weights.transforms()
# Step 3: Apply inference preprocessing transforms
batch = preprocess(img).unsqueeze(0)
# Step 4: Use the model and print the predicted category
prediction = model(batch).squeeze(0).softmax(0)
class_id = prediction.argmax().item()
score = prediction[class_id].item()
category_name = weights.meta["categories"][class_id]
print(f"{category_name}: {100 * score:.1f}%")
You can read more about the new API in the docs. To provide your feedback, please use this dedicated Github issue.
New architectures and model variants
Classification
The Swin Transformer and EfficienetNetV2 are two popular classification models which are often used for downstream vision tasks. This release includes 6 pre-trained weights for their classification variants. Here is how to use the new models:
import torch
from torchvision.models import *
image = torch.rand(1, 3, 224, 224)
model = swin_t(weights="DEFAULT").eval()
prediction = model(image)
image = torch.rand(1, 3, 384, 384)
model = efficientnet_v2_s(weights="DEFAULT").eval()
prediction = model(image)
In addition to the above, we also provide new variants for existing architectures such as ShuffleNetV2, ResNeXt and MNASNet. The accuracies of all the new pre-trained models obtained on ImageNet-1K are seen below:
Model | [email protected] | [email protected]
-- | -- | --
swin_t | 81.474 | 95.776
swin_s | 83.196 | 96.36
swin_b | 83.582 | 96.64
efficientnet_v2_s | 84.228 | 96.878
efficientnet_v2_m | 85.112 | 97.156
efficientnet_v2_l | 85.808 | 97.788
resnext101_64x4d | 83.246 | 96.454
resnext101_64x4d (quantized) | 82.898 | 96.326
shufflenet_v2_x1_5 | 72.996 | 91.086
shufflenet_v2_x1_5 (quantized) | 72.052 | 90.700
shufflenet_v2_x2_0 | 76.230 | 93.006
shufflenet_v2_x2_0 (quantized) | 75.354 | 92.488
mnasnet0_75 | 71.180 | 90.496
mnas1_3 | 76.506 | 93.522
We would like to thank Hu Ye for contributing to TorchVision the Swin Transformer implementation.
[BETA] Object Detection and Instance Segmentation
We have introduced 3 new model variants for RetinaNet, FasterRCNN and MaskRCNN that include several post-paper architectural optimizations and improved training recipes. All models can be used similarly:
import torch
from torchvision.models.detection import *
images = [torch.rand(3, 800, 600)]
model = retinanet_resnet50_fpn_v2(weights="DEFAULT")
# model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT")
# model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")
model.eval()
prediction = model(images)
Below we present the metrics of the new variants on COCO val2017. In parenthesis we denote the improvement over the old variants:
Model | Box mAP | Mask mAP
-- | -- | --
retinanet_resnet50_fpn_v2 | 41.5 (+5.1) | -
fasterrcnn_resnet50_fpn_v2 | 46.7 (+9.7) | -
maskrcnn_resnet50_fpn_v2 | 47.4 (+9.5) | 41.8 (+7.2)
We would like to thank Ross Girshick, Piotr Dollar, Vaibhav Aggarwal, Francisco Massa and Hu Ye for their past research and contributions to this work.
New pre-trained weights
SWAG weights
The ViT and RegNet model variants offer new pre-trained SWAG (Supervised Weakly from hashtAGs) weights. One of the biggest of these models achieves a whopping 88.6% accuracy on ImageNet-1K. We currently offer two versions of the weights: 1) fine-tuned end-to-end weights on ImageNet-1K (highest accuracy) and 2) frozen trunk weights with a linear classifier fit on ImageNet-1K (great for transfer learning). Below we see the detailed accuracies of each model variant:
Model Weights | [email protected] | [email protected]
-- | -- | --
RegNet_Y_16GF_Weights.IMAGENET1K_SWAG_E2E_V1 | 86.012 | 98.054
RegNet_Y_16GF_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 83.976 | 97.244
RegNet_Y_32GF_Weights.IMAGENET1K_SWAG_E2E_V1 | 86.838 | 98.362
RegNet_Y_32GF_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 84.622 | 97.48
RegNet_Y_128GF_Weights.IMAGENET1K_SWAG_E2E_V1 | 88.228 | 98.682
RegNet_Y_128GF_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 86.068 | 97.844
ViT_B_16_Weights.IMAGENET1K_SWAG_E2E_V1 | 85.304 | 97.65
ViT_B_16_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 81.886 | 96.18
ViT_L_16_Weights.IMAGENET1K_SWAG_E2E_V1 | 88.064 | 98.512
ViT_L_16_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 85.146 | 97.422
ViT_H_14_Weights.IMAGENET1K_SWAG_E2E_V1 | 88.552 | 98.694
ViT_H_14_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 85.708 | 97.73
The weights can be loaded normally as follows:
from torchvision.models import *
model1 = vit_h_14(weights="IMAGENET1K_SWAG_E2E_V1")
model2 = vit_h_14(weights="IMAGENET1K_SWAG_LINEAR_V1")
The SWAG weights are released under the Attribution-NonCommercial 4.0 International license. We would like to thank Laura Gustafson, Mannat Singh and Aaron Adcock for their work and support in making the weights available to TorchVision.
Model Refresh
The release of the Multi-weight support API enabled us to refresh the most popular models and offer more accurate weights. We improved on average each model by ~3 points. The new recipe used was learned on top of ResNet50 and its details were covered on a previous blogpost.
Model | Old weights | New weights
-- | -- | --
efficientnet_b1 | 78.642 | 79.838
mobilenet_v2 | 71.878 | 72.154
mobilenet_v3_large | 74.042 | 75.274
regnet_y_400mf | 74.046 | 75.804
regnet_y_800mf | 76.42 | 78.828
regnet_y_1_6gf | 77.95 | 80.876
regnet_y_3_2gf | 78.948 | 81.982
regnet_y_8gf | 80.032 | 82.828
regnet_y_16gf | 80.424 | 82.886
regnet_y_32gf | 80.878 | 83.368
regnet_x_400mf | 72.834 | 74.864
regnet_x_800mf | 75.212 | 77.522
regnet_x_1_6gf | 77.04 | 79.668
regnet_x_3_2gf | 78.364 | 81.196
regnet_x_8gf | 79.344 | 81.682
regnet_x_16gf | 80.058 | 82.716
regnet_x_32gf | 80.622 | 83.014
resnet50 | 76.13 | 80.858
resnet50 (quantized) | 75.92 | 80.282
resnet101 | 77.374 | 81.886
resnet152 | 78.312 | 82.284
resnext50_32x4d | 77.618 | 81.198
resnext101_32x8d | 79.312 | 82.834
resnext101_32x8d (quantized) | 78.986 | 82.574
wide_resnet50_2 | 78.468 | 81.602
wide_resnet101_2 | 78.848 | 82.51
We would like to thank Piotr Dollar, Mannat Singh and Hugo Touvron for their past research and contributions to this work.
Ops and Transforms
New Augmentations, Layers and Losses
This release brings a bunch of new primitives which can be used to produce SOTA models. Some highlights include the addition of AugMix data-augmentation method, the DropBlock layer, the cIoU/dIoU loss and many more. We would like to thank Aditya Oke, Abhijit Deo, Yassine Alouini and Hu Ye for contributing to the project and for helping us maintain TorchVision relevant and fresh.
Documentation
We completely revamped our models documentation to make them easier to browse, and added various key information such as supported image sizes, or image pre-processing steps of pre-trained weights. We now have a main model page with various summary tables of available weights, and each model has a dedicated page. Each model builder is also documented in their own page, with more details about the available weights, including accuracy, minimal image size, link to training recipes, and other valuable info. For comparison, our previous models docs are here. To provide feedback on the new documentation, please use the dedicated Github issue.
Backward-incompatible changes
The new Multi-weight support API replaced the legacy “pretrained” parameter of model builders. Both solutions are currently supported to maintain backwards compatibility but our intention is to remove the deprecated API in 2 versions. Migrating to the new API is very straightforward. The following method calls between the 2 APIs are all equivalent:
from torchvision.models import resnet50, ResNet50_Weights
# Using pretrained weights:
resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
resnet50(weights="IMAGENET1K_V1")
resnet50(pretrained=True) # deprecated
resnet50(True) # deprecated
# Using no weights:
resnet50(weights=None)
resnet50()
resnet50(pretrained=False) # deprecated
resnet50(False) # deprecated
Deprecations
[models, models.quantization] Reinstate and deprecate model_urls
and quant_model_urls
(#5992)
[transforms] Deprecate int as interpolation argument type (#5974)
New Features
[models] New Multi-weight API support (#5618, #5859, #6047, #6026, #5848)
[models] Adding Swin Transformer architecture (#5491)
[models] Adding EfficientNetV2 architecture (#5450)
[models] Adding detection model improved weights: RetinaNet, MaskRCNN, FasterRCNN (#5756, #5773, #5763)
[models] Adding classification model weight: resnext101 64x4d, mnasnet0_75, mnasnet1_3 (#5935, #6019)
[models] Add SWAG model pretrained weights (#5714, #5722, #5732, #5793, #5721)
[ops] AddingIoU loss function variants: DIoU, CIoU (#5786, #5776)
[ops] Adding various ops and test for ops (#6053, #5416, #5792, #5783)
[transforms] Adding AugMix transforms implementation (#5411)
[reference scripts] Support custom weight decay setting in classification reference script (#5671)
[transforms, reference scripts] Improve detection reference script: Scale Jitter, RandomShortestSize, FixedSizeCrop (#5435, #5610, #5607)
[ci] Add M1 support : (#6167)
[ci] Add Python-3.10 (build and test) (#5420)
Improvements
[documentation] Complete new revamp of models documentation (#5821, #5876, #5899, #6025, #5885, #5884, #5886, #5891, #6023, #6009, #5852, #5831, #5832, #6003, #6013, #5856, #6004, #6005, #5878, #6012, #5894, #6002, #5854, #5864, #5920, #5869, #5871, #6021, #6006, #6016, #5905, #6028, #5915, #5924, #5977, #5918, #5921, #5934, #5936, #5937, #5933, #5949, #5988, #5962, #5963, #5975, #5900, #5917, #5895, #5901, #6033, #6032, #6030, #5904, #5661, #6035, #6049, #6036, #5908, #5907, #6044, #6039, #5874, #6151)
[documentation] Various documentation improvements (#5695, #5930, #5814, #5799, #5827, #5796, #5923, #5599, #5554, #5995, #5457, #6163, #6031, #6000, #5847, #6024))
[documentation] Add warnings in docs to document Beta APIs (#6115)
[datasets] improve GDrive downloads (#5704, #5645)
[datasets] indicate md5 checksum is not used for security (#5717)
[models] Add shufflenetv2 1.5 and 2.0 weights (#5906)
[models] Reduce unnecessary cuda sync in anchor_utils.py (#5515)
[models] Adding improved MobileNetV2 weights (#5560)
[models] Remove (N, T, H, W, C) => (N, T, C, H, W)
from presets (#6058)
[models] add swin_s and swin_b variants and improved swin_t (#6048)
[models] Update ShuffleNetV2 annotations for x1_5 and x2_0 variants (#6022)
[models] Better error message in ViT (#5820)
[models, ops] Add private support for ciou and diou (#5984, #5685, #5690)
[models, reference scripts] Various improvements to detection recipe and models (#5715, #5444)
[transforms, tests] add functional vertical flip tests on segmentation mask (#5860)
[transforms] make _max_value jit-scriptable (#5623)
[transforms] Make ScaleJitter proportional (#5559)
[transforms] add tensor kernels for normalize and erase (#5462)
[transforms] Update transforms following PIL deprecation (#5898)
[transforms, models, datasets…] Replace asserts with exceptions (#5587, #5659)
[utils] add warning if font is not set in draw_bounding_boxes (#5785)
[utils] Throw warning for empty masks or box tensors on draw_segmentation_masks and draw_bounding_boxes (#5857)
[video] Add output_format do video datasets and readers (#6061)
[video, io] Better compatibility with FFMPEG 5.0 (#5644)
[video, io] Allow cuda device to be passed without the index for GPU decoding (#5505)
[reference scripts] Simplify EMA to use Pytorch's update_parameters (#5469)
[reference scripts] Reduce variance of evaluation in reference (#5819)
[reference scripts] Various improvements to RAFT training reference (#5590)
[tests] Speed up Model tests by 20% (#5574)
[tests] Make test suite fail on unexpected test success (#5556)
[tests] Skip big model in test to reduce memory usage in CI (#5903, #5902)
[tests] Improve test of backbone utils (#5552)
[tests] Validate against expected files on videos (#6077)
[ci] Support for CUDA 11.6 (#5803, 5862)
[ci] pre-download model weights in CI docs build (#5625)
Bug Fixes
[transforms] remove option to pass fill as str in transforms (#5632)
[transforms] Better handling for Pad's fill argument (#5596)
[transforms] [FBcode->GH] Fix accimage tests (#5545)
[transforms] Update _pil_constants.py (#6154) (#6156)
[transforms] Fix resize transform when size == small_edge_size and max_size isn't None (#5409)
[transforms] Fixed rotate transform with expand inconsistency (#5677)
[transforms] Fixed upstream issue with padding (#5875)
[transforms] Fix functional.adjust_gamma (#5427)
[models] Respect strict=False
when loading detection models (#5841)
[models] Fix resnet norm initialization (#6082) (#6085)
[models] Use frozen BN only if pre-trained for detection models. (#5443)
[models] fix fcos gtarea calculation (#5816)
[models, onnx] Add topk min function for trace and onnx (#5310)
[models, tests] fix mobilnet norm layer test (#5643)
[reference scripts] Fix regression on Detection training script (#5985)
[datasets] do not re-download from GDrive if file is already present (#5805)
[datasets] Fix datasets: kinetics, Flowers102, VOC_2009, INaturalist 2021_train, caltech (#5578, #5775, #5425, #5844, #5789)
[documentation] Fixes device mismatch issue while building docs (#5428)
[documentation] Fix Accuracy meta-data on shufflenetv2 (#5896)
[documentation] fix typo in docstrings of some transforms (#5609)
[video, documentation] Fix append of audio_pts (#5488)
[io, tests] More robust check in tests for 16 bits images (#5652)
[video, io] Fix shape mismatch error in video reader (#5489)
[io] Address nvjpeg leak on CUDA < 11.6 issue (#5713, #5482)
[ci] Fixing issue with setup_env.sh in docker: resolve "unsafe directory" error (#6106) (#6109)
[ci] fix documentation version problems when new release is tagged (#5583)
[ci] Replace jcenter and fix version for android (#6046)
[tests] Add .float() before .mean() on test_backbone_utils.py because .mean() dont accept integer dtype (#6090) (#6091)
[tests] Fix keypointrcnn_resnet50_fpn flaky test (#5911)
[tests] Disable test_encode|write_jpeg_reference
tests (#5910)
[mobile] Bump up LibTorchvision version number for Podspec to release Cocoapods (#5624)
[feature extraction] Add default tracer args for model feature extraction function (#5637)
[build] Fix libtorchvision.so not able to encode images by adding *_FOUND macros to CMakeLists.txt (#5547)
Code Quality
[dataset, models] Better deprecation message for voc2007 and SqueezeExcitation (#5391)
[datasets, reference scripts] Use Kinetics instead of Kinetics400 in references (#5787) (#5952)
[models] CleanUp DenseNet code (#5966)
[models] Minor Swin Transformer fixes (#6054)
[models, onnx] Use onnx function only in tracing mode (#5468)
[models] Refactor swin transfomer so later we can reuse component for 3d version (#6088) (#6100)
[models, tests] Fix minor issues with model tests. (#5576)
[transforms] Remove to_tensor()
and ToTensor()
usages (#5553)
[transforms] Refactor Augmentation Space calls to speed up. (#5402)
[transforms] Recoded _max_value method using a dictionary (#5566)
[transforms] Replace get_image_size/num_channels with get_dimensions (#5487)
[ops] Replace usages of atomicAdd with gpuAtomicAdd (#5823)
[ops] Fix unused variable warning in ps_roi_align_kernel.cu (#5408)
[ops] Remove custom ops interpolation with antialiasing (#5329)
[ops] Move Permute layer to ops. (#6055)
[ops] Remove assertions for generalized_box_iou (#5691)
[utils] Moving sequence_to_str
to torchvision._utils
(#5604)
[utils] Clarify TypeError message in make_grid (#5997)
[video, io] replace distutils.spawn with shutil.which per PEP632 in setup script (#5849)
[video, io] Move VideoReader out of init (#5495)
[video, io] Remove unnecessary initialisation in GPUDecoder (#5507)
[video, io] Remove unused member variable and argument in GPUDecoder (#5499)
[video, io] Improve test_video_reader (#5498)
[video, io] Update private attribute name for readability (#5484)
[video, tests] Improve test_videoapi (#5497)
[reference scripts] Minor updates to optical flow ref for consistency (#5654)
[reference scripts] Add barrier() after init_process_group() (#5475)
[ci] Delete stale packaging scripts (#5433)
[ci] remove explicit install of Pillow throughout CI (#5950)
[ci, test] remove unnecessary pytest install (#5739)
[ci, tests] Remove unnecessary PYTORCH_TEST_WITH_SLOW env (#5631)
[ci] Add .git-blame-ignore-revs to ignore specific commits in git blame (#5696)
[ci] Remove CUDA 11.1 support (#5477, #5470, #5451, #5978)
[ci] Minor linting improvement (#5880)
[ci] Remove Bandit and CodeQL jobs (#5734)
[ci] Various type annotation fixes / issues (#5598, #5970, #5943)
Contributors
We're grateful for our community, which helps us improving torchvision by submitting issues and PRs, and providing feedback and suggestions. The following persons have contributed patches for this release:
Abhijit Deo, Aditya Oke, Andrey Talman, Anton Thomma, Behrooz, Bruno Korbar, Daniel Angelov, Dbhasin1, Drishti Bhasin, F-G Fernandez, Federico Pozzi, FG Fernandez, Georg Grab, Gouvernathor, Hu Ye, Jeffery (Zeyu) Zhao, Joao Gomes, kaijieshi, Kazuki Adachi, KyleCZH, kylematoba, LEGRAND Matthieu, Lezwon Castelino, Luming Tang, Matti Picus, Nicolas Hug, Nikita, Nikita Shulga, oxabz, Philip Meier, Prabhat Roy, puhuk, Richard Barnes, Sahil Goyal, satojkovic, Shijie, Shubham Bhokare, talregev, tcmyxc, Vasilis Vryniotis, vfdev, WuZhe, XiaobingZhang, Xu Zhao, Yassine Alouini, Yonghye Kwon, YosuaMichael, Yulv-git, Zhiqiang Wang
Source code(tar.gz)
Source code(zip)