The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

Overview

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.


WebsiteKey FeaturesHow To UseDocsExamplesCommunityGrid AILicense

PyPI - Python Version PyPI Status PyPI Status Conda DockerHub codecov

ReadTheDocs Slack license

*Codecov is > 90%+ but build delays may show less

PyTorch Lightning is just organized PyTorch

Lightning disentangles PyTorch code to decouple the science from the engineering. PT to PL


Lightning Design Philosophy

Lightning structures PyTorch code with these principles:

Lightning forces the following structure to your code which makes it reusable and shareable:

  • Research code (the LightningModule).
  • Engineering code (you delete, and is handled by the Trainer).
  • Non-essential research code (logging, etc... this goes in Callbacks).
  • Data (use PyTorch DataLoaders or organize them into a LightningDataModule).

Once you do this, you can train on multiple-GPUs, TPUs, CPUs and even in 16-bit precision without changing your code!

Get started with our 2 step guide


Continuous Integration

Lightning is rigorously tested across multiple GPUs, TPUs CPUs and against major Python and PyTorch versions.

Current build statuses
System / PyTorch ver. 1.7 (min. req.) 1.8 (LTS) 1.9 1.10 (latest)
Linux py3.7 [GPUs**] - Build Status - -
Linux py3.7 [TPUs***] - CircleCI - -
Linux py3.8 (with Conda Test Test Test Test
Linux py3.{7,9} Test - - Test
OSX py3.{7,9} Test - - Test
Windows py3.{7,9} Test - - Test
Linux py3.6 Test - - -
OSX py3.6 Test - - -
Windows py3.6 Test - - -
  • ** tests run on two NVIDIA P100
  • *** tests run on Google GKE TPUv2/3. TPU py3.7 means we support Colab and Kaggle env.

How To Use

Step 0: Install

Simple installation from PyPI

pip install pytorch-lightning
Other installation options

Install with optional dependencies

pip install pytorch-lightning['extra']

Conda

conda install pytorch-lightning -c conda-forge

Install stable 1.5.x

the actual status of 1.5 [stable] is following:

CI basic testing CI complete testing PyTorch & Conda TPU tests Docs check

Install future release from the source

pip install git+https://github.com/PytorchLightning/[email protected]/1.5.x --upgrade

Install bleeding-edge - future 1.6

Install nightly from the source (no guarantees)

pip install https://github.com/PyTorchLightning/pytorch-lightning/archive/master.zip

or from testing PyPI

pip install -iU https://test.pypi.org/simple/ pytorch-lightning

Step 1: Add these imports

import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl

Step 2: Define a LightningModule (nn.Module subclass)

A LightningModule defines a full system (ie: a GAN, autoencoder, BERT or a simple Image Classifier).

class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop. It is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

Note: Training_step defines the training loop. Forward defines how the LightningModule behaves during inference/prediction.

Step 3: Train!

dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train, val = random_split(dataset, [55000, 5000])

autoencoder = LitAutoEncoder()
trainer = pl.Trainer()
trainer.fit(autoencoder, DataLoader(train), DataLoader(val))

Advanced features

Lightning has over 40+ advanced features designed for professional AI research at scale.

Here are some examples:

Highlighted feature code snippets
# 8 GPUs
# no code changes needed
trainer = Trainer(max_epochs=1, gpus=8)

# 256 GPUs
trainer = Trainer(max_epochs=1, gpus=8, num_nodes=32)
Train on TPUs without code changes
# no code changes needed
trainer = Trainer(tpu_cores=8)
16-bit precision
# no code changes needed
trainer = Trainer(precision=16)
Experiment managers
from pytorch_lightning import loggers

# tensorboard
trainer = Trainer(logger=TensorBoardLogger("logs/"))

# weights and biases
trainer = Trainer(logger=loggers.WandbLogger())

# comet
trainer = Trainer(logger=loggers.CometLogger())

# mlflow
trainer = Trainer(logger=loggers.MLFlowLogger())

# neptune
trainer = Trainer(logger=loggers.NeptuneLogger())

# ... and dozens more
EarlyStopping
es = EarlyStopping(monitor="val_loss")
trainer = Trainer(callbacks=[es])
Checkpointing
checkpointing = ModelCheckpoint(monitor="val_loss")
trainer = Trainer(callbacks=[checkpointing])
Export to torchscript (JIT) (production use)
# torchscript
autoencoder = LitAutoEncoder()
torch.jit.save(autoencoder.to_torchscript(), "model.pt")
Export to ONNX (production use)
# onnx
with tempfile.NamedTemporaryFile(suffix=".onnx", delete=False) as tmpfile:
    autoencoder = LitAutoEncoder()
    input_sample = torch.randn((1, 64))
    autoencoder.to_onnx(tmpfile.name, input_sample, export_params=True)
    os.path.isfile(tmpfile.name)

Pro-level control of training loops (advanced users)

For complex/professional level work, you have optional full control of the training loop and optimizers.

class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.automatic_optimization = False

    def training_step(self, batch, batch_idx):
        # access your optimizers with use_pl_optimizer=False. Default is True
        opt_a, opt_b = self.optimizers(use_pl_optimizer=True)

        loss_a = ...
        self.manual_backward(loss_a, opt_a)
        opt_a.step()
        opt_a.zero_grad()

        loss_b = ...
        self.manual_backward(loss_b, opt_b, retain_graph=True)
        self.manual_backward(loss_b, opt_b)
        opt_b.step()
        opt_b.zero_grad()

Advantages over unstructured PyTorch

  • Models become hardware agnostic
  • Code is clear to read because engineering code is abstracted away
  • Easier to reproduce
  • Make fewer mistakes because lightning handles the tricky engineering
  • Keeps all the flexibility (LightningModules are still PyTorch modules), but removes a ton of boilerplate
  • Lightning has dozens of integrations with popular machine learning tools.
  • Tested rigorously with every new PR. We test every combination of PyTorch and Python supported versions, every OS, multi GPUs and even TPUs.
  • Minimal running speed overhead (about 300 ms per epoch compared with pure PyTorch).

Examples

Hello world
Contrastive Learning
NLP
Reinforcement Learning
Vision
Classic ML

Community

The lightning community is maintained by

  • 10+ core contributors who are all a mix of professional engineers, Research Scientists, and Ph.D. students from top AI labs.
  • 480+ active community contributors.

Want to help us build Lightning and reduce boilerplate for thousands of researchers? Learn how to make your first contribution here

Lightning is also part of the PyTorch ecosystem which requires projects to have solid testing, documentation and support.

Asking for help

If you have any questions please:

  1. Read the docs.
  2. Search through existing Discussions, or add a new question
  3. Join our slack.

Funding

We're venture funded to make sure we can provide around the clock support, hire a full-time staff, attend conferences, and move faster through implementing features you request.


Grid AI

Grid AI is our platform for training models at scale on the cloud!

Sign up for our FREE community Tier here

To use grid, take your regular command:

python my_model.py --learning_rate 1e-6 --layers 2 --gpus 4

And change it to use the grid train command:

grid train --grid_gpus 4 my_model.py --learning_rate 'uniform(1e-6, 1e-1, 20)' --layers '[2, 4, 8, 16]'

The above command will launch (20 * 4) experiments each running on 4 GPUs (320 GPUs!) - by making ZERO changes to your code.

Comments
  • Cross validation feature

    Cross validation feature

    🚀 Feature

    Cross-Validation is a crucial model validation techniques for assessing how the model generalizes on new data.

    Motivation

    Research papers usually require cross-validation. From my point of view, this kind of feature would simplify the work of researches.

    Pitch

    I want to pass a parameter to the Trainer object to specify that I want to train the model on K-folds.

    In the case that nobody wants to make a PR, I can start working on that.

    feature help wanted good first issue discussion 
    opened by BraveDistribution 106
  • Improve typing coverage (4/n)

    Improve typing coverage (4/n)

    🚀 Typing coverage

    Let's improve typing coverage of PyTorch Lightning together!

    I'm creating a new issue in order to increase visibility. There are three older issues (#7037, #5023, #4698) which became stale over time.

    Plan

    Currently, there are 55 files which are excluded from mypy checks so that our CI does not fail. These files vastly differ in difficulty in order to make the typing complete. For this reason, we are introducing difficulty estimate for each file so that community members can choose to work on the files appropriate to their skill level.

    Please, comment on this issue in order to reserve a particular file to work on. Once you do so, I will edit this top comment to avoid collisions. Once you think your work is finished, please open a PR referencing this issue which:

    • removes the corresponding line from pyproject.toml
    • and passes mypy checks with the corresponding line removed. You can test it locally by running mypy from root directory

    If you are struggling with pushing it over the finish line, open the PR anyway and someone from our team will help you to get it there. 🚀

    Please note, that it can happen that you may need to edit more than just one file. This is fine, but please keep in mind, that the goal of your PR will be to make the check passing for the chosen file. Also, please note that the difficulty is just an educated guess.

    For those of you who are not familiar with the process of contributing a PR, we have prepared a simple guide that will walk you through the necessary steps. You can do it! :rocket: :muscle:

    List of files and guesstimated difficulty

    Completed

    Difficulty 1 of 3

    • [x] pytorch_lightning/core/decorators.py #14044
    • [x] pytorch_lightning/profilers/advanced.py @nninept #13792 ~- [ ] pytorch_lightning/profilers/base.py @LeeChanHyuk #13879~
    • [x] pytorch_lightning/loggers/base.py @JustinGoheen #13494
    • [x] pytorch_lightning/__setup__.py @CyprienRicque #13472 ~- [ ] pytorch_lightning/distributed/dist.py @puhuk #13492~
    • [x] pytorch_lightning/strategies/single_device.py @CyprienRicque #13532
    • [x] pytorch_lightning/trainer/optimizers.py @gautierdag #13470
    • [x] pytorch_lightning/utilities/distributed.py @krishnakalyan3 #13678
    • [x] pytorch_lightning/callbacks/finetuning.py @ar90n #13516
    • [x] pytorch_lightning/loggers/mlflow.py @JustinGoheen ~~#13690~~ #13691
    • [x] pytorch_lightning/tuner/tuning.py @donlapark ~~#13616~~ #13631
    • [x] pytorch_lightning/strategies/single_tpu.py @CyprienRicque #13534
    • [x] pytorch_lightning/strategies/ddp2.py @CyprienRicque #13535
    • [x] pytorch_lightning/strategies/parallel.py @CyprienRicque #13556
    • [x] pytorch_lightning/loggers/csv_logs.py @JustinGoheen #13538
    • [x] pytorch_lightning/tuner/lr_finder.py @donlapark #13513 #13652
    • [x] pytorch_lightning/strategies/dp.py @CyprienRicque #13564
    • [x] pytorch_lightning/profilers/simple.py @krishnakalyan3 #14103
    • [x] pytorch_lightning/strategies/sharded_spawn.py @krishnakalyan3 #14102
    • [x] pytorch_lightning/demos/mnist_datamodule.py @alro923 #13929
    • [x] pytorch_lightning/demos/boring_classes.py @krishnakalyan3 #14201
    • [x] pytorch_lightning/tuner/batch_size_scaling.py @ar90n #13518

    Difficulty 2 of 3

    • [x] pytorch_lightning/loops/epoch/training_epoch_loop.py @himkt #13555
    • [x] pytorch_lightning/core/mixins/device_dtype_mixin.py @krishnakalyan3 #13704
    • [x] pytorch_lightning/loggers/comet.py @JustinGoheen #13689
    • [x] pytorch_lightning/loggers/tensorboard.py @JustinGoheen #13688
    • [x] pytorch_lightning/strategies/horovod.py @CyprienRicque #13570
    • [x] pytorch_lightning/callbacks/model_checkpoint.py @BongYang #13617
    • [x] pytorch_lightning/strategies/fully_sharded.py @BongYang #13941
    • [x] pytorch_lightning/loggers/neptune.py @JustinGoheen #13692
    • [x] pytorch_lightning/utilities/meta.py @nninept #13763 #13868
    • [x] pytorch_lightning/strategies/tpu_spawn.py @BongYang #13813
    • [x] pytorch_lightning/loggers/logger.py @JustinGoheen #13541
    • [x] pytorch_lightning/loggers/wandb.py @gautierdag #13483
    • [x] pytorch_lightning/callbacks/stochastic_weight_avg.py @donlapark #13685 #13860
    • [x] pytorch_lightning/strategies/strategy.py @CyprienRicque #13519
    • [x] pytorch_lightning/strategies/deepspeed.py @donlapark #13832
    • [x] pytorch_lightning/strategies/ddp_spawn.py @donlapark #13865
    • [x] pytorch_lightning/strategies/ipu.py @HalestormAI #13786
    • [x] pytorch_lightning/trainer/connectors/callback_connector.py @krishnakalyan3 #13750
    • [x] pytorch_lightning/strategies/ddp.py @lijm1358 #13885
    • [x] pytorch_lightning/core/saving.py @JustinGoheen #13932
    • [x] pytorch_lightning/callbacks/quantization.py @krishnakalyan3 #13782
    • [x] pytorch_lightning/strategies/sharded.py @lijm1358 #14184
    • [x] pytorch_lightning/core/datamodule.py @JustinGoheen #13693

    Difficulty 3 of 3

    ~- [ ] pytorch_lightning/trainer/callback_hook.py @JustinGoheen #13807 ~

    • [x] pytorch_lightning/core/module.py @JustinGoheen #13603
    • [x] pytorch_lightning/trainer/connectors/data_connector.py @JustinGoheen #13806
    • [x] pytorch_lightning/utilities/auto_restart.py @donlapark #13904
    • [x] pytorch_lightning/trainer/supporters.py @donlapark #14633
    • [x] pytorch_lightning/profilers/pytorch.py @krishnakalyan3 #14405
    • [x] pytorch_lightning/utilities/data.py @nandwalritik #13901
    • [x] pytorch_lightning/trainer/trainer.py [email protected] #13810~ @BongYang #14204
    • [x] pytorch_lightning/callbacks/progress/rich_progress.py @donlapark #14963

    cc @borda @justusschock @awaelchli @rohitgr7 @Borda @tchaton @aniketmaurya @kingjuno @alat-rights @carmocca @akihironitta @stancld as you were all involved in previous issues

    help wanted good first issue let's do it! code quality 
    opened by otaj 105
  • Code stuck on

    Code stuck on "initalizing ddp" when using more than one gpu

    🐛 Bug

    I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify

    pl.Trainer(gpus=[0])
    

    It runs fine. However, once I add another GPU

    pl.Trainer(gpus=[0,1,2,3])
    

    I get this output:

    GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4 initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4 initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4 initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

    And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.

    Any idea why this may happen? I have tried with both ddp and ddp_spawn.

    • PyTorch Version-- tried both 1.4 and 1.7
    • OS-- Linux
    • Installed with pip
    • Python version: 3.8.5
    • CUDA/cuDNN version: 10.1
    • GPU models and configuration: NVIDIA K80s
    bug help wanted distributed priority: 1 
    opened by JosephGatto 78
  • Implementing mAP

    Implementing mAP

    What does this PR do?

    Implements mAP, as mentioned in #2552. I'm creating a draft pull request, as opposed to a regular pull request, to receive some feedback as well as guidance on some implementation details.

    Before submitting

    • [x] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
    • [x] Did you read the contributor guideline, Pull Request section?
    • [x] Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
    • [x] Did you make sure to update the documentation with your changes?
    • [x] Did you write any new necessary tests?
    • [x] Did you verify new and existing tests pass locally with your changes?
    • [x] If you made a notable change (that affects users), did you update the CHANGELOG?

    PR review

    Anyone in the community is free to review the PR once the tests have passed. Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:

    • [x] Is this pull request ready for review? (if not, please submit in draft mode)
    • [x] Check that all items from Before submitting are resolved
    • [x] Make sure the title is self explanatory and the description concisely explains the PR
    • [x] Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

    Did you have fun?

    Make sure you had fun coding 🙃

    feature has conflicts 
    opened by briankosw 68
  • Add Support for multiple train loaders

    Add Support for multiple train loaders

    Before submitting

    • [ ] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
    • [x] Did you read the contributor guideline, Pull Request section?
    • [ ] Did you make sure to update the docs?
    • [x] Did you write any new necessary tests?
    • [ ] If you made a notable change (that affects users), did you update the CHANGELOG?

    What does this PR do?

    When this is finished it adds support for drawing batches from multiple train loaders at once. If the loaders are specified as a Mapping (dict), the resulting batch will consist of one batch per loader under the same keys as the loaders like this:

    loaders = {"x": loader_x, "y": loader_y, "z": loader_z}
    

    will result in a batch like this:

    {"x": batch_from_loader_x, "y": batch_from_loader_y, "z": batch_from_loader_z}
    

    and loaders in a sequence will return in a sequence-batch built of the separate batches in the correct order:

    loaders = [loader_0, loader_1, loader_2]
    

    will result in a batch like this:

    [batch_from_loader_0, batch_from_loader_1, batch_from_loader_2]
    

    PR review

    Anyone in the community is free to review the PR once the tests have passed.
    If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

    Did you have fun?

    Make sure you had fun coding 🙃

    feature help wanted ready priority: 0 design 
    opened by justusschock 67
  • Remove deprecated code after the 1.6 release

    Remove deprecated code after the 1.6 release

    Proposed refactor

    Remove deprecated code after the 1.6 release.

    NOTE: Please pick up a single item from the list (by commenting here in the issue) - and if there are no conflicts - we will happily assign you and put your name in front of the item in the list.

    Please note that unless mentioned, the classes are importable from pytorch_lightning, example: from pytorch_lightning import Trainer.

    • [x] LightningModule.summarize -> #12559
    • [x] pytorch_lightning.core.memory.LayerSummary -> #12593
    • [x] pytorch_lightning.core.memory.ModelSummary -> #12593
    • [x] pytorch_lightning.core.memory.get_gpu_memory_map -> #12644
    • [x] pytorch_lightning.core.memory.get_memory_profile -> #12659
    • [x] LightningModule.model_size -> #12641
    • [x] LightningDataModule.train_transforms -> #12662
    • [x] LightningDataModule.val_transforms -> #12763
    • [x] LightningDataModule.test_transforms -> #12773
    • [x] LightningDataModule.size -> #12780
    • [x] LightningDataModule.dims and LightningDataModule(dims=...) -> #12780
    • [x] LightningModule.get_progress_bar_dict -> #12839
    • [x] Trainer.progress_bar_dict -> #12839
    • [x] Trainer(prepare_data_per_node=...) -> #12536
    • [x] Trainer(stochastic_weight_avg=...) -> #12535
    • [x] Trainer(terminate_on_nan=...) and Trainer.terminate_on_nan -> #12553
    • [x] LightningModule.on_{train,val,test,predict}_dataloader -> #13033
    • [x] pytorch_lightning.loggers.TestTubeLogger -> #12859
    • [x] pytorch_lightning.Callback.on_keyboard_interrupt -> #13438
    • [x] Trainer(process_position=...) -> #13071
    • [x] Trainer(flush_logs_every_n_steps=...) -> #13074
    • [x] LightningModule.add_to_queue -> @shenoynikhil
    • [x] LightningModule.get_from_queue -> @shenoynikhil
    • [x] Trainer(progress_bar_refresh_rate=...) -> #12514
    • [x] LightningLoggerBase.close and pytorch_lightning.loggers.LoggerCollection.close -> #13149
    • [x] pytorch_lightning.distributed.dist.LightningDistributed #13549
    • [x] Trainer(checkpoint_callback=...) -> #13027
    • [x] Passing dataloader_idx to on_train_batch_start of pytorch_lightning.Callback and LightningModule -> #12769
    • [x] LightningModule.on_post_move_to_device #13548
    • [x] pytorch_lightning.core.decorators.parameter_validation #13514
    • [x] Trainer(accelerator="ddp_spawn") #12696
    • [x] Trainer(plugins="ddp_spawn") #12700
    • [x] Trainer(weights_summary="full"), Trainer(weights_summary=None), Trainer.weights_summary -> #13070
    • [x] Trainer(log_gpu_memory=...) -> #12657
    • [x] Trainer.slurm_job_id #13459
    • [x] pytorch_lightning.callbacks.gpu_stats.GPUStatsMonitor -> #12554
    • [x] pytorch_lightning.callbacks.gpu_stats.XLAStatsMonitor -> #12688
    • [x] pytorch_lightning.callbacks.progress.ProgressBar -> #12658
    • [x] Trainer(max_steps=None) and Trainer.fit_loop.max_steps = None #13591
    • [x] pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor.lr_sch_names -> #13353
    • [x] KubeflowEnvironment.is_using_kubeflow, LSFEnvironment.is_using_lsf, TorchElasticEnvironment.is_using_torchelastic #13458
    • [x] pytorch_lightning.overrides.distributed.IndexBatchSamplerWrapper.batch_indices #13565
    • [x] pytorch_lightning.strategies.SingleDeviceStrategy.post_dispatch #13461
    • [x] pytorch_lightning.trainer.connectors.logger_connector.logger_connector.LoggerConnector.gpu_metrics

    Feel free to cross-check from the test file to ensure that the relevant test fails now (since it's no more deprecated and instead removed).

    Pitch

    All the deprecated features we have are tested here:

    https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/deprecated_api/test_remove_1-7.py

    If you are interested in taking care of one item, post a comment here asking to take it. This avoids multiple people working on the same thing.

    Additional context

    See pull requests linked in #10312 for examples on how to contribute :) Or a recent pull request #12514.


    If you enjoy Lightning, check out our other projects! ⚡

    • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

    • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

    • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

    • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

    • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

    cc @borda @justusschock @awaelchli @rohitgr7 @krshrimali

    good first issue refactor 
    opened by akihironitta 65
  • replace Hparams by init args

    replace Hparams by init args

    Problem

    hparams was a temporary fix for not auto storing args by users. It’s something everyone hacks around, is not intuitive and makes the pl module somehow less like at pt module.

    end of hparams!

    This PR

    This PR removes that and instead:

    • Stores all the args passed in init automatically so checkpoints can have this information.
    • doesn’t store things like losses, etc... only primitives, lists, dicts, tuples and namespace
    • auto saves this info into checkpoints
    • it DOES NOT assign properties automatically

    Backward compatibility

    • this PR is still backward compatible for people who want to continue using hparams directly.

    Summary

    Before:

    hparams = dict or Namespace
    
    class LitModel(pl.LightningModule):
        def __init__(self, hparams, my_pretrained_nn_module):
            super().__init__()
            self.hparams = hparams
            self.l1 = nn.Linear(hparams.in_dim, hparams.out_dim)
            self.feature_extractor = my_pretrained_nn_module()
    
    # old way had a ton of problems with this
    model = LitModel.load_from_checkpoint(PATH)
    

    New:

    class LitModel(pl.LightningModule):
        def __init__(self, in_dim, out_dim, my_pretrained_nn_module):
            super().__init__()
            self.in_dim = in_dim
            self.out_dim = out_dim
            
            # self.in_dim, etc were auto registered to the module
            self.l1 = nn.Linear(in_dim, out_dim)
            self.feature_extractor = my_pretrained_nn_module()
    
    # load from checkpoint still works as normal, but objects and such need to be specified
    model = LitModel.load_from_checkpoint(PATH, my_pretrained_nn_module=MyModule)
    
    # or can overwrite the old settings as well
    model = LitModel.load_from_checkpoint(PATH, in_dim=some_new_dim, my_pretrained_nn_module=MyModule)
    
    feature help wanted 
    opened by williamFalcon 63
  • Unify usage of multiple callbacks

    Unify usage of multiple callbacks

    🚀 Feature

    Simplified API, with callbacks... as e.g. Keras did, pass just list of callbacks to be executed and Trainer will call then when needed instead of having them specified https://github.com/PyTorchLightning/pytorch-lightning/blob/b1040523b2180300574d961444b00abfa3c84195/pytorch_lightning/trainer/trainer.py#L65-L66

    mentioned also in https://github.com/PyTorchLightning/pytorch-lightning/issues/825#issuecomment-588226411

    feature help wanted discussion 
    opened by Borda 60
  • Lose performance between 0.6.0 and 0.7.1

    Lose performance between 0.6.0 and 0.7.1

    🐛 Bug

    When I train exactly the same model with pl 0.7.1, I get worse performance compared to pl0.6.0. I did a fresh install or Asteroid with both versions and ran exactly the same script on the same hardware. I get significantly worse performance with pl0.7.1. Are there some known issues I should be aware of? In the mean time, I'll have to downgrade to 0.6.0

    Environment

    PL 0.6.0

    Collecting environment information... [8/105] PyTorch version: 1.4.0
    Is debug build: No
    CUDA used to build PyTorch: 10.1

    OS: Debian GNU/Linux 10 (buster)
    GCC version: (Debian 8.3.0-6) 8.3.0
    CMake version: version 3.14.0

    Python version: 3.6 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA

    Versions of relevant libraries: [pip3] numpy==1.18.1 [pip3] pytorch-lightning==0.6.0 [pip3] torch==1.4.0 [pip3] torchvision==0.4.2 [conda] blas 1.0 mkl [conda] mkl 2019.4 243 [conda] mkl-include 2020.0 166 [conda] mkl-service 2.3.0 py36he904b0f_0 [conda] mkl_fft 1.0.14 py36ha843d7b_0 [conda] mkl_random 1.1.0 py36hd6b4f25_0 [conda] torch 1.3.1 pypi_0 pypi [conda] torchvision 0.4.2 pypi_0 pypi

    Diff between 0.6.0 and 0.7.1 envs

    diff env_0.7 env_0.6

    19c19
    < [pip3] pytorch-lightning==0.7.1
    ---
    > [pip3] pytorch-lightning==0.6.0
    
    help wanted 
    opened by mpariente 53
  • CUDA OOM when initializing DDP

    CUDA OOM when initializing DDP

    🐛 Bug

    Hey everyone,

    I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following Trainer configuration:

    trainer = Trainer(
            default_root_dir=os.getcwd(),
            limit_train_batches=1,
            limit_val_batches=1,
            max_epochs=1,
            weights_summary=None,
            gpus=2,
            accelerator="ddp",
            auto_select_gpus=True
    )
    

    And the output I get is the following:

    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
    LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
    initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
    initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
    Traceback (most recent call last):
      File "boring_model.py", line 138, in <module>
        run_test()
      File "boring_model.py", line 133, in run_test
        trainer.fit(model, train_data, val_data)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
        results = self.accelerator_backend.train()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
        self.init_ddp_connection(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
        torch_distrib.init_process_group(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
        barrier()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
        work = _default_pg.barrier()
    RuntimeError: CUDA error: out of memory
    Traceback (most recent call last):
      File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 138, in <module>
        run_test()
      File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 133, in run_test
        trainer.fit(model, train_data, val_data)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
        results = self.accelerator_backend.train()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
        self.init_ddp_connection(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
        torch_distrib.init_process_group(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
        barrier()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
        work = _default_pg.barrier()
    RuntimeError: Broken pipe
    

    The script with the BoringModel I run on our workstation is in this gist.

    However, this doesn't happen on Colab using your BoringModel notebook (my version can be found here).

    I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:

    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
    initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-11-1f9f6fbe4f6c> in <module>
    ----> 1 test_x(tmpdir)
    
    <ipython-input-10-d400f0366266> in test_x(tmpdir)
         16 
         17     # Train the model ⚡
    ---> 18     trainer.fit(model, train, val)
         19 
         20     trainer.test(test_dataloaders=test)
    
    ~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
        442         self.call_hook('on_fit_start')
        443 
    --> 444         results = self.accelerator_backend.train()
        445         self.accelerator_backend.teardown()
        446 
    
    ~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in train(self)
        146         model = self.trainer.model
        147 
    --> 148         results = self.ddp_train(process_idx=self.task_idx, model=model)
        149         if 'WORLD_SIZE' in os.environ:
        150             del os.environ['WORLD_SIZE']
    
    ~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in ddp_train(self, process_idx, model)
        236         # where to store ip_table
        237         model.trainer = self.trainer
    --> 238         self.init_ddp_connection(
        239             self.trainer.global_rank,
        240             self.trainer.world_size,
    
    ~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in init_ddp_connection(self, global_rank, world_size, is_slurm_managing_tasks)
        213                 f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}"
        214             )
    --> 215             torch_distrib.init_process_group(
        216                 torch_backend, rank=global_rank, world_size=world_size
        217             )
    
    ~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
        440     # process groups including global variables are updated correctly on all
        441     # ranks.
    --> 442     barrier()
        443 
        444 def _new_process_group_helper(world_size,
    
    ~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in barrier(group, async_op)
       1945     if group == GroupMember.WORLD:
       1946         _check_default_pg()
    -> 1947         work = _default_pg.barrier()
       1948     else:
       1949         work = group.barrier()
    
    RuntimeError: CUDA error: out of memory
    

    At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:

    try:
    	trainer.fit(model, train_data, val_data)
    except:
    	trainer.fit(model, train_data, val_data)
    

    As a result, I get this stack trace:

    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
    LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
    initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
    initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
    Traceback (most recent call last):
      File "boring_model.py", line 135, in run_test
        trainer.fit(model, train_data, val_data)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
        results = self.accelerator_backend.train()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
        self.init_ddp_connection(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
        torch_distrib.init_process_group(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
        barrier()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
        work = _default_pg.barrier()
    RuntimeError: CUDA error: out of memory
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "boring_model.py", line 143, in <module>
        run_test()
      File "boring_model.py", line 137, in run_test
        trainer.fit(model, train_data, val_data)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
        results = self.accelerator_backend.train()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
        model = self.configure_ddp(model, device_ids)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
        model = self.ddp_plugin.configure_ddp(model, device_ids)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
        model = LightningDistributedDataParallel(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
        self._sync_params_and_buffers(authoritative_rank=0)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
        self._distributed_broadcast_coalesced(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
        dist._broadcast_coalesced(
    RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
    Traceback (most recent call last):
      File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 135, in run_test
        trainer.fit(model, train_data, val_data)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
        results = self.accelerator_backend.train()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
        self.init_ddp_connection(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
        torch_distrib.init_process_group(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
        barrier()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
        work = _default_pg.barrier()
    RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 143, in <module>
        run_test()
      File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 137, in run_test
        trainer.fit(model, train_data, val_data)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
        results = self.accelerator_backend.train()
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
        model = self.configure_ddp(model, device_ids)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
        model = self.ddp_plugin.configure_ddp(model, device_ids)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
        model = LightningDistributedDataParallel(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
        self._sync_params_and_buffers(authoritative_rank=0)
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
        self._distributed_broadcast_coalesced(
      File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
        dist._broadcast_coalesced(
    RuntimeError: Broken pipe
    

    Expected behavior

    The models should train without issues.

    Environment

    • CUDA:
      • GPU:
        • TITAN V
        • TITAN V
        • TITAN V
        • TITAN V
        • TITAN V
        • TITAN V
        • TITAN V
        • TITAN V
        • TITAN V
        • TITAN V
      • available: True
      • version: 10.1
    • Packages:
      • numpy: 1.19.2
      • pyTorch_debug: True
      • pyTorch_version: 1.7.0
      • pytorch-lightning: 1.0.6
      • tqdm: 4.52.0
    • System:
      • OS: Linux
      • architecture:
        • 64bit
        • ELF
      • processor: x86_64
      • python: 3.8.5
      • version: #1 SMP Fri Oct 18 17:15:30 UTC 2019

    Additional context

    I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.

    This happens also if I select (free) GPUs manually by specifying them in the gpus flag as a List[int]. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, setting accelerator="dp"I have no issues.

    Thanks in advance!

    bug help wanted distributed 
    opened by dedeswim 51
  •  NCCL error using DDP and  PyTorch 1.7

    NCCL error using DDP and PyTorch 1.7

    🐛 Bug

    Getting this error when attempting to use ddp with the "getting started" autoencoder example:

    Stack Trace:

    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
    LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
    initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
    initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
    Traceback (most recent call last):
      File "01_getting_started_autoencoder.py", line 66, in <module>
        modle, trainer = cli_main()
      File "01_getting_started_autoencoder.py", line 60, in cli_main
        trainer.fit(model, train_dl)
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    Traceback (most recent call last):
      File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 66, in <module>
        results = self.accelerator_backend.train()
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
        self.trainer.is_slurm_managing_tasks
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
        torch_backend, rank=global_rank, world_size=world_size
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
        barrier()
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
        modle, trainer = cli_main()
      File "/home/user/development/_training/ml/pl-playground/01_getting_started_autoencoder.py", line 60, in cli_main
        trainer.fit(model, train_dl)
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
        results = self.accelerator_backend.train()
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 138, in train
        work = _default_pg.barrier()
    RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
        results = self.ddp_train(process_idx=self.task_idx, model=model)
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 231, in ddp_train
        self.trainer.is_slurm_managing_tasks
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 213, in init_ddp_connection
        torch_backend, rank=global_rank, world_size=world_size
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
        barrier()
      File "/home/user/anaconda3/envs/playground-pl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
        work = _default_pg.barrier()
    RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
    

    To Reproduce

    Follow the code in the getting started question with these parameters to Trainer:

    model = LitAutoEncoder()
    trainer = pl.Trainer(gpus='1,2', distributed_backend='ddp')
    trainer.fit(model, train_dl)
    

    Expected behavior

    For it to train on multiple GPUs :)

    Environment

    • PyTorch Version 1.7:
    • OS (e.g., Linux): Ubuntu 18.04
    • How you installed PyTorch (conda, pip, source): pip
    • Build command you used (if compiling from source): n/a
    • Python version: 3.7
    • CUDA/cuDNN version: 10.2/7.6.5
    • GPU models and configuration: 2 1080Tis
    • Any other relevant information: n/a
    bug help wanted priority: 0 distributed 3rd party 
    opened by ohmeow 51
  • Sync with master changes

    Sync with master changes

    What does this PR do?

    git cherry-pick ca88f813a440bef61e611dc3c40343a2774cd21a..f9ae89f075c67832c1a68f62f85c709bc683454b
    

    Does your PR introduce any breaking changes? If yes, please list them.

    None

    opened by carmocca 1
  • Fix LR scheduler behaviour with AMP

    Fix LR scheduler behaviour with AMP

    What does this PR do?

    When training when native AMP and a LR scheduler, we get this warning that indicates that a LR step has been taken when an optimizer step was skipped (expected at the beginning of the training with native AMP):

    /usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    

    Fixes #16228 #5558

    Does your PR introduce any breaking changes? If yes, please list them.

    No

    Before submitting

    • [x] Was this discussed/approved via a GitHub issue? (not for typos and docs)
    • [x] Did you read the contributor guideline, Pull Request section?
    • [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
    • [ ] Did you make sure to update the documentation with your changes? (if necessary)
    • [ ] Did you write any new necessary tests? (not for typos and docs)
    • [ ] Did you verify new and existing tests pass locally with your changes?
    • [ ] Did you list all the breaking changes introduced by this pull request?
    • [ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

    PR review

    Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

    • [x] Is this pull request ready for review? (if not, please submit in draft mode)
    • [x] Check that all items from Before submitting are resolved
    • [x] Make sure the title is self-explanatory and the description concisely explains the PR
    • [x] Add labels and milestones (and optionally projects) to the PR so it can be classified
    pl 
    opened by milesial 0
  • Unexpected eternal CPU RAM growth during training

    Unexpected eternal CPU RAM growth during training

    Bug description

    Description

    At each training batch, CPU memory increases a bit until I run out of memory (total RAM: 376 GB). After that I cannot even ssh into the machine the jupyter notebook server I was using was served (nor see any errors caused). I cannot understand where or why memory is being cached forever. It seems to be in the dataloader though.

    I have written a minimal example code shared below.

    Setup

    The cluster I use is managed by SLURM and it uses a lustre filesystem. Training is performed using an NVidia GPU. My python 3.8.15 is from a Debian bullseye official docker image, which I pull and use with Singularity (have to as my cluster does not allow docker).

    Other information

    With this setup the bug was only sensible using the Xarray based dataset. I still wanted to post it here because in my real case scenario, I have tried changing data to NPZ format, but the same problem happened, so it did not seem to be an Xarray problem. Also, sometimes the bug just doesn't happen for some minutes. I don't know why. I am experiencing this behavior for some months already now :S

    Workaround

    One workaround we use is using more workers to load data. So the memory is forced to be free when the epoch ends because the threads are dead I suppose. So for a while I can force that less batches are trained per epoch.

    How to reproduce the bug

    Dependencies

    • numpy==1.22.4 (also tested on 1.23.5)
    • xarray==2022.11.0 (also tested on 2022.12.0)
    • h5netcdf==1.0.2 (also tested on 1.1.0)
    • torch==1.13.0 (also tested on 1.13.1)
    • pytorch-lightning==1.6.5 (also tested on 1.8.6)
    • scipy==1.8.1

    Code

    # Imports
    
    from pathlib import Path
    import numpy as np
    import xarray as xr
    import torch
    import pytorch_lightning as pl
    from scipy.ndimage import gaussian_filter
    
    # Generating data
    
    data_dir = Path.cwd() / "data"
    
    def generate_fake_sample(shape):
        x = np.random.normal(size=shape)
        x = gaussian_filter(x, 8)
        x_min, x_max = x.min(), x.max()
        x = (x - x_min) / (x_max - x_min)
        x = x.astype(np.float32)
        y = (x > .7).astype(np.int8)
        return {"x": x, "y": y}
    
    # Here 2 equivalent datasets are defined, based on
    # - Xarray's NetCDF4 files
    # - Numpy's NPZ files
    
    class XarrayDataset(torch.utils.data.Dataset):
        
        def __init__(self,
                     data_dir=None,
                     transform=None,
                     shape=(128, 128, 128),
                     name="sample",
                     size=1):
            """
            Dataset made of NC files.
            """
            self.data_dir = Path(data_dir)
            self.data_path = self.data_dir / f"{name}.nc"
            self.transform = transform
            self.shape = shape
            self.size = size
            self.prepared = False
            
        def __len__(self):
            return self.size
        
        def __getitem__(self, idx):        
            ds = xr.open_dataset(self.data_path)
            
            sample = dict(ds.data_vars)        
            for var, val in sample.items():
                sample[var] = torch.from_numpy(val.data)
            ds.close()
            
            if self.transform:
                sample = self.transform(sample)
            return sample
            
        def prepare_data(self):
            if self.data_path.exists():
                return
            self.data_dir.mkdir(exist_ok=True)
    
            sample = generate_fake_sample(self.shape)
            ds = xr.Dataset({
                var: xr.DataArray(arr) for var, arr in sample.items()
            })
            ds.to_netcdf(self.data_path)
            ds.close()
    
    
    class NpzDataset(torch.utils.data.Dataset):
        
        def __init__(self,
                     data_dir=None,
                     transform=None,
                     shape=(128, 128, 128),
                     name="sample",
                     size=1):
            """
            Dataset made of NPZ files.
            """
            self.data_dir = Path(data_dir)
            self.data_path = self.data_dir / f"{name}.npz"
            self.transform = transform
            self.shape = shape
            self.size = size
            self.prepared = False
            
        def __len__(self):
            return self.size
        
        def __getitem__(self, idx):
            if not self.prepared:
                self.prepare_data()
            
            npz = np.load(self.data_path)
            
            sample = dict(npz)
            for var, val in sample.items():
                sample[var] = torch.from_numpy(val)
    
            if self.transform:
                sample = self.transform(sample)
            return sample
            
        def prepare_data(self):
            if self.data_path.exists():
                return
            self.data_dir.mkdir(exist_ok=True)
            
            sample = generate_fake_sample(self.shape)
            np.savez_compressed(self.data_path, **sample)
    
    # Choose dataset kind
    
    ChosenDataset = XarrayDataset
    #ChosenDataset = NpzDataset
    
    # Transforms
    
    class ComposedTransform():
        
        def __init__(self, transforms):
            self.transforms = transforms
            
        def __call__(self, sample):
            for transform in self.transforms:
                sample = transform(sample)
            return sample
    
    class AddAxisTransform():
        
        def __init__(self, keys=(), axis=0):
            self.keys = keys
            self.axis = axis
        
        def __call__(self, sample):
            for key in self.keys:
                sample[key] = torch.unsqueeze(sample[key], axis=self.axis)
            return sample
            
    class ConcatenateTransform():
        
        def __init__(self, key_groups, axis=0):
            self.key_groups = key_groups
            self.axis = axis
        
        def __call__(self, sample):
            for new_key, old_keys in self.key_groups.items():
                old_tensors = [sample[k] for k in old_keys]
                new_tensor = torch.concat(old_tensors, axis=self.axis)
                for old_key in np.unique(old_keys):
                    del sample[old_key]
                sample[new_key] = new_tensor
            return sample
    
    # Model
    
    SimpleModel = torch.nn.Sequential(
        torch.nn.Conv3d(in_channels=2, out_channels=1, kernel_size=1),
        torch.nn.Sigmoid(),
    )
    
    # Lightning module
    
    class TrainableSegmenter(pl.LightningModule):
        def __init__(self, data_dir):
            super().__init__()
            self.save_hyperparameters()
            
            self.data_dir = Path(data_dir)
            
            self.model = SimpleModel
            self.loss = torch.nn.CrossEntropyLoss()
            
            self.learning_rate = 1e-2
            self.batch_size = 64
            
            self.loader_cpus = 0
            self.prefetch_factor = 2
            self.persistent_workers = False
            self.pin_memory = False
            
            self.shuffle = True
            self.drop_last = True
            
            self.pre_processing_transforms = [
                AddAxisTransform(keys=["x", "y"], axis=0),
                # Although strange, this is similar to what I need to accomplish in my case
                ConcatenateTransform({"x": ["x", "x"], "y": ["y"]}, axis=0),
            ]
            self.augmentation_transforms = []
    
        def forward(self, x):
            return self.model(x)
    
        def configure_optimizers(self):
            return torch.optim.Adam(self.parameters(), lr=self.learning_rate)
    
        def training_step(self, batch, batch_idx):
            X = batch["x"]
            Y = batch["y"]
            Y_pred_proba = self.forward(X)
            loss = self.loss(Y_pred_proba, Y.to(torch.float16))
            
            self.log_dict({"train_loss": loss}, on_epoch=True)     
            return loss
    
        def validation_step(self, batch, batch_idx):
            X = batch["x"]
            Y = batch["y"]
            Y_pred_proba = self.forward(X)
            loss = self.loss(Y_pred_proba, Y.to(torch.float16))
            
            self.log_dict({"val_loss": loss}, on_epoch=True)
        
        def prepare_data(self):
            self.train_dataset = ChosenDataset(
                data_dir=self.data_dir,
                name="train",
                shape=(128, 128, 128),
                size=900,
                transform=ComposedTransform([
                    *self.pre_processing_transforms,
                    *self.augmentation_transforms,
                ]),
            )
            
            self.val_dataset = ChosenDataset(
                data_dir=self.data_dir,
                name="val",
                shape=(128, 128, 128),
                size=100,
                transform=ComposedTransform([
                    *self.pre_processing_transforms,
                ]),
            )
            
            self.train_dataset.prepare_data()
            self.val_dataset.prepare_data()
        
        def train_dataloader(self):                        
            return torch.utils.data.DataLoader(
                self.train_dataset,
                batch_size=self.batch_size,
                prefetch_factor=self.prefetch_factor,
                num_workers=self.loader_cpus,
                drop_last=self.drop_last,
                pin_memory=self.pin_memory,
                persistent_workers=self.persistent_workers,
                shuffle=self.shuffle,
            )
    
        def val_dataloader(self):
            return torch.utils.data.DataLoader(
                self.val_dataset,
                batch_size=self.batch_size,
                prefetch_factor=self.prefetch_factor,
                num_workers=self.loader_cpus,
                drop_last=self.drop_last,
                pin_memory=self.pin_memory,
                persistent_workers=self.persistent_workers,
                shuffle=False,
            )
    
    # Training
    
    pl.seed_everything(0)
    pl_model = TrainableSegmenter(data_dir=data_dir)
    
    trainer = pl.Trainer(
        accelerator="gpu",
        devices=[0],
        max_epochs=100,
    )
    
    trainer.fit(pl_model)
    

    Error messages and logs

    Cannot read errors because the computer crashes

    Environment

    Singularity (from Debian Bullseye based Python 3.8.15 Docker image)
    - PyTorch Lightning Version : 1.6.5, 1.8.6
    - PyTorch Version: 1.13.0, 1.13.1
    - Python version: 3.8.15
    - OS: Linux
    - CUDA/cuDNN version: CUDA 11.2 / cuDNN 8.2
    - GPU models and configuration: Tesla V100 (32 GB VRAM)
    - How you installed Lightning: pip
    

    More info

    No response

    Edit

    Adding trials using new lib versions.

    needs triage 
    opened by marcosrdac 1
  • Introduce training and eval modes for the strategy

    Introduce training and eval modes for the strategy

    Description & Motivation

    Some of our strategies need to be set up differently depending on whether the model is being trained or just evaluated (.fit vs .test).

    Some examples:

    https://github.com/Lightning-AI/lightning/blob/859a228a915893f14759dda6d35f6050fe6df382/src/pytorch_lightning/strategies/ddp.py#L170-L173

    https://github.com/Lightning-AI/lightning/blob/859a228a915893f14759dda6d35f6050fe6df382/src/pytorch_lightning/strategies/ddp.py#L357-L363

    https://github.com/Lightning-AI/lightning/blob/859a228a915893f14759dda6d35f6050fe6df382/src/pytorch_lightning/strategies/ipu.py#L146-L149

    This requires us to pass in the entire Trainer instance just to check the stage. Unifying this with the base strategies from Lite is not possible, because of the dependency on Trainer.

    Pitch

    Introduce strategy.train() and strategy.eval() with a similar mechanism as in nn.Module.train/eval(). This could be a method or a boolean attribute. The call to change this state would happen in the training loop/the trainer.

    For now, this proposal would only be for the strategy definition in Trainer. This helps us standardize the interface for both strategies, while maintaining the flexibility of adding Trainer-specific logic outside Lite.

    Alternatives

    Alternatively, the state could be passed in to the individual methods of the strategy.

    Additional context

    No response

    cc @borda @justusschock @carmocca

    feature strategy 
    opened by awaelchli 0
  • Fixed error when using W&B project name from environment variables

    Fixed error when using W&B project name from environment variables

    What does this PR do?

    This PR fixes the error caused by the new change of having a default project argument in the WandbLogger. This change would override the project name from environment variables. The fix is to check if the project argument is "lightning_logs" and whether WANDB_PROJECT exists in os.environ. In case both conditions are true, project is set to os.environ["WANDB_PROJECT"]

    Fixes #16028

    Does your PR introduce any breaking changes? If yes, please list them.

    None

    Before submitting

    • [x] Was this discussed/approved via a GitHub issue? (not for typos and docs)
    • [x] Did you read the contributor guideline, Pull Request section?
    • [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
    • [x] Did you make sure to update the documentation with your changes? (if necessary)
    • [x] Did you write any new necessary tests? (not for typos and docs)
    • [x] Did you verify new and existing tests pass locally with your changes?
    • [x] Did you list all the breaking changes introduced by this pull request?
    • [x] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

    PR review

    Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

    • [x] Is this pull request ready for review? (if not, please submit in draft mode)
    • [x] Check that all items from Before submitting are resolved
    • [x] Make sure the title is self-explanatory and the description concisely explains the PR
    • [x] Add labels and milestones (and optionally projects) to the PR so it can be classified

    Did you have fun?

    Make sure you had fun coding 🙃

    pl 
    opened by manangoel99 0
Releases(1.8.6)
  • 1.8.6(Dec 21, 2022)

    App

    Added

    • Added partial support for fastapi Request annotation in configure_api handlers (#16047)
    • Added a nicer UI with URL and examples for the autoscaler component (#16063)
    • Enabled users to have more control over scaling out/in intervals (#16093)
    • Added more datatypes to the serving component (#16018)
    • Added work.delete method to delete the work (#16103)
    • Added display_name property to LightningWork for the cloud (#16095)
    • Added ColdStartProxy to the AutoScaler (#16094)
    • Added status endpoint, enable ready (#16075)
    • Implemented ready for components (#16129)

    Changed

    • The default start_method for creating Work processes locally on macOS is now 'spawn' (previously 'fork') (#16089)
    • The utility lightning.app.utilities.cloud.is_running_in_cloud now returns True during the loading of the app locally when running with --cloud (#16045)
    • Updated Multinode Warning (#16091)
    • Updated app testing (#16000)
    • Changed overwrite to True (#16009)
    • Simplified messaging in cloud dispatch (#16160)
    • Added annotations endpoint (#16159)

    Fixed

    • Fixed PythonServer messaging "Your app has started" (#15989)
    • Fixed auto-batching to enable batching for requests coming even after the batch interval but is in the queue (#16110)
    • Fixed a bug where AutoScaler would fail with min_replica=0 (#16092
    • Fixed a non-thread safe deepcopy in the scheduler (#16114)
    • Fixed HTTP Queue sleeping for 1 sec by default if no delta was found (#16114)
    • Fixed the endpoint info tab not showing up in the AutoScaler UI (#16128)
    • Fixed an issue where an exception would be raised in the logs when using a recent version of streamlit (#16139)
    • Fixed e2e tests (#16146)

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.5.post0...1.8.6

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.6-py3-none-any.whl(1.70 MB)
    lightning-1.8.6.tar.gz(1.42 MB)
    lightning-app-1.8.6.tar.gz(1.11 MB)
    lightning-lite-1.8.6.tar.gz(93.07 KB)
    lightning_app-1.8.6-py3-none-any.whl(1.18 MB)
    lightning_lite-1.8.6-py3-none-any.whl(133.47 KB)
    pytorch-lightning-1.8.6.tar.gz(562.70 KB)
    pytorch_lightning-1.8.6-py3-none-any.whl(781.50 KB)
  • 1.8.5.post0(Dec 16, 2022)

    App

    • Fixed install/upgrade - removing single quote (#16079)
    • Fixed bug where components that are re-instantiated several times failed to initialize if they were modifying self.lightningignore (#16080)
    • Fixed a bug where apps that had previously been deleted could not be run again from the CLI (#16082)

    Pytorch

    • Add function to remove checkpoint to allow override for extended classes (#16067)

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.5...1.8.5.post0

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.5.post0-py3-none-any.whl(1.69 MB)
    lightning-1.8.5.post0.tar.gz(1.41 MB)
    lightning-app-1.8.5.post0.tar.gz(1.11 MB)
    lightning-lite-1.8.5.post0.tar.gz(93.19 KB)
    lightning_app-1.8.5.post0-py3-none-any.whl(1.18 MB)
    lightning_lite-1.8.5.post0-py3-none-any.whl(133.54 KB)
    pytorch-lightning-1.8.5.post0.tar.gz(563.05 KB)
    pytorch_lightning-1.8.5.post0-py3-none-any.whl(781.74 KB)
  • 1.8.5(Dec 15, 2022)

    App

    Added

    • Added Lightning{Flow,Work}.lightningignores attributes to programmatically ignore files before uploading to the cloud (#15818)
    • Added a progress bar while connecting to an app through the CLI (#16035)
    • Support running on multiple clusters (#16016)
    • Added guards to cluster deletion from cli (#16053)
    • Added creation of the default .lightningignore that ignores venv (#16056)

    Changed

    • Cleanup cluster waiting (#16054)

    Fixed

    • Fixed DDPStrategy import in app framework (#16029)
    • Fixed AutoScaler raising an exception when non-default cloud compute is specified (#15991)
    • Fixed and improvements of login flow (#16052)
    • Fixed the debugger detection mechanism for the lightning App in VSCode (#16068)

    Pytorch

    • some minor cleaning

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.4.post0...1.8.5

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.5-py3-none-any.whl(1.69 MB)
    lightning-1.8.5.tar.gz(1.41 MB)
    lightning-app-1.8.5.tar.gz(1.11 MB)
    lightning-lite-1.8.5.tar.gz(93.15 KB)
    lightning_app-1.8.5-py3-none-any.whl(1.18 MB)
    lightning_lite-1.8.5-py3-none-any.whl(133.46 KB)
    pytorch-lightning-1.8.5.tar.gz(562.80 KB)
    pytorch_lightning-1.8.5-py3-none-any.whl(781.60 KB)
  • 1.8.4.post0(Dec 9, 2022)

    App

    • Fixed MultiNode Component to use separate cloud computes (#15965)
    • Fixed Registration for CloudComputes of Works in L.app.structures (#15964)
    • Fixed a bug where auto-upgrading to the latest lightning via the CLI could get stuck in a loop (#15984)

    Pytorch

    • Fixed the XLAProfiler not recording anything due to mismatching of action names (#15885)

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.4...1.8.4.post0

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.4.post0-py3-none-any.whl(1.69 MB)
    lightning-1.8.4.post0.tar.gz(1.41 MB)
    lightning-app-1.8.4.post0.tar.gz(1.10 MB)
    lightning-lite-1.8.4.post0.tar.gz(93.12 KB)
    lightning_app-1.8.4.post0-py3-none-any.whl(1.17 MB)
    lightning_lite-1.8.4.post0-py3-none-any.whl(133.50 KB)
    pytorch-lightning-1.8.4.post0.tar.gz(562.79 KB)
    pytorch_lightning-1.8.4.post0-py3-none-any.whl(781.56 KB)
  • 1.8.4(Dec 8, 2022)

    App

    Added

    • Add code_dir argument to tracer run (#15771)
    • Added the CLI command lightning run model to launch a LightningLite accelerated script (#15506)
    • Added the CLI command lightning delete app to delete a lightning app on the cloud (#15783)
    • Added a CloudMultiProcessBackend which enables running a child App from within the Flow in the cloud (#15800)
    • Utility for pickling work object safely even from a child process (#15836)
    • Added AutoScaler component (#15769)
    • Added the property ready of the LightningFlow to inform when the Open App should be visible (#15921)
    • Added private work attributed _start_method to customize how to start the works (#15923)
    • Added a configure_layout method to the LightningWork which can be used to control how the work is handled in the layout of a parent flow (#15926)
    • Added the ability to run a Lightning App or Component directly from the Gallery using lightning run app organization/name (#15941)
    • Added automatic conversion of list and dict of works and flows to structures (#15961)

    Changed

    • The MultiNode components now warn the user when running with num_nodes > 1 locally (#15806)
    • Cluster creation and deletion now waits by default [#15458
    • Running an app without a UI locally no longer opens the browser (#15875)
    • Show a message when BuildConfig(requirements=[...]) is passed but a requirements.txt file is already present in the Work (#15799)
    • Show a message when BuildConfig(dockerfile="...") is passed but a Dockerfile file is already present in the Work (#15799)
    • Dropped name column from cluster list (#15721)
    • Apps without UIs no longer activate the "Open App" button when running in the cloud (#15875)
    • Wait for full file to be transferred in Path / Payload (#15934)

    Removed

    • Removed the SingleProcessRuntime (#15933)

    Fixed

    • Fixed SSH CLI command listing stopped components (#15810)
    • Fixed bug when launching apps on multiple clusters (#15484)
    • Fixed Sigterm Handler causing thread lock which caused KeyboardInterrupt to hang (#15881)
    • Fixed MPS error for multinode component (defaults to cpu on mps devices now as distributed operations are not supported by pytorch on mps) (#15748)
    • Fixed the work not stopped when successful when passed directly to the LightningApp (#15801)
    • Fixed the PyTorch Inference locally on GPU (#15813)
    • Fixed the enable_spawn method of the WorkRunExecutor (#15812)
    • Fixed require/import decorator (#15849)
    • Fixed a bug where using L.app.structures would cause multiple apps to be opened and fail with an error in the cloud (#15911)
    • Fixed PythonServer generating noise on M1 (#15949)
    • Fixed multiprocessing breakpoint (#15950)
    • Fixed detection of a Lightning App running in debug mode (#15951)
    • Fixed ImportError on Multinode if package not present (#15963)

    Lite

    • Fixed shuffle=False having no effect when using DDP/DistributedSampler (#15931)

    Pytorch

    Changed

    • Direct support for compiled models (#15922)

    Fixed

    • Fixed issue with unsupported torch.inference_mode() on hpu backends (#15918)
    • Fixed LRScheduler import for PyTorch 2.0 (#15940)
    • Fixed fit_loop.restarting to be False for lr finder (#15620)
    • Fixed torch.jit.script-ing a LightningModule causing an unintended error message about deprecated use_amp property (#15947)

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.3...1.8.4

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.4-py3-none-any.whl(1.69 MB)
    lightning-1.8.4.tar.gz(1.41 MB)
    lightning-app-1.8.4.tar.gz(1.10 MB)
    lightning-lite-1.8.4.tar.gz(93.09 KB)
    lightning_app-1.8.4-py3-none-any.whl(1.17 MB)
    lightning_lite-1.8.4-py3-none-any.whl(133.42 KB)
    pytorch-lightning-1.8.4.tar.gz(562.35 KB)
    pytorch_lightning-1.8.4-py3-none-any.whl(781.21 KB)
  • 1.8.3(Nov 23, 2022)

    App

    Changed

    • Deduplicate top-level lighting CLI command groups (#15761)
      • lightning add ssh-key CLI command has been transitioned to lightning create ssh-key
      • lightning remove ssh-key CLI command has been transitioned to lightning delete ssh-key
    • Set Torch inference mode for prediction (#15719)
    • Improved LightningTrainerScript start-up time (#15751)
    • Disable XSRF protection in StreamlitFrontend to support upload in localhost (#15684)

    Fixed

    • Fixed debugging with VSCode IDE (#15747)
    • Fixed setting property to the LightningFlow (#15750)

    Lite

    Changed

    • Temporarily removed support for Hydra multi-run (#15737)

    Pytorch

    Changed

    • Temporarily removed support for Hydra multi-run (#15737)
    • Switch from tensorboard to tensorboardx in TensorBoardLogger (#15728)

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.2...1.8.3

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.3-py3-none-any.whl(1.61 MB)
    lightning-1.8.3.tar.gz(1.33 MB)
    lightning-app-1.8.3.tar.gz(1.03 MB)
    lightning-lite-1.8.3.tar.gz(92.98 KB)
    lightning_app-1.8.3-py3-none-any.whl(1.10 MB)
    lightning_lite-1.8.3-py3-none-any.whl(133.32 KB)
    pytorch-lightning-1.8.3.tar.gz(561.37 KB)
    pytorch_lightning-1.8.3-py3-none-any.whl(780.14 KB)
  • 1.8.2(Nov 18, 2022)

    App

    Added

    • Added title and description to ServeGradio (#15639)
    • Added a friendly error message when attempting to run the default cloud compute with a custom base image configured (#14929)

    Changed

    • Improved support for running apps when dependencies aren't installed (#15711)
    • Changed the root directory of the app (which gets uploaded) to be the folder containing the app file, rather than any parent folder containing a .lightning file (#15654)
    • Enabled MultiNode Components to support state broadcasting (#15607)
    • Prevent artefactual "running from outside your current environment" error (#15647)
    • Rename failed -> error in tables (#15608)

    Fixed

    • Fixed race condition to over-write the frontend with app infos (#15398)
    • Fixed bi-directional queues sending delta with Drive Component name changes (#15642)
    • Fixed CloudRuntime works collection with structures and accelerated multi node startup time (#15650)
    • Fixed catimage import (#15712)
    • Parse all lines in app file looking for shebangs to run commands (#15714)

    Lite

    Fixed

    • Fixed the automatic fallback from LightningLite(strategy="ddp_spawn", ...) to LightningLite(strategy="ddp", ...) when on an LSF cluster (#15103)

    Pytorch

    Fixed

    • Make sure save_dir can be empty str (#15638](https://github.com/PyTorchLightning/pytorch-lightning/issues/15638))
    • Fixed the automatic fallback from Trainer(strategy="ddp_spawn", ...) to Trainer(strategy="ddp", ...) when on an LSF cluster (#15103](https://github.com/PyTorchLightning/pytorch-lightning/issues/15103))

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.1...1.8.2

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.2-py3-none-any.whl(1.61 MB)
    lightning-1.8.2.tar.gz(1.33 MB)
    lightning-app-1.8.2.tar.gz(1.03 MB)
    lightning-lite-1.8.2.tar.gz(93.06 KB)
    lightning_app-1.8.2-py3-none-any.whl(1.10 MB)
    lightning_lite-1.8.2-py3-none-any.whl(133.40 KB)
    pytorch-lightning-1.8.2.tar.gz(561.06 KB)
    pytorch_lightning-1.8.2-py3-none-any.whl(779.93 KB)
  • 1.8.1(Nov 10, 2022)

    App

    Added

    • Added the start method to the work (#15523)
    • Added a MultiNode Component to run with distributed computation with any frameworks (#15524)
    • Expose RunWorkExecutor to the work and provides default ones for the MultiNode Component (#15561)
    • Added a start_with_flow flag to the LightningWork which can be disabled to prevent the work from starting at the same time as the flow (#15591)
    • Added support for running Lightning App with VSCode IDE debugger (#15590)
    • Added bi-directional delta updates between the flow and the works (#15582)
    • Added --setup flag to lightning run app CLI command allowing for dependency installation via app comments (#15577)
    • Auto-upgrade / detect environment mis-match from the CLI (#15434)
    • Added Serve component (#15609)

    Changed

    • Changed the flow.flows to be recursive wont to align the behavior with the flow.works (#15466)
    • The params argument in TracerPythonScript.run no longer prepends -- automatically to parameters (#15518)
    • Only check versions / env when not in the cloud (#15504)
    • Periodically sync database to the drive (#15441)
    • Slightly safer multi node (#15538)
    • Reuse existing commands when running connect more than once (#15471)

    Fixed

    • Fixed writing app name and id in connect.txt file for the command CLI (#15443)
    • Fixed missing root flow among the flows of the app (#15531)
    • Fixed bug with Multi Node Component and add some examples (#15557)
    • Fixed a bug where payload would take a very long time locally (#15557)
    • Fixed an issue with the lightning CLI taking a long time to error out when the cloud is not reachable (#15412)

    Lite

    Fixed

    • Fix an issue with the SLURM srun detection causing permission errors (#15485)
    • Fixed the import of lightning_lite causing a warning 'Redirects are currently not supported in Windows or MacOs' (#15610)

    PyTorch

    Fixed

    • Fixed TensorBoardLogger not validating the input array type when logging the model graph (#15323)
    • Fixed an attribute error in ColossalAIStrategy at import time when torch.distributed is not available (#15535)
    • Fixed an issue when calling fs.listdir with file URI instead of path in CheckpointConnector (#15413)
    • Fixed an issue with the BaseFinetuning callback not setting the track_running_stats attribute for batch normaliztion layers (#15063)
    • Fixed an issue with WandbLogger(log_model=True|'all) raising an error and not being able to serialize tensors in the metadata (#15544)
    • Fixed the gradient unscaling logic when using Trainer(precision=16) and fused optimizers such as Adam(..., fused=True) (#15544)
    • Fixed model state transfer in multiprocessing launcher when running multi-node (#15567)
    • Fixed manual optimization raising AttributeError with Bagua Strategy (#12534)
    • Fixed the import of pytorch_lightning causing a warning 'Redirects are currently not supported in Windows or MacOs' (#15610)

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.0...1.8.1

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.1-py3-none-any.whl(1.59 MB)
    lightning-1.8.1.tar.gz(1.31 MB)
    lightning-app-1.8.1.tar.gz(1.03 MB)
    lightning-lite-1.8.1.tar.gz(92.91 KB)
    lightning_app-1.8.1-py3-none-any.whl(1.09 MB)
    lightning_lite-1.8.1-py3-none-any.whl(133.31 KB)
    pytorch-lightning-1.8.1.tar.gz(560.75 KB)
    pytorch_lightning-1.8.1-py3-none-any.whl(779.69 KB)
  • 1.8.0.post1(Nov 2, 2022)

    What's Changed

    • Implement freeze batchnorm with freezing track running stats by @PososikTeam in https://github.com/Lightning-AI/lightning/pull/15063
    • Pkg: fix parsing versions by @Borda in https://github.com/Lightning-AI/lightning/pull/15401
    • Remove pytest as a requirement to run app by @manskx in https://github.com/Lightning-AI/lightning/pull/15449

    New Contributors

    • @PososikTeam made their first contribution in https://github.com/Lightning-AI/lightning/pull/15063

    Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.0...1.8.0.post1

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.0.post1-py3-none-any.whl(1.57 MB)
    lightning-1.8.0.post1.tar.gz(1.29 MB)
    lightning-app-1.8.0.post1.tar.gz(1022.57 KB)
    lightning-lite-1.8.0.post1.tar.gz(92.34 KB)
    lightning_app-1.8.0.post1-py3-none-any.whl(1.06 MB)
    lightning_lite-1.8.0.post1-py3-none-any.whl(133.24 KB)
    pytorch-lightning-1.8.0.post1.tar.gz(558.13 KB)
    pytorch_lightning-1.8.0.post1-py3-none-any.whl(777.42 KB)
  • 1.8.0(Nov 1, 2022)

    The core team is excited to announce the release of Lightning 1.8 :zap:

    Lightning v1.8 is the culmination of work from 52 contributors who have worked on features, bug-fixes, and documentation for a total of over 550+ commits since v1.7.

    Highlights

    Colossal-AI

    Colossal-AI focuses on improving efficiency when training large-scale AI models with billions of parameters. With the new Colossal-AI strategy in Lightning 1.8, you can train existing models like GPT-3 with up to half as many GPUs as usually needed. You can also train models up to twice as big with the same number of GPUs, saving you significant cost. Here is how you use it:

    # Select the strategy with good defaults
    trainer = Trainer(strategy="colossalai")
    
    # or tune parameters to your liking
    from lightning.pytorch.strategies import ColossalAIStrategy
    
    trainer = Trainer(strategy=ColossalAIStrategy(placement_policy="cpu", ...))
    

    You can find Colossal-AI's benchmarks with Lightning on GPT-2 here.

    Under the hood, Colossal-AI implements different parallelism algorithms that are especially interesting for the development of SOTA transformer models:

    • Data Parallelism
    • Pipeline Parallelism
    • 1D, 2D, 2.5D, 3D Tensor Parallelism
    • Sequence Parallelism
    • Zero Redundancy Optimization

    Learn how to install and use Colossal-AI effectively with Lightning here.

    NOTE: This strategy is marked as experimental. Stay tuned for more updates in the future.

    Secrets for Lightning Apps

    Introducing encrypted secrets (#14612), a feature requested by Lightning App users :tada:!

    Encrypted secrets allow you to securely pass private data to your apps, like API keys, access tokens, database passwords, or other credentials, without exposing them in your code.

    1. Add a secret to your Lightning account in lightning.ai (read more here)

    2. Add an environment variable to your app to read the secret:

      # somewhere in your Flow or Work:
      GitHubComponent(api_token=os.environ["API_TOKEN"])
      
    3. Pass the secret to your app run with the following command:

      lightning run app app.py --cloud --secret API_TOKEN=github_api_token
      

    These secrets are encrypted and stored in the Lightning database. Nothing except your app can access the value.

    NOTE: This is an experimental feature.

    CLI Commands for Lightning Apps

    Introducing CLI commands for apps (#13602)! As a Lightning App builder, if you want to easily create a CLI interface for users to interract with your app, then this is for you.

    Here is an example where users can dynamically create notebooks from the CLI. All you need to do is implement the configure_commands hook on the LightningFlow:

    import lightning as L
    from commands.notebook.run import RunNotebook
    
    
    class Flow(L.LightningFlow):
        ...
    
        def configure_commands(self):
            # Return a list of dictionaries with commands:
            return [{"run notebook": RunNotebook(method=self.run_notebook)}]
    
    
    app = L.LightningApp(Flow())
    

    Once the app is running with lightning run app app.py, you can connect to the app with the following command:

    lightning connect {app name} -y
    

    and run the command that was configured:

    lightning run notebook --name=my_notebook_name
    

    For a full tutorial and running example, visit our docs. TODO: add to docs NOTE: This is an experimental feature.

    Auto-wrapping for FSDP Strategy

    In Lightning v1.7, we introduced an integration for PyTorch FSDP in the form of our FSDP strategy, which allows you to train huge models with billions of parameters sharded across hundreds of GPUs and machines.

    # Native FSDP implementation
    trainer = Trainer(strategy="fsdp_native")
    

    We are continuing to improve the support for this feature by adding automatic wrapping of layers for use cases where the model fits into CPU memory, but not into GPU memory (#14383).

    Here are some examples:

    Case 1: Model is so large that it does not fit into CPU memory. Construct your layers in the configure_sharded_model hook and wrap the large ones you want to shard across GPUs:

    class MassiveModel(LightningModule):
        ...
        
        # Create model here and wrap the large layers for sharding
        def configure_sharded_model(self):
            for i, layer in enumerate(self.block):
                self.block[i] = wrap(layer)
            ...
    

    Case 2: Model fits into CPU memory, but not into GPU memory. In Lightning v1.8, you no longer need to do anything special here, as we can automatically wrap the layers for you using FSDP's policy:

    model = MassiveModel()
    trainer = Trainer(
        accelerator="gpu", 
        devices=8, 
        strategy="fsdp_native",  # or strategy="fsdp" for fairscale
        precision=16
    )
    
    # Automatically wraps the layers here:
    trainer.fit(model)
    

    Case 3: Model fits into GPU memory. No action required, use any strategy you want.

    Note: if you want to manually wrap layers for more control, you can still do that!

    Read more about FSDP and how layer wrapping works in our docs.

    New Tuner Callbacks

    In this release, we focused on Tuner improvements and introduced two new callbacks that can help you customize the batch size finder and learning rate finder as per your use case.

    Batch Size Finder (#11089)

    1. You can customize the BatchSizeFinder callback to run at different epochs. This feature is useful while fine-tuning models since you can't always use the same batch size after unfreezing the backbone.

      from lightning.pytorch.callbacks import BatchSizeFinder
      
      
      class FineTuneBatchSizeFinder(BatchSizeFinder):
          def __init__(self, milestones, *args, **kwargs):
              super().__init__(*args, **kwargs)
              self.milestones = milestones
      
          def on_fit_start(self, *args, **kwargs):
              return
      
          def on_train_epoch_start(self, trainer, pl_module):
              if trainer.current_epoch in self.milestones or trainer.current_epoch == 0:
                  self.scale_batch_size(trainer, pl_module)
      
      
      trainer = Trainer(callbacks=[FineTuneBatchSizeFinder(milestones=(5, 10))])
      trainer.fit(...)
      
    2. Run batch size finder for validate/test/predict.

      from lightning.pytorch.callbacks import BatchSizeFinder
      
      
      class EvalBatchSizeFinder(BatchSizeFinder):
          def __init__(self, *args, **kwargs):
              super().__init__(*args, **kwargs)
      
          def on_fit_start(self, *args, **kwargs):
              return
      
          def on_test_start(self, trainer, pl_module):
              self.scale_batch_size(trainer, pl_module)
      
      
      trainer = Trainer(callbacks=[EvalBatchSizeFinder()])
      trainer.test(...)
      

    Learning Rate Finder (#13802)

    You can now use the LearningRateFinder callback to run at different intervals. This feature is useful when fine-tuning models, for example.

    from lightning.pytorch.callbacks import LearningRateFinder
    
    
    class FineTuneLearningRateFinder(LearningRateFinder):
        def __init__(self, milestones, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.milestones = milestones
    
        def on_fit_start(self, *args, **kwargs):
            return
    
        def on_train_epoch_start(self, trainer, pl_module):
            if trainer.current_epoch in self.milestones or trainer.current_epoch == 0:
                self.lr_find(trainer, pl_module)
    
    trainer = Trainer(callbacks=[FineTuneLearningRateFinder(milestones=(5, 10))])
    trainer.fit(...)
    

    LightningCLI Improvements

    Even though the LightningCLI class is designed to help in the implementation of command line tools, there are instances when it might be more desirable to run directly from Python. In Lightning 1.8, you can now do this (#14596):

    from lightning.pytorch.cli import LightningCLI
    
    def cli_main(args):
        cli = LightningCLI(MyModel, ..., args=args)
        ...
    

    Anywhere in your program, you can now call the CLI directly:

    cli_main(["--trainer.max_epochs=100", "--model.encoder_layers=24"])
    

    Learn about all features of the LightningCLI!

    Improvements to the SLURM Support

    Multi-node training on a SLURM cluster has been supported since the inception of Lightning Trainer, and has seen several improvements over time thanks to many community contributions. And we just keep going! In this release, we've added two quality of life improvements:

    • The preemption/termination signal is now configurable (#14626):

      # the default signal is SIGUSR1
      trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGUSR1)])
      
      # customize it for your cluster
      trainer = Trainer(plugins=[SLURMEnvironment(requeue_signal=signal.SIGHUP)])
      
    • Automatic requeuing of jobs now also works for array jobs (#15040)! Array jobs are a convenient way to group/launch several scripts at once. When the SLURM scheduler interrupts your jobs, Lightning will save a checkpoint, resubmit a new job, and, once the scheduler allocates resources, the Trainer will resume from where it left off.

    Read more about our SLURM integration here.

    Backward Incompatible Changes

    This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.

    Callback hooks for loading and saving checkpoints

    The signature and behavior of the on_load_checkpoint and on_save_checkpoint callback hooks have changed (#14835):

    Before:

    def on_save_checkpoint(self, trainer, pl_module, checkpoint):
        ...
        # previously, we were able to return state here
        return state
    
    def on_load_checkpoint(self, trainer, pl_module, callback_state):
        # previously, only the state for this callback was passed in as argument
        ...
    

    Now:

    def on_save_checkpoint(self, trainer, pl_module, checkpoint):
        ...
        # returning a value here is no longer supported
        # you can modify the checkpoint dict directly
        return None
    
    
    def state_dict(self):
        ...
        # Now, return state from this new method
        return state
    
    
    def on_load_checkpoint(self, trainer, pl_module, checkpoint):
        # previously, only the state for this callback was passed in as argument
        ...
        
        
    def load_state_dict(self, state):
        # Now, the state for this callback gets passed to this new method
        ...
    

    DataModule hooks for loading and saving checkpoints

    The on_save_checkpoint and on_load_checkpoint hooks on the LightningDataModule have been removed in favor of the state_dict and load_state_dict methods:

    -def on_save_checkpoint(self, checkpoint):
    -    checkpoint["banana"] = self.banana
    +def state_dict(self):
    +    return dict(banana=self.banana)
    
    
    -def on_load_checkpoint(self, checkpoint):
    -    self.banana = checkpoint["banana"]
    +def load_state_dict(self, state):
    +    self.banana = state["banana"]
    

    Callback hooks

    We removed some Callback hooks that were ambiguous to use Removed deprecated callback hooks (#14834):

    | Old name | New name | |------------------------------|--------------------------------| | on_batch_start | on_train_batch_start | | on_batch_end | on_train_batch_end | | on_epoch_start | on_train_epoch_start | | on_epoch_start | on_validation_epoch_start | | on_epoch_start | on_test_epoch_start | | on_pretrain_routine_start | on_fit_start |

    Trainer Device Attributes

    We cleaned up the properties related to device indices (#14829).

    The attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores,num_processes,root_gpu,data_parallel_device_ids} have been removed in favor of accelerator-agnostic attributes:

    trainer = Trainer(...)
    
    # access the number of devices the trainer uses on this machine ...
    print(trainer.num_devices)
    
    # ... or the device IDs
    print(trainer.device_ids)
    

    Setting the torch-distributed backend

    In previous versions of Lightning, switching between the "gloo" and "nccl" backends for multi-GPU, multi-node training was possible through setting an environment variable like so:

    PL_TORCH_DISTRIBUTED_BACKEND="gloo" python train.py
    

    But not all strategies support changing the backend in this way. From now on, the backend has to be set in the code (#14693):

    trainer = Trainer(strategy=DDPStrategy(process_group_backend="gloo"))
    

    The default remains "nccl", and you should choose "gloo" only for debugging purposes.

    Logging with multiple loggers

    Logging with multiple loggers can be super useful (and super easy with Lightning). For example, you could be using one logger to record sensitive image logs to a hosted MLFlow server within your organization, and at the same time log loss curves online to WandB.

    trainer = Trainer(
        loggers=[WandbLogger(...), MLFlowLogger(...)]
    )
    

    Here are two major changes that apply when using multiple loggers in 1.8:

    • Checkpoints and profiler reports no longer go to a strange folder with a long, hard to remember name (#14325). From now on, these arifacts will land in the version folder of the first logger in the list.

    • The loggers used to be wrapped by a LoggerCollection object, so that when you accessed trainer.logger you could log to all of them simultaneously. However, this "magic" caused confusion and errors among users and we decided to simplify this (#14283):

      # now returns the first logger in the list
      print(trainer.logger)
      
      # access all loggers in a list with plural
      loggers = trainer.loggers
      
      for logger in loggers:
          logger.do_something()
      

    Deprecations

    Why is Lightning deprecating APIs in every release?

    Many users have this question, and it is a fair one! Deprecations are a normal part of API evolution in all software. We continually improve Lightning, which means we make APIs like class names, methods, hooks and arguments clear, easy to remember, and general enough to adopt more functionality in the future. Sometimes we have to let old things go to build new and better products.

    Learn more about our deprecation window here.

    So far, we have followed the pattern of removing deprecated functionality and APIs after two minor versions of deprecation. From Lightning 1.8 onward, we will additionaly convert warnings to error messages after the deprecation phase ends. This way, we can greatly improve the upgrade experience with helpful messages for users who skip more than two minor Lightning versions. The exception to this rule are experimental features, which are marked as such in our documentation.

    Here is a summary of major deprecations introduced in 1.8:

    | API | Removal version | Alternative | |--------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------------| | Argument Trainer(amp_level=...) | 1.10 | Trainer(plugins=[ApexMixedPrecisionPlugin(amp_level=...)]) | | Function unwrap_lightning_module | 1.10 | Strategy.lightning_module | | Function unwrap_lightning_module_sharded | 1.10 | Strategy.lightning_module | | Import pl.core.mixins.DeviceDtypeModuleMixin | 1.10 | No longer supported | | Argument LightningCLI(save_config_filename=...) | 1.10 | LightningCLI(save_config_kwargs=dict(config_filename=...)) | | Argument LightningCLI(save_config_overwrite=...) | 1.10 | LightningCLI(save_config_kwargs=dict(overwrite=...)) | | Argument LightningCLI(save_config_multifile=...) | 1.10 | LightningCLI(save_config_kwargs=dict(multifile=...)) | | Enum TrainerFn.TUNING | 1.10 | No longer supported | | Enum RunningStage.TUNING | 1.10 | No longer supported | | Attribute Trainer.tuning | 1.10 | No longer supported |

    CHANGELOG

    Lightning App

    Added
    • Added load_state_dict and state_dict hooks for LightningFlow components (#14100)
    • Added a --secret option to CLI to allow binding secrets to app environment variables when running in the cloud (#14612)
    • Added support for running the works without cloud compute in the default container (#14819)
    • Added an HTTPQueue as an optional replacement for the default redis queue (#14978
    • Added support for configuring flow cloud compute (#14831)
    • Added support for adding descriptions to commands either through a docstring or the DESCRIPTION attribute (#15193
    • Added a try / catch mechanism around request processing to avoid killing the flow (#15187
    • Added an Database Component (#14995
    • Added authentication to HTTP queue (#15202)
    • Added support to pass a LightningWork to the LightningApp (#15215
    • Added support getting CLI help for connected apps even if the app isn't running (#15196
    • Added support for adding requirements to commands and installing them when missing when running an app command (#15198
    • Added Lightning CLI Connection to be terminal session instead of global (#15241
    • Added support for managing SSH-keys via CLI (#15291)
    • Add a JustPyFrontend to ease UI creation with https://github.com/justpy-org/justpy (#15002)
    • Added a layout endpoint to the Rest API and enable to disable pulling or pushing to the state (#15367
    • Added support for functions for configure_api and configure_commands to be executed in the Rest API process (#15098
    • Added support to start lightning app on cloud without needing to install dependencies locally (#15019
    Changed
    • Improved the show logs command to be standalone and re-usable (#15343
    • Removed the --instance-types option when creating clusters (#15314)
    Fixed
    • Fixed an issue when using the CLI without arguments (#14877)
    • Fixed a bug where the upload files endpoint would raise an error when running locally (#14924)
    • Fixed BYOC cluster region selector -> hiding it from help since only us-east-1 has been tested and is recommended ([#15277]https://github.com/Lightning-AI/lightning/pull/15277)
    • Fixed a bug when launching an app on multiple clusters (#15226)
    • Fixed a bug with a default CloudCompute for Lightning flows (#15371)

    Lightning Trainer

    Added
    • Added support for requeueing slurm array jobs (#15040)
    • Added native AMP support for ddp_fork (and associated alias strategies) with CUDA GPUs (#14983)
    • Added BatchSizeFinder callback (#11089)
    • Added LearningRateFinder callback (#13802)
    • Tuner now supports a new method argument which will determine when to run the BatchSizeFinder: one of fit, validate, test or predict (#11089)
    • Added prefix to log message in seed_everything with rank info (#14031)
    • Added support for auto wrapping for DDPFullyShardedNativeStrategy (#14252)
    • Added support for passing extra init-parameters to the LightningDataModule.from_datasets (#14185)
    • Added support for saving sharded optimizer state dict outside of DDPShardedStrategy (#14208)
    • Added support for auto wrapping for DDPFullyShardedStrategy (#14383)
    • Integrate the lightning_utilities package ( #14475, #14537, #14556, #14558, #14575, #14620)
    • Added args parameter to LightningCLI to ease running from within Python (#14596)
    • Added WandbLogger.download_artifact and WandbLogger.use_artifact for managing artifacts with Weights and Biases (#14551)
    • Added an option to configure the signal SLURM sends when a job is preempted or requeued (#14626)
    • Added a warning when the model passed to LightningLite.setup() does not have all parameters on the same device (#14822)
    • The CometLogger now flags the Comet Experiments as being created from Lightning for analytics purposes (#14906)
    • Introduce ckpt_path="hpc" keyword for checkpoint loading (#14911)
    • Added a more descriptive error message when attempting to fork processes with pre-initialized CUDA context (#14709)
    • Added support for custom parameters in subclasses of SaveConfigCallback (#14998)
    • Added inference_mode flag to Trainer to let users enable/disable inference mode during evaluation (#15034)
    • Added LightningLite.no_backward_sync for control over efficient gradient accumulation with distributed strategies (#14966)
    • Added a sanity check that scripts are executed with the srun command in SLURM and that environment variables are not conflicting (#15011)
    • Added an error message when attempting to launch processes with python -i and an interactive-incompatible strategy (#15293)
    Changed
    • The Trainer.{fit,validate,test,predict,tune} methods now raise a useful error message if the input is not a LightningModule (#13892)
    • Raised a MisconfigurationException if batch transfer hooks are overriden with IPUAccelerator (#13961)
    • Replaced the unwrapping logic in strategies with direct access to unwrapped LightningModule (#13738)
    • Enabled on_before_batch_transfer for DPStrategy and IPUAccelerator (#14023)
    • When resuming training with Apex enabled, the Trainer will now raise an error (#14341)
    • Included torch.cuda rng state to the aggregate _collect_rng_states() and _set_rng_states() (#14384)
    • Changed trainer.should_stop to not stop in between an epoch and run until min_steps/min_epochs only (#13890)
    • The pyDeprecate dependency is no longer installed (#14472)
    • When using multiple loggers, by default checkpoints and profiler output now get saved to the log dir of the first logger in the list (#14325)
    • In Lightning Lite, state-dict access to the module wrapper now gets passed through to the original module reference (#14629)
    • Removed fall-back to LightningEnvironment when number of SLURM tasks does not correspond to number of processes in Trainer (#14300)
    • Aligned DDP and DDPSpawn strategies in setting up the environment (#11073)
    • Integrated the Lite Precision plugins into the PL Precision plugins - the base class in PL now extends the lightning_lite.precision.Precision base class (#14798)
      • The PrecisionPlugin.backward signature changed: The closure_loss argument was renamed to tensor
      • The PrecisionPlugin.{pre_,post_}backward signature changed: The closure_loss argument was renamed to tensor and moved as the first argument
      • The PrecisionPlugin.optimizer_step signature changed: The model, optimizer_idx and closure arguments need to be passed as keyword arguments now
    • Trainer queries the CUDA devices through NVML if available to avoid initializing CUDA before forking, which eliminates the need for the PL_DISABLE_FORK environment variable introduced in v1.7.4 (#14631)
    • The MLFlowLogger.finalize() now sets the status to FAILED when an exception occurred in Trainer, and sets the status to FINISHED on successful completion (#12292)
    • It is no longer needed to call model.double() when using precision=64 in Lightning Lite (#14827)
    • HPC checkpoints are now loaded automatically only in slurm environment when no specific value for ckpt_path has been set (#14911)
    • The Callback.on_load_checkpoint now gets the full checkpoint dictionary and the callback_state argument was renamed checkpoint (#14835)
    • Moved the warning about saving nn.Module in save_hyperparameters() to before the deepcopy (#15132)
    • To avoid issues with forking processes, from PyTorch 1.13 and higher, Lightning will directly use the PyTorch NVML-based check for torch.cuda.device_count and from PyTorch 1.14 and higher, Lightning will configure PyTorch to use a NVML-based check for torch.cuda.is_available. (#15110, #15133)
    • The NeptuneLogger now uses neptune.init_run instead of the deprecated neptune.init to initialize a run (#15393)
    Deprecated
    • Deprecated LightningDeepSpeedModule (#14000)
    • Deprecated amp_level from Trainer in favour of passing it explictly via precision plugin (#13898)
    • Deprecated the calls to pytorch_lightning.utiltiies.meta functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)
    • Deprecated the unwrap_lightning_module and unwrap_lightning_module_sharded utility functions in favor of accessing the unwrapped LightningModule on the strategy directly (#13738)
    • Deprecated the pl_module argument in LightningParallelModule, LightningDistributedModule, LightningShardedDataParallel, LightningBaguaModule and LightningDeepSpeedModule wrapper classes (#13738)
    • Deprecated the on_colab_kaggle function (#14247)
    • Deprecated the internal pl.core.mixins.DeviceDtypeModuleMixin class (#14511, #14548)
    • Deprecated all functions in pytorch_lightning.utilities.xla_device (#14514, #14550)
      • Deprecated the internal inner_f function
      • Deprecated the internal pl_multi_process function
      • Deprecated the internal XLADeviceUtils.xla_available staticmethod
      • Deprecated the XLADeviceUtils.tpu_device_exists staticmethod in favor of pytorch_lightning.accelerators.TPUAccelerator.is_available()
    • Deprecated pytorch_lightning.utilities.distributed.tpu_distributed in favor of lightning_lite.accelerators.tpu.tpu_distributed (#14550)
    • Deprecated all functions in pytorch_lightning.utilities.cloud_io in favor of lightning_lite.utilities.cloud_io (#14515)
    • Deprecated the functions in pytorch_lightning.utilities.apply_func in favor of lightning_utilities.core.apply_func (#14516, #14537)
    • Deprecated all functions in pytorch_lightning.utilities.device_parser (#14492, #14753)
      • Deprecated the pytorch_lightning.utilities.device_parser.determine_root_gpu_device in favor of lightning_lite.utilities.device_parser.determine_root_gpu_device
      • Deprecated the pytorch_lightning.utilities.device_parser.parse_gpu_ids in favor of lightning_lite.utilities.device_parser.parse_gpu_ids
      • Deprecated the pytorch_lightning.utilities.device_parser.is_cuda_available in favor of lightning_lite.accelerators.cuda.is_cuda_available
      • Deprecated the pytorch_lightning.utilities.device_parser.num_cuda_devices in favor of lightning_lite.accelerators.cuda.num_cuda_devices
      • Deprecated the pytorch_lightning.utilities.device_parser.parse_cpu_cores in favor of lightning_lite.accelerators.cpu.parse_cpu_cores
      • Deprecated the pytorch_lightning.utilities.device_parser.parse_tpu_cores in favor of lightning_lite.accelerators.tpu.parse_tpu_cores
      • Deprecated the pytorch_lightning.utilities.device_parser.parse_hpus in favor of pytorch_lightning.accelerators.hpu.parse_hpus
    • Deprecated duplicate SaveConfigCallback parameters in LightningCLI.__init__: save_config_kwargs, save_config_overwrite and save_config_multifile. New save_config_kwargs parameter should be used instead (#14998)
    • Deprecated TrainerFn.TUNING, RunningStage.TUNING and trainer.tuning property (#15100)
    • Deprecated custom pl.utilities.distributed.AllGatherGrad implementation in favor of PyTorch's (#15364)
    Removed
    • Removed the deprecated Trainer.training_type_plugin property in favor of Trainer.strategy (#14011)
    • Removed all deprecated training type plugins (#14011)
    • Removed the deprecated DDP2Strategy (#14026)
    • Removed the deprecated DistributedType and DeviceType enum classes (#14045)
    • Removed deprecated support for passing the rank_zero_warn warning category positionally (#14470)
    • Removed the legacy and unused Trainer.get_deprecated_arg_names() (#14415)
    • Removed the deprecated on_train_batch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#14373)
    • Removed the deprecated training_epoch_end(outputs) format when multiple optimizers are used and TBPTT is enabled (#14373)
    • Removed the experimental pytorch_lightning.utiltiies.meta functions in favor of built-in https://github.com/pytorch/torchdistx support (#13868)
    • Removed the deprecated LoggerCollection; Trainer.logger and LightningModule.logger now returns the first logger when more than one gets passed to the Trainer (#14283)
    • Removed the deprecated the trainer.lr_schedulers (#14408)
    • Removed the deprecated LightningModule.{on_hpc_load,on_hpc_save} hooks in favor of the general purpose hooks LightningModule.{on_load_checkpoint,on_save_checkpoint} (#14315)
    • Removed deprecated support for old torchtext versions (#14375)
    • Removed deprecated support for the old neptune-client API in the NeptuneLogger (#14727)
    • Removed the deprecated weights_save_path Trainer argumnent and Trainer.weights_save_path property (#14424)
    • Removed the deprecated (#14471)
      • pytorch_lightning.utilities.distributed.rank_zero_only in favor of pytorch_lightning.utilities.rank_zero.rank_zero_only
      • pytorch_lightning.utilities.distributed.rank_zero_debug in favor of pytorch_lightning.utilities.rank_zero.rank_zero_debug
      • pytorch_lightning.utilities.distributed.rank_zero_info in favor of pytorch_lightning.utilities.rank_zero.rank_zero_info
      • pytorch_lightning.utilities.warnings.rank_zero_warn in favor of pytorch_lightning.utilities.rank_zero.rank_zero_warn
      • pytorch_lightning.utilities.warnings.rank_zero_deprecation in favor of pytorch_lightning.utilities.rank_zero.rank_zero_deprecation
      • pytorch_lightning.utilities.warnings.LightningDeprecationWarning in favor of pytorch_lightning.utilities.rank_zero.LightningDeprecationWarning
    • Removed deprecated Trainer.num_processes attribute in favour of Trainer.num_devices (#14423)
    • Removed the deprecated Trainer.data_parallel_device_ids hook in favour of Trainer.device_ids (#14422)
    • Removed the deprecated class TrainerCallbackHookMixin (#14401)
    • Removed the deprecated BaseProfiler and AbstractProfiler classes (#14404)
    • Removed the deprecated way to set the distributed backend via the environment variable PL_TORCH_DISTRIBUTED_BACKEND, in favor of setting the process_group_backend in the strategy constructor (#14693)
    • Removed deprecated callback hooks (#14834)
      • Callback.on_configure_sharded_model in favor of Callback.setup
      • Callback.on_before_accelerator_backend_setup in favor of Callback.setup
      • Callback.on_batch_start in favor of Callback.on_train_batch_start
      • Callback.on_batch_end in favor of Callback.on_train_batch_end
      • Callback.on_epoch_start in favor of Callback.on_{train,validation,test}_epoch_start
      • Callback.on_epoch_end in favor of Callback.on_{train,validation,test}_epoch_end
      • Callback.on_pretrain_routine_{start,end} in favor of Callback.on_fit_start
    • Removed the deprecated device attributes Trainer.{devices,gpus,num_gpus,ipus,tpu_cores} in favor of the accelerator-agnostic Trainer.num_devices (#14829)
    • Removed the deprecated LightningIPUModule (#14830)
    • Removed the deprecated Logger.agg_and_log_metrics hook in favour of Logger.log_metrics and the agg_key_funcs and agg_default_func arguments. (#14840)
    • Removed the deprecated precision plugin checkpoint hooks PrecisionPlugin.on_load_checkpoint and PrecisionPlugin.on_save_checkpoint (#14833)
    • Removed the deprecated Trainer.root_gpu attribute in favor of Trainer.strategy.root_device (#14829)
    • Removed the deprecated Trainer.use_amp and LightningModule.use_amp attributes (#14832)
    • Removed the deprecated callback hooks Callback.on_init_start and Callback.on_init_end (#14867)
    • Removed the deprecated Trainer.run_stage in favor of Trainer.{fit,validate,test,predict} (#14870)
    • Removed the deprecated SimpleProfiler.profile_iterable and AdvancedProfiler.profile_iterable attributes (#14864)
    • Removed the deprecated Trainer.verbose_evaluate (#14884)
    • Removed the deprecated Trainer.should_rank_save_checkpoint (#14885)
    • Removed the deprecated TrainerOptimizersMixin (#14887)
    • Removed the deprecated Trainer.lightning_optimizers (#14889)
    • Removed the deprecated TrainerDataLoadingMixin (#14888)
    • Removed the deprecated Trainer.call_hook in favor of Trainer._call_callback_hooks, Trainer._call_lightning_module_hook, Trainer._call_ttp_hook, and Trainer._call_accelerator_hook (#14869)
    • Removed the deprecated Trainer.{validated,tested,predicted}_ckpt_path (#14897)
    • Removed the deprecated device_stats_monitor_prefix_metric_keys (#14890)
    • Removed the deprecated LightningDataModule.on_save/load_checkpoint hooks (#14909)
    • Removed support for returning a value in Callback.on_save_checkpoint in favor of implementing Callback.state_dict (#14835)
    Fixed
    • Fixed an issue with LightningLite.setup() not setting the .device attribute correctly on the returned wrapper (#14822)
    • Fixed an attribute error when running the tuner together with the StochasticWeightAveraging callback (#14836)
    • Fixed MissingFieldException in offline mode for the NeptuneLogger() (#14919)
    • Fixed wandb save_dir is overridden by None dir when using CLI (#14878)
    • Fixed a missing call to LightningDataModule.load_state_dict hook while restoring checkpoint using LightningDataModule.load_from_checkpoint (#14883)
    • Fixed torchscript error with containers of LightningModules (#14904)
    • Fixed reloading of the last checkpoint on run restart (#14907)
    • SaveConfigCallback instances should only save the config once to allow having the overwrite=False safeguard when using LightningCLI(..., run=False) (#14927)
    • Fixed an issue with terminating the trainer profiler when a StopIteration exception is raised while using an IterableDataset (#14940)
    • Do not update on-plateau schedulers when reloading from an end-of-epoch checkpoint (#14702)
    • Fixed Trainer support for PyTorch built without distributed support (#14971)
    • Fixed batch normalization statistics calculation in StochasticWeightAveraging callback (#14866)
    • Avoided initializing optimizers during deepspeed inference (#14944)
    • Fixed LightningCLI parse_env and description in subcommands (#15138)
    • Fixed an exception that would occur when creating a multiprocessing.Pool after importing Lightning (#15292)
    • Fixed a pickling error when using RichProgressBar together with checkpointing (#15319)
    • Fixed the RichProgressBar crashing when used with distributed strategies (#15376)
    • Fixed an issue with RichProgressBar not resetting the internal state for the sanity check progress (#15377)
    • Fixed an issue with DataLoader re-instantiation when the attribute is an array and the default value of the corresponding argument changed (#15409)

    Full commit list: https://github.com/PyTorchLightning/pytorch-lightning/compare/1.7.0...1.8.0

    Contributors

    Veteran

    @akihironitta @ananthsub @AndresAlgaba @ar90n @Atharva-Phatak @awaelchli @BongYang @Borda @carmocca @dependabot @donlapark @ethanwharris @Felonious-Spellfire @hhsecond @jerome-habana @JustinGoheen @justusschock @kaushikb11 @krishnakalyan3 @krshrimali @luca-medeiros @manangoel99 @manskx @mauvilsa @MrShevan @nicolai86 @nmiculinic @otaj @Queuecumber @rlizzo @rohitgr7 @rschireman @SeanNaren @speediedan @tchaton @tshu-w

    New

    @Birch-san @clementpoiret @HalestormAI @thongonary @alecmerdler @adam-lightning @yurijmikhalevich @lijm1358 @robert-s-lee @panos-is @kacperlukawski @alro923 @dmitsf @Anner-deJong @cschell @nishantb06 @Callidior @j0rd1smit @MarcSkovMadsen @KralaBenjamin @robertomest @daniel347x @pierocor @datumbox @nohalon @pritamsoni-hsr @nandwalritik @gilfree @ritsuki1227 @christopher-nguyen-re @JulesGM @jgbos @dconathan @jsr-p @NeoKish @Blaizzy @suyash-811 @alexkuzmik @ziyadsheeba @geoffrey-g-delhomme @amrutha1098 @AlessioQuercia @ver217 @Helias @zxvix @1SAA @fabiofumarola @luca3rd @kimpty @PaulLerner @rbracco @wouterzwerink

    If we forgot somebody or you have a suggestion, find support here :zap:

    Did you know?

    Chuck Norris can write functions of infinite recursion ... and have them return.

    Source code(tar.gz)
    Source code(zip)
    lightning-1.8.0-py3-none-any.whl(1.57 MB)
    lightning-1.8.0.tar.gz(1.29 MB)
    lightning-app-1.8.0.tar.gz(1022.44 KB)
    lightning-lite-1.8.0.tar.gz(92.29 KB)
    lightning_app-1.8.0-py3-none-any.whl(1.06 MB)
    lightning_lite-1.8.0-py3-none-any.whl(133.15 KB)
    pytorch-lightning-1.8.0.tar.gz(557.88 KB)
    pytorch_lightning-1.8.0-py3-none-any.whl(777.19 KB)
  • App/0.7.0(Oct 20, 2022)

    [0.7.0] - 2022-10-20

    Added

    • Add --secret option to CLI to allow binding Secrets to app environment variables when running in the cloud (#14612)
    • Added support for adding descriptions to commands either through a docstring or the DESCRIPTION attribute (#15193
    • Added option to add custom meta tags to the UI container (#14915)
    • Added support to pass a LightningWork to the LightningApp (#15215

    Changed

    • Allowed root path to run the app on /path (#14972)
    Source code(tar.gz)
    Source code(zip)
    lightning-2022.10.20-py3-none-any.whl(186.84 KB)
    lightning-2022.10.20.tar.gz(65.21 KB)
    lightning-app-0.7.0.tar.gz(1006.30 KB)
    lightning_app-0.7.0-py3-none-any.whl(1.03 MB)
  • app/0.6.3(Oct 7, 2022)

  • 1.7.7(Sep 22, 2022)

    [1.7.7] - 2022-09-22

    Fixed

    • Fixed the availability check for the neptune-client package (#14714)
    • Break HPU Graphs into two parts (forward + backward as one and optimizer as another) for better performance (#14656)
    • Fixed torchscript error with ensembles of LightningModules (#14657, #14724)
    • Fixed an issue with TensorBoardLogger.finalize creating a new experiment when none was created during the Trainer's execution (#14762)
    • Fixed TypeError on import when torch.distributed is not available (#14809)

    Contributors

    @awaelchli @Borda @carmocca @dependabot @otaj @raoakarsha

    If we forgot someone due to not matching commit email with GitHub account, let us know :)

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.9.22-py3-none-any.whl(176.53 KB)
    lightning-2022.9.22.tar.gz(61.61 KB)
    pytorch-lightning-1.7.7.tar.gz(511.63 KB)
    pytorch_lightning-1.7.7-py3-none-any.whl(691.51 KB)
  • app/0.6.2(Sep 22, 2022)

    [0.6.2] - 2022-09-22

    Changed

    • Improved Lightning App connect logic by disconnecting automatically (#14532)
    • Improved the error message when the LightningWork is missing the run method (#14759)
    • Improved the error message when the root LightningFlow passed to LightningApp is missing the run method (#14760)

    Fixed

    • Fixed a bug where the uploaded command file wasn't properly parsed (#14532)
    • Fixed an issue where custom property setters were not being used LightningWork class (#14259)
    • Fixed an issue where some terminals would display broken icons in the PL app CLI (#14226)

    Contributors

    @awaelchli, @borda, @pranjaldatta, @tchaton

    If we forgot someone due to not matching commit email with GitHub account, let us know :]

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.9.22-py3-none-any.whl(186.08 KB)
    lightning-2022.9.22.tar.gz(65.12 KB)
    lightning-app-0.6.2.tar.gz(994.45 KB)
    lightning_app-0.6.2-py3-none-any.whl(1.02 MB)
  • app/0.6.1(Sep 19, 2022)

    [0.6.1] - 2022-09-19

    Added

    • Add support to upload files to the Drive through an asynchronous upload_file endpoint (#14703)

    Changed

    • Application storage prefix moved from app_id to project_id/app_id (#14583)
    • LightningCloud client calls to use keyword arguments instead of positional arguments (#14685)

    Fixed

    • Making threadpool non-default from LightningCloud client (#14757)
    • Resolved a bug where the state change detection using DeepDiff won't work with Path, Drive objects (#14465)
    • Resolved a bug where the wrong client was passed to collect cloud logs (#14684)
    • Resolved the memory leak issue with the Lightning Cloud package and bumped the requirements to use the latest version (#14697)
    • Fixing 5000 log line limitation for Lightning AI BYOC cluster logs (#14458)
    • Fixed a bug where the uploaded command file wasn't properly parsed (#14532)
    • Resolved LightningApp(..., debug=True) (#14464)

    Contributors

    @dmitsf @hhsecond @tchaton @nohalon @krshrimali @pritamsoni-hsr @nmiculinic @ethanwharris @yurijmikhalevich @Felonious-Spellfire @otaj @Borda

    If we forgot someone due to not matching commit email with GitHub account, let us know :)

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.9.19-py3-none-any.whl(186.02 KB)
    lightning-2022.9.19.tar.gz(64.99 KB)
    lightning-app-0.6.1.tar.gz(992.79 KB)
    lightning_app-0.6.1-py3-none-any.whl(1.02 MB)
  • 1.7.6(Sep 13, 2022)

    [1.7.6] - 2022-09-13

    Changed

    • Improved the error messaging when passing Trainer.method(model, x_dataloader=None) with no module-method implementations available (#14614)

    Fixed

    • Reset the dataloaders on OOM failure in batch size finder to use the last successful batch size (#14372)
    • Fixed an issue to keep downscaling the batch size in case there hasn't been even a single successful optimal batch size with mode="power" (#14372)
    • Fixed an issue where self.log-ing a tensor would create a user warning from PyTorch about cloning tensors (#14599)
    • Fixed compatibility when torch.distributed is not available (#14454)

    Contributors

    @akihironitta @awaelchli @Borda @carmocca @dependabot @krshrimali @mauvilsa @pierocor @rohitgr7 @wangraying

    If we forgot someone due to not matching commit email with GitHub account, let us know :)

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.9.13-py3-none-any.whl(176.05 KB)
    lightning-2022.9.13.tar.gz(61.55 KB)
    pytorch-lightning-1.7.6.tar.gz(511.20 KB)
    pytorch_lightning-1.7.6-py3-none-any.whl(690.96 KB)
  • app/0.6.0(Sep 8, 2022)

    [0.6.0] - 2022-09-08

    Added

    • Introduce lightning connect (#14452)
    • Adds PanelFrontend to easily create complex UI in Python (#13531)
    • Add support for Lightning App Commands through the configure_commands hook on LightningFlow and ClientCommand (#13602)
    • Add support for Lightning AI BYOC cluster management (#13835)
    • Add support to see Lightning AI BYOC cluster logs (#14334)
    • Add support to run Lightning apps on Lightning AI BYOC clusters (#13894)
    • Add support for listing Lightning AI apps (#13987)
    • Adds LightningTrainingComponent. LightningTrainingComponent orchestrates multi-node training in the cloud (#13830)
    • Add support for printing application logs using CLI lightning show logs <app_name> [components] (#13634)
    • Add support for Lightning API through the configure_api hook on the LightningFlow and the Post, Get, Delete, Put with HttpMethods (#13945)
    • Added a warning when configure_layout returns URLs configured with HTTP instead of HTTPS (#14233)
    • Add --app_args support from the CLI (#13625)

    Changed

    • Default values and parameter names for Lightning AI BYOC cluster management (#14132)
    • Run the flow only if the state has changed from the previous execution (#14076)
    • Increased DeepDiff's verbose level to properly handle dict changes (#13960)
    • Setup: added requirement freeze for the next major version (#14480)

    Fixed

    • Unification of app template: moved app.py to root dir for lightning init app <app_name> template (#13853)
    • Fixed an issue with lightning --version command (#14433)
    • Fixed imports of collections.abc for py3.10 (#14345)

    Contributors

    @adam-lightning, @awaelchli, @Borda, @dmitsf, @manskx, @MarcSkovMadsen, @nicolai86, @tchaton

    If we forgot someone due to not matching commit email with GitHub account, let us know :]

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.9.8-py3-none-any.whl(185.47 KB)
    lightning-2022.9.8.tar.gz(64.78 KB)
    lightning-app-0.6.0.tar.gz(987.74 KB)
    lightning_app-0.6.0-py3-none-any.whl(1.01 MB)
  • 1.7.5(Sep 7, 2022)

    [1.7.5] - 2022-09-06

    Fixed

    • Squeezed tensor values when logging with LightningModule.log (#14489)
    • Fixed WandbLogger save_dir is not set after creation (#14326)
    • Fixed Trainer.estimated_stepping_batches when maximum number of epochs is not set (#14317)

    Contributors

    @carmocca @dependabot @robertomest @rohitgr7 @tshu-w

    If we forgot someone due to not matching commit email with GitHub account, let us know :)

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.9.7-py3-none-any.whl(173.51 KB)
    lightning-2022.9.7.tar.gz(60.70 KB)
    pytorch-lightning-1.7.5.tar.gz(509.56 KB)
    pytorch_lightning-1.7.5-py3-none-any.whl(690.05 KB)
  • 1.7.4(Aug 31, 2022)

    [1.7.4] - 2022-08-31

    Added

    • Added an environment variable PL_DISABLE_FORK that can be used to disable all forking in the Trainer (#14319)

    Fixed

    • Fixed LightningDataModule hparams parsing (#12806)
    • Reset epoch progress with batch size scaler (#13846)
    • Fixed restoring the trainer after using lr_find() so that the correct LR schedule is used for the actual training (#14113)
    • Fixed incorrect values after transferring data to an MPS device (#14368)

    Contributors

    @rohitgr7 @tanmoyio @justusschock @cschell @carmocca @Callidior @awaelchli @j0rd1smit @dependabot @Borda @otaj

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.31-py3-none-any.whl(173.49 KB)
    lightning-2022.8.31.tar.gz(60.68 KB)
    pytorch-lightning-1.7.4.tar.gz(509.50 KB)
    pytorch_lightning-1.7.4-py3-none-any.whl(689.97 KB)
  • 1.7.3(Aug 25, 2022)

    [1.7.3] - 2022-08-25

    Fixed

    • Fixed an assertion error when using a ReduceOnPlateau scheduler with the Horovod strategy (#14215)
    • Fixed an AttributeError when accessing LightningModule.logger and the Trainer has multiple loggers (#14234)
    • Fixed wrong num padding for RichProgressBar (#14296)
    • Added back support for logging in the configure_gradient_clipping hook after unintended removal in v1.7.2 (#14298)
    • Fixed an issue to avoid the impact of sanity check on reload_dataloaders_every_n_epochs for validation (#13964)

    Contributors

    @awaelchli @Borda @carmocca @dependabot @kaushikb11 @otaj @rohitgr7

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.25-py3-none-any.whl(173.49 KB)
    lightning-2022.8.25.tar.gz(60.68 KB)
    pytorch-lightning-1.7.3.tar.gz(508.76 KB)
    pytorch_lightning-1.7.3-py3-none-any.whl(689.26 KB)
  • app/0.5.7(Aug 22, 2022)

    [0.5.7] - 2022-08-22

    Changed

    • Release LAI docs as stable (#14250)
    • Compatibility for Python 3.10

    Fixed

    • Pinning starsessions to 1.x (#14333)
    • Parsed local package versions (#13933)

    Contributors

    @borda, @hhsecond, @manskx

    If we forgot someone due to not matching commit email with GitHub account, let us know :]

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.22-py3-none-any.whl(166.27 KB)
    lightning-2022.8.22.tar.gz(59.05 KB)
    lightning-app-0.5.7.tar.gz(964.39 KB)
    lightning_app-0.5.7-py3-none-any.whl(1012.86 KB)
  • app/0.5.6(Aug 18, 2022)

  • 1.7.2(Aug 17, 2022)

    [1.7.2] - 2022-08-17

    Added

    • Added FullyShardedNativeNativeMixedPrecisionPlugin to handle precision for DDPFullyShardedNativeStrategy (#14092)
    • Added profiling to these hooks: on_before_batch_transfer, transfer_batch_to_device, on_after_batch_transfer, configure_gradient_clipping, clip_gradients (#14069)

    Changed

    • Updated compatibility for LightningLite to run with the latest DeepSpeed 0.7.0 (13967)
    • Raised a MisconfigurationException if batch transfer hooks are overriden with IPUAccelerator (13961)
    • The default project name in WandbLogger is now "lightning_logs" (#14145)
    • The WandbLogger.name property no longer returns the name of the experiment, and instead returns the project's name (#14145)

    Fixed

    • Fixed a bug that caused spurious AttributeError when multiple DataLoader classes are imported (#14117)
    • Fixed epoch-end logging results not being reset after the end of the epoch (#14061)
    • Fixed saving hyperparameters in a composition where the parent class is not a LightningModule or LightningDataModule (#14151)
    • Fixed epoch-end logging results not being reset after the end of the epoch (#14061)
    • Fixed the device placement when LightningModule.cuda() gets called without specifying a device index and the current cuda device was not 0 (#14128)
    • Avoided false positive warning about using sync_dist when using torchmetrics (#14143)
    • Avoid metadata.entry_points deprecation warning on Python 3.10 (#14052)
    • Avoid raising the sampler warning if num_replicas=1 (#14097)
    • Fixed resuming from a checkpoint when using Stochastic Weight Averaging (SWA) (#9938)
    • Avoided requiring the FairScale package to use precision with the fsdp native strategy (#14092)
    • Fixed an issue in which the default name for a run in WandbLogger would be set to the project name instead of a randomly generated string (#14145)
    • Fixed not preserving set attributes on DataLoader and BatchSampler when instantiated inside *_dataloader hooks (#14212)

    Contributors

    @adamreeve @akihironitta @awaelchli @Borda @carmocca @dependabot @otaj @rohitgr7

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.17-py3-none-any.whl(165.45 KB)
    lightning-2022.8.17.tar.gz(58.97 KB)
    pytorch-lightning-1.7.2.tar.gz(508.56 KB)
    pytorch_lightning-1.7.2-py3-none-any.whl(689.02 KB)
  • 1.7.1(Aug 9, 2022)

    [1.7.1] - 2022-08-09

    Fixed

    • Casted only floating point tensors to fp16 with IPUs (#13983)
    • Casted tensors to fp16 before moving them to device with DeepSpeedStrategy (#14000)
    • Fixed the NeptuneLogger dependency being unrecognized (#13988)
    • Fixed an issue where users would be warned about unset max_epochs even when fast_dev_run was set (#13262)
    • Fixed MPS device being unrecognized (#13992)
    • Fixed incorrect precision="mixed" being used with DeepSpeedStrategy and IPUStrategy (#14041)
    • Fixed dtype inference during gradient norm computation (#14051)
    • Fixed a bug that caused ddp_find_unused_parameters to be set False, whereas the intended default is True (#14095)

    Contributors

    @adamjstewart @akihironitta @awaelchli @Birch-san @carmocca @clementpoiret @dependabot @rohitgr7

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.9-py3-none-any.whl(164.74 KB)
    lightning-2022.8.9.tar.gz(58.75 KB)
    pytorch-lightning-1.7.1.tar.gz(505.91 KB)
    pytorch_lightning-1.7.1-py3-none-any.whl(685.08 KB)
  • app/0.5.5(Aug 9, 2022)

    [0.5.5] - 2022-08-9

    Deprecated

    • Deprecate sheety API (#14004)

    Fixed

    • Resolved a bug where the work statuses will grow quickly and be duplicated (#13970)
    • Resolved a bug about a race condition when sending the work state through the caller_queue (#14074)
    • Fixed Start Lightning App on Cloud if Repo Begins With Name "Lightning" (#14025)

    Contributors

    @manskx, @rlizzo, @tchaton

    If we forgot someone due to not matching commit email with GitHub account, let us know :]

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.9-py3-none-any.whl(165.38 KB)
    lightning-2022.8.9.tar.gz(58.34 KB)
    lightning-app-0.5.5.tar.gz(1.03 MB)
    lightning_app-0.5.5-py3-none-any.whl(1.07 MB)
  • 1.7.0(Aug 2, 2022)

    The core team is excited to announce the release of PyTorch Lightning 1.7 :zap:

    PyTorch Lightning 1.7 is the culmination of work from 106 contributors who have worked on features, bug-fixes, and documentation for a total of over 492 commits since 1.6.0.

    Highlights

    Apple Silicon Support

    For those using PyTorch 1.12 on M1 or M2 Apple machines, we have created the MPSAccelerator. MPSAccelerator enables accelerated GPU training on Apple’s Metal Performance Shaders (MPS) as a backend process.


    NOTE

    Support for this accelerator is currently marked as experimental in PyTorch. Because many operators are still missing, you may run into a few rough edges.


    # Selects the accelerator
    trainer = pl.Trainer(accelerator="mps")
    
    # Equivalent to
    from pytorch_lightning.accelerators import MPSAccelerator
    trainer = pl.Trainer(accelerator=MPSAccelerator())
    
    # Defaults to "mps" when run on M1 or M2 Apple machines
    # to avoid code changes when switching computers
    trainer = pl.Trainer(accelerator="gpu")
    

    Native Fully Sharded Data Parallel Strategy

    PyTorch 1.12 also added native support for Fully Sharded Data Parallel (FSDP). Previously, PyTorch Lightning enabled this by using the fairscale project. You can now choose between both options.


    NOTE

    Support for this strategy is marked as beta in PyTorch.


    # Native PyTorch implementation
    trainer = pl.Trainer(strategy="fsdp_native")
    
    # Equivalent to
    from pytorch_lightning.strategies import DDPFullyShardedNativeStrategy
    trainer = pl.Trainer(strategy=DDPFullyShardedNativeStrategy())
    
    # For reference, FairScale's implementation can be used with
    trainer = pl.Trainer(strategy="fsdp")
    

    A Collaborative Training strategy using Hivemind

    Collaborative Training solves the need for top-tier multi-GPU servers by allowing you to train across unreliable machines such as local ones or even preemptible cloud compute across the Internet.

    Under the hood, we use Hivemind. This provides de-centralized training across the Internet.

    from pytorch_lightning.strategies import HivemindStrategy
    
    trainer = pl.Trainer(
        strategy=HivemindStrategy(target_batch_size=8192), 
        accelerator="gpu", 
        devices=1
    )
    

    For more information, check out the docs.

    Distributed support in Jupyter Notebooks

    So far, the only multi-GPU strategy supported in Jupyter notebooks (including Grid.ai, Google Colab, and Kaggle, for example) has been the Data-Parallel (DP) strategy (strategy="dp"). DP, however, has several limitations that often obstruct users' workflows. It can be slow, it's incompatible with TorchMetrics, it doesn't persist state changes on replicas, and it's difficult to use with non-primitive input- and output structures.

    In this release, we've added support for Distributed Data Parallel in Jupyter notebooks using the fork mechanism to address these shortcomings. This is only available for MacOS and Linux (sorry Windows!).


    NOTE

    This feature is experimental.


    This is how you use multi-device in notebooks now:

    # Train on 2 GPUs in a Jupyter notebook
    trainer = pl.Trainer(accelerator="gpu", devices=2)
    
    # Can be set explicitly
    trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_notebook")
    
    # Can also be used in non-interactive environments
    trainer = pl.Trainer(accelerator="gpu", devices=2, strategy="ddp_fork")
    

    By default, the Trainer detects the interactive environment and selects the right strategy for you. Learn more in the full documentation.

    Versioning of "last" checkpoints

    If a run is configured to save to the same directory as a previous run and ModelCheckpoint(save_last=True) is enabled, the "last" checkpoint is now versioned with a simple -v1 suffix to avoid overwriting the existing "last" checkpoint. This mimics the behaviour for checkpoints that monitor a metric.

    Automatically reload the "last" checkpoint

    In certain scenarios, like when running in a cloud spot instance with fault-tolerant training enabled, it is useful to load the latest available checkpoint. It is now possible to pass the string ckpt_path="last" in order to load the latest available checkpoint from the set of existing checkpoints.

    trainer = Trainer(...)
    trainer.fit(..., ckpt_path="last")
    

    Validation every N batches across epochs

    In some cases, for example iteration based training, it is useful to run validation after every N number of training batches without being limited by the epoch boundary. Now, you can enable validation based on total training batches.

    trainer = Trainer(..., val_check_interval=N, check_val_every_n_epoch=None)
    trainer.fit(...)
    

    For example, given 5 epochs of 10 batches, setting N=25 would run validation in the 3rd and 5th epoch.

    CPU stats monitoring

    PyTorch Lightning provides the DeviceStatsMonitor callback to monitor the stats of the hardware currently used. However, users often also want to monitor the stats of other hardware. In this release, we have added an option to additionally monitor CPU stats:

    from pytorch_lightning.callbacks import DeviceStatsMonitor
    
    # Log both CPU stats and GPU stats
    trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="gpu")
    
    # Log just the GPU stats
    trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=False), accelerator="gpu")
    
    # Equivalent to `DeviceStatsMonitor()`
    trainer = pl.Trainer(callbacks=DeviceStatsMonitor(cpu_stats=True), accelerator="cpu")
    

    The CPU stats are gathered using the psutil package.

    Automatic distributed samplers

    It is now possible to use custom samplers in a distributed environment without the need to set replace_ddp_sampler=False and wrap your sampler manually with the DistributedSampler.

    Inference mode support

    PyTorch 1.9 introduced torch.inference_mode, which is a faster alternative for torch.no_grad. Lightning will now use inference_mode wherever possible during evaluation.

    Support for warn-level determinism

    In Pytorch 1.11, operations that do not have a deterministic implementation can be set to throw a warning instead of an error when ran in deterministic mode. This is now supported by our Trainer:

    trainer = pl.Trainer(deterministic="warn")
    

    LightningCLI improvements

    After the latest updates to jsonargparse, the library supporting the LightningCLI, there's now complete support for shorthand notation. This includes automatic support for shorthand notation to all arguments, not just the ones that are part of the registries, plus support inside configuration files.

    + # pytorch_lightning==1.7.0
      trainer:
      callbacks:
    -   - class_path: pytorch_lightning.callbacks.EarlyStopping
    +   - class_path: EarlyStopping
          init_args:
            monitor: "loss"
    

    A header with the version that generated the config is now included.

    All subclasses for a given base class can be specified by name, so there's no need to explicitly register them. The only requirement is that the module where the subclass is defined is imported prior to parsing.

    from pytorch_lightning.cli import LightningCLI
    import my_code.models
    import my_code.optimizers
    
    cli = LightningCLI()
    # Now use any of the classes:
    # python trainer.py fit --model=Model1 --optimizer=CustomOptimizer
    

    The new version renders the registries and the auto_registry flag, introduced in 1.6.0, unnecessary, so we have deprecated them.

    Support was also added for list appending; for example, to add a callback to an existing list that might be already configured:

    $ python trainer.py fit \
    -   --trainer.callbacks=EarlyStopping \
    +   --trainer.callbacks+=EarlyStopping \
        --trainer.callbacks.patience=5 \
    -   --trainer.callbacks=LearningRateMonitor \
    +   --trainer.callbacks+=LearningRateMonitor \
        --trainer.callbacks.logging_interval=epoch
    

    Callback registration through entry points

    Entry Points are an advanced feature in Python's setuptools that allow packages to expose metadata to other packages. In Lightning, we allow an arbitrary package to include callbacks that the Lightning Trainer can automatically use when installed, without you having to manually add them to the Trainer. This is useful in production environments where it is common to provide specialized monitoring and logging callbacks globally for every application.

    A setup.py file for a callbacks plugin package could look something like this:

    from setuptools import setup
    
    setup(
        name="my-package",
        version="0.0.1",
        entry_points={
            # Lightning will look for this key here in the environment:
            "pytorch_lightning.callbacks_factory": [
                "monitor_callbacks=factories:my_custom_callbacks_factory"
            ]
        },
    )
    

    Read more about callback entry points in our docs.

    Rank-zero only EarlyStopping messages

    Our EarlyStopping callback implementation, by default, logs the stopping messages on every rank when it's run in a distributed environment. This was done in case the monitored values were not synchronized. However, some users found this verbose. To avoid this, you can now set a flag:

    from pytorch_lightning.callbacks import EarlyStopping
    
    trainer = pl.Trainer(callbacks=EarlyStopping(..., log_rank_zero_only=True))
    

    A base Checkpoint class for extra customization

    If you want to customize ModelCheckpoint callback, without all the extra functionality this class provides, this release provides an empty class Checkpoint for easier inheritance. In all internal code, the check is made against the Checkpoint class in order to ensure everything works properly for custom classes.

    Validation now runs in overfitting mode

    Setting overfit_batches=N, now enables validation and runs N number of validation batches during trainer.fit.

    # Uses 1% of each train & val set
    trainer = Trainer(overfit_batches=0.01)
    
    # Uses 10 batches for each train & val set
    trainer = Trainer(overfit_batches=10)
    

    Device Stats Monitoring support for HPUs

    DeviceStatsMonitor callback can now be used to automatically monitor and log device stats during the training stage with Habana devices.

    from pytorch_lightning import Trainer
    from pytorch_lightning.callbacks import DeviceStatsMonitor
    
    device_stats = DeviceStatsMonitor()
    trainer = Trainer(accelerator="hpu", callbacks=[device_stats])
    

    New Hooks

    LightningDataModule.load_from_checkpoint

    Now, hyper-parameters from LightningDataModule save to checkpoints and reload when training is resumed. And just like you use LightningModule.load_from_checkpoint to load a model using a checkpoint filepath, you can now load LightningDataModule using the same hook.

    # Lad weights without mapping ...
    datamodule = MyLightningDataModule.load_from_checkpoint('path/to/checkpoint.ckpt')
    
    # Or load weights and hyperparameters from separate files.
    datamodule = MyLightningDataModule.load_from_checkpoint(
        'path/to/checkpoint.ckpt',
        hparams_file='/path/to/hparams_file.yaml'
    )
    
    # Override some of the params with new values
    datamodule = MyLightningDataModule.load_from_checkpoint(
        'path/to/checkpoint.ckpt',
        batch_size=32,
        num_workers=10,
    )
    

    Experimental Features

    ServableModule and its Servable Module Validator Callback

    When serving models in production, it generally is a good pratice to ensure that the model can be served and optimzed before starting training to avoid wasting money.

    To do so, you can import a ServableModule (an nn.Module) and add it as an extra base class to your base model as follows:

    from pytorch_lightning import LightningModule
    from pytorch_lightning.serve import ServableModule
    
    class ProductionReadyModel(LightningModule, ServableModule):
        ...
    

    To make your model servable, you would need to implement three hooks:

    • configure_payload: Describe the format of the payload (data sent to the server).
    • configure_serialization: Describe the functions used to convert the payload to tensors (de-serialization) and tensors to payload (serialization)
    • serve_step: The method used to transform the input tensors to a dictionary of prediction tensors.
    from pytorch_lightning.serve import ServableModule, ServableModuleValidator
    
    class ProductionReadyModel(LitModule, ServableModule):
        def configure_payload(self):
            # 1: Access the train dataloader and load a single sample.
            image, _ = self.trainer.train_dataloader.loaders.dataset[0]
    
            # 2: Convert the image into a PIL Image to bytes and encode it with base64
            pil_image = T.ToPILImage()(image)
            buffered = BytesIO()
            pil_image.save(buffered, format="JPEG")
            img_str = base64.b64encode(buffered.getvalue()).decode("UTF-8")
    
            payload = {"body": {"x": img_str}}
            return payload
    
        def configure_serialization(self):
            deserializers = {"x": Image(224, 224).deserialize}
            serializers = {"output": Top1().serialize}
            return deserializers, serializers
    
        def serve_step(self, x: torch.Tensor) -> Dict[str, torch.Tensor]:
            return {"output": self.model(x)}
    

    Finally, add the ServableModuleValidator callback to the Trainer to validate the model is servable on_train_start. This uses a FastAPI server.

    pl_module = ProductionReadyModel()
    trainer = Trainer(..., callbacks=[ServableModuleValidator()])
    trainer.fit(pl_module)
    

    Have a look at the full example here.

    Asynchronous Checkpointing

    You can now save checkpoints asynchronously using the AsyncCheckpointIO plugin without blocking your training process. To enable this, you can pass a AsyncCheckpointIO plugin to the Trainer.

    from pytorch_lightning.plugins.io import AsyncCheckpointIO
    
    trainer = Trainer(plugins=[AsyncCheckpointIO()])
    

    Have a look at the full example here.

    Backward Incompatible Changes

    This section outlines notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.

    Removed support for the DDP2 strategy

    The DDP2 strategy, previously known as the DDP2 plugin, has been part of Lightning since its inception. Due to both the technical challenges in maintaining the plugin after PyTorch's removal of the multi-device support in DistributedDataParallel, as well as a general lack of interest, we have decided to retire the strategy entirely.

    Do not force metric synchronization on epoch end

    In previous versions, metrics logged inside epoch-end hooks were forcefully synced. This makes the sync_dist flag irrelevant and causes communication overhead that might be undesired. In this release, we've removed this behaviour and instead warn the user that synchronization might be desired.

    Deprecations

    | API | Removal version | Alternative | |--------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------------| | Import pytorch_lightning.loggers.base.LightningLoggerBase | 1.9 | pytorch_lightning.loggers.logger.Logger | | Import pytorch_lightning.callbacks.base.Callback | 1.9 | pytorch_lightning.callbacks.callback.Callback | | Import pytorch_lightning.core.lightning.LightningModule | 1.9 | pytorch_lightning.core.module.LightningModule | | Import pytorch_lightning.loops.base.Loop | 1.9 | pytorch_lightning.loops.loop.Loop | | Import pytorch_lightning.profiler | 1.9 | pytorch_lightning.profilers | | Arguments Trainer(num_processes=..., gpus=..., tpu_cores=..., ipus=...) | 2.0 | Trainer(accelerator=..., devices=...) | | Argument LightningCLI(seed_everything_default=None) | 1.9 | LightningCLI(seed_everything_default=False) | | Method Trainer.reset_train_val_dataloaders() | 1.9 | Trainer.reset_{train,val}_dataloader | | Import pytorch_lightning.utilities.cli module | 1.9 | pytorch_lightning.cli | | Objects pytorch_lightning.utilities.cli.{OPTIMIZER,LR_SCHEDULER,MODEL,DATAMODULE,CALLBACK,LOGGER}_REGISTRY | 1.9 | Not necessary anymore | | Argument LightningCLI(auto_registry=...) | 1.9 | Not necessary anymore | | Argument Trainer(strategy="ddp2") and class pytorch_lightning.strategies.DDP2Strategy | 1.8 | No longer supported |

    CHANGELOG

    Added
    • Added ServableModule and its associated callback called ServableModuleValidator to ensure the model can served (#13614)
    • Converted validation loop config warnings to PossibleUserWarning (#13377)
    • Added a flag named log_rank_zero_only to EarlyStopping to disable logging to non-zero rank processes (#13233)
    • Added support for reloading the last checkpoint saved by passing ckpt_path="last" (#12816)
    • Added LightningDataModule.load_from_checkpoint to support loading datamodules directly from checkpoint (#12550)
    • Added a friendly error message when attempting to call Trainer.save_checkpoint() without a model attached (#12772)
    • Added a friendly error message when attempting to use DeepSpeedStrategy on unsupported accelerators (#12699)
    • Enabled torch.inference_mode for evaluation and prediction (#12715)
    • Added support for setting val_check_interval to a value higher than the amount of training batches when check_val_every_n_epoch=None (#11993)
    • Include the pytorch_lightning version as a header in the CLI config files (#12532)
    • Added support for Callback registration through entry points (#12739)
    • Added support for Trainer(deterministic="warn") to warn instead of fail when a non-deterministic operation is encountered (#12588)
    • Added profiling to the loops' dataloader __next__ calls (#12124)
    • Hivemind Strategy
      • Added CollaborativeStrategy (#12842)
      • Renamed CollaborativeStrategy to HivemindStrategy (#13388)
      • Removed unnecessary endpoint logic, renamed collaborative to hivemind (#13392)
    • Include a version suffix for new "last" checkpoints of later runs in the same directory (#12902)
    • Show a better error message when a Metric that does not return a Tensor is logged (#13164)
    • Added missing predict_dataset argument in LightningDataModule.from_datasets to create predict dataloaders (#12942)
    • Added class name prefix to metrics logged by DeviceStatsMonitor (#12228)
    • Automatically wrap custom samplers under a distributed environment by using DistributedSamplerWrapper (#12959)
    • Added profiling of LightningDataModule hooks (#12971)
    • Added Native FSDP Strategy (#12447)
    • Added breaking of lazy graph across training, validation, test and predict steps when training with habana accelerators to ensure better performance (#12938)
    • Added Checkpoint class to inherit from (#13024)
    • Added CPU metric tracking to DeviceStatsMonitor (#11795)
    • Added teardown() method to Accelerator (#11935)
    • Added support for using custom Trainers that don't include callbacks using the CLI (#13138)
    • Added a timeout argument to DDPStrategy and DDPSpawnStrategy. (#13244, #13383)
    • Added XLAEnvironment cluster environment plugin (#11330)
    • Added logging messages to notify when FitLoop stopping conditions are met (#9749)
    • Added support for calling unknown methods with DummyLogger (#13224
    • Added support for recursively setting the Trainer reference for ensembles of LightningModules (#13638
    • Added Apple Silicon Support via MPSAccelerator (#13123)
    • Added support for DDP Fork (#13405)
    • Added support for async checkpointing (#13658)
    • Added support for HPU Device stats monitor (#13819)
    Changed
    • accelerator="gpu" now automatically selects an available GPU backend (CUDA and MPS currently) (#13642)
    • Enable validation during overfitting (#12527)
    • Added dataclass support to extract_batch_size (#12573)
    • Changed checkpoints save path in the case of one logger and user-provided weights_save_path from weights_save_path/name/version/checkpoints to weights_save_path/checkpoints (#12372)
    • Changed checkpoints save path in the case of multiple loggers and user-provided weights_save_path from weights_save_path/name1_name2/version1_version2/checkpoints to weights_save_path/checkpoints (#12372)
    • Marked swa_lrs argument in StochasticWeightAveraging callback as required (#12556)
    • LightningCLI's shorthand notation changed to use jsonargparse native feature (#12614)
    • LightningCLI changed to use jsonargparse native support for list append (#13129)
    • Changed seed_everything_default argument in the LightningCLI to type Union[bool, int]. If set to True a seed is automatically generated for the parser argument --seed_everything. (#12822, #13110)
    • Make positional arguments required for classes passed into the add_argparse_args function. (#12504)
    • Raise an error if there are insufficient training batches when using a float value of limit_train_batches (#12885)
    • DataLoader instantiated inside a *_dataloader hook will not set the passed arguments as attributes anymore (#12981)
    • When a multi-element tensor is logged, an error is now raised instead of silently taking the mean of all elements (#13164)
    • The WandbLogger will now use the run name in the logs folder if it is provided, and otherwise the project name (#12604)
    • Enabled using any Sampler in distributed environment in Lite (#13646)
    • Raised a warning instead of forcing sync_dist=True on epoch end (13364)
    • Updated val_check_interval(int) to consider total train batches processed instead of _batches_that_stepped for validation check during training (#12832
    • Updated Habana Accelerator's auto_device_count, is_available & get_device_name methods based on the latest torch habana package (#13423)
    • Disallowed using BatchSampler when running on multiple IPUs (#13854)
    Deprecated
    • Deprecated pytorch_lightning.accelerators.gpu.GPUAccelerator in favor of pytorch_lightning.accelerators.cuda.CUDAAccelerator (#13636)
    • Deprecated pytorch_lightning.loggers.base.LightningLoggerBase in favor of pytorch_lightning.loggers.logger.Logger, and deprecated pytorch_lightning.loggers.base in favor of pytorch_lightning.loggers.logger (#120148)
    • Deprecated pytorch_lightning.callbacks.base.Callback in favor of pytorch_lightning.callbacks.callback.Callback (#13031)
    • Deprecated num_processes, gpus, tpu_cores, and ipus from the Trainer constructor in favor of using the accelerator and devices arguments (#11040)
    • Deprecated setting LightningCLI(seed_everything_default=None) in favor of False (#12804).
    • Deprecated pytorch_lightning.core.lightning.LightningModule in favor of pytorch_lightning.core.module.LightningModule (#12740)
    • Deprecated pytorch_lightning.loops.base.Loop in favor of pytorch_lightning.loops.loop.Loop (#13043)
    • Deprecated Trainer.reset_train_val_dataloaders() in favor of Trainer.reset_{train,val}_dataloader (#12184)
    • Deprecated LightningCLI's registries in favor of importing the respective package (#13221)
    • Deprecated public utilities in pytorch_lightning.utilities.cli.LightningCLI in favor of equivalent copies in pytorch_lightning.cli.LightningCLI (#13767)
    • Deprecated pytorch_lightning.profiler in favor of pytorch_lightning.profilers (#12308)
    Removed
    • Removed deprecated IndexBatchSamplerWrapper.batch_indices (#13565)
    • Removed the deprecated LightningModule.add_to_queue and LightningModule.get_from_queue method (#13600)
    • Removed deprecated pytorch_lightning.core.decorators.parameter_validation from decorators (#13514)
    • Removed the deprecated Logger.close method (#13149)
    • Removed the deprecated weights_summary argument from the Trainer constructor (#13070)
    • Removed the deprecated flush_logs_every_n_steps argument from the Trainer constructor (#13074)
    • Removed the deprecated process_position argument from the Trainer constructor (13071)
    • Removed the deprecated checkpoint_callback argument from the Trainer constructor (#13027)
    • Removed the deprecated on_{train,val,test,predict}_dataloader hooks from the LightningModule and LightningDataModule (#13033)
    • Removed the deprecated TestTubeLogger (#12859)
    • Removed the deprecated pytorch_lightning.core.memory.LayerSummary and pytorch_lightning.core.memory.ModelSummary (#12593)
    • Removed the deprecated summarize method from the LightningModule (#12559)
    • Removed the deprecated model_size property from the LightningModule class (#12641)
    • Removed the deprecated stochastic_weight_avg argument from the Trainer constructor (#12535)
    • Removed the deprecated progress_bar_refresh_rate argument from the Trainer constructor (#12514)
    • Removed the deprecated prepare_data_per_node argument from the Trainer constructor (#12536)
    • Removed the deprecated pytorch_lightning.core.memory.{get_gpu_memory_map,get_memory_profile} (#12659)
    • Removed the deprecated terminate_on_nan argument from the Trainer constructor (#12553)
    • Removed the deprecated XLAStatsMonitor callback (#12688)
    • Remove deprecated pytorch_lightning.callbacks.progress.progress (#12658)
    • Removed the deprecated dim and size arguments from the LightningDataModule constructor(#12780)
    • Removed the deprecated train_transforms argument from the LightningDataModule constructor(#12662)
    • Removed the deprecated log_gpu_memory argument from the Trainer constructor (#12657)
    • Removed the deprecated automatic logging of GPU stats by the logger connector (#12657)
    • Removed deprecated GPUStatsMonitor callback (#12554)
    • Removed support for passing strategy names or strategy instances to the accelerator Trainer argument (#12696)
    • Removed support for passing strategy names or strategy instances to the plugins Trainer argument (#12700)
    • Removed the deprecated val_transforms argument from the LightningDataModule constructor (#12763)
    • Removed the deprecated test_transforms argument from the LightningDataModule constructor (#12773)
    • Removed deprecated Trainer(max_steps=None) (#13591)
    • Removed deprecated dataloader_idx argument from on_train_batch_start/end hooks Callback and LightningModule (#12769, #12977)
    • Removed deprecated get_progress_bar_dict property from LightningModule (#12839)
    • Removed sanity check for multi-optimizer support with habana backends (#13217)
    • Removed the need to explicitly load habana module (#13338)
    • Removed the deprecated Strategy.post_dispatch() hook (#13461)
    • Removed deprecated pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor.lr_sch_names (#13353)
    • Removed deprecated Trainer.slurm_job_id in favor of SLURMEnvironment.job_id (#13459)
    • Removed support for the DDP2Strategy (#12705)
    • Removed deprecated LightningDistributed (#13549)
    • Removed deprecated ClusterEnvironment properties master_address and master_port in favor of main_address and main_port (#13458)
    • Removed deprecated ClusterEnvironment methods KubeflowEnvironment.is_using_kubelfow(), LSFEnvironment.is_using_lsf() and TorchElasticEnvironment.is_using_torchelastic() in favor of the detect() method (#13458)
    • Removed deprecated Callback.on_keyboard_interrupt (#13438)
    • Removed deprecated LightningModule.on_post_move_to_device (#13548)
    • Removed TPUSpawnStrategy.{tpu_local_core_rank,tpu_global_core_rank} attributes in favor of TPUSpawnStrategy.{local_rank,global_rank} (#11163)
    • Removed SingleTPUStrategy.{tpu_local_core_rank,tpu_global_core_rank} attributes in favor of SingleTPUStrategy.{local_rank,global_rank}(#11163)
    Fixed
    • Improved support for custom DataLoaders when instantiated in *_dataloader hook (#12981)
    • Allowed custom BatchSamplers when instantiated in *_dataloader hook #13640)
    • Fixed an issue with unsupported torch.inference_mode() on hpu backends by making it use no_grad (#13014)
    • The model wrapper returned by LightningLite.setup() now properly supports pass-through when looking up attributes (#12597)
    • Fixed issue where the CLI fails with certain torch objects (#13153)
    • Fixed LightningCLI signature parameter resolving for some lightning classes (#13283)
    • Fixed Model Summary when using DeepSpeed Stage 3 (#13427)
    • Fixed pytorch_lightning.utilities.distributed.gather_all_tensors to handle tensors of different dimensions (#12630)
    • Fixed the input validation for the accelerator Trainer argument when passed as a string (#13417)
    • Fixed Trainer.predict(return_predictions=False) to track prediction's batch_indices (#13629)
    • Fixed and issue that prevented setting a custom CheckpointIO plugin with strategies (#13785)
    • Fixed main progress bar counter when val_check_interval=int and check_val_every_n_epoch=None (#12832
    • Improved support for custom ReduceLROnPlateau scheduler if reduce_on_plateau is set by the user in scheduler config (#13838)
    • Used global_step while restoring logging step for old checkpoints (#13645)
    • When training with precision=16 on IPU, the cast has been moved off the IPU onto the host, making the copies from host to IPU cheaper (#13880)
    • Fixed error handling in learning rate finder when not enough data points are available to give a good suggestion (#13845)
    • Fixed an issue that caused the learning rate finder to set the model's learning rate to None when no suggestion was possible (#13845)
    • Fixed an issue causing deterministic algorighms and other globals to get reset in spawned processes (#13921)
    • Fixed default amp_level for DeepSpeedPrecisionPlugin to O2 (#13897)
    • Fixed Python 3.10 compatibility for truncated back-propagation through time (TBPTT) (#13973)
    • Fixed TQDMProgressBar reset and update to show correct time estimation (2/2) (#13962)

    Full commit list: https://github.com/PyTorchLightning/pytorch-lightning/compare/1.6.0...1.7.0

    Contributors

    Veteran

    @akashkw @akihironitta @aniketmaurya @awaelchli @Benjamin-Etheredge @Borda @carmocca @catalys1 @daniellepintz @edenlightning @edward-io @EricWiener @fschlatt @ftorres16 @jerome-habana @justusschock @karthikrangasai @kaushikb11 @krishnakalyan3 @krshrimali @mauvilsa @nikvaessen @otaj @pre-commit-ci @puhuk @raoakarsha @rasbt @rohitgr7 @SeanNaren @s-rog @talregev @tchaton @tshu-w @twsl @weiji14 @williamFalcon @WrRan

    New

    @alvitawa @aminst @ankitaS11 @ar90n @Atharva-Phatak @bibhabasumohapatra @BongYang @code-review-doctor @CompRhys @Cyprien-Ricque @dependabot @digital-idiot @DN6 @donlapark @ekagra-ranjan @ethanfurman @gautierdag @georgestein @HallerPatrick @HenryLau0220 @hhsecond @himkt @HMellor @igorgad @inwaves @ishtos @JeroenDelcour @JiahaoYao @jiny419 @jinyoung-lim @JustinGoheen @jxmorris12 @Keiku @kingjuno @lsy643 @luca-medeiros @lukasugar @maciek-pioro @mads-oestergaard @manskx @martinosorb @MohammedAlkhrashi @MrShevan @myxik @naisofly @NathanielDamours @nayoungjun @niberger @nitinramvelraj @nninept @pbsds @Pragyanstha @PrajwalBorkar @Prometheos2 @rampartrange @rhjohnstone @rschireman @samz5320 @Schinkikami @semaphore-egg @shantam-8 @shenoynikhil @sisilmehta2000 @s-kumano @stanbiryukov @talregev @tanmoyio @tkonopka @vumichien @wangherr @yhl48 @YongWookHa

    If we forgot somebody or you have a suggestion, find support here :zap:

    Did you know?

    Chuck Norris can unit-test entire applications with a single assert.

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.2-py3-none-any.whl(164.77 KB)
    lightning-2022.8.2.tar.gz(58.69 KB)
    pytorch-lightning-1.7.0.tar.gz(505.14 KB)
    pytorch_lightning-1.7.0-py3-none-any.whl(684.43 KB)
  • app/0.5.4(Aug 1, 2022)

    [0.5.4] - 2022-08-01

    Changed

    • Wrapped imports for traceability (#13924)
    • Set version as today (#13906)

    Fixed

    • Included app templates to the lightning and app packages (#13731)
    • Added UI for installing it all (#13732)
    • Fixed build meta pkg flow (#13926)

    Contributors

    @Borda, @manskx

    If we forgot someone due to not matching commit email with GitHub account, let us know :]

    Source code(tar.gz)
    Source code(zip)
    lightning-2022.8.1-py3-none-any.whl(165.34 KB)
    lightning-2022.8.1.tar.gz(58.34 KB)
    lightning-app-0.5.4.tar.gz(1.03 MB)
    lightning_app-0.5.4-py3-none-any.whl(1.07 MB)
PyTorch implementations of algorithms for density estimation

pytorch-flows A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invert

Ilya Kostrikov 546 Dec 05, 2022
MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving Code will be available soon. Motivation Architecture

Kai Chen 24 Apr 19, 2022
MAGMA - a GPT-style multimodal model that can understand any combination of images and language

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning Authors repo (alphabetical) Constantin (CoEich), Mayukh (Mayukh

Aleph Alpha GmbH 331 Jan 03, 2023
[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

TorchSemiSeg [CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision by Xiaokang Chen1, Yuhui Yuan2, Gang Zeng1, Jingdong Wang

Chen XiaoKang 387 Jan 08, 2023
🔥RandLA-Net in Tensorflow (CVPR 2020, Oral & IEEE TPAMI 2021)

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (CVPR 2020) This is the official implementation of RandLA-Net (CVPR2020, Oral

Qingyong 1k Dec 30, 2022
A "gym" style toolkit for building lightweight Neural Architecture Search systems

A "gym" style toolkit for building lightweight Neural Architecture Search systems

Jack Turner 12 Nov 05, 2022
ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data

ARKitScenes This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D

Apple 371 Jan 05, 2023
Consensus score for tripadvisor

ContripScore ContripScore is essentially a score that combines an Internet platform rating and a consensus rating from sentiment analysis (For instanc

Pepe 1 Jan 13, 2022
TLoL (Python Module) - League of Legends Deep Learning AI (Research and Development)

TLoL-py - League of Legends Deep Learning Library TLoL-py is the Python component of the TLoL League of Legends deep learning library. It provides a s

7 Nov 29, 2022
Self-describing JSON-RPC services made easy

ReflectRPC Self-describing JSON-RPC services made easy Contents What is ReflectRPC? Installation Features Datatypes Custom Datatypes Returning Errors

Andreas Heck 31 Jul 16, 2022
Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

This is the official PyTorch implementation of our paper: "Joint Object Detection and Multi-Object Tracking with Graph Neural Networks". Our project website and video demos are here.

Richard Wang 443 Dec 06, 2022
PyTorch implementation of spectral graph ConvNets, NIPS’16

Graph ConvNets in PyTorch October 15, 2017 Xavier Bresson http://www.ntu.edu.sg/home/xbresson https://github.com/xbresson https://twitter.com/xbresson

Xavier Bresson 287 Jan 04, 2023
Attention-driven Robot Manipulation (ARM) which includes Q-attention

Attention-driven Robotic Manipulation (ARM) This codebase is home to: Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation I

Stephen James 84 Dec 29, 2022
Final Project for the CS238: Decision Making Under Uncertainty course at Stanford University in Autumn '21.

Final Project for the CS238: Decision Making Under Uncertainty course at Stanford University in Autumn '21. We optimized wind turbine placement in a wind farm, subject to wake effects, using Q-learni

Manasi Sharma 2 Sep 27, 2022
Non-stationary GP package written from scratch in PyTorch

NSGP-Torch Examples gpytorch model with skgpytorch # Import packages import torch from regdata import NonStat2D from gpytorch.kernels import RBFKernel

Zeel B Patel 1 Mar 06, 2022
League of Legends Reinforcement Learning Environment (LoLRLE) multiple training scenarios using PPO.

League of Legends Reinforcement Learning Environment (LoLRLE) About This repo contains code to train an agent to play league of legends in a distribut

2 Aug 19, 2022
CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection

CLOCs is a novel Camera-LiDAR Object Candidates fusion network. It provides a low-complexity multi-modal fusion framework that improves the performance of single-modality detectors. CLOCs operates on

Su Pang 254 Dec 16, 2022
Best Practices on Recommendation Systems

Recommenders What's New (February 4, 2021) We have a new relase Recommenders 2021.2! It comes with lots of bug fixes, optimizations and 3 new algorith

Microsoft 14.8k Jan 03, 2023
Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021)

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021) Overview Prerequisites Linux Pytho

Shaojie Li 34 Mar 31, 2022
This is the implementation of the paper LiST: Lite Self-training Makes Efficient Few-shot Learners.

LiST (Lite Self-Training) This is the implementation of the paper LiST: Lite Self-training Makes Efficient Few-shot Learners. LiST is short for Lite S

Microsoft 28 Dec 07, 2022