DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Overview

Build Status PyPI version Documentation Status License MIT Downloads

03/2021: DeepSpeed is hiring! Come join us: SDE 2, Sr. SDE, Sr. Researcher

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

10x Larger Models

10x Faster Training

Minimal Code Change

DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:

  • Extreme scale: Using current generation of GPU clusters with hundreds of devices, 3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
  • Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
  • Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
  • Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth. 1-bit Adam/1-bit LAMB reduce communication volume by up to 5x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.

Early adopters of DeepSpeed have already produced a language model (LM) with over 17B parameters called Turing-NLG, establishing a new SOTA in the LM category.

DeepSpeed is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

For further documentation, tutorials, and technical deep-dives please see deepspeed.ai!

News

Table of Contents

Section Description
Why DeepSpeed? DeepSpeed overview
Install Installation details
Features Feature list and overview
Further Reading Documentation, tutorials, etc.
Contributing Instructions for contributing
Publications Publications related to DeepSpeed
Videos Videos related to DeepSpeed

Why DeepSpeed?

Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism. DeepSpeed addresses these challenges to accelerate model development and training.

Installation

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

Note: PyTorch must be installed before installing DeepSpeed.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

ds_report

If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.

Features

Below we provide a brief feature list, see our detailed feature overview for descriptions and usage.

Further Reading

All DeepSpeed documentation can be found on our website: deepspeed.ai

Article Description
DeepSpeed Features DeepSpeed features
Getting Started First steps with DeepSpeed
DeepSpeed JSON Configuration Configuring DeepSpeed
API Documentation Generated DeepSpeed API documentation
CIFAR-10 Tutorial Getting started with CIFAR-10 and DeepSpeed
Megatron-LM Tutorial Train GPT2 with DeepSpeed and Megatron-LM
BERT Pre-training Tutorial Pre-train BERT with DeepSpeed
Learning Rate Range Test Tutorial Faster training with large learning rates
1Cycle Tutorial SOTA learning schedule in DeepSpeed

Contributing

DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Publications

  1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).
  2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
  3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.
  4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840.
  5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888.
  6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857.
  7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. arXiv:2104.06069.

Videos

  1. DeepSpeed KDD 2020 Tutorial
    1. Overview
    2. ZeRO + large model training
    3. 17B T-NLG demo
    4. Fastest BERT training + RScan tuning
    5. DeepSpeed hands on deep dive: part 1, part 2, part 3
    6. FAQ
  2. Microsoft Research Webinar
  3. DeepSpeed on AzureML
  4. Community Tutorials
Comments
  • [REQUEST] how to Wrap normalization layers like LayerNorm in FP32 when use zero (fp16 or bf16)?

    [REQUEST] how to Wrap normalization layers like LayerNorm in FP32 when use zero (fp16 or bf16)?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe the solution you'd like A clear and concise description of what you want to happen.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Additional context Add any other context or screenshots about the feature request here.

    enhancement 
    opened by xiaohu2015 0
  • [BUG] some docs have broken formatting

    [BUG] some docs have broken formatting

    Describe the bug

    The API arguments docs aren't formatted, e.g. these ones (but there are probably more of those):

    https://deepspeed.readthedocs.io/en/latest/training.html#model-saving https://deepspeed.readthedocs.io/en/latest/training.html#gradient-accumulation https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html (most of the page)

    e.g. have a look at: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#loading-training-checkpoints

    here all the args are piled into one para, instead of being itemized and nothing is formatted:

    Load training checkpoint :param load_dir: Required. Directory to load the checkpoint from :param tag: Checkpoint tag used as a unique identifier for checkpoint, if not provided will attempt to load tag in ‘latest’ file :param load_module_strict: Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match. :param load_optimizer_states: Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM’s momentum and variance :param load_lr_scheduler_states: Optional. Boolean to add the learning rate scheduler states from Checkpoint. :param load_module_only: Optional. Boolean to load only the model weights from the checkpoint. Ex. warmstarting. :param custom_load_fn: Optional. Custom model load function.

    bug training 
    opened by stas00 0
  • [GatheredParameters] fix memory leak

    [GatheredParameters] fix memory leak

    Currently on exit from GatheredParameters with modified_rank=None memory is leaked as the gathered param remains gathered (the leak remains until the param is gathered again, most likely the first forward).

    This PR fixes this problem by re-partitioning the param on exit from the GatheredParameters context.

    A new test is supplied that reproduces this scenario and which fails prior to this PR.

    @tjruwase

    opened by stas00 0
  • [GatheredParameters] add support for any iterable

    [GatheredParameters] add support for any iterable

    This PR extends GatheredParameters to support any iterable of parameters.

    Currently there is an issue if someone does:

    with deepspeed.zero.GatheredParameters(model.parameters(), ...):
    

    it gets silently skipped and no gathering happens.

    I raised this issue here https://github.com/microsoft/DeepSpeed/issues/2658 as it can be a huge problem if this happens during model weights init which 99% of the time will silently do nothing on 0-length vectors and the user isn't the wiser that their training is going to break because of that. This is a very important issue. Please kindly give it extra attention. I run into it myself and had users report the same issue.

    So this PR at least makes the most obvious mistake no longer a mistake as intuitively model.parameters() should just work and not require the user to remember to do list(model.parameters()) as there is no assert if it's not done in this way.

    I modified one of the tests to ensure this case is tested, the list one is obvious sub-case of the generator. but I can fork the test and do each explicitly if you prefer that.

    I changed the API doc to match the new reality. The tutorials/docs don't seem to be discussing GatheredParameters's args so nothing to change there,

    @tjruwase

    opened by stas00 0
  • [fp16] lower `initial_scale_power` to `16`

    [fp16] lower `initial_scale_power` to `16`

    I'm proposing to change the default initial_scale_power to 16 from the current 32. Here is why:

    From wikipedia

    The minimum strictly positive (subnormal) value is 2−24 ≈ 5.96 × 10−8. The minimum positive normal value is 2−14 ≈ 6.10 × 10−5. The maximum representable value is (2−2**−10) × 215 = 65504.

    So I guess if the loss were to be 2**−24 then the maximum possible loss scale could be 2**48 (24+16) before it overflows, so 2**32 mathematically passes as a legit loss scale, except it's fantastically improbable.

    But practically have you ever seen loss<1 ? And loss>1 then takes us to initial_scale_power=16 as the practical starting point, which is likely to lead to just a few skipped optim states.

    (while this might sounds as a pointless change - who cares about a few skipped steps which are likely to be totally insignificant when training for thousands of steps, these things affect situations like writing tests, or debugging a failing from the beginning training, etc.)


    I hope this is not a backward compatibility breaking change, as someone not specifying initial_scale_power explicitly and relying on the default 32 will now have a slightly different outcome as they will start training sooner (less skipping). If it is, then we should leave the default 32 but change the doc to use 16 and add a note to why.

    So please kindly discuss among your colleagues whether the proposed change is a good idea as is, or whether it'd be safer to not change the default, but the docs only. Thank you.

    @tjruwase

    opened by stas00 0
  • Fix INT8-quantization for BLOOM, OPT, and Neo-X

    Fix INT8-quantization for BLOOM, OPT, and Neo-X

    This PR addresses https://github.com/microsoft/DeepSpeed/issues/2616 and https://github.com/microsoft/DeepSpeed/issues/2379

    Also, this adds the support for INT8 inference of the different model architectures quantizing form the HF checkpoint directly. Here is an example using the DeepSpeedExamples inference test-suite running facebook/opt-30b using only one 32GB NVIDIA V100 card:

    deepspeed --num_nodes 1 --num_gpus 1 inference-test.py --ds_inference --use_kernel --name facebook/opt-30b --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/ --dtype int8
    

    producing the following text:

    ------------------------------------------------------
    Free memory : 0.238525 (GigaBytes)  
    Total memory: 31.748535 (GigaBytes)  
    Requested memory: 0.140137 (GigaBytes) 
    Setting maximum total tokens (input + output) to 82 
    ------------------------------------------------------
    generation time is 10.450812101364136 sec
    
    in=DeepSpeed is a machine learning framework
    out=DeepSpeed is a machine learning framework for large-scale, complex data
    
    DeepSpeed is a machine learning framework specifically designed to solve some of the most complex and large-scale problems. The goal of DeepSpeed is to provide a rich infrastructure on top of which researchers can build highly
    ------------------------------------------------------------
    [2023-01-04 11:23:05,806] [INFO] [launch.py:350:main] Process 33466 exits successfully.
    

    Note that the memory is too tight here, however, we can still generate 50 tokens using the input text!

    opened by RezaYazdaniAminabadi 0
Releases(v0.7.7)
  • v0.7.7(Dec 12, 2022)

    What's Changed

    • Update the locator for Megatron-LM by @rapsealk in https://github.com/microsoft/DeepSpeed/pull/2564
    • use get_global_rank if available by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2567
    • Add Determined to open-source DL frameworks by @sirredbeard in https://github.com/microsoft/DeepSpeed/pull/2573
    • Support fp32 gradaccum for bf16 model by @delock in https://github.com/microsoft/DeepSpeed/pull/2566
    • Drop Maxwell Support by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2574
    • Fix quantized-inference & Add generic support of checkpoint loading by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2547
    • Fix MegatronLayerPolicy to have megatron_v2=True by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2579
    • Update barrier and reduce_scatter_base to conform to PyTorch signatures by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2570
    • Support N-dimension input in quantization kernel by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2575
    • Add checkpoint sharding unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2561
    • Updating docs README by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2587
    • Updating API docs by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2586
    • Fix issues w. python 3.6 + add py-version checks to CI by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2589
    • [benchmarks] get mask token from tokenizer by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2592

    New Contributors

    • @rapsealk made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2564
    • @sirredbeard made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2573

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.6...v0.7.7

    Source code(tar.gz)
    Source code(zip)
  • v0.7.6(Dec 1, 2022)

    What's Changed

    • DeepSpeed inference config. (#2459) by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2472
    • Update docs to autogenerate pydantic config model docs by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2509
    • Add max_tokens alias to max_out_tokens arg to maintain backwards compatibility by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2508
    • Deepspeed quantization library v0.1 by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2450
    • Fix backward compatibility for InferenceConfig by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2516
    • Add missing Inference sub-configs by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2518
    • Add note about nvcc/hipcc requirement by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2519
    • Update codeowners by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2525
    • Dequantization Utils Library by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2521
    • Fixes for torch 1.14 due to new torch.numel return type by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2522
    • Ensure MOE is initialized for SD by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2534
    • Make DS-Inference config readable from JSON by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2537
    • Add MII tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2533
    • Remove mutable default parameter in init_inference() by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2540
    • Change Where DS/Triton is Used in Stable Diffusion by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2536
    • Pass down the new DS inference config to replace_transformer_layer. by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2539
    • Adding Gradient Accumulation Data Type Config by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2512
    • Report progress at gradient accumulation boundary by @ShijieZZZZ in https://github.com/microsoft/DeepSpeed/pull/2553
    • encoded ds config into command line argument when launching child processes in autotuning by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2524
    • Add missing MoE fields to inference config for backward compatibility by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2556
    • Abstract accelerator (step 1) by @delock in https://github.com/microsoft/DeepSpeed/pull/2504
    • Fix invalid check of recorded parameter orders in zero stage3. by @inkcherry in https://github.com/microsoft/DeepSpeed/pull/2550

    New Contributors

    • @ShijieZZZZ made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2553
    • @delock made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2504
    • @inkcherry made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2550

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.5...v0.7.6

    Source code(tar.gz)
    Source code(zip)
  • v0.7.5(Nov 14, 2022)

    What's Changed

    • Fix Bug #2319 by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2438
    • update pytorch pool operator function signiture by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2443
    • Fix build issues on Windows by @eltonzheng in https://github.com/microsoft/DeepSpeed/pull/2428
    • rollback ds config changes by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2395
    • Use CUDA events for inference model profiling by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2371
    • Fixing a config mismatch in unit test. by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2447
    • Reduction Kernel Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2436
    • deepspeed/launcher/launch.py: add option enable_each_rank_log by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2409
    • Fixes for various CI problems by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2457
    • Cache Allocation and Softmax Fixes by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2433
    • Fix checkpoint loading at inference-engine by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2429
    • Create a new folder structure to isolate model-specific code in DS by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2464
    • don't gather partitioned activations for mp size 1 by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2454
    • Updating autotune json default in docs. by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2476
    • Added MLFLOW environment variables for logging metrics within trainig… by @savitamittal1 in https://github.com/microsoft/DeepSpeed/pull/2477
    • fix accelerate link in README by @kyoto7250 in https://github.com/microsoft/DeepSpeed/pull/2481
    • Fix Stable-Diffusion: Add correct memory-allocation at DeepSpeed-Attention by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2474
    • Fix CI issues related to cupy install by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2483
    • Add scale_attn_by_inverse_layer_idx feature by @hyunwoongko in https://github.com/microsoft/DeepSpeed/pull/2486
    • Stable Diffusion Enhancements by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2491
    • stage_1_and_2.py: no allreduce needed when mp size is 1 by @guoyejun in https://github.com/microsoft/DeepSpeed/pull/2494
    • Make bf16_optimizer work for non pipeline parallelism by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2470
    • Fix nightly CI tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2493
    • Make data contiguous before the inplace reshape-copy_ function. by @lokoppakmsft in https://github.com/microsoft/DeepSpeed/pull/2489
    • Fix typos: deepseed -> deepspeed by @jinyouzhi in https://github.com/microsoft/DeepSpeed/pull/2499

    New Contributors

    • @guoyejun made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2409
    • @savitamittal1 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2477
    • @kyoto7250 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2481
    • @lokoppakmsft made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2489
    • @jinyouzhi made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2499

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.4...v0.7.5

    Source code(tar.gz)
    Source code(zip)
  • v0.7.4(Oct 21, 2022)

    What's Changed

    • MOE residual matmult unit test by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2323
    • MOE matmult with memaccess by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2336
    • Refactor residual add kernels by @arashb in https://github.com/microsoft/DeepSpeed/pull/2333
    • mem access for quantize kernel by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2331
    • increase min pre-commit versions by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2346
    • Extend scratch buffer for long prompts by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2212
    • [docs] fix zero docs by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2350
    • Staging profile inference v1 (#2348) by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2349
    • Kernel Data Conversion Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2327
    • Add Onebit Optimizers in init by @l4d2boomer in https://github.com/microsoft/DeepSpeed/pull/2340
    • docs(mixture-of-experts-inference): fix typo in tuto by @jqueguiner in https://github.com/microsoft/DeepSpeed/pull/2345
    • Use blob storage for datasets in unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2342
    • Refactor gptj_residual_add kernels for better readability by @arashb in https://github.com/microsoft/DeepSpeed/pull/2358
    • Updated issue templates by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2363
    • fix cuda invalid config error in dequant kernel by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2362
    • Add missing pytest fixture scope by @arashb in https://github.com/microsoft/DeepSpeed/pull/2353
    • Extend residual_add kernel tests to cover pre_attn_norm by @arashb in https://github.com/microsoft/DeepSpeed/pull/2354
    • Refactor fused_bias_residual kernels for better readability by @arashb in https://github.com/microsoft/DeepSpeed/pull/2356
    • Capture error message during sweep tests by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2351
    • Fix an exception when auto-casting dicts to fp16 by @mjksmith in https://github.com/microsoft/DeepSpeed/pull/2370
    • Refactor remaining distributed tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2216
    • Fix the MLP output tensor's shape by @arashb in https://github.com/microsoft/DeepSpeed/pull/2380
    • add 11.8 to cuda_minor_mismatch_ok to allow building with current CUDA by @Thomas-MMJ in https://github.com/microsoft/DeepSpeed/pull/2390
    • Pin Transformers test version by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2402
    • Change type to tuple in replace_wo_policy isinstance check by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2387
    • Checkpoint backwards-compatbility workaround by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2384
    • Add Predicated Global Load to Memory Access Utils by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2373
    • MII blog post by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2418
    • Fix figure reference by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2419
    • Add SLURM Multinode Runner by @dashstander in https://github.com/microsoft/DeepSpeed/pull/2404
    • Fix issue with corrupted output on long generation for GPT by @andrewchernyh in https://github.com/microsoft/DeepSpeed/pull/2359
    • Fix GPT Neo-X multi-gpu inference by @andrewchernyh in https://github.com/microsoft/DeepSpeed/pull/2401
    • CI fixes related to triton by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2422
    • [docs] update mii blog title by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2423
    • add SD injection policy by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2381
    • Fix checkpoint loading when it is a dictionary by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2425
    • Make error regex more generic in collect_results.py by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2415
    • fixes #2389 by @clumsy in https://github.com/microsoft/DeepSpeed/pull/2411
    • Fix for inference gpt-j test by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2430
    • Fixing bug 2361 by @jomayeri in https://github.com/microsoft/DeepSpeed/pull/2410
    • Universal checkpoint for zero stage 1 by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2284
    • only add deps if extra is explicitly called by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2432
    • Add TestInjectionPolicy inference unittest class for testing custom injection policies by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2426
    • [memory estimators] new config args sync by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2431
    • parallelize writing of layer checkpoint files across data parallel instances by @adammoody in https://github.com/microsoft/DeepSpeed/pull/1419
    • Fix broken link to DeepSpeed Megatron fork by @lekurile in https://github.com/microsoft/DeepSpeed/pull/2440

    New Contributors

    • @l4d2boomer made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2340
    • @jqueguiner made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2345
    • @mjksmith made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2370
    • @Thomas-MMJ made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2390
    • @lekurile made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2387
    • @dashstander made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2404
    • @andrewchernyh made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2359
    • @clumsy made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2411
    • @jomayeri made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2410

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.3...v0.7.4

    Source code(tar.gz)
    Source code(zip)
  • v0.7.3(Sep 19, 2022)

    What's Changed

    • Add blob storage to CI runners by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2260
    • Update replace_module.py, test-gptj.py related fix by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2269
    • Fix OrderedDict import for python3.6 by @Dipet in https://github.com/microsoft/DeepSpeed/pull/2267
    • Ds inference/fix mp2 by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2270
    • Trajepl: nebula load fix by @trajepl in https://github.com/microsoft/DeepSpeed/pull/2182
    • Prevent torch ext folder mkdir at tmp by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2274
    • Ds-inference Int8 support through ZeroQuant technology by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2217
    • add a new unit test for cuda ops by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2278
    • Addition to code owners file by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2279
    • Memory Access Utility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2276
    • Fp32 accuracy bug fix by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2285
    • Refactor universal checkpointing and tensor fragments by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2253
    • [ds-inference] fix progress bar by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2286
    • Offload all gradients to nvme by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2282
    • fused bias relu unittest by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2297
    • Fix for pytest picking up wrong deepspeed by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2299
    • Fix for Zero3 when MP>1 by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2289
    • Unit test for bias add kernel by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2298
    • Update relu.cu with mem_access_utils by @molly-smith in https://github.com/microsoft/DeepSpeed/pull/2306
    • Add tensor parallel inference unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2232
    • Fix the residual add mp scaling for GPTNeoX by @arashb in https://github.com/microsoft/DeepSpeed/pull/2310
    • Add unit tests for residual_add kernel by @arashb in https://github.com/microsoft/DeepSpeed/pull/2307
    • add inference eval scripts by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2303
    • Upgrade P40 tests to torch 1.8 by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2316
    • ZeRO-Inference blog by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2271
    • ZeRO-Inference blog - wrap up by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2321
    • ZeRO-Inference blog - Update README by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2322
    • Refactor relu bias add with mem_access utils by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2317
    • add quant unit test by @GuanhuaWang in https://github.com/microsoft/DeepSpeed/pull/2315
    • only override forward if using cuda-graph by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2291
    • Add more options to inference benchmark by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2325

    New Contributors

    • @molly-smith made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2269

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.2...v0.7.3

    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Aug 25, 2022)

    What's Changed

    • Enable contiguous gradients with Z1+MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2250
    • Correctly detect CPU optimizer usage by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2257
    • Update Half Precision Kernel Compatibility by @cmikeh2 in https://github.com/microsoft/DeepSpeed/pull/2261
    • fix #2240: wrong time unit in flops_profiler by @yzs981130 in https://github.com/microsoft/DeepSpeed/pull/2241

    New Contributors

    • @cmikeh2 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2261
    • @yzs981130 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2241

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.1...v0.7.2

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Aug 23, 2022)

    What's Changed

    • Fix for distributed tests on pytorch>=1.12 by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2141
    • delay torch import for inference compatability check by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2167
    • Fix wrong unit of latency in flops-profiler (#2090) by @zionwu in https://github.com/microsoft/DeepSpeed/pull/2095
    • [docs] adoption updates by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2173
    • Update for AMD CI workflow by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2172
    • [docs] update offload docs to include stage 1 by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2178
    • Fixing model partitioning without injection by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2179
    • Match compute and reduce dtype by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2145
    • Enable fused_lamb_cuda_kernel on ROCm by @rraminen in https://github.com/microsoft/DeepSpeed/pull/2148
    • Update README to latest Composer version by @hanlint in https://github.com/microsoft/DeepSpeed/pull/2177
    • [deepspeed/autotuner] Missing hjson import by @rahilbathwal5 in https://github.com/microsoft/DeepSpeed/pull/2175
    • [docs] add more models to adoption by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2189
    • [CI] fix lightning tests by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2190
    • Fix typos on README.md by @gasparitiago in https://github.com/microsoft/DeepSpeed/pull/2192
    • Fix the layer-past for GPT based models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2196
    • Add gradient_average flag support for sparse grads by @Dipet in https://github.com/microsoft/DeepSpeed/pull/2188
    • Adding the compression tutorial on GPT distillation and quantization by @minjiaz in https://github.com/microsoft/DeepSpeed/pull/2197
    • Log user config exactly by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2201
    • Fix the tensor-slicing copy for qkv parameters by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2198
    • Refactor Distributed Tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2180
    • fix table syntax by @kamalkraj in https://github.com/microsoft/DeepSpeed/pull/2204
    • Correctly detect offload configuration by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2208
    • add cuda 11.7 by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2211
    • use torch 1.9 in accelerate tests by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2215
    • [zero-3] print warning once and support torch parameter by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/2127
    • Add support of OPT models by @arashb in https://github.com/microsoft/DeepSpeed/pull/2205
    • fix typos in readme. by @zhjohnchan in https://github.com/microsoft/DeepSpeed/pull/2218
    • Fix regression w. dist_init_required by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2225
    • add doc for new bert example by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2224
    • Remove the random-generator from context during inference by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2228
    • allow saving ckpt w/o ckpt json + bloom copy fix by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2237
    • Correctly detect zero_offload by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2213
    • [docs] update community videos by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2249
    • Refactor dist tests: Checkpointing by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2202
    • Make OPT policy backward compatible with pre-OPT transformers versions by @arashb in https://github.com/microsoft/DeepSpeed/pull/2254
    • fix ds-inference without policy by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2247

    New Contributors

    • @zionwu made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2095
    • @hanlint made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2177
    • @rahilbathwal5 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2175
    • @gasparitiago made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2192
    • @arashb made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2205
    • @zhjohnchan made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2218

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.7.0...v0.7.1

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Aug 1, 2022)

    New features

    • DeepSpeed Compression: https://www.microsoft.com/en-us/research/blog/deepspeed-compression-a-composable-library-for-extreme-compression-and-zero-cost-quantization/

    What's Changed

    • Adding DeepSpeed Compression Composer by @yaozhewei in https://github.com/microsoft/DeepSpeed/pull/2105
    • Remove hardcoded ROCm install path by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2093
    • Fix softmax dim of Residual MoE implementation in moe/layer.py by @hero007feng in https://github.com/microsoft/DeepSpeed/pull/2110
    • reduce ds-inference log verbosity by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2111
    • DeepSpeed Compression announcement by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2114
    • Checkpoint reshaping by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1953
    • Fix init_process_group by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2121
    • DS Benchmarks QoL Improvements by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2120
    • [ROCm] Wrong command broke ROCm build. by @jpvillam-amd in https://github.com/microsoft/DeepSpeed/pull/2118
    • DeepSpeed Communication Profiling and Logging by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2012
    • Add flake8 to pre-commit checks by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2051
    • Fix conflict between Tutel and top-2 gate in MoE layer by @yetiansh in https://github.com/microsoft/DeepSpeed/pull/2053
    • adding HF Accelerate+DS tests workflow by @pacman100 in https://github.com/microsoft/DeepSpeed/pull/2134
    • [inference tests] turn off time check for now by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2142
    • Allow turning off loss scaling wrt GAS + update tput calculator by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2140
    • Refactor ZeRO configs to use Pydantic by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2004
    • Add purely-local sliding window sparse attention config by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1962
    • Trajepl/nebula ckpt engine by @trajepl in https://github.com/microsoft/DeepSpeed/pull/2085
    • Graceful exit on failures for multi-node runs by @jerrymannil in https://github.com/microsoft/DeepSpeed/pull/2008
    • fix: fix BF16_Optimizer compatibility issue by @shjwudp in https://github.com/microsoft/DeepSpeed/pull/2152
    • Fix random token-generation issue + MP-checkpoint loading/saving by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2132
    • Added retain_graph as a kwarg to the main engine backward function by @ncilfone in https://github.com/microsoft/DeepSpeed/pull/1149
    • Elastic Training support in DeepSpeed by @aj-prime in https://github.com/microsoft/DeepSpeed/pull/2156
    • prevent cuda 10 builds of inference kernels on ampere by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2157
    • [zero-3] shutdown zero.Init from within ds.init by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2150
    • enable fp16 input autocasting by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2158
    • Release swap buffers for persisted params by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2089
    • Tensor parallelism for Mixture of Experts by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2074

    New Contributors

    • @hero007feng made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2110
    • @jpvillam-amd made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2118
    • @yetiansh made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2053
    • @pacman100 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2134
    • @jimwu6 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2144
    • @trajepl made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2085
    • @ncilfone made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1149

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.7...v0.7.0

    Source code(tar.gz)
    Source code(zip)
  • v0.6.7(Jul 19, 2022)

    What's Changed

    • Add Inference support for running the BigScience-BLOOM Architecture by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2083
    • [ds-inference] checkpoint loading => tqdm by @stas00 in https://github.com/microsoft/DeepSpeed/pull/2107
    • Dont overwrite hook handles in flop profiler by @Sanger2000 in https://github.com/microsoft/DeepSpeed/pull/2106
    • Support HuggingFace NeoX injection policy by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2087

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.6...v0.6.7

    Source code(tar.gz)
    Source code(zip)
  • v0.6.6(Jul 18, 2022)

    What's Changed

    • [docs] add 530b paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1979
    • small fix for the HF Bert models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1984
    • Add unit test for various model families and inference tasks by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1981
    • Fix for lightning tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1988
    • fix typo when getting kernel dim in conv calculation by @cli99 in https://github.com/microsoft/DeepSpeed/pull/1989
    • Add torch-latest and torch-nightly CI workflows by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1990
    • [bug] Add user-defined launcher args for MPI launcher by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1933
    • Propagate max errorcode to deepspeed when using PDSH launcher by @jerrymannil in https://github.com/microsoft/DeepSpeed/pull/1994
    • [docs] add new build badges to landing page by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1998
    • DeepSpeed Comm. Backend v1 by @awan-10 in https://github.com/microsoft/DeepSpeed/pull/1985
    • Relax DeepSpeed MoE ZeRO-1 Assertion by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2007
    • update CODEOWNERS by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2017
    • [CI] force upgrade HF dependencies & output py env by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2015
    • [inference] test suite for ds-kernels (bert, roberta, gpt2, gpt-neo, gpt-j) by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1992
    • DeepSpeed examples refresh by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2021
    • Fix transformer API for training-evaluation pipeline by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2018
    • DataLoader Length Fix by @Sanger2000 in https://github.com/microsoft/DeepSpeed/pull/1718
    • DeepSpeed Monitor Module (Master) by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2013
    • Use partition numel by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2011
    • fix import errors by @KMFODA in https://github.com/microsoft/DeepSpeed/pull/2026
    • Fix inference unit test import error catching by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2024
    • Retain available params until last use by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2016
    • Split parameter offload from z3 by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/2009
    • Fix flops profiler print statements by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2038
    • Add compression papers by @conglongli in https://github.com/microsoft/DeepSpeed/pull/2042
    • Fix the half-precision version of CPU-Adam by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/2032
    • Fix for AMD unit tests by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/2047
    • Wrong partition_id while copying fp32_params -> fp16 params in Z2 for MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2058
    • Fix missing import in replace_module.py by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2050
    • Comms Benchmarks by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/2040
    • add ds inference paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2072
    • Comments for better understanding of zero stage1_2 by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/2027
    • [docs] fix broken read-the-docs build by @jeffra in https://github.com/microsoft/DeepSpeed/pull/2075
    • Fix building package without a GPU by @aphedges in https://github.com/microsoft/DeepSpeed/pull/2049
    • Fix partition id in the fp32->fp16 param copying step for z2+cpu-offload by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2059
    • Codeowner addendum and fix to small model debugging script by @samadejacobs in https://github.com/microsoft/DeepSpeed/pull/2076
    • remove require grad in params count by @cli99 in https://github.com/microsoft/DeepSpeed/pull/2065
    • Add missing newline for ZeroOneAdam parameter table by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/2088
    • fixed "None type has no len()" by @xiazeyu in https://github.com/microsoft/DeepSpeed/pull/2091
    • Improving memory utilization of Z2+MoE by @siddharth9820 in https://github.com/microsoft/DeepSpeed/pull/2079

    New Contributors

    • @jerrymannil made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1994
    • @Sanger2000 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1718
    • @KMFODA made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2026
    • @siddharth9820 made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2058
    • @samadejacobs made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2076
    • @xiazeyu made their first contribution in https://github.com/microsoft/DeepSpeed/pull/2091

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.5...v0.6.6

    Source code(tar.gz)
    Source code(zip)
  • v0.6.5(May 25, 2022)

    What's Changed

    • GatheredParameters - accept a tuple of params by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1941
    • Update partition_parameters.py by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/1943
    • fix step in adam by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1823
    • [pipe] prevent deadlock with multiple evals sequence by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1944
    • Fairseq support by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1915
    • DeepSpeed needs to start cleaning up by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1947
    • trivial fix by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/1954
    • Enabling CUDA-graph for the bert-type models by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1952
    • Add loss scale guard to avoid inf loop by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1958
    • [launcher] add option to bypass ssh check by @liamcli in https://github.com/microsoft/DeepSpeed/pull/1957
    • Bump nokogiri from 1.13.4 to 1.13.6 in /docs by @dependabot in https://github.com/microsoft/DeepSpeed/pull/1965
    • Fix typo in timer.py by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1964
    • [docs] fix dependabot version issue by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1966
    • Don't add curand on rocm by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1968
    • Add Unidirectional Sparse Attention Type to BigBird and BSLongformer by @Quentin-Anthony in https://github.com/microsoft/DeepSpeed/pull/1959
    • Fix: Sparse tensors not updating by @Dipet in https://github.com/microsoft/DeepSpeed/pull/1914
    • Fixing several bugs in the inference-api and the kernels by @RezaYazdaniAminabadi in https://github.com/microsoft/DeepSpeed/pull/1951

    New Contributors

    • @Quentin-Anthony made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1958

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.4...v0.6.5

    Source code(tar.gz)
    Source code(zip)
  • v0.6.4(May 6, 2022)

    What's Changed

    • [fix] Windows installs cannot import fcntl by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1921
    • [build] explicitly add op_builder to manifest by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1920
    • Enable DeepSpeed inference on ROCm by @rraminen in https://github.com/microsoft/DeepSpeed/pull/1922
    • bf16 inference by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1917
    • spell err by @kisseternity in https://github.com/microsoft/DeepSpeed/pull/1929
    • [ZeRO-3] Rename confusing log message by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1932
    • [bug] Fix time log error in PipelineEngine by @Codle in https://github.com/microsoft/DeepSpeed/pull/1934
    • Improve z3 trace management by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1916

    New Contributors

    • @kisseternity made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1929
    • @Codle made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1934

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.3...v0.6.4

    Source code(tar.gz)
    Source code(zip)
  • v0.6.3(Apr 27, 2022)

    What's Changed

    • Fix setup.py crash when torch is not installed. by @PaperclipBadger in https://github.com/microsoft/DeepSpeed/pull/1866
    • Add support for AWS SageMaker. by @matherit in https://github.com/microsoft/DeepSpeed/pull/1868
    • Fix broken links by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1873
    • [docs] add amd blog to website by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1874
    • [docs] add moe paper by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1875
    • Supporting multiple modules injection with a single policy when they … by @samyam in https://github.com/microsoft/DeepSpeed/pull/1869
    • [docs] fix dead links by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1877
    • add now required -lcurand to solve undefined symbol: curandCreateGenerator by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1879
    • Bug fix for flops profilers output by @VisionTheta in https://github.com/microsoft/DeepSpeed/pull/1885
    • Bump nokogiri from 1.13.3 to 1.13.4 in /docs by @dependabot in https://github.com/microsoft/DeepSpeed/pull/1889
    • [docs] fix commonmarker security issue by @jeffra in https://github.com/microsoft/DeepSpeed/pull/1892
    • bf16+pipeline parallelism by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1801
    • fix file ordering by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1822
    • Use f-strings where possible by @manuelciosici in https://github.com/microsoft/DeepSpeed/pull/1900
    • [partition_parameters.py] better diagnostics by @stas00 in https://github.com/microsoft/DeepSpeed/pull/1887
    • comm backend: cast bool when not supported by torch2cupy by @conglongli in https://github.com/microsoft/DeepSpeed/pull/1894
    • Use cuda events to improve timing for multi-stream execution by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1881
    • Fix multiple zero 3 tracing errors by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1901
    • Improve ds_report output for HIP/ROCm by @mrwyattii in https://github.com/microsoft/DeepSpeed/pull/1906
    • Fix launcher for reading env vars by @szhengac in https://github.com/microsoft/DeepSpeed/pull/1907
    • Fix OOM and type mismatch by @tjruwase in https://github.com/microsoft/DeepSpeed/pull/1884

    New Contributors

    • @PaperclipBadger made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1866
    • @matherit made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1868
    • @VisionTheta made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1885
    • @szhengac made their first contribution in https://github.com/microsoft/DeepSpeed/pull/1822

    Misc

    • v0.6.2 was skipped due to a build/deploy issue with that release

    Full Changelog: https://github.com/microsoft/DeepSpeed/compare/v0.6.1...v0.6.3

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Mar 7, 2022)

  • v0.5.10(Jan 19, 2022)

  • v0.5.0(Aug 17, 2021)

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
A state of the art of new lightweight YOLO model implemented by TensorFlow 2.

CSL-YOLO: A New Lightweight Object Detection System for Edge Computing This project provides a SOTA level lightweight YOLO called "Cross-Stage Lightwe

Miles Zhang 54 Dec 21, 2022
Implementation for the paper SMPLicit: Topology-aware Generative Model for Clothed People (CVPR 2021)

SMPLicit: Topology-aware Generative Model for Clothed People [Project] [arXiv] License Software Copyright License for non-commercial scientific resear

Enric Corona 225 Dec 13, 2022
Chinese clinical named entity recognition using pre-trained BERT model

Chinese clinical named entity recognition (CNER) using pre-trained BERT model Introduction Code for paper Chinese clinical named entity recognition wi

Xiangyang Li 109 Dec 14, 2022
Fast, differentiable sorting and ranking in PyTorch

Torchsort Fast, differentiable sorting and ranking in PyTorch. Pure PyTorch implementation of Fast Differentiable Sorting and Ranking (Blondel et al.)

Teddy Koker 655 Jan 04, 2023
FcaNet: Frequency Channel Attention Networks

FcaNet: Frequency Channel Attention Networks PyTorch implementation of the paper "FcaNet: Frequency Channel Attention Networks". Simplest usage Models

327 Dec 27, 2022
Writeups for the challenges from DownUnderCTF 2021

cloud Challenge Author Difficulty Release Round Bad Bucket Blue Alder easy round 1 Not as Bad Bucket Blue Alder easy round 1 Lost n Found Blue Alder m

DownUnderCTF 161 Dec 31, 2022
Neural Point-Based Graphics

Neural Point-Based Graphics Project   Video   Paper Neural Point-Based Graphics Kara-Ali Aliev1 Artem Sevastopolsky1,2 Maria Kolos1,2 Dmitry Ulyanov3

Ali Aliev 252 Dec 13, 2022
[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

CONQUER: Contexutal Query-aware Ranking for Video Corpus Moment Retreival PyTorch implementation of CONQUER: Contexutal Query-aware Ranking for Video

Hou zhijian 23 Dec 26, 2022
Cross-platform CLI tool to generate your Github profile's stats and summary.

ghs Cross-platform CLI tool to generate your Github profile's stats and summary. Preview Hop on to examples for other usecases. Jump to: Installation

HackerRank 134 Dec 20, 2022
(ICONIP 2020) MobileHand: Real-time 3D Hand Shape and Pose Estimation from Color Image

MobileHand: Real-time 3D Hand Shape and Pose Estimation from Color Image This repo contains the source code for MobileHand, real-time estimation of 3D

90 Dec 12, 2022
Neural style transfer in PyTorch.

style-transfer-pytorch An implementation of neural style transfer (A Neural Algorithm of Artistic Style) in PyTorch, supporting CPUs and Nvidia GPUs.

Katherine Crowson 395 Jan 06, 2023
Jittor implementation of PCT:Point Cloud Transformer

PCT: Point Cloud Transformer This is a Jittor implementation of PCT: Point Cloud Transformer.

MenghaoGuo 547 Jan 03, 2023
Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study Supplementary Materials for Kentaro Matsuura, Junya Honda, Imad

Kentaro Matsuura 4 Nov 01, 2022
The final project for "Applying AI to Wearable Device Data" course from "AI for Healthcare" - Udacity.

Motion Compensated Pulse Rate Estimation Overview This project has 2 main parts. Develop a Pulse Rate Algorithm on the given training data. Then Test

Omar Laham 2 Oct 25, 2022
Non-Vacuous Generalisation Bounds for Shallow Neural Networks

This package requires jax, tensorflow, and numpy. Either tensorflow or scikit-learn can be used for loading data. To run in a nix-shell with required

Felix Biggs 0 Feb 04, 2022
Recurrent Conditional Query Learning

Recurrent Conditional Query Learning (RCQL) This repository contains the Pytorch implementation of One Model Packs Thousands of Items with Recurrent C

Dongda 4 Nov 28, 2022
Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models Abstract Many applications of generative models rely on the marginali

Stanford Intelligent Systems Laboratory 9 Jun 06, 2022
An open source object detection toolbox based on PyTorch

MMDetection is an open source object detection toolbox based on PyTorch. It is a part of the OpenMMLab project.

Bo Chen 24 Dec 28, 2022
[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

AMOS This repository contains the scripts for fine-tuning AMOS pretrained models on GLUE and SQuAD 2.0 benchmarks. Paper: Pretraining Text Encoders wi

Microsoft 22 Sep 15, 2022
ICON: Implicit Clothed humans Obtained from Normals (CVPR 2022)

ICON: Implicit Clothed humans Obtained from Normals Yuliang Xiu · Jinlong Yang · Dimitrios Tzionas · Michael J. Black CVPR 2022 News 🚩 [2022/04/26] H

Yuliang Xiu 1.1k Jan 04, 2023