๐Ÿ”ฎ Execution time predictions for deep neural network training iterations across different GPUs.

Overview

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

DOI DOI

Habitat is a tool that predicts a deep neural network's training iteration execution time on a given GPU. It currently supports PyTorch. To learn more about how Habitat works, please see our research paper.

Running From Source

Currently, the only way to run Habitat is to build it from source. You should use the Docker image provided in this repository to make sure that you can compile the code.

  1. Download the Habitat pre-trained models.
  2. Run extract-models.sh under analyzer to extract and install the pre-trained models.
  3. Run setup.sh under docker/ to build the Habitat container image.
  4. Run start.sh to start a new container. By default, your home directory will be mounted inside the container under ~/home.
  5. Once inside the container, run install-dev.sh under analyzer/ to build and install the Habitat package.
  6. In your scripts, import habitat to get access to Habitat. See experiments/run_experiment.py for an example showing how to use Habitat.

License

The code in this repository is licensed under the Apache 2.0 license (see LICENSE and NOTICE), with the exception of the files mentioned below.

This software contains source code provided by NVIDIA Corporation. These files are:

  • The code under cpp/external/cupti_profilerhost_util/ (CUPTI sample code)
  • cpp/src/cuda/cuda_occupancy.h

The code mentioned above is licensed under the NVIDIA Software Development Kit End User License Agreement.

We include the implementations of several deep neural networks under experiments/ for our evaluation. These implementations are copyrighted by their original authors and carry their original licenses. Please see the corresponding README files and license files inside the subdirectories for more information.

Research Paper

Habitat began as a research project in the EcoSystem Group at the University of Toronto. The accompanying research paper will appear in the proceedings of USENIX ATC'21. If you are interested, you can read a preprint of the paper here.

If you use Habitat in your research, please consider citing our paper:

@inproceedings{habitat-yu21,
  author = {Yu, Geoffrey X. and Gao, Yubo and Golikov, Pavel and Pekhimenko,
    Gennady},
  title = {{Habitat: A Runtime-Based Computational Performance Predictor for
    Deep Neural Network Training}},
  booktitle = {{Proceedings of the 2021 USENIX Annual Technical Conference
    (USENIX ATC'21)}},
  year = {2021},
}
Comments
  • I wonder what the meaning of varing kernel is.

    I wonder what the meaning of varing kernel is.

    Hi I am reading Habitat research paper.

    I wonder what the meaning of varing kernel is. I thought the GPU kernel is a collection of instructions that run in parallel, is that right?

    Can you give me an example of this phrase? 'some DNN operations are implemented using different GPU kernels on different GPUs '

    Thank you for taking the time to read.

    question 
    opened by Baek-sohyeon 6
  •  error: function cuptiProfilerBeginSession(&begin_session_params) failed with error CUPTI_ERROR_UNKNOWN

    error: function cuptiProfilerBeginSession(&begin_session_params) failed with error CUPTI_ERROR_UNKNOWN

    Hi @geoffxy,

    Great work here. I am quite interested in your project and try to reproduce from my side. Hower hit the error in the titel, I suspect that it may be caused by incompetible between CUPTI and NVIDIA driver version, I am wondering if could share you experiment setup here, mostly the host side, are you still using 18.04, what the nvidia driver version, did you use nvidia-docker2 or nvidia-container-runtime? what is your docker version?

    As mine, I am using 18.04 as host, driver 470.103.01, nvidia-docker2, docker 20.10.12.

    Thanks, Liang

    opened by liayan 3
  • How Habitat measures the execution time associated with the operationโ€™s backward pass?

    How Habitat measures the execution time associated with the operationโ€™s backward pass?

    Hi! Thanks for your perfect job.

    It's easy to understand to measure the execution time in the forward pass. But in the backward pass, how Habitat does? I think it is an undoubtedly different processor, right?

    @geoffxy Hope for your reply soon!

    question 
    opened by xiyiyia 2
  • Large Prediction Errors

    Large Prediction Errors

    Hi, I am reproducing the experiments in Habitat now. This is an interesting work and it's very convenient to run Habitat and process the results using the following two scripts.

    bash habitat/experiments/gather_raw_data.sh  <target_device>
    bash habitat/experiments/process_raw_data.sh
    

    Due to the limitation of GPU resources, I can not access all GPU models listed in the paper and only test it on V100, P100 and T4. But the prediction error is quite large, compared to that shown in the paper. You can check the results here.

    Basically, the setting I used follows habitat/docker/Dockerfile. Here are some of my experiment settings that may be different from yours:

    • CUDA driver version: 455.32.00,
    • I do not mount the user account on the host machine into the container

    So,

    1. Is there any hyper-parameter I need to tune to get a better prediction error ? 2.Can you share the cross-GPU prediction error between each pair of GPUs or just the output of habitat/experiments/process_raw_data.sh? Fig 3 in the paper only shows the results "averaged across all other โ€œoriginโ€ GPUs".
    2. Will the setting differences listed above affect the prediction error ? Or any other possible reasons ?

    Thanks.

    question 
    opened by joapolarbear 2
  • CMake Error at CMakeLists.txt:22 (pybind11_add_module):

    CMake Error at CMakeLists.txt:22 (pybind11_add_module):

    when running "install-dev.sh", hit below error:

    CMake Error at CMakeLists.txt:22 (pybind11_add_module): Unknown CMake command "pybind11_add_module".

    -- Configuring incomplete, errors occurred!

    opened by liayan 1
  • CUPTI_ERROR_INSUFFICIENT_PRIVILEGES in container

    CUPTI_ERROR_INSUFFICIENT_PRIVILEGES in container

    The default configuration on my OS and current directions in README may lead to a CUPTI_ERROR_INSUFFICIENT_PRIVILEGES when using CUPTI inside the container.

    The example log is attached below:

    /home/ubuntu/home/habitat/cpp/src/cuda/cupti_tracer.cpp:120: error: function cuptiActivityRegisterCallbacks(cuptiBufferRequested, cuptiBufferCompleted) failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES.
    Traceback (most recent call last):
      File "run_experiment.py", line 246, in <module>
        main()
      File "run_experiment.py", line 238, in main
        run_dcgan_experiments(context)
      File "run_experiment.py", line 155, in run_dcgan_experiments
        context,
      File "run_experiment.py", line 85, in run_experiment_config
        threshold = compute_threshold(runnable, context)
      File "run_experiment.py", line 66, in compute_threshold
        runnable()
      File "run_experiment.py", line 150, in runnable
        iteration(*inputs)
      File "/home/ubuntu/home/habitat/experiments/dcgan/entry_point.py", line 41, in iteration
        netD.zero_grad()
      File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1098, in zero_grad
        p.grad.detach_()
      File "/home/ubuntu/home/habitat/analyzer/habitat/tracking/operation.py", line 62, in hook
        kwargs,
      File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/operation.py", line 45, in measure_operation
        record_kernels,
      File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/operation.py", line 164, in _to_run_time_measurement
        if record_kernels else []
      File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/kernel.py", line 34, in measure_kernels
        self._measure_kernels_raw(runnable, fname)
      File "/home/ubuntu/home/habitat/analyzer/habitat/profiling/kernel.py", line 48, in _measure_kernels_raw
        time_kernels = hc.profile(runnable)
    RuntimeError: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
    

    My solution: Adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and reboot.

    Ref:

    • https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters
    • https://github.com/tensorflow/tensorflow/issues/35860#issuecomment-585436324
    opened by yzs981130 1
  • Fail to build the image

    Fail to build the image

    Hi, I am following the steps here to reproduce habitat. When running setup.sh to build the image, the following error occurs

    Step 14/19 : RUN gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4
    ---> Running in d42ae3b13a05
    gpg: WARNING: unsafe permissions on homedir '/root/.gnupg'
    gpg: keybox '/root/.gnupg/pubring.kbx' created
    gpg: keyserver receive failed: No name
    The command '/bin/sh -c gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4' returned a non-zero code: 2
    

    Does it mean the keyserver ha.pool.sks-keyservers.net is not accessible now?

    I wonder whether it is necessary to duplicate the user account on the host machine into the container. With a root account in the container, I can access everything mounted from the host machine. What problem does it cause?

    Looking forward to your reply. Thanks.

    opened by joapolarbear 1
  • Fix format specifier for size_t

    Fix format specifier for size_t

    https://stackoverflow.com/questions/2524611/how-can-one-print-a-size-t-variable-portably-using-the-printf-family

    Signed-off-by: Kiruya Momochi [email protected]

    opened by KiruyaMomochi 0
  • Broken pillow for torchvision in Dockerfile causes docker build failed

    Broken pillow for torchvision in Dockerfile causes docker build failed

    Currently, pip3 install torchvision==0.5.0 should fail due to the broken dependency of pillow, shown in the following CI building process:

    https://github.com/yzs-lab/habitat/runs/4311964953?check_suite_focus=true#step:3:915

    Corresponding logs are attached below:

    The headers or library files could not be found for zlib,
        a required dependency when compiling Pillow from source.
        
        Please see the install instructions at:
           https://pillow.readthedocs.io/en/latest/installation.html
        
        Traceback (most recent call last):
          File "/tmp/pip-build-c0iq5ua_/pillow/setup.py", line 1024, in <module>
            zip_safe=not (debug_build() or PLATFORM_MINGW),
          File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 129, in setup
            return distutils.core.setup(**attrs)
          File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
            dist.run_commands()
          File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
            self.run_command(cmd)
          File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
            cmd_obj.run()
          File "/usr/lib/python3/dist-packages/setuptools/command/install.py", line 61, in run
            return orig.install.run(self)
          File "/usr/lib/python3.6/distutils/command/install.py", line 589, in run
            self.run_command('build')
          File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
            self.distribution.run_command(command)
          File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
            cmd_obj.run()
          File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
            self.run_command(cmd_name)
          File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
            self.distribution.run_command(command)
          File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
            cmd_obj.run()
          File "/usr/lib/python3/dist-packages/setuptools/command/build_ext.py", line 78, in run
            _build_ext.run(self)
          File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
            self.build_extensions()
          File "/tmp/pip-build-c0iq5ua_/pillow/setup.py", line 790, in build_extensions
            raise RequiredDependencyException(f)
        __main__.RequiredDependencyException: zlib
        
        During handling of the above exception, another exception occurred:
        
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-build-c0iq5ua_/pillow/setup.py", line 1037, in <module>
            raise RequiredDependencyException(msg)
        __main__.RequiredDependencyException:
        
        The headers or library files could not be found for zlib,
        a required dependency when compiling Pillow from source.
        
        Please see the install instructions at:
           https://pillow.readthedocs.io/en/latest/installation.html
        
        
        
        ----------------------------------------
    Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-c0iq5ua_/pillow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-8eakyb7g-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-c0iq5ua_/pillow/
    The command '/bin/sh -c pip3 install   torch==1.4.0   torchvision==0.5.0   pandas==1.1.2   tqdm==4.49.0' returned a non-zero code: 1
    
    opened by yzs981130 0
Releases(v1.0.0)
  • v1.0.0(Jun 1, 2021)

    This release is the first feature release of Habitat.

    Habitat is a tool that predicts a deep neural network's training iteration execution time on a given GPU. To learn more about how Habitat works, please see our research paper.

    Source code(tar.gz)
    Source code(zip)
Owner
Geoffrey Yu
Computer Science PhD Student at MIT | Software Engineering '18 @uWaterloo
Geoffrey Yu
My implementation of transformers related papers for computer vision in pytorch

vision_transformers This is my personnal repo to implement new transofrmers based and other computer vision DL models I am currenlty working without a

samsja 1 Nov 10, 2021
Unofficial pytorch-lightning implement of Mip-NeRF

mipnerf_pl Unofficial pytorch-lightning implement of Mip-NeRF, Here are some results generated by this repository (pre-trained models are provided bel

Jianxin Huang 159 Dec 23, 2022
Official Pytorch implementation of "CLIPstyler:Image Style Transfer with a Single Text Condition"

CLIPstyler Official Pytorch implementation of "CLIPstyler:Image Style Transfer with a Single Text Condition" Environment Pytorch 1.7.1, Python 3.6 $ c

203 Dec 30, 2022
Inkscape extensions for figure resizing and editing

Academic-Inkscape: Extensions for figure resizing and editing This repository contains several Inkscape extensions designed for editing plots. Scale P

192 Dec 26, 2022
Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

Wenwen Yu 255 Dec 29, 2022
Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

MeshTransformer โœจ This is our research code of End-to-End Human Pose and Mesh Reconstruction with Transformers. MEsh TRansfOrmer is a simple yet effec

Microsoft 473 Dec 31, 2022
QueryInst: Parallelly Supervised Mask Query for Instance Segmentation

QueryInst is a simple and effective query based instance segmentation method driven by parallel supervision on dynamic mask heads, which outperforms previous arts in terms of both accuracy and speed.

Hust Visual Learning Team 386 Jan 08, 2023
Large-scale language modeling tutorials with PyTorch

Large-scale language modeling tutorials with PyTorch ์•ˆ๋…•ํ•˜์„ธ์š”. ์ €๋Š” TUNiB์—์„œ ๋จธ์‹ ๋Ÿฌ๋‹ ์—”์ง€๋‹ˆ์–ด๋กœ ๊ทผ๋ฌด ์ค‘์ธ ๊ณ ํ˜„์›…์ž…๋‹ˆ๋‹ค. ์ด ์ž๋ฃŒ๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ ๊ฐœ๋ฐœ์— ํ•„์š”ํ•œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๊ธฐ์ˆ ๋“ค์„ ์†Œ๊ฐœ๋“œ๋ฆฌ๊ธฐ ์œ„ํ•ด ๋งˆ๋ จํ•˜์˜€์œผ๋ฉฐ ๊ธฐ๋ณธ์ ์œผ๋กœ

TUNiB 172 Dec 29, 2022
Fake-user-agent-traffic-geneator - Python CLI Tool to generate fake traffic against URLs with configurable user-agents

Fake traffic generator for Gartner Demo Generate fake traffic to URLs with custo

New Relic Experimental 3 Oct 31, 2022
TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular potentials

TorchMD-net TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular

TorchMD 104 Jan 03, 2023
PyTorch implementation for "Sharpness-aware Quantization for Deep Neural Networks".

Sharpness-aware Quantization for Deep Neural Networks This is the official repository for our paper: Sharpness-aware Quantization for Deep Neural Netw

Zhuang AI Group 30 Dec 19, 2022
Food recognition model using convolutional neural network & computer vision

Food recognition model using convolutional neural network & computer vision. The goal is to match or beat the DeepFood Research Paper

Hemanth Chandran 1 Jan 13, 2022
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
Pixray is an image generation system

Pixray is an image generation system

pixray 883 Jan 07, 2023
Code to reproduce the results for Statistically Robust Neural Network Classification, published in UAI 2021

Code to reproduce the results for Statistically Robust Neural Network Classification, published in UAI 2021

1 Jun 02, 2022
Code of paper "Compositionally Generalizable 3D Structure Prediction"

Compositionally Generalizable 3D Structure Prediction In this work, We bring in the concept of compositional generalizability and factorizes the 3D sh

Songfang Han 30 Dec 17, 2022
An improvement of FasterGICP: Acceptance-rejection Sampling based 3D Lidar Odometry

fasterGICP This package is an improvement of fast_gicp Please cite our paper if possible. W. Jikai, M. Xu, F. Farzin, D. Dai and Z. Chen, "FasterGICP:

79 Dec 31, 2022
Official repository of IMPROVING DEEP IMAGE MATTING VIA LOCAL SMOOTHNESS ASSUMPTION.

IMPROVING DEEP IMAGE MATTING VIA LOCAL SMOOTHNESS ASSUMPTION This is the official repository of IMPROVING DEEP IMAGE MATTING VIA LOCAL SMOOTHNESS ASSU

็”ต็บฟๆ† 14 Dec 15, 2022
Conjugated Discrete Distributions for Distributional Reinforcement Learning (C2D)

Conjugated Discrete Distributions for Distributional Reinforcement Learning (C2D) Code & Data Appendix for Conjugated Discrete Distributions for Distr

1 Jan 11, 2022
Codes and pretrained weights for winning submission of 2021 Brain Tumor Segmentation (BraTS) Challenge

Winning submission to the 2021 Brain Tumor Segmentation Challenge This repo contains the codes and pretrained weights for the winning submission to th

94 Dec 28, 2022