SmartSim Infrastructure Library.

Overview


Home    Install    Documentation    Slack Invite    Cray Labs   


License GitHub last commit GitHub deployments PyPI - Wheel PyPI - Python Version GitHub tag (latest by date) Language Code style: black


SmartSim

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

SmartSim provides an API to connect HPC (MPI + X) simulations written in Fortran, C, C++, and Python to an in-memory database called the Orchestrator. The Orchestrator is built on Redis, a popular caching database written in C. This connection between simulation and database is the fundamental paradigm of SmartSim. Simulations in the aforementioned languages can stream data to the Orchestrator and pull the data out in Python for online analysis, visualization, and training.

In addition, the Orchestrator is equipped with ML inference runtimes: PyTorch, TensorFlow, and ONNX. From inside a simulation, users can store and execute trained models and retrieve the result.

Supported ML Libraries

SmartSim 0.3.0 uses Redis 6.0.8 and RedisAI 1.2

Library Supported Version
PyTorch 1.7.0
TensorFlow 1.15.0
ONNX 1.2.0

At this time, PyTorch is the most tested within SmartSim and we recommend users use PyTorch at this time if possible.

SmartSim is made up of two parts

  1. SmartSim Infrastructure Library (This repository)
  2. SmartRedis

SmartSim Infrastructure Library

The Infrastructure Library (IL) helps users get the Orchestrator running on HPC systems. In addition, the IL provides mechanisms for creating, configuring, executing and monitoring HPC workloads. Users can launch everything needed to run converged ML and simulation workloads right from a jupyter notebook using the IL Python interface.

Dependencies

The following third-party (non-Python) libraries are used in the SmartSim IL.

SmartRedis

The SmartSim IL Clients (SmartRedis) are implementations of Redis clients that implement the RedisAI API with additions specific to scientific workflows.

SmartRedis clients are available in Fortran, C, C++, and Python. Users can seamlessly pull and push data from the Orchestrator from different languages.

Language Version/Standard
Python 3.7+
Fortran 2003
C C99
C++ C++11

SmartRedis clients are cluster compatible and work with the open source Redis stack.

Dependencies

SmartRedis utilizes the following libraries.

Publications

The following are public presentations or publications using SmartSim (more to come!)

Cite

Please use the following citation when referencing SmartSim, SmartRedis, or any SmartSim related work.

Partee et al., “Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling,” arXiv:2104.09355, Apr. 2021, [Online]. Available: http://arxiv.org/abs/2104.09355.

bibtex

```latex
@misc{partee2021using,
      title={Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling},
      author={Sam Partee and Matthew Ellis and Alessandro Rigazzi and Scott Bachman and Gustavo Marques and Andrew Shao and Benjamin Robbins},
      year={2021},
      eprint={2104.09355},
      archivePrefix={arXiv},
      primaryClass={cs.CE}
}
```
Comments
  • Special torch version breaks GPU install of ML backends

    Special torch version breaks GPU install of ML backends

    Hi, I'm trying to configure SmartSim for an Intel+Nvidia Volta node on our local cluster. I was able to get the conda environment setup and successfully executed 'pip install smartsim'. However, when I tried the next step 'smart --device gpu', I get this error:

    (SmartSim-cime) [[email protected] SmartSim]$ smart --device gpu

    Backends Requested

    PyTorch: True
    TensorFlow: True
    ONNX: False
    

    Running SmartSim build process... Traceback (most recent call last): File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 424, in cli() File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 421, in cli builder.run_build(args.device, pt, tf, onnx) File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 89, in run_build self.install_torch(device=device) File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 138, in install_torch if not self.check_installed("torch", self.torch_version): File "/home/rta/.conda/envs/SmartSim-cime/bin/smart", line 121, in check_installed installed_major, installed_minor, _ = installed_version.split(".") ValueError: too many values to unpack (expected 3)

    area: build area: third-party User Issue 
    opened by aulwes 11
  • Codecov and reporter for CI

    Codecov and reporter for CI

    This PR adds running the coverage tests to the GitHub Actions CI. In addition to running the coverage tests, the resulting coverage report is uploaded to codecov, and a PR bot will report any coverage changes that a given PR causes. We also now have a codecov badge for our README.md.

    opened by EricGustin 6
  • Fix re-run smart build bug

    Fix re-run smart build bug

    This PR fixes #162. There are four cases that this PR accounts for which have been confirmed to work on local machine and Horizon.

    1. User attempts to run smart build when RAI_PATH is set and backends are installed
      • Inform user that backends are already built, tell them where they are built, and notify the user that there is no reason to build.
    2. User attempts to run smart build when RAI_PATH is set, but backends are not installed
      • Inform user that before running smart build, they should unset RAI_PATH
    3. User attempts to run smart build when RAI_PATH is not set and backends are installed
      • Inform user that they need to run smart clean before running smart build
    4. User attempts to run smart build when RAI_PATH is not set and backends are not installed
      • Allow smart build to continue
    opened by EricGustin 5
  • Change the way environmental variables are passed (batch mode)

    Change the way environmental variables are passed (batch mode)

    Description

    When in batch mode SmartSim creates the batch script then executed on target machine. Each generated srun execution in the script overwrites PATH, LD_LIBRARY_PATH and PYTHONPATH using export parameter of srun. This can be inconvenient if added preamble (using add_premable function) wishes to modify one of those since the result will never propagate to the target machine.

    Justification

    Some use-cases require sourcing scripts before using specialised hardware. Those scripts often tend to modify above environmental variables. Generally user should be able to change those in preamble.

    Implementation Strategy

    PATH, LD_LIBRARY_PATH and PYTHONPATH should not be changed by using export parameter of srun but rather in one of two following ways:

    1. Adding export option to generated SBATCH script : #SBATCH --export and removing export from srun parameters of generated script. Srun will propagate those and the user will have a chance to add additional paths.
    2. Simply modifying those variables as a part of sbatch script by adding export PATH=$PATH:smartsim_paths.... to the script after preamble.

    This should not change the way SmartSim works while giving the user more flexibility on preamble scripts.

    type: feature User Issue area: settings 
    opened by hubert-chrzaniuk-gc 5
  • Tutorial for SmartSim on Slurm

    Tutorial for SmartSim on Slurm

    Description

    Add a tutorial for using SmartSim on Slurm based systems.

    Justification

    Currently, we only have two tutorials and neither cover Slurm functionality

    Implementation Strategy

    The following items should be covered in a SmartSim-Slurm tutorial

    • [x] Launching jobs in a previously obtained interactive slurm allocation (using SrunSettings)
    • [x] Launching batch jobs with SbatchSettings in ensembles
    • [x] Getting and releasing allocations through the slurm interface
    • [x] Launching on allocations obtained through the smartsim.slurm interface
    • [x] Creating and launching the SlurmOrchestrator on interactive allocations and as a batch

    The tutorial should be implemented as a Jupyter Notebook and use nbsphinx to host as a part of the documentation.

    area: docs 
    opened by Spartee 5
  • Fix codecov badge 'unknown' bug

    Fix codecov badge 'unknown' bug

    This is a fix for the codecov badge in the README displaying as 'unknown' rather than displaying the repository's code coverage percentage. This occurs because the coverage report is never uploaded on the develop branch. Instead, the coverage report is ran on the compare branch of a PR. This is a problem because the codecov badge is linked to the develop branch. Since no coverage report is currently being uploaded from the develop branch, the badge is not able to display the repository's code coverage percentage.

    The solution introduced in this PR is to run the coverage tests & upload to codecov on pushes to the develop branch, in addition to the already existing event trigger for all PRs.

    opened by EricGustin 4
  • Fix build_docs github action + some minor improvements

    Fix build_docs github action + some minor improvements

    Fix build_docs github action that was previously broken. Some other minor improvements were added as part of this fix.

    Changes include:

    • build_docs action is triggered when we push to develop branch
      • Previously, it was on PR closed, which isn't necessarily a merge, i.e. a contributor could close a PR without merging and the action would still trigger
    • Use GITHUB_TOKEN instead of personal authentication (PA) credentials to push
    • Use "[email protected]" as the git author (user) to distinguish automated commits from human commits
    • Check static files from doc branch into develop, so that we can modify them in a regular PR to develop, rather than have to edit the doc branch directly. I'd like for us to get to a point where doc branch pushes are fully automated, so we should never have to modify it directly.
      • Static files include: .nojekyll (both top-level and per-sphinx-build), CNAME, and top-level index.html
    opened by ben-albrecht 4
  • Orchestrator port parametrized in tests implemented

    Orchestrator port parametrized in tests implemented

    Created a parameter that is used for the port argument when instantiating an Orchestrator in the test suite.

    Originally, the default port parameter was a hardcoded port number. However, if two local tests run on the same system, local Orchestrators will all be launched on the same port which will result in test failure.

    I added the method get_test_port() to the WLMUtils class in conftest.py and replaced each original port argument with the new parameter get_test_port(). Now, local Orchestrators will all be launched on different ports rather than the same.

    opened by amandarichardsonn 3
  • Keydb support

    Keydb support

    This PR adds the ability to use KeyDB in place of Redis as the database through a new --keydb flag for smart build. This flexibility enables users to maximize their throughput if they desire.

    opened by EricGustin 3
  • Add new RunSettings Methods

    Add new RunSettings Methods

    Adds new RunSettings methods to grow the list of interface functions that are available across all RunSettings objects where possible. Methods added for similar arguments that are shared by two or more of the existing RunSettings sub-classes.

    Do let me know if I am missing any additional methods for arguments that should be included, or implementations for existing RunSettings.

    opened by MattToast 3
  • Adapt to new LSF situation on Summit

    Adapt to new LSF situation on Summit

    After the last update of Summit, many internal mechanisms of SmartSim stopped working. Here is a list of the issues and what I did to mitigate them:

    • ERF files used for mpmd now don't accept simple rank_count, but need rank id. Now the LSFOrchestrator will run each shard through the rank with the same id (rank 0 will run shard 0 and so on). It is the same as before, but we specify it.
    • ERF files now don't accept more than one app on the same host. I suspect this is a bug, but this just means we cannot run more than one shard per host. This did not result in any change, but limits our features.
    • Environment variables are read the wrong way. Specifying more than one env var resulted in wrong handling (everything was assigned to first var). We now store the formatted env vars as a list of strings which is then parsed correctly.
    • Killing a jsrun process does not kill its spawned processes anymore. This caused most of the problems, as we were relying on it to stop applications. I turned JsrunSteps into managed ones. This means we use jslist to get the status of a jsrun call inside an allocation. It works, but if anything but SmartSim launches a jsrun command, the matching step id could be lost due to a race condition (ids are assigned incrementally, starting from 0, we mock the Slurm format of alloc_id.step_id internally, to distinguish them from batch jobs). Users will need to only use SmartSim within an allocation.
    opened by al-rigazzi 3
  • Set number of tasks for PALS mpiexec

    Set number of tasks for PALS mpiexec

    Added an explicit call to set the number of tasks for the PALS mpiexec settings. This now sets the number of tasks with "--np", whereas before it was being set with "--n", which was giving and error on Polaris.

    opened by rickybalin 0
  • Domain Socket Support for co-located databases

    Domain Socket Support for co-located databases

    Models can now connect to a co-located database using a Unix Domain Socket (UDS). In synthetic benchmarks, this can lead to significant improvements over using the loopback interface over TCP/IP. To ensure compatibility with previous scripts we now provide the following interfaces:

    • colocate_db (original): Will throw a DeprecationWarning but otherwise just wraps colocate_db_tcp
    • colocate_db_tcp: Listens for connections over the loopback
    • colocate_db_uds: Listens for connections over UDS
    opened by ashao 1
  • Model as Batch Job

    Model as Batch Job

    Allows users to launch individual models through an experiment as a batch job with batch settings.

    exp = Experiment(...)
    rs = exp.craete_run_settings(...)
    bs = exp.create_batch_settings(...)
    model = exp.create_model("MyModel", rs, batch_settings=bs)
    exp.start(model)  # launches the model as a batch job
    

    If a model with batch settings is added to a higher order entity (e.g. an ensemble), the batch settings of the higher order entity will be used and the batch settings of the model will be ignored to avoid an "sbatch-in-sbatch" scenario or similar .

    opened by MattToast 2
  • Implement UDS support for co-located databases

    Implement UDS support for co-located databases

    Description

    Unix domain sockets allow for more direct communication for entities that exist on the same device. This feature was implemented for the SmartRedis client, but support to configure the orchestrator and clients via SmartSim still needs to be added.

    Justification

    Users of the co-located deployment of the orchestrator will benefit from increased performance. Initial results show that the latency is significantly reduced.

    Implementation Strategy

    • [ ] Modify the colocated orchestrator method of Model to include an option for UDS
    • [ ] Ensure that the SSDB variable set during experiment launch points to the UDS address of the orchestrator
    area: orchestrator type: feature 
    opened by ashao 0
  • Check compatibility of GCC/Python versions for compiling SmartSim dependencies

    Check compatibility of GCC/Python versions for compiling SmartSim dependencies

    Description

    The documentation currently is fairly prescriptive with regard to GCC and Python versions, stating that SmartSim has build issues with GCC>=10 and only supports up to Python. The former this has to do with error messages during compilation of RedisAI or one of its dependencies. However anecdotally @ashao has compiled with GCC 11 and 12 without apparent issue. This should be double checked and the documentation updated to specify which versions of GCC are likely to work out of the box. For Python, the team recalls reasons why 3.10 would not work, but this should be revisted and documented

    Justification

    Users who use one newer versions of GCC will be less likely to avoid using SmartSim due to apparent compatibility issues.

    Implementation Strategy

    • [ ] RedisAI 1.2.3, 1.2.5, 1.2.7: Test SmartSim builds with GPU on horizon which has GCC versions 10.3.0, 11.2.0, and 12.2.0 and on Python 3.10, 3.11
    • [ ] Evaluate whether there is a quick workaround for errors with GCC 10
    • [ ] Update documentation
    type: feature 
    opened by ashao 0
  • Client logger

    Client logger

    Currently users do not have access to logging capabilities in SmartRedis. This means that users do not have a detailed understanding of client activity in order to debug complex workflows. Completion of this epic will enable logging in all four clients with varying levels of verbosity.

    Epic 
    opened by mellis13 0
Releases(v0.4.1)
  • v0.4.1(Jun 25, 2022)

    Released on June 24, 2022

    Description: This release of SmartSim introduces a new experimental feature to help make SmartSim workflows more portable: the ability to run simulations models in a container via Singularity. This feature has been tested on a small number of platforms and we encourage users to provide feedback on its use.

    We have also made improvements in a variety of areas: new utilities to load scripts and machine learning models into the database directly from SmartSim driver scripts and install-time choice to use either KeyDB or Redis for the Orchestrator. The RunSettings API is now more consistent across subclasses. Another key focus of this release was to aid new SmartSim users by including more extensive tutorials and improving the documentation. The docker image containing the SmartSim tutorials now also includes a tutorial on online training.

    Launcher improvements

    Documentation and tutorials

    General improvements and bug fixes

    Dependency updates

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Feb 12, 2022)

    Released on Feb 11, 2022

    Description: In this release SmartSim continues to promote ease of use. To this end SmartSim has introduced new portability features that allow users to abstract away their targeted hardware, while providing even more compatibility with existing libraries.

    A new feature, Co-located orchestrator deployments has been added which provides scalable online inference capabilities that overcome previous performance limitations in separated orchestrator/application deployments. For more information on advantages of co-located deployments, see the Orchestrator section of the SmartSim documentation.

    The SmartSim build was significantly improved to increase customization of build toolchain and the smart command line interface was expanded.

    Additional tweaks and upgrades have also been made to ensure an optimal experience. Here is a comprehensive list of changes made in SmartSim 0.4.0.

    Orchestrator Enhancements:

    • Add Orchestrator Co-location (PR139)
    • Add Orchestrator configuration file edit methods (PR109)

    Emphasize Driver Script Portability:

    • Add ability to create run settings through an experiment (PR110)
    • Add ability to create batch settings through an experiment (PR112)
    • Add automatic launcher detection to experiment portability functions (PR120)

    Expand Machine Learning Library Support:

    • Data loaders for online training in Keras/TF and Pytorch (PR115)(PR140)
    • ML backend versions updated with expanded support for multiple versions (PR122)
    • Launch Ray internally using RunSettings (PR118)
    • Add Ray cluster setup and deployment to SmartSim (PR50)

    Expand Launcher Setting Options:

    • Add ability to use base RunSettings on a Slurm, PBS, or Cobalt launchers (PR90)
    • Add ability to use base RunSettings on LFS launcher (PR108)

    Deprecations and Breaking Changes

    • Orchestrator classes combined into single implementation for portability (PR139)
    • smartsim.constants changed to smartsim.status (PR122)
    • smartsim.tf migrated to smartsim.ml.tf (PR115)(PR140)
    • TOML configuration option removed in favor of environment variable approach (PR122)

    General Improvements and Bug Fixes:

    • Improve and extend parameter handling (PR107)(PR119)
    • Abstract away non-user facing implementation details (PR122)
    • Add various dimensions to the CI build matrix for SmartSim testing (PR130)
    • Add missing functions to LSFSettings API (PR113)
    • Add RedisAI checker for installed backends (PR137)
    • Remove heavy and unnecessary dependencies (PR116)(PR132)
    • Fix LSFLauncher and LSFOrchestrator(PR86)
    • Fix over greedy Workload Manager Parsers (PR95)
    • Fix Slurm handling of comma-separated env vars (PR104)
    • Fix internal method calls (PR138)

    Documentation Updates:

    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Aug 12, 2021)

    Released on August 11, 2021

    Description:

    • Upgraded RedisAI backend to 1.2.3 (PR69)
    • PyTorch 1.7.1, TF 2.4.2, and ONNX 1.6-7 (PR69)
    • LSF launcher for IBM machines (PR62)
    • Improved code coverage by adding more unit tests (PR53)
    • Orchestrator methods to get address and check status (PR60)
    • Added Manifest object that tracks deployables in Experiments (PR61)
    • Bug fixes (PR52) (PR58) (PR67)
    • Updated documentation and examples (PR51) (PR57) (PR71)
    • Improved IP address aquisition (PR72)
    • Binding database to network interfaces (PR73)
    • Support for custom TF installation (i.e. Summit Power9) (PR76)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(May 8, 2021)

  • v0.3.0(Apr 2, 2021)

Owner
Cray Labs
Cray Labs
PyTorch Lightning + Hydra. A feature-rich template for rapid, scalable and reproducible ML experimentation with best practices. ⚡🔥⚡

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥 Click on Use this template to initialize new re

Łukasz Zalewski 2.1k Jan 09, 2023
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models Requirements A suitable conda environment named ldm can be created and activated with: conda env create -f environment.yaml co

CompVis Heidelberg 5.6k Jan 04, 2023
Faster RCNN pytorch windows

Faster-RCNN-pytorch-windows Faster RCNN implementation with pytorch for windows Open cmd, compile this comands: cd lib python setup.py build develop T

Hwa-Rang Kim 1 Nov 11, 2022
Accelerate Neural Net Training by Progressively Freezing Layers

FreezeOut A simple technique to accelerate neural net training by progressively freezing layers. This repository contains code for the extended abstra

Andy Brock 203 Jun 19, 2022
Repository for publicly available deep learning models developed in Rosetta community

trRosetta2 This package contains deep learning models and related scripts used by Baker group in CASP14. Installation Linux/Mac clone the package git

81 Dec 29, 2022
Event sourced bank - A wide-and-shallow example using the Python event sourcing library

Event Sourced Bank A "wide but shallow" example of using the Python event sourci

3 Mar 09, 2022
Differentiable Quantum Chemistry (only Differentiable Density Functional Theory and Hartree Fock at the moment)

DQC: Differentiable Quantum Chemistry Differentiable quantum chemistry package. Currently only support differentiable density functional theory (DFT)

75 Dec 02, 2022
Localizing Visual Sounds the Hard Way

Localizing-Visual-Sounds-the-Hard-Way Code and Dataset for "Localizing Visual Sounds the Hard Way". The repo contains code and our pre-trained model.

Honglie Chen 58 Dec 07, 2022
MvtecAD unsupervised Anomaly Detection

MvtecAD unsupervised Anomaly Detection This respository is the unofficial implementations of DFR: Deep Feature Reconstruction for Unsupervised Anomaly

0 Feb 25, 2022
PyTorch implementation of Histogram Layers from DeepHist: Differentiable Joint and Color Histogram Layers for Image-to-Image Translation

deep-hist PyTorch implementation of Histogram Layers from DeepHist: Differentiable Joint and Color Histogram Layers for Image-to-Image Translation PyT

Winfried Lötzsch 10 Dec 06, 2022
[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Near-Duplicate Video Retrieval with Deep Metric Learning This repository contains the Tensorflow implementation of the paper Near-Duplicate Video Retr

Liming Jiang 238 Nov 25, 2022
🗺 General purpose U-Network implemented in Keras for image segmentation

TF-Unet General purpose U-Network implemented in Keras for image segmentation Getting started • Training • Evaluation Getting started Looking for Jupy

Or Fleisher 2 Aug 31, 2022
This repo contains the code for paper Inverse Weighted Survival Games

Inverse-Weighted-Survival-Games This repo contains the code for paper Inverse Weighted Survival Games instructions general loss function (--lfn) can b

3 Jan 12, 2022
Pytorch ImageNet1k Loader with Bounding Boxes.

ImageNet 1K Bounding Boxes For some experiments, you might wanna pass only the background of imagenet images vs passing only the foreground. Here, I'v

Amin Ghiasi 11 Oct 15, 2022
DANA paper supplementary materials

DANA Supplements This repository stores the data, results, and R scripts to generate these reuslts and figures for the corresponding paper Depth Norma

0 Dec 17, 2021
Official code for "On the Frequency Bias of Generative Models", NeurIPS 2021

Frequency Bias of Generative Models Generator Testbed Discriminator Testbed This repository contains official code for the paper On the Frequency Bias

35 Nov 01, 2022
Converting CPT to bert form for use

cpt-encoder 将CPT转成bert形式使用 说明 刚刚刷到又出了一种模型:CPT,看论文显示,在很多中文任务上性能比mac bert还好,就迫不及待想把它用起来。 根据对源码的研究,发现该模型在做nlu建模时主要用的encoder部分,也就是bert,因此我将这部分权重转为bert权重类型

黄辉 1 Oct 14, 2021
The code from the paper Character Transformations for Non-Autoregressive GEC Tagging

Character Transformations for Non-Autoregressive GEC Tagging Milan Straka, Jakub Náplava, Jana Straková Charles University Faculty of Mathematics and

ÚFAL 5 Dec 10, 2022
Get 2D point positions (e.g., facial landmarks) projected on 3D mesh

points2d_projection_mesh Input 2D points (e.g. facial landmarks) on an image Camera parameters (extrinsic and intrinsic) of the image Aligned 3D mesh

5 Dec 08, 2022
Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

John Ingraham 159 Dec 15, 2022