TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

Related tags

Deep Learningtorchx
Overview

PyPI License Tests Lint

TorchX

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

For the latest documentation, please refer to our website.

Requirements

TorchX SDK (torchx):

  • python3 (3.8+)
  • torch

TorchX Kubeflow Pipelines Support (torchx-kfp):

  • torchx
  • kfp

Installation

# install torchx sdk and CLI
pip install torchx

# install torchx kubeflow pipelines (kfp) support
pip install "torchx[kfp]"

Quickstart

See the quickstart guide.

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

TorchX is BSD licensed, as found in the LICENSE file.

Comments
  • cli: defer loading schedulers until used

    cli: defer loading schedulers until used

    This updates the cli, schedulers and runners so that the schedulers are only imported when they need to be used. This means only the relevant scheduler is loaded which vastly improves responsiveness.

    • torchx --help 900ms -> 300ms
    • torchx status local_cwd:// 1.38s to 840ms

    Breaking changes:

    • schedulers.get_schedulers() has been removed in favor of schedulers.get_scheduler_factories() since it's too dangerous
    • Runner.run_opts() now takes a specific scheduler name instead of returning all scheduler runopts since that requires loading all schedulers.

    get_schedulers is an internal interface since downstream users should be using the runner interface so changing this shouldn't be an issue

    Runner.run_opts() is part of the user Runner interface but it's only really practical for the CLI so I doubt there's any OSS usage of it

    Test plan:

    (torchx) [email protected] ~/D/torchx (deferload)> time torchx --help
    usage: torchx [-h] [--log_level LOG_LEVEL] [--version] {builtins,cancel,configure,describe,log,run,runopts,status} ...
    
    torchx CLI
    
    optional arguments:
      -h, --help            show this help message and exit
      --log_level LOG_LEVEL
                            Python logging log level
      --version             show program's version number and exit
    
    sub-commands:
      Use the following commands to run operations, e.g.: torchx run ${JOB_NAME}
    
      {builtins,cancel,configure,describe,log,run,runopts,status}
    
    ________________________________________________________
    Executed in  300.70 millis    fish           external
       usr time  280.99 millis  916.00 micros  280.07 millis
       sys time   16.46 millis    0.00 micros   16.46 millis
    
    (torchx) [email protected] ~/D/torchx (deferload)> time torchx status local_docker://torchx/sh-lsgcv92jm13ps
    torchx 2022-06-24 15:55:58 INFO     AppDef:
      State: SUCCEEDED
      Num Restarts: -1
    Roles:
     *sh[0]:SUCCEEDED
      sh[1]:SUCCEEDED
      sh[2]:SUCCEEDED
    
    ________________________________________________________
    Executed in  507.45 millis    fish           external
       usr time  472.78 millis  926.00 micros  471.86 millis
       sys time   19.77 millis    0.00 micros   19.77 millis
    
    CLA Signed 
    opened by d4l3k 17
  • Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Summary: Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Differential Revision: D30980471

    CLA Signed fb-exported 
    opened by aivanou 17
  • API to list jobs by scheduler

    API to list jobs by scheduler

    Given a scheduler, torchx list -s <scheduler_name> lists the jobs scheduled on it. MVP that supports listing jobs for kubernetes scheduler to address https://github.com/pytorch/torchx/issues/503 More features like listing just jobs started by torchx, listing active jobs, filtering options to be implemented in following PRs

    Test plan:

    (venv38) [email protected]:~/wp/torchx$ torchx list -s kubernetes
    kubernetes://default/default:cifar-trainer-a5qvfhe1hyb1b
    kubernetes://default/default:cifar-trainer-d796ei2tdf4bc
    kubernetes://default/default:cifar-trainer-em0iao2m9agog
    kubernetes://default/default:cifar-trainer-ew33oxmdg7t0c
    kubernetes://default/default:cifar-trainer-grjsnfxeinqbd
    kubernetes://default/default:cifar-trainer-p4bwlewvwmt1f
    kubernetes://default/default:cifar-trainer-pwy4c4omfufff
    kubernetes://default/default:cifar-trainer-qyan5cp6vz5od
    kubernetes://default/default:cifar-trainer-rbrd5krzkyz3c
    kubernetes://default/default:cifar-trainer-rw9kn5rtau6se
    kubernetes://default/default:cifar-trainer-tz45941r7u1b
    kubernetes://default/default:cifar-trainer-uycdlshurnf6c
    kubernetes://default/default:cifar-trainer-vstsr7qtecaif
    kubernetes://default/default:cifar-trainer-xnqu2mxdc2a5b
    .
    .
    .
    .
    kubernetes://default/default:train-zhqz9fhr7h0tsc
    kubernetes://default/default:trainer-bwhcp3vw2khftc
    kubernetes://default/default:trainer-m3mxh6dcxgwtv
    
    CLA Signed 
    opened by priyaramani 15
  • ray_scheduler: workspace + fixed no role logging

    ray_scheduler: workspace + fixed no role logging

    This updates Ray to have proper workspace support.

    • -c working_dir=... is deprecated in favor of torchx run --workspace=...
    • -c requirements=... is optional and requirements.txt will be automatically read from the workspace if present
    • torchx log ray://foo/bar works without requiring /ray/0

    Test plan:

    (torchx) [email protected] ~/D/t/e/ray (ray)> torchx run -s ray --wait --log dist.ddp --env LOGLEVEL=INFO -j 2x1 -m scripts.compute_world_size
    torchx 2022-05-18 16:55:31 INFO     Checking for changes in workspace `file:///home/tristanr/Developer/torchrec/examples/ray`...
    torchx 2022-05-18 16:55:31 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
    torchx 2022-05-18 16:55:31 INFO     Built new image `/tmp/torchx_workspacebe6331jv` based on original image `ghcr.io/pytorch/torchx:0.2.0dev0` and changes in workspace `file:///home/tristanr/Developer/torch
    rec/examples/ray` for role[0]=compute_world_size.
    torchx 2022-05-18 16:55:31 WARNING  The Ray scheduler does not support port mapping.
    torchx 2022-05-18 16:55:31 INFO     Uploading package gcs://_ray_pkg_63a39f7096dfa0bd.zip.
    torchx 2022-05-18 16:55:31 INFO     Creating a file package for local directory '/tmp/torchx_workspacebe6331jv'.
    ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    torchx 2022-05-18 16:55:31 INFO     Launched app: ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    torchx 2022-05-18 16:55:31 INFO     AppStatus:
      msg: PENDING
      num_restarts: -1
      roles:
      - replicas:
        - hostname: <NONE>
          id: 0
          role: ray
          state: !!python/object/apply:torchx.specs.api.AppState
          - 2
          structured_error_msg: <NONE>
        role: ray
      state: PENDING (2)
      structured_error_msg: <NONE>
      ui_url: null
    
    torchx 2022-05-18 16:55:31 INFO     Job URL: None
    torchx 2022-05-18 16:55:31 INFO     Waiting for the app to finish...
    torchx 2022-05-18 16:55:31 INFO     Waiting for app to start before logging...
    torchx 2022-05-18 16:55:43 INFO     Job finished: SUCCEEDED
    (torchx) [email protected] ~/D/t/e/ray (ray)> torchx log ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    ray/0 Waiting for placement group to start.
    ray/0 running ray.wait on [ObjectRef(8f2664c081ffc268e1c4275021ead9801a8d33861a00000001000000), ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
    ray/0 (CommandActor pid=494377)   entrypoint       : scripts.compute_world_size
    ray/0 (CommandActor pid=494377)   min_nodes        : 2
    ray/0 (CommandActor pid=494377)   max_nodes        : 2
    ray/0 (CommandActor pid=494377)   nproc_per_node   : 1
    ray/0 (CommandActor pid=494377)   run_id           : compute_world_size-mpr03nzqvvg3td
    ray/0 (CommandActor pid=494377)   rdzv_backend     : c10d
    ray/0 (CommandActor pid=494377)   rdzv_endpoint    : localhost:29500
    ray/0 (CommandActor pid=494377)   rdzv_configs     : {'timeout': 900}
    ray/0 (CommandActor pid=494377)   max_restarts     : 0
    ray/0 (CommandActor pid=494377)   monitor_interval : 5
    ray/0 (CommandActor pid=494377)   log_dir          : None
    ray/0 (CommandActor pid=494377)   metrics_cfg      : {}
    ray/0 (CommandActor pid=494377)
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
    ray/0 (CommandActor pid=494406)   entrypoint       : scripts.compute_world_size
    ray/0 (CommandActor pid=494406)   min_nodes        : 2
    ray/0 (CommandActor pid=494406)   max_nodes        : 2
    ray/0 (CommandActor pid=494406)   nproc_per_node   : 1
    ray/0 (CommandActor pid=494406)   run_id           : compute_world_size-mpr03nzqvvg3td
    ray/0 (CommandActor pid=494406)   rdzv_backend     : c10d
    ray/0 (CommandActor pid=494406)   rdzv_endpoint    : 172.26.20.254:29500
    ray/0 (CommandActor pid=494406)   rdzv_configs     : {'timeout': 900}
    ray/0 (CommandActor pid=494406)   max_restarts     : 0
    ray/0 (CommandActor pid=494406)   monitor_interval : 5
    ray/0 (CommandActor pid=494406)   log_dir          : None
    ray/0 (CommandActor pid=494406)   metrics_cfg      : {}
    ray/0 (CommandActor pid=494406)
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
    ray/0 (CommandActor pid=494377)   restart_count=0
    ray/0 (CommandActor pid=494377)   master_addr=tristanr-arch2
    ray/0 (CommandActor pid=494377)   master_port=48089
    ray/0 (CommandActor pid=494377)   group_rank=1
    ray/0 (CommandActor pid=494377)   group_world_size=2
    ray/0 (CommandActor pid=494377)   local_ranks=[0]
    ray/0 (CommandActor pid=494377)   role_ranks=[1]
    ray/0 (CommandActor pid=494377)   global_ranks=[1]
    ray/0 (CommandActor pid=494377)   role_world_sizes=[2]
    ray/0 (CommandActor pid=494377)   global_world_sizes=[2]
    ray/0 (CommandActor pid=494377)
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t/attempt_0/0/error.json
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
    ray/0 (CommandActor pid=494406)   restart_count=0
    ray/0 (CommandActor pid=494406)   master_addr=tristanr-arch2
    ray/0 (CommandActor pid=494406)   master_port=48089
    ray/0 (CommandActor pid=494406)   group_rank=0
    ray/0 (CommandActor pid=494406)   group_world_size=2
    ray/0 (CommandActor pid=494406)   local_ranks=[0]
    ray/0 (CommandActor pid=494406)   role_ranks=[0]
    ray/0 (CommandActor pid=494406)   global_ranks=[0]
    ray/0 (CommandActor pid=494406)   role_world_sizes=[2]
    ray/0 (CommandActor pid=494406)   global_world_sizes=[2]
    ray/0 (CommandActor pid=494406)
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p/attempt_0/0/error.json
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.000942230224609375 seconds
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0013003349304199219 seconds
    ray/0 (CommandActor pid=494377) [0]:initializing `gloo` process group
    ray/0 (CommandActor pid=494377) [0]:successfully initialized process group
    ray/0 (CommandActor pid=494377) [0]:rank: 1, actual world_size: 2, computed world_size: 2
    ray/0 (CommandActor pid=494406) [0]:initializing `gloo` process group
    ray/0 (CommandActor pid=494406) [0]:successfully initialized process group
    ray/0 (CommandActor pid=494406) [0]:rank: 0, actual world_size: 2, computed world_size: 2
    ray/0 running ray.wait on [ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]
    
    CLA Signed 
    opened by d4l3k 14
  • Investigate Perf Drop for fairseq gpu training on aws batch with EFA (p3dn)

    Investigate Perf Drop for fairseq gpu training on aws batch with EFA (p3dn)

    πŸ› Bug

    Not really a TorchX bug per-se, but opening this issue to keep track of this investigation.

    There seems to be a perf drop when running the same fairseq trainer on AWS Batch on p3dn (w/ EFA) versus running on the same host baremetal (no Docker).

    Module (check all that applies):

    • [ ] torchx.spec
    • [ ] torchx.component
    • [ ] torchx.apps
    • [ ] torchx.runtime
    • [ ] torchx.cli
    • [ ] torchx.schedulers
    • [ ] torchx.pipelines
    • [ ] torchx.aws
    • [ ] torchx.examples
    • [x] other

    Below is a table summarizing the different runs:

    Common Params:

    1. Instance Type: p3dn.24xlarge
    2. Num Nodes: 2
    3. Nodes in same placement group: Yes
    4. Environment Variables: a. LOGLEVEL=INFO b. NCCL_SOCKET_IFNAME=eth0 c. NCCL_ALGO=RING d. NCCL_PROTO=simple e. FI_PROVIDER=efa f. FI_EFA_USE_DEVICE_RDMA=1
    5. In all cases RDMA is not enabled (seems to not be available on p3dn see: https://github.com/aws/aws-ofi-nccl/issues/104)

    |Scheduler | Docker | EFA picked up by NCCL | NCCL_ALGO | Performance | Log | |-------------|:----------- |:-------------------------|:-------------|:-------------|:-------------| | BareMetal | No | No | Tree | baseline (~110k wpm) | gist-link| | BareMetal | Yes | Yes | Tree | WIP | |
    | AWS Batch | Yes | Yes | Ring | ~44k wpm | | | AWS Batch | Yes | Yes | Tree | ~70k wpm

    • Scheduler BareMetal: ssh onto the nodes and kick off the trainer manually
    • Docker Yes/No: trainer runs in docker (e.g. docker run) versus python -m torch.distributed.launch)

    Logs

    To Reproduce

    Steps to reproduce the behavior:

    Expected behavior

    Environment

    • torchx version (e.g. 0.1.0rc1): torchx-nightly
    • Python version: 3.8+
    • OS (e.g., Linux): Amazon Linux2
    • How you installed torchx (conda, pip, source, docker): pip
    • Docker image and tag (if using docker):
    • Git commit (if installed from source): N/A
    • Execution environment (on-prem, AWS, GCP, Azure etc): AWS Batch
    • Any other relevant information:

    Additional context

    bug aws_batch 
    opened by kiukchung 14
  • schedulers/aws_batch: add a scheduler to launch jobs directly on aws_batch

    schedulers/aws_batch: add a scheduler to launch jobs directly on aws_batch

    This adds a scheduler that allows for launching TorchX jobs directly on AWS Batch. This requires almost no infrastructure setup (just UI) and provided a fairly sane docker job engine.

    Test plan:

    scripts/awsbatchint.sh
    pytest torchx/schedulers/test/aws_batch_scheduler_test.py
    pyre
    
    CLA Signed 
    opened by d4l3k 14
  • Ray scheduler driver and job api

    Ray scheduler driver and job api

    To run distributed pytorch test

    cd torch/torchx/schedulers/test
    python -m unittest ray_scheduler_test.py
    
    2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
    2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip.
    2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'.
    status: PENDING
    status: PENDING
    status: RUNNING
    status: RUNNING
    status: RUNNING
    status: RUNNING
    status: SUCCEEDED
    2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
    (CommandActor pid=3575) initializing `gloo` process group
    (CommandActor pid=3574) initializing `gloo` process group
    (CommandActor pid=3575) successfully initialized process group
    (CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2
    (CommandActor pid=3574) successfully initialized process group
    (CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2
    

    To run distributed pytorch test using torchX CLI

    # Setup cluster and get a HEAD NODE IP
    ray up -y ray_cluster.yaml
    pip install torchx[dev]
    
    # Get a job ID from deployed job
    torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=aivanou_test utils.binary --entrypoint ray_simple.py
    
    # Use job ID to get logs or job status
    torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
    torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
    
    CLA Signed 
    opened by msaroufim 14
  • Move `examples` under `torchx` module

    Move `examples` under `torchx` module

    Summary: The diff moves examples under torchx namespace, also removes examples Dockerfile, and makes torchx image to use dev-requirements

    Differential Revision: D31464358

    fb-exported 
    opened by aivanou 14
  • github/kfp: run integration tests even for external users

    github/kfp: run integration tests even for external users

    This splits the kfp integration tests into two steps.

    1. without secrets on PR branch: builds the pipeline + containers and saves them as a GitHub artifact
    2. with secrets from master branch: loads the artifact and launches it on the KFP cluster

    Test plan:

    scripts/kfpint.py --path /tmp/foo --save
    scripts/kfpint.py --path /tmp/foo --load
    

    CI

    CLA Signed Merged 
    opened by d4l3k 14
  • add LSF scheduler

    add LSF scheduler

    I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

    Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

    In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

    component_integration_tests.lsf.txt

    Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

    The following are example commands.

    Example: native hello_world and CLI utils

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
    lsf://torchx/echo-pxc3gn5ct061k
    $ torchx list -s lsf
    $ torchx status lsf://torchx/echo-pxc3gn5ct061k
    $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
    $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
    

    Example: Docker hello_world

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
    

    Example: Singularity hello_world

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
    

    Example: Docker Distributed

    $ cp scripts/dist_app.py /mnt/data/dist/
    $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
    

    Example: Singularity Distributed

    $ cp scripts/dist_app.py /mnt/data/dist/
    $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
    
    CLA Signed 
    opened by takeshi-yoshimura 13
  • c10d communication failure on k8s cluster

    c10d communication failure on k8s cluster

    πŸ› Bug

    cross-referencing the issue posted here:

    https://github.com/pytorch/pytorch/issues/80772

    which is about race condition between rank 0 pod registering with DNS and c10d service trying to look it up.

    To Reproduce

    Steps to reproduce the behavior:

    1. run the code with torchx run --scheduler kubernetes dist.ddp -j 8x1 --script cifar_dist.py
    2. see the pods scheduled and then erroring

    Code:

    import os
    import torch
    import torch.distributed as dist
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data.distributed import DistributedSampler
    from torch.utils.data import DataLoader
    
    from torchvision.models import resnet18
    from torchvision.datasets import CIFAR10
    import torchvision.transforms as transforms
    
    from torch.nn.parallel import DistributedDataParallel as DDP
    import random
    import numpy as np
    from time import sleep
    
    model_dir = '/model'
    model_filename = 'mymodel'
    batch_size = 128
    
    
    def set_random_seeds(seed=0):
    
        torch.manual_seed(seed)
        # torch.backends.cudnn.deterministic = True
        # torch.backends.cudnn.benchmark = False
        np.random.seed(seed)
        random.seed(seed)
    
    
    def evaluate(model, device, test_loader):
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for data in test_loader:
                images, labels = data[0].to(device), data[1].to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        accuracy = correct / total
        return accuracy
    
    
    dist.init_process_group("gloo")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    
    if torch.cuda.is_available():
        device = torch.cuda.device(f'cuda:{rank}')
    else:
        device = torch.device('cpu')
    
    
    print(f"Running basic DDP example on rank {rank} of {world_size}")
    
    if rank == 0:
        model_filepath = os.path.join(model_dir, model_filename)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)
    
    set_random_seeds()
    
    print('create model...')
    # create model
    model = resnet18(pretrained=False)
    if torch.cuda.is_available():
        torch.cuda.set_device(rank)
        device = torch.cuda.device(f'cuda:{rank}')
        ddp_model = DDP(model, device_ids=[rank], output_device=rank)
    else:
        device = torch.device('cpu')
        ddp_model = DDP(model)
    
    print('prepare dataset...')
    # Prepare dataset and dataloader
    transform = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    
    train_set = CIFAR10(root="data", train=True, download=True, transform=transform) 
    test_set = CIFAR10(root="data", train=False, download=True, transform=transform)
    
    train_sampler = DistributedSampler(dataset=train_set)
    
    train_loader = DataLoader(dataset=train_set, batch_size=batch_size, sampler=train_sampler, num_workers=0)
    # Test loader does not have to follow distributed sampling strategy
    test_loader = DataLoader(dataset=test_set, batch_size=128, shuffle=False, num_workers=0)
    
    print('setup model...')
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
    
    for epoch in range(1000):
        print(f'Epoch {epoch}')
        if epoch % 10 == 0:
            if rank == 0:
                accuracy = evaluate(model=ddp_model, device=device, test_loader=test_loader)
                torch.save(ddp_model.state_dict(), model_filepath)
                print("-" * 75)
                print(f"Epoch: {epoch}, Accuracy: {accuracy}")
                print("-" * 75)
            else:
                print("-" * 75)
                print(f"Epoch: {epoch}")
                print("-" * 75)
    
        ddp_model.train()
    
        i = 0
        for data in train_loader:
            print(i)
            i += 1
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        
    

    Expected behavior

    I expect pods to connect to c10d properly and run the code.

    Environment

    Collecting environment information...
    PyTorch version: 1.11.0
    Is debug build: False
    CUDA used to build PyTorch: 11.3
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 18.04.6 LTS (x86_64)
    GCC version: Could not collect
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.27
    
    Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
    Python platform: Linux-5.4.17-2136.306.1.3.el8uek.x86_64-x86_64-with-glibc2.17
    Is CUDA available: False
    CUDA runtime version: No CUDA
    GPU models and configuration: No CUDA
    Nvidia driver version: No CUDA
    cuDNN version: No CUDA
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    
    Versions of relevant libraries:
    [pip3] botorch==0.6.0
    [pip3] gpytorch==1.6.0
    [pip3] mypy-extensions==0.4.3
    [pip3] numpy==1.21.2
    [pip3] pytorch-lightning==1.5.10
    [pip3] torch==1.11.0
    [pip3] torch-model-archiver==0.6.0
    [pip3] torchelastic==0.2.2
    [pip3] torchmetrics==0.9.1
    [pip3] torchserve==0.6.0
    [pip3] torchtext==0.12.0
    [pip3] torchvision==0.12.0
    [pip3] torchx==0.2.0
    [conda] blas                      1.0                         mkl  
    [conda] botorch                   0.6.0                    pypi_0    pypi
    [conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
    [conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
    [conda] gpytorch                  1.6.0                    pypi_0    pypi
    [conda] mkl                       2021.4.0           h06a4308_640  
    [conda] mkl-service               2.4.0            py38h7f8727e_0  
    [conda] mkl_fft                   1.3.1            py38hd3c417c_0  
    [conda] mkl_random                1.2.2            py38h51133e4_0  
    [conda] numpy                     1.21.2           py38h20f2e39_0  
    [conda] numpy-base                1.21.2           py38h79a1101_0  
    [conda] pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
    [conda] pytorch-lightning         1.5.10                   pypi_0    pypi
    [conda] pytorch-mutex             1.0                        cuda    pytorch
    [conda] torch-model-archiver      0.6.0                    pypi_0    pypi
    [conda] torchelastic              0.2.2                    pypi_0    pypi
    [conda] torchmetrics              0.9.1                    pypi_0    pypi
    [conda] torchserve                0.6.0                    pypi_0    pypi
    [conda] torchtext                 0.12.0                     py38    pytorch
    [conda] torchvision               0.12.0               py38_cu113    pytorch
    [conda] torchx                    0.2.0                    pypi_0    pypi
    
    • Execution environment: Oracle Cloud OKE cluster k8s version 1.23.4
    • dns: CoreDNS-1.8.6 linux/amd64, go1.16.7 BoringCrypto, 69a006c9f1a
    opened by streamnsight 13
  • Find default namespace from kube_config.

    Find default namespace from kube_config.

    For kubernetes scheduler, find default namespace from kube_config other than hard-coded "default" value.

    Right now the default namespace for list method is hard-coded as "Default", which is not the case for multiple user scenario in kubernetes cluster. The default namespace can be read from current context of kube configuration.

    Test plan:

    Tested manually with a generated kube config.

    CLA Signed 
    opened by liuwenchao 3
  • (torchx/aws_batch) Enable Region Selection in TorchConfig

    (torchx/aws_batch) Enable Region Selection in TorchConfig

    QOL improvements for support for AWS regions.

    Current Behavior: Use the default Region specified in .aws config

    Proposed Behaviour: Have manual override in .torchxconfig. If no override and no default throw and error

    Future Improvements: Add profile support. Add fargate support.

    CLA Signed 
    opened by ashvinnihalani 3
  • Kubernetes: support default scheduler instead of volcano for autoscaling

    Kubernetes: support default scheduler instead of volcano for autoscaling

    Description

    Right now Volcano doesn't handle autoscaling correctly since it won't schedule jobs if there's not enough resources. For single node jobs it would be nice to support the non-volcano K8s batch scheduler so the Pods will be created and can autoscale the cluster.

    Detailed Proposal

    Add a setting scheduler to the kubernetes scheduler to allow setting schedulerName to "default-scheduler" to use the standard logic.

    https://volcano.sh/en/docs/vcjob/#schedulername

    There's some prototype work that was done for TPU node prototyping in https://github.com/pytorch/torchx/commit/66e934c59b376a8c0b9aa6523fde34e81a673eb2 which required the same default scheduler logic

    Alternatives

    Additional context/links

    opened by d4l3k 0
  • [torchx/schedulers - aws batch] Only display jobs with torchx tag when listing

    [torchx/schedulers - aws batch] Only display jobs with torchx tag when listing

    Description

    Add a tag filter by the existence of the tag key: torchx.python.org/version when listing jobs.

    Motivation/Background

    Currently when the batch compute environment has jobs not launched via torchx, when we list jobs via:

    torchx list -s aws_batch
    

    It lists ALL the jobs (not just the ones that are submitted via torchx). Since we already tag the batch jobs launched with torchx with the tag key torchx.pytorch.org/version (see screenshot below) we should only list the jobs that have this tag.

    image

    Furthermore, we can also:

    1. Add a unix-username tag so that we only display the jobs relevant to the user
    2. Add a --filter option to either show all (*) jobs or filter by user or job queue or other useful criteria

    Detailed Proposal

    See motivation above.

    Alternatives

    N/A

    Additional context/links

    N/A

    opened by kiukchung 0
  • [RFC][torchx/schedulers - aws batch] support wildcard job queue selection

    [RFC][torchx/schedulers - aws batch] support wildcard job queue selection

    Description

    For the AWS Batch scheduler integration in torchx, support wildcards in the job_queue runopt with a simple (or configurable or extendable) queue selection logic. For instance, assume that my organization uses a job queue naming convention of the form

    ${TEAM_NAME}-${CLUSTER_TYPE}-${REGION}
    
    Example:
    
    pt_r2p-gpu_cluster-us-east-1a
    pt_r2p-gpu_cluster-us-east-1b
    pt_r2p-gpu_cluster-us-east-1c
    pt_r2p-cpu_cluster-us-east-1a
    pt_r2p-cpu_cluster-us-east-1b
    pt_r2p-trainium_cluster-us-east-1a
    pt_r2p-trainium_cluster-us-east-1c
    
    pt_core-gpu_cluster-us-east-1a
    pt_core-cpu_cluster-us-east-1a
    

    If I'm in the pt_r2p team, and want to submit a job to any gpu compute environment that has free capacity regardless of region, then I can use a wildcard on the ${CLUSTER_TYPE} portion of the job queue name as:

    [aws_batch]
    job_queue=pt_r2p-gpu_cluster-*
    

    Motivation/Background

    Ideally, with AWS Batch we create a single job queue (JQ) connected to multiple compute environments (CE) and always submit to the same JQ to have Batch figure out which CE the job needs to be submitted to. With Fair Share (FS) scheduling announced a year ago (see announcement) this is theoretically possible. However many users of AWS Batch are still using FIFO (originally supported) scheduling policy in which case having a single JQ is impractical in a multi-team use case scenario since users from other teams may affect the scheduling overhead of my team. This starts escalating pretty quickly in cases where teams BYO (bring your own) capacity.

    Detailed Proposal

    Support wild-cards for job_queue names for the aws_batch scheduler with the following MVP queue selection algorithm (greedy):

    1. Find all job queues that match the wild-card expression
    2. For each job queue pull the details of the CE that it is hooked up to
    3. Filter the CEs down to the ones that actually support the host type + quantity that the job needs
    4. For each filtered CE look at the free resources and rank them by most-free -> least-free
    5. Pick the job queue that has the most CEs with the highest rank

    This algorithm effectively choses the JQ that the job needs to be submitted to that will yield the least wait time in the queue.

    To actually implement the greedy algorithm above, I suggest that we add chain-able selection algorithms. For instance the algorithm above can be expressed as a chain of primitives:

    jqs = get_matching("pt_r2p-gpu_cluster-*")
    jqs = filter_resource(jqs, role.resource)
    jqs = order_by(jqs, Ordering.FREE_CAPACITY, desc=True)
    
    jqs[0] # <-- select this one to run
    

    Similarly a "first-match" algorithm can be implemented as:

    get_matching("pt_r2p-gpu_cluster-*")[0]
    

    We can follow torchdata's datapipe interface such that each function in the chain has the signature:

    def fn(jqs: List[JobQueue], *args, **kwargs) -> List[JobQueue]:
       """
       Returns a sorted/filtered/manipulated list of job queues to pass to the next chained fn.
       """
       pass
    

    Alternatives

    1. Resources/guidelines/script to migrate from FIFO queues Fair-Share.

    Additional context/links

    N/A

    opened by kiukchung 1
Releases(v0.4.0)
  • v0.4.0(Dec 30, 2022)

  • v0.3.0(Oct 27, 2022)

  • v0.2.0(Jun 15, 2022)

  • v0.1.2(Mar 29, 2022)

    Full Changelog: https://github.com/pytorch/torchx/compare/v0.1.1...v0.1.2

    Milestone: https://github.com/pytorch/torchx/milestones/3

    • PyTorch 1.11 Support
    • Python 3.10 Support
    • torchx.workspace
      • TorchX now supports a concept of workspaces. This enables seamless launching of jobs using changes present in your local workspace. For Docker based schedulers, we automatically build a new docker container on job launch making it easier than ever to run experiments. #333
    • torchx.schedulers
      • Ray #329
        • Newly added Ray scheduler makes it easy to launch jobs on Ray.
        • https://pytorch.medium.com/large-scale-distributed-training-with-torchx-and-ray-1d09a329aacb
      • AWS Batch #381
        • Newly added AWS Batch scheduler makes it easy to launch jobs in AWS with minimal infrastructure setup.
      • Slurm
        • Slurm jobs will by default launch in the current working directory to match local_cwd and workspace behavior. #372
        • Replicas now have their own log files and can be accessed programmatically. #373
        • Support for comment, mail-user and constraint fields. #391
        • Workspace support (prototype) - Slurm jobs can now be launched in isolated experiment directories. #416
      • Kubernetes
        • Support for running jobs under service accounts. #408
        • Support for specifying instance types. #433
      • All Docker-based Schedulers (Kubernetes, Batch, Docker)
        • Added bind mount and volume supports #420, #426
        • Bug fix: Better shm support for large dataloader #429
        • Support for .dockerignore and custom Dockerfiles #401
      • Local Scheduler
        • Automatically set CUDA_VISIBLE_DEVICES #383
        • Improved log ordering #366
    • torchx.components
      • dist.ddp
        • Rendezvous works out of the box on all schedulers #400
        • Logs are now prefixed with local ranks #412
        • Can specify resources via the CLI #395
        • Can specify environment variables via the CLI #399
      • HPO
        • Ax runner now lives in the Ax repo https://github.com/facebook/Ax/commit/8e2e68f21155e918996bda0b7d97b5b9ef4e0cba
    • torchx.cli
      • .torchxconfig
        • You can now specify component argument defaults .torchxconfig https://github.com/pytorch/torchx/commit/c37cfd7846d5a0cb527dd19c8c95e881858f8f0a
        • ~/.torchxconfig can now be used to set user level defaults. #378
        • --workspace can be configured #397
      • Color change and bug fixes #419
    • torchx.runner
      • Now supports workspace interfaces. #360
      • Returned lines now preserve whitespace to provide support for progress bars #425
      • Events are now logged to torch.monitor when available. #379
    • torchx.notebook (prototype)
      • Added new workspace interface for developing models and launching jobs via a Jupyter Notebook. #356
    • Docs
      • Improvements to clarify TorchX usage w/ workspaces and general cleanups.
      • #374, #402, #404, #407, #434
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Nov 18, 2021)

  • v0.1.0(Oct 21, 2021)

  • v0.1.0rc1(Oct 18, 2021)

  • v0.1.0rc0(Oct 5, 2021)

  • v0.1.0b0(Jun 29, 2021)

  • v0.1.0.dev2(Jun 24, 2021)

  • v0.1.0.dev1(Jun 17, 2021)

  • v0.1.0.dev0(Jun 16, 2021)

Automatic Image Background Subtraction

Automatic Image Background Subtraction This repo contains set of scripts for automatic one-shot image background subtraction task using the following

Oleg SΓ©mery 6 Dec 05, 2022
Official implementation of VQ-Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesis Overview This is the official repo for the paper: [Vector Quantized Diffusion Model for T

Microsoft 592 Jan 03, 2023
FairyTailor: Multimodal Generative Framework for Storytelling

FairyTailor: Multimodal Generative Framework for Storytelling

Eden Bens 172 Dec 30, 2022
HomoInterpGAN - Homomorphic Latent Space Interpolation for Unpaired Image-to-image Translation

HomoInterpGAN Homomorphic Latent Space Interpolation for Unpaired Image-to-image Translation (CVPR 2019, oral) Installation The implementation is base

Ying-Cong Chen 99 Nov 15, 2022
Computer Vision Script to recognize first person motion, developed as final project for the course "Machine Learning and Deep Learning"

Overview of The Code BaseColab/MLDL_FPAR.pdf: it contains the full explanation of our work Base Colab: it contains the base colab used to perform all

Simone Papicchio 4 Jul 16, 2022
Semi-supervised Transfer Learning for Image Rain Removal. In CVPR 2019.

Semi-supervised Transfer Learning for Image Rain Removal This package contains the Python implementation of "Semi-supervised Transfer Learning for Ima

Wei Wei 59 Dec 26, 2022
modelvshuman is a Python library to benchmark the gap between human and machine vision

modelvshuman is a Python library to benchmark the gap between human and machine vision. Using this library, both PyTorch and TensorFlow models can be evaluated on 17 out-of-distribution datasets with

Bethge Lab 244 Jan 03, 2023
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022
A data-driven approach to quantify the value of classifiers in a machine learning ensemble.

Documentation | External Resources | Research Paper Shapley is a Python library for evaluating binary classifiers in a machine learning ensemble. The

Benedek Rozemberczki 188 Dec 29, 2022
Tools for manipulating UVs in the Blender viewport.

UV Tool Suite for Blender A set of tools to make editing UVs easier in Blender. These tools can be accessed wither through the Kitfox - UV panel on th

35 Oct 29, 2022
Repository for "Toward Practical Monocular Indoor Depth Estimation" (CVPR 2022)

Toward Practical Monocular Indoor Depth Estimation Cho-Ying Wu, Jialiang Wang, Michael Hall, Ulrich Neumann, Shuochen Su [arXiv] [project site] DistDe

Meta Research 122 Dec 13, 2022
Code for project: "Learning to Minimize Remainder in Supervised Learning".

Learning to Minimize Remainder in Supervised Learning Code for project: "Learning to Minimize Remainder in Supervised Learning". Requirements and Envi

Yan Luo 0 Jul 18, 2021
Simple ray intersection library similar to coldet - succedeed by libacc

Ray Intersection This project offers a header only acceleration structure library including implementations for a BVH- and KD-Tree. Applications may i

Nils Moehrle 29 Jun 23, 2022
implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning"

MarginGAN This repository is the implementation of the paper "MarginGAN: Adversarial Training in Semi-Supervised Learning". 1."preliminary" is the imp

Van 7 Dec 23, 2022
A simple program for training and testing vit

Vit This is a simple program for training and testing vit. Key requirements: torch, torchvision and timm. Dataset I put 5 categories of the cub classi

xiezhenyu 2 Oct 11, 2022
[SIGGRAPH 2022 Journal Track] AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars Fangzhou Hong1*  Mingyuan Zhang1*  Liang Pan1  Zhongang Cai1,2,3  Lei Yang2 

Fangzhou Hong 749 Jan 04, 2023
Spatial Contrastive Learning for Few-Shot Classification (SCL)

This repo contains the official implementation of Spatial Contrastive Learning for Few-Shot Classification (SCL), which presents of a novel contrastive learning method applied to few-shot image class

Yassine 34 Dec 25, 2022
TensorFlow implementation of "Learning from Simulated and Unsupervised Images through Adversarial Training"

Simulated+Unsupervised (S+U) Learning in TensorFlow TensorFlow implementation of Learning from Simulated and Unsupervised Images through Adversarial T

Taehoon Kim 569 Dec 29, 2022
Yolov5 + Deep Sort with PyTorch

λ”₯μ†ŒνŠΈ μˆ˜μ •μ€‘ Yolov5 + Deep Sort with PyTorch Introduction This repository contains a two-stage-tracker. The detections generated by YOLOv5, a family of obj

1 Nov 26, 2021
Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

WarPI The official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels". Run python main.py --corruption_type

Haoliang Sun 3 Sep 03, 2022