A Lightweight Cluster/Cloud VM Job Management Tool 🚀

Last update: Dec 12, 2022

Related tags

Overview

Lightweight Cluster/Cloud VM Job Management 🚀

Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SSH servers or Google Cloud Platform VMs? mle-scheduler provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:

MLEJob: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).
MLEQueue: Launches and monitors a queue of jobs with different training configurations and/or seeds.

For a quickstart check out the notebook blog or the example scripts 📖

	Local	Slurm	Grid Engine	SSH	GCP

Installation ⏳

pip install mle-scheduler

Managing a Single Job with `MLEJob` Locally 🚀

from mle_scheduler import MLEJob

# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
    resource_to_run="local",
    job_filename="train.py",
    config_filename="base_config_1.yaml",
    experiment_dir="logs_single",
    seed_id=1
)

_ = job.run()

Managing a Queue of Jobs with `MLEQueue` Locally 🚀 ... 🚀

from mle_scheduler import MLEQueue

# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_1
   
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/
   
    _base_config_2
   
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/
   
    _base_config_2
   
queue = MLEQueue(
    resource_to_run="local",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_queue"
)

queue.run()

Launching Slurm Cluster-Based Jobs 🐒

", # Partition to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "modules_to_load": "nvidia/cuda/10.0" # Modules to load at start-up } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_slurm", random_seeds=[0, 1] ) queue.run() ">

# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "
   
    "
   ,  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Launching GridEngine Cluster-Based Jobs 🐘

", # Queue to schedule jobs on "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True, # Whether to use anaconda venv "num_logical_cores": 5, # Number of requested CPU cores per job "num_gpus": 1, # Number of requested GPUs per job "gpu_type": "V100S", # GPU model requested for each job "gpu_prefix": "cuda" #$ -l {gpu_prefix}="{num_gpus}" } queue = MLEQueue( resource_to_run="slurm-cluster", job_filename="train.py", job_arguments=job_args, config_filenames=["base_config_1.yaml", "base_config_2.yaml"], experiment_dir="logs_grid_engine", random_seeds=[0, 1] ) queue.run() ">

# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
    "queue": "
   
    "
   ,  # Queue to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "gpu_prefix": "cuda"  #$ -l {gpu_prefix}="{num_gpus}"
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_grid_engine",
    random_seeds=[0, 1]
)
queue.run()

Launching SSH Server-Based Jobs 🦊

", # SSH server user name "pkey_path": "
", # Private key path (e.g. ~/.ssh/id_rsa) "main_server": "

", # SSH Server address "jump_server": '', # Jump host address "ssh_port": 22, # SSH port "remote_dir": "mle-code-dir", # Dir to sync code to on server "start_up_copy_dir": True, # Whether to copy code to server "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "env_name": "mle-toolbox", # Env to activate at job start-up "use_conda_venv": True # Whether to use anaconda venv } queue = MLEQueue( resource_to_run="ssh-node", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_ssh_queue", job_arguments=job_args, ssh_settings=ssh_settings) queue.run() ">
ssh_settings = {
    "user_name": "
     
      "
     ,  # SSH server user name
    "pkey_path": "
     
      "
     ,  # Private key path (e.g. ~/.ssh/id_rsa)
    "main_server": "
     
      "
     ,  # SSH Server address
    "jump_server": '',  # Jump host address
    "ssh_port": 22,  # SSH port
    "remote_dir": "mle-code-dir",  # Dir to sync code to on server
    "start_up_copy_dir": True,  # Whether to copy code to server
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True  # Whether to use anaconda venv
}

queue = MLEQueue(
    resource_to_run="ssh-node",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_ssh_queue",
    job_arguments=job_args,
    ssh_settings=ssh_settings)

queue.run()

Launching GCP VM-Based Jobs 🦄

", # Name of your GCP project "bucket_name": "
", # Name of your GCS bucket "remote_dir": "

", # Name of code dir in bucket "start_up_copy_dir": True, # Whether to copy code to bucket "clean_up_remote_dir": True # Whether to delete remote_dir on exit } job_args = { "num_gpus": 0, # Number of requested GPUs per job "gpu_type": None, # GPU requested e.g. "nvidia-tesla-v100" "num_logical_cores": 1, # Number of requested CPU cores per job } queue = MLEQueue( resource_to_run="gcp-cloud", job_filename="train.py", config_filenames=["base_config_1.yaml", "base_config_2.yaml"], random_seeds=[0, 1], experiment_dir="logs_gcp_queue", job_arguments=job_args, cloud_settings=cloud_settings, ) queue.run() ">
cloud_settings = {
    "project_name": "
     
      "
     ,  # Name of your GCP project
    "bucket_name": "
     
      "
     , # Name of your GCS bucket
    "remote_dir": "
     
      "
     ,  # Name of code dir in bucket
    "start_up_copy_dir": True,  # Whether to copy code to bucket
    "clean_up_remote_dir": True  # Whether to delete remote_dir on exit
}

job_args = {
    "num_gpus": 0,  # Number of requested GPUs per job
    "gpu_type": None,  # GPU requested e.g. "nvidia-tesla-v100"
    "num_logical_cores": 1,  # Number of requested CPU cores per job
}

queue = MLEQueue(
    resource_to_run="gcp-cloud",
    job_filename="train.py",
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    random_seeds=[0, 1],
    experiment_dir="logs_gcp_queue",
    job_arguments=job_args,
    cloud_settings=cloud_settings,
)
queue.run()

Development & Milestones for Next Release

You can run the test suite via python -m pytest -vv tests/. If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue 🤗 . In future releases I plan on implementing the following:

Clean up TPU GCP VM & JAX dependencies case
Add local launching of cluster jobs via SSH to headnode
Add Docker/Singularity container setup support
Add Azure support
Add AWS support

Comments

use sys.executable instead of 'python'

In some systems (like mine, when I run locally on conda), the Python executable is not "python". I used here a global variable - not sure if that's the best way, but it allows for cases where we don't want the executable to be the same as sys.executable (e.g. if we want to execute the job on a different python interpreter than the one we are using).

opened by boazbk 4
Handle case when experiment_dir is not provided

At the moment if "experiment_dir" is None, then cmd_line_args is not initialized, and hence future lines like cmd_line_args += " -config " + self.config_filename will fail.

The proposed change just initializes cmd_line_args to the empty string, and then adds all options to it later.

opened by boazbk 2

[Feature] Make `meta_log` accessible from queue

Instead of having to ...

# Merge logs of random seeds & configs -> load & get final scores
queue.merge_configs(merge_seeds=True)
meta_log = load_meta_log("logs_search/meta_log.hdf5")
test_scores = [meta_log[r].stats.test_loss.mean[-1] for r in queue.mle_run_ids]

it would be great to do the load_meta_log already within the MLEQueue if merge_configs is called.

opened by RobertTLange 1

Handling Errors thrown in GCP VMs

Complete newbie to using VMs, so I'm guessing this will be a rookie questions.

If an error is encountered when executing a job on a GCP VM, what are the best practices for handling them? I'm not even sure how to know if there was an error, which obviously complicates the debugging process.

opened by wbrenton 0
Cmd capture
Adds MLEQueue option to delete config after job has finished

Adds debug_mode option to store stdout & stderr to files - partially addresses #3

Adds merging/loading of generated logs in MLEQueue w. automerge_configs option

Use system executable python version
opened by RobertTLange 0
What environment does it depend on?

It's greate of you have finished so good tool for job scheduler. I want to konw what environment does it depend on? And if it can run on Kubernetes docker environment? Thanks!

opened by kongjibai 0

Releases(v0.0.5)

v0.0.5(Jan 5, 2022)
Adds MLEQueue option to delete config after job has finished (delete_config)

Adds debug_mode option to store stdout & stderr to files

Adds merging/loading of generated logs in MLEQueue w. automerge_configs option

Use system executable python version

Source code(tar.gz)
Source code(zip)
v0.0.4(Dec 7, 2021)
[x] Track config base strings for auto-merging of mle-logs & add merge_configs

[x] Allow scheduling on multiple partitions via -p <part1>,<part2> & queues via -q <queue1>,<queue2>

Source code(tar.gz)
Source code(zip)
v0.0.3(Nov 12, 2021)

Source code(tar.gz)
Source code(zip)
v0.0.2(Nov 12, 2021)

Source code(tar.gz)
Source code(zip)

v0.0.1(Nov 12, 2021)

First release 🤗 implementing core API of MLEJob and MLEQueue

# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
    "partition": "<SLURM_PARTITION>",  # Partition to schedule jobs on
    "env_name": "mle-toolbox",  # Env to activate at job start-up
    "use_conda_venv": True,  # Whether to use anaconda venv
    "num_logical_cores": 5,  # Number of requested CPU cores per job
    "num_gpus": 1,  # Number of requested GPUs per job
    "gpu_type": "V100S",  # GPU model requested for each job
    "modules_to_load": "nvidia/cuda/10.0"  # Modules to load at start-up
}

queue = MLEQueue(
    resource_to_run="slurm-cluster",
    job_filename="train.py",
    job_arguments=job_args,
    config_filenames=["base_config_1.yaml",
                      "base_config_2.yaml"],
    experiment_dir="logs_slurm",
    random_seeds=[0, 1]
)
queue.run()

Source code(tar.gz)
Source code(zip)

A Lightweight Cluster/Cloud VM Job Management Tool 🚀

Related tags

Overview

Lightweight Cluster/Cloud VM Job Management 🚀

Installation ⏳

Managing a Single Job with MLEJob Locally 🚀

Managing a Queue of Jobs with MLEQueue Locally 🚀 ... 🚀

Launching Slurm Cluster-Based Jobs 🐒

Launching GridEngine Cluster-Based Jobs 🐘

Launching SSH Server-Based Jobs 🦊

Launching GCP VM-Based Jobs 🦄

Development & Milestones for Next Release

Comments

use sys.executable instead of 'python'

Handle case when experiment_dir is not provided

[Feature] Make `meta_log` accessible from queue

Handling Errors thrown in GCP VMs

Cmd capture

What environment does it depend on?

Releases(v0.0.5)

v0.0.5(Jan 5, 2022)

v0.0.4(Dec 7, 2021)

v0.0.3(Nov 12, 2021)

v0.0.2(Nov 12, 2021)

v0.0.1(Nov 12, 2021)

Owner

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

The easiest way to automate your data

Clepsydra is a mini framework for task scheduling

Python job scheduling for humans.

A Python concurrency scheduling library, compatible with asyncio and trio.

Automate SQL Jobs Monitoring with python

A simple scheduler tool that provides desktop notifications about classes and opens their meet links in the browser automatically at the start of the class.

Ffxiv-blended-job-icons - All action icons for each class/job are blended together to create new backgrounds for each job/class icon!

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish

Remote task execution tool

Another Scheduler is a Kubernetes controller that automatically starts, stops, or restarts pods from a deployment at a specified time using a cron annotation.

Vertigo is an application used to schedule @code4tomorrow classes.

Here is the live demonstration of endpoints and celery worker along with RabbitMQ

A flexible python library for building your own cron-like system, with REST APIs and a Web UI.

A Lightweight Cluster/Cloud VM Job Management Tool 🚀

A powerful workflow engine implemented in pure Python

A calendaring app for Django. It is now stable, Please feel free to use it now. Active development has been taken over by bartekgorny.

Aiorq is a distributed task queue with asyncio and redis

Crontab jobs management in Python

A task scheduler with task scheduling, timing and task completion time tracking functions

Managing a Single Job with `MLEJob` Locally 🚀

Managing a Queue of Jobs with `MLEQueue` Locally 🚀 ... 🚀