Large scale and asynchronous Hyperparameter Optimization at your fingertip.

Overview

Syne Tune

Release Python Version License

This package provides state-of-the-art distributed hyperparameter optimizers (HPO) where trials can be evaluated with several backend options (local backend to evaluate them locally; SageMaker to evaluate them as separate SageMaker training jobs; another backend with fast startup times is also in the making).

Installing

To install Syne Tune from pip, you can simply do:

pip install syne-tune

This will install a bare-bone version. If you want in addition to install our own Gaussian process based optimizers, Ray Tune or Bore optimizer, you can run pip install syne-tune[X] where X can be

  • gpsearchers: For built-in Gaussian process based optimizers
  • raytune: For Ray Tune optimizers
  • benchmarks: For installing all dependencies required to run all benchmarks
  • extra: For installing all the above
  • bore: For Bore optimizer

For instance, pip install syne-tune[gpsearchers] will install Syne Tune along with many built-in Gaussian process optimizers.

To install the latest version from git, run the following:

pip install git+https://github.com/awslabs/syne-tune.git

For local development, we recommend to use the following setup which will enable you to easily test your changes:

pip install --upgrade pip
git clone [email protected]:awslabs/syne-tune.git
cd syne-tune
pip install -e .[extra]

How to enable tuning and tuning script conventions

This section describes how to enable tuning an endpoint script. In particular, we describe:

  1. how hyperparameters are transmitted from the “tuner” to the user script function
  2. how the user communicates metrics to the “tuner” script (which depends on a backend implementation)
  3. how does the user enables checkpointing to pause/resume trial tuning jobs?

Hyperparameters. Hyperparameters are passed through command line arguments as in SageMaker. For instance, for a hyperparameters num_epochs:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--num_epochs', type=int, required=True)
args, _ = parser.parse_known_args()
for i in range(1, args.num_epochs + 1):
  ... # do something

Communicating metrics. You should call a function to report metrics after each epochs or at the end of the trial. For example:

from syne_tune.report import report
for epoch in range(1, num_epochs + 1):
   # ... do something
   train_acc = compute_accuracy()
   report(train_acc=train_acc, epoch=epoch)

reports artificial results obtained in a dummy loop. In addition to user metrics, Syne Tune will automatically add the following metrics:

  • st_worker_timestamp: the time stamp when report was called
  • st_worker_time: the total time spent when report was called since the creation of the reporter
  • st_worker_cost (only when running on SageMaker): the dollar-cost spent since the creation of the reporter

Model output and checkpointing (optional). Since trials may be paused and resumed (either by schedulers or when using spot-instances), the user has the possibility to checkpoint intermediate results. Model outputs and checkpoints must be written into a specific local path given by the command line argument st_checkpoint_dir. Saving/loading model checkpoint from this directory enables to save/load the state when the job is stopped/resumed (setting the folder correctly and uniquely per trial is the responsibility of the backend), see checkpoint_example.py to see a fully working example of a tuning script with checkpoint enabled.

Under the hood, we use SageMaker checkpoint mechanism to enable checkpointing when running tuning remotely or when using the SageMaker backend. Checkpoints are saved in s3://{s3_bucket}/syne-tune/{tuner-name}/{trial-id}/, where s3_bucket can be configured (defaults to default_bucket of the session).

We refer to checkpoint_example.py for a complete example of a script with checkpoint enabled.

Many other examples of scripts that can be tuned are are available in examples/training_scripts.

Launching a tuning job

Tuning options. At a high-level a tuning consists in a tuning-loop that evaluates different trial in parallel and only let the top ones continue. This loop continues until a stopping criterion is met (for instance a maximum wallclock-time) and each time a worker is available asks a scheduler (an HPO algorithm) to decide which trial should be evaluated next. The execution of the trial is done on a backend. The pseudo-code of an HPO loop is as follow:

def hpo_loop(hpo_algorithm, backend):
    while not_done():
        if worker_is_free():
            config = hpo_algorithm.suggest()
            backend.start_trial(config)
        for result in backend.fetch_new_results():
            decision = hpo_algorithm.on_trial_result(result)
            if decision == "stop":
                backend.stop_trial(result.trial)

By changing the backend, users can decide whether the trial should be evaluated in a local machine, whether the trial should be executed on SageMaker with a separate training job or whether the trial should be evaluated on a cluster of multiple machines (available as a separate package for now).

Below is a minimal example showing how to tune a script train_height.py with Random-search:

from pathlib import Path
from syne_tune.search_space import randint
from syne_tune.backend.local_backend import LocalBackend
from syne_tune.optimizer.schedulers.fifo import FIFOScheduler
from syne_tune.stopping_criterion import StoppingCriterion
from syne_tune.tuner import Tuner

config_space = {
    "steps": 100,
    "width": randint(0, 20),
    "height": randint(-100, 100)
}

# path of a training script to be tuned
entry_point = Path(__file__).parent / "training_scripts" / "height_example" / "train_height.py"

# Local back-end
backend = LocalBackend(entry_point=str(entry_point))

# Random search without stopping
scheduler = FIFOScheduler(
    config_space,
    searcher="random",
    mode="min",
    metric="mean_loss",
)

tuner = Tuner(
    backend=backend,
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=30),
    n_workers=4,
)

tuner.run()

An important part of this script is the definition of config_space, the configuration space (or search space). This tutorial provides some advice on this choice.

Using the local backend LocalBackend(entry_point=...) allows to run the trials (4 at the same time) on the local machine. If instead, users prefer to evaluate trials on SageMaker, then SageMaker backend can be used which allow to tune any SageMaker Framework (see launch_height_sagemaker.py for an example), here is one example to run a PyTorch estimator on a GPU

from sagemaker.pytorch import PyTorch
from syne_tune.backend.sagemaker_backend.sagemaker_backend import SagemakerBackend
from syne_tune.backend.sagemaker_backend.sagemaker_utils import get_execution_role

backend = SagemakerBackend(
    # we tune a PyTorch Framework from Sagemaker
    sm_estimator=PyTorch(
        entry_point="path_to_your_entrypoint.py",
        instance_type="ml.p2.xlarge",
        instance_count=1,
        role=get_execution_role(),
        max_run=10 * 60,
        framework_version='1.7.1',
        py_version='py3',
    ),
)

Note that Syne Tune code is sent with the SageMaker Framework so that the import syne_tune.report that imports the reporter works when executing the training script, as such there is no need to install Syne Tune in the docker image of the SageMaker Framework.

In addition, users can decide to run the tuning loop on a remote instance. This is helpful to avoid the need of letting a developer machine run and to benchmark many seed/model options.

tuner = RemoteLauncher(
    tuner=Tuner(
        backend=backend,
        scheduler=scheduler,
        n_workers=n_workers,
        tuner_name="height-tuning",
        stop_criterion=StoppingCriterion(max_wallclock_time=600),
    ),
    # Extra arguments describing the ressource of the remote tuning instance and whether we want to wait
    # the tuning to finish. The instance-type where the tuning job runs can be different than the
    # instance-type used for evaluating the training jobs.
    instance_type='ml.m5.large',
)

tuner.run(wait=False)

In this case, the tuning loop is going to be executed on a ml.m5.large instance instead of running locally. Both backends can be used when using the remote launcher (if you run with the Sagemaker backend the tuning loop will happen on the instance type specified in the remote launcher and the trials will be evaluated on the instance(s) configured in the SageMaker framework, this may include several instances in case of distributed training). In the case where the remote launcher is used with a SageMaker backend, a SageMaker job is created to execute the tuning loop which then schedule a new SageMaker training job for each configuration to be evaluated. The options and use-case in this table:

Tuning loop Trial execution Use-case example
Local Local Quick tuning for cheap models, debugging. launch_height_local.py
Local SageMaker Avoid saturating machine with trial computation with expensive trial, possibly use distributed training, enable debugging the tuning loop on a local machine. launch_height_sagemaker.py
SageMaker Local Run remotely to benchmark many HPO algo/seeds options, possibly with a big machine with multiple CPUs or GPUs. launch_height_sagemaker_remotely.py
SageMaker SageMaker Run remotely to benchmark many HPO algo/seeds options, enable distributed training or heavy computation. launch_height_sagemaker_remotely.py with distribute_trials_on_SageMaker=True

To summarize, to evaluate trial execution locally, users should use LocalBackend, to evaluate trials on SageMaker users should use the SageMakerBackend which allows to tune any SageMaker Estimator, see launch_height_local.py or launch_height_sagemaker.py for examples. To run a tuning loop remotely, RemoteLauncher can be used, see launch_height_sagemaker_remotely.py for an example.

Output of a tuning job.

Every tuning experiment generates three files:

  • results.csv.zip contains live information of all the results that were seen by the scheduler in addition to other information such as the decision taken by the scheduler, the wallclock time or the dollar-cost of the tuning (only on SageMaker).
  • tuner.dill contains the checkpoint of the tuner which include backend, scheduler and other information. This can be used to resume a tuning experiment, use Spot instance for tuning or perform fine-grain analysis of the scheduler state.
  • metadata.json contains the time-stamp when the Tuner start to effectively run. It also contains possible user metadata information.

For instance, the following code:

tuner = Tuner(
   backend=backend,
   scheduler=scheduler,
   n_workers=4,
   tuner_name="height-tuning",
   stop_criterion=StoppingCriterion(max_wallclock_time=600),
   metadata={'description': 'just an example'},
)
tuner.run()

runs a tuning by evaluating 4 configurations in parallel with a given backend/scheduler and stops after 600s. Tuner appends a unique string to ensure unicity of tuner name (with the above example the id of the experiment may be height-tuning-2021-07-02-10-04-37-233). Results are updated every 30 seconds by default which is configurable.

Experiment data can be retrieved at a later stage for further analysis with the following command:

tuning_experiment = load_experiment("height-tuning-2021-07-02-10-04-37-233")
tuning_experiment = load_experiment(tuner.name) # equivalent

The results obtained load_experiment have the following schema.

class ExperimentResult:
    name: str
    results: pandas.DataFrame
    metadata: Dict
    tuner: Tuner

Where metadata contains the metadata provided by the user ({'description': 'just an example'} in this case) as well as st_tuner_creation_timestamp which stores the time-stamp when the tuning actually started.

Output of a tuning job when running tuning on SageMaker. When the tuning runs remotely on SageMaker, the results are stored at a regular cadence to s3://{s3_bucket}/syne-tune/{tuner-name}/, where s3_bucket can be configured (defaults to default_bucket of the session). For instance, if the above experiment is run remotely, the following path is used for checkpointing results and states:

s3://sagemaker-us-west-2-{aws_account_id}/syne-tune/height-tuning-2021-07-02-10-04-37-233/results.csv.zip

Multiple GPUs. If your instance has multiple GPUs, the local backend can run different trials in parallel, each on its own GPU (with the option LocalBackend(rotate_gpus=True), which is activated by default). When a new trial starts, it is assigned to a free GPU if possible. In case of ties, the GPU with fewest prior assignments is chosen. If the number of workers is larger than the number of GPUs, several trials will run as subprocesses on the same GPU. If the number of workers is smaller or equal to the number of GPUs, each trial occupies a GPU on its own, and trials can start without delay. Reasons to choose rotate_gpus=False include insufficient GPU memory or the training evaluation code making good use of multiple GPUs.

Examples

Once you have a tuning script, you can call Tuner with any scheduler to perform your HPO. You will find the following examples in examples/ folder:

Running on SageMaker

If you want to launch experiments on SageMaker rather than on your local machine, you will need access to AWS and SageMaker on your machine.

Make sure that:

  • awscli is installed (see this link)
  • docker is installed and running (see this link)
  • A SageMaker role have been created (see this page for instructions if you created a SageMaker notebook in the past, this role should have been created for you).
  • AWS credentials have been set properly (see this link).

Note: all those conditions are already met if you run in a SageMaker notebook, they are only relevant if you run in your local machine or on another environment.

The following command should run without error if your credentials are available:

python -c "import boto3; print(boto3.client('sagemaker').list_training_jobs(MaxResults=1))"

Or run the following example that evaluates trials on SageMaker.

python examples/launch_height_sagemaker.py

Syne Tune allows you to launch HPO experiments remotely on SageMaker, instead of them running on your local machine. This is particularly interesting for running many experiments in parallel. Here is an example:

python examples/launch_height_sagemaker_remotely.py

If you run this for the first time, it will take a while, building a docker image with the Syne Tune dependencies and pushing it to ECR. This has to be done only once, even if Syne Tune source code is modified later on.

Assuming that launch_height_sagemaker_remotely.py is working for you now, you should note that the script returns immediately after starting the experiment, which is running as a SageMaker training job. This allows you to run many experiments in parallel, possibly by using the command line launcher.

If running this example fails, you are probably not setup to build docker images and push them to ECR on your local machine. Check that aws-cli is installed and that docker is running on your machine. After checking that those conditions are met (consider using a SageMaker notebook if not since AWS access and docker are configured automatically), you can try to building the image again by running with the following:

cd container
bash build_syne_tune_container.sh

To run on SageMaker, you can also use any custom docker images available on ECR. See launch_height_sagemaker_custom_image.py for an example on how to run with a script with a custom docker image.

Benchmarks

Syne Tune comes with a range of benchmarks for testing and demonstration. Turning your own tuning problem into a benchmark is simple and comes with a number of advantages. As detailed in this tutorial, you can use the CL launcher launch_hpo.py in order to start one or more experiments, adjusting many parameters of benchmark, back-end, tuner, or scheduler from the command line. The simpler launch_benchmarks.py can also be used to launch experiments.

Once tunings experiments are finished, show_experiment_results.py gives an example of how results can be retrieved and plotted.

Tutorials

Do you want to know more? Here are a number of short tutorials.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Comments
  • Allow for independence between number of trials and number of combinations.

    Allow for independence between number of trials and number of combinations.

    TL;DR: Please allow to run experiments with more (or less) trials than the number of combinations.

    Right now when running a hyperparameter tuning job with AMT I get an error message when the number of trials exceeds the number of enumerable combinations; here being the discrete combinations of integer and categorical parameter values.

    But this is limiting as the exploration of the search space is noisy and more than one trial may be needed to understand the variance and to establish a stable mean.

    As a workaround I am starting tuning jobs with the Random strategy with an additional hyper parameter "dummy" that is continuous. This way I can specify the number of trials I need. But this makes it harder to use this data basis for future warmstarts to narrow down scenarios. Further it forces me to allow the "dummy" parameter in my training script.

    Example: I want to know if adding a non-linearity and additional capacity (Pooler) on top of a BERT-like model will yield better performance as result, or if the extra capacity will make the model lazier and not use the transformer blocks below. I also want to see if this assessment changes when adding more transformer layers.

    So I have two categorical variables. Layers: [1, 4, 8] and scale-of-classifier: [0, 0.5, 1.0, 2.0]. These are just 3*4 combinations. But given the noisy nature of a NN training a single data point per combination has next to no meaning. To produce the understanding below I used about 100 data points with the workaround from above.

    If I could just specify the categorical parameters and the number of trails (for GridSearch/Random) my appreciation will follow you until the end of your hopefully long and fulfilling life.

    image image

    opened by marianokamp 23
  • Experiment Results Contain Random Rows

    Experiment Results Contain Random Rows

    In my experiment, the result data frame contains multiple rows with trial id 1 with the same content as the next row, the only difference being the config. This causes problems since sometimes the best config is now trial id 1 that shows a config which did not achieve the best performance.

    See this example: True trial id 1 performance is 81% (Row 4) but trial id 1 also shows up in row 10 with highest accuracy. I've added a simple example to reproduce this behavior.

    image

    from pathlib import Path
    
    from sagemaker.pytorch import PyTorch
    
    from syne_tune.backend import SageMakerBackend
    from sagemaker import get_execution_role
    from syne_tune.optimizer.baselines import RandomSearch
    from syne_tune import Tuner
    from syne_tune.config_space import randint
    from syne_tune import StoppingCriterion
    from syne_tune.optimizer.schedulers.fifo import FIFOScheduler
    
    entry_point = Path('examples') / "training_scripts" / "height_example" / "train_height.py"
    assert entry_point.is_file(), 'File unknown'
    mode = "min"
    metric = "mean_loss"
    instance_type = 'ml.c5.4xlarge'
    instance_count = 1
    instance_max_time = 999
    n_workers = 20
    
    config_space = {
        "steps": 1,
        "width": randint(0, 20),
        "height": randint(-100, 100)
    }
    
    backend = SageMakerBackend(
        sm_estimator=PyTorch(
            entry_point=str(entry_point),
            instance_type=instance_type,
            instance_count=instance_count,
            role=get_execution_role(),
            max_run=instance_max_time,
            py_version='py3',
            framework_version='1.6',
        ),
        metrics_names=[metric],
    )
    
    # Random search without stopping
    scheduler = FIFOScheduler(
        config_space=config_space,
        searcher='random',
        mode=mode,
        metric=metric,
    )
    
    tuner = Tuner(
        trial_backend=backend,
        scheduler=scheduler,
        stop_criterion=StoppingCriterion(max_wallclock_time=300),
        n_workers=n_workers,
    )
    
    tuner.run()
    
    
    bug 
    opened by wistuba 17
  • Promotion Logic Bug

    Promotion Logic Bug

    There seems to be a problem with the Hyperband promotion logic.

    How to reproduce: Add type="promotion" to https://github.com/awslabs/syne-tune/blob/main/benchmarking/nursery/benchmark_automl/baselines.py#L69

    Run python benchmarking/nursery/benchmark_automl/benchmark_main.py --num_seeds 1 --method ASHA --benchmark lcbench-airlines

      File "/syne-tune/benchmarking/nursery/benchmark_automl/benchmark_main.py", line 209, in <module>
        tuner.run()
      File "/syne-tune/syne_tune/tuner.py", line 240, in run
        raise e
      File "/syne-tune/syne_tune/tuner.py", line 175, in run
        new_done_trial_statuses, new_results = self._process_new_results(
      File "/syne-tune/syne_tune/tuner.py", line 345, in _process_new_results
        done_trials_statuses = self._update_running_trials(
      File "/syne-tune/syne_tune/tuner.py", line 465, in _update_running_trials
        decision = self.scheduler.on_trial_result(trial=trial, result=result)
      File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 779, in on_trial_result
        task_info = self.terminator.on_task_report(trial_id, result)
      File "/syne-tune/syne_tune/optimizer/schedulers/hyperband.py", line 1124, in on_task_report
        rung_sys.on_task_report(trial_id, result, skip_rungs=skip_rungs)
      File "/syne-tune/syne_tune/optimizer/schedulers/hyperband_promotion.py", line 221, in on_task_report
        assert resource == milestone, (
    AssertionError: trial_id 1: resource = 4 > 3 milestone. Make sure to report time attributes covering all milestones```
    bug 
    opened by wistuba 16
  • Refactor surrogates in blackbox repository

    Refactor surrogates in blackbox repository

    Currently, surrogates may return inconsistent metric curves (e.g., elapsed_time not monotonic w.r.t. fidelity). It is also unclear how seed is treated in a surrogate.

    Will use multi-variate regression natively supported in scikit-learn. We currently already use that w.r.t. num_objectives. The input of the model will be the HP config only. The old way can still be used, but won't be the default.

    Will also sort out the situation with seed.

    enhancement 
    opened by mseeger 15
  • Grid search in syne-tune

    Grid search in syne-tune

    Hey folks, would you be interested in grid search implemented in syne-tune? I had a few offline discussions with some of you already, and it seems that you are not against grid search added to syne-tune, but want to keep a record of that here.

    Additionally, would you have any pointers as to what would be the best way to add grid search to syne-tune?

    enhancement 
    opened by iaroslav-ai 15
  • SageMaker ResourceLimitExceeded

    SageMaker ResourceLimitExceeded

    Hi, I have a limit of 8 ml.g5.12xlarge instances, and although I set Tuner.n_workers = 5 I still got a ResourceLimitExceeded error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend before launching new ones?

    Also, when using RemoteLauncher, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:

    try:
        # manage tuning jobs
    except:
       # raise error
    finally:
       # stop any trials still running
    
    enhancement 
    opened by austinmw 15
  • Issue with running launch_sagemaker_backend.py: No module named 'benchmarks'

    Issue with running launch_sagemaker_backend.py: No module named 'benchmarks'

    Hello! When running https://github.com/awslabs/syne-tune/blob/main/docs/tutorials/basics/scripts/launch_sagemaker_backend.py (python docs/tutorials/basics/scripts/launch_sagemaker_backend.py) on the main branch I get an error within the spawned SageMaker training jobs:

    Traceback (most recent call last):
      File "traincode_report_withcheckpointing.py", line 29, in <module>
        from benchmarks.checkpoint import resume_from_checkpointed_model, \
    ModuleNotFoundError: No module named 'benchmarks'
    

    I'm including the full log below. I’m not certain if it’s due to my AWS environment setup (although I am generally able to run SageMaker training jobs) or an issue with the code, could you please have a look?

    Best wishes, Adam

    Full log:

    showing log of sagemaker job: traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4
    bash: cannot set terminal process group (-1): Inappropriate ioctl for device
    bash: no job control in this shell
    2022-01-18 16:34:35,020 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
    2022-01-18 16:34:35,023 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:35,035 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
    2022-01-18 16:34:36,465 sagemaker_pytorch_container.training INFO     Invoking user training script.
    2022-01-18 16:34:37,061 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:37,076 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:37,090 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
    2022-01-18 16:34:37,103 sagemaker-training-toolkit INFO     Invoking user script
    Training Env:
    {
        "additional_framework_parameters": {},
        "channel_input_dirs": {},
        "current_host": "algo-1",
        "framework_module": "sagemaker_pytorch_container.training:main",
        "hosts": [
            "algo-1"
        ],
        "hyperparameters": {
            "batch_size": 126,
            "weight_decay": 0.7744002774231975,
            "st_checkpoint_dir": "/opt/ml/checkpoints",
            "st_instance_count": 1,
            "n_units_2": 322,
            "dataset_path": "./",
            "n_units_1": 107,
            "dropout_2": 0.20979101632756325,
            "dropout_1": 0.4715702331554363,
            "epochs": 81,
            "learning_rate": 0.0029903699075321814,
            "st_instance_type": "ml.m4.10xlarge"
        },
        "input_config_dir": "/opt/ml/input/config",
        "input_data_config": {},
        "input_dir": "/opt/ml/input",
        "is_master": true,
        "job_name": "traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4",
        "log_level": 20,
        "master_hostname": "algo-1",
        "model_dir": "/opt/ml/model",
        "module_dir": "s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz",
        "module_name": "traincode_report_withcheckpointing",
        "network_interface_name": "eth0",
        "num_cpus": 40,
        "num_gpus": 0,
        "output_data_dir": "/opt/ml/output/data",
        "output_dir": "/opt/ml/output",
        "output_intermediate_dir": "/opt/ml/output/intermediate",
        "resource_config": {
            "current_host": "algo-1",
            "hosts": [
                "algo-1"
            ],
            "network_interface_name": "eth0"
        },
        "user_entry_point": "traincode_report_withcheckpointing.py"
    }
    Environment variables:
    SM_HOSTS=["algo-1"]
    SM_NETWORK_INTERFACE_NAME=eth0
    SM_HPS={"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975}
    SM_USER_ENTRY_POINT=traincode_report_withcheckpointing.py
    SM_FRAMEWORK_PARAMS={}
    SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
    SM_INPUT_DATA_CONFIG={}
    SM_OUTPUT_DATA_DIR=/opt/ml/output/data
    SM_CHANNELS=[]
    SM_CURRENT_HOST=algo-1
    SM_MODULE_NAME=traincode_report_withcheckpointing
    SM_LOG_LEVEL=20
    SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
    SM_INPUT_DIR=/opt/ml/input
    SM_INPUT_CONFIG_DIR=/opt/ml/input/config
    SM_OUTPUT_DIR=/opt/ml/output
    SM_NUM_CPUS=40
    SM_NUM_GPUS=0
    SM_MODEL_DIR=/opt/ml/model
    SM_MODULE_DIR=s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz
    SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":126,"dataset_path":"./","dropout_1":0.4715702331554363,"dropout_2":0.20979101632756325,"epochs":81,"learning_rate":0.0029903699075321814,"n_units_1":107,"n_units_2":322,"st_checkpoint_dir":"/opt/ml/checkpoints","st_instance_count":1,"st_instance_type":"ml.m4.10xlarge","weight_decay":0.7744002774231975},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-640549960621/traincode-report-withcheckpointing-2022-01-18-16-26-35-248-4/source/sourcedir.tar.gz","module_name":"traincode_report_withcheckpointing","network_interface_name":"eth0","num_cpus":40,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"traincode_report_withcheckpointing.py"}
    SM_USER_ARGS=["--batch_size","126","--dataset_path","./","--dropout_1","0.4715702331554363","--dropout_2","0.20979101632756325","--epochs","81","--learning_rate","0.0029903699075321814","--n_units_1","107","--n_units_2","322","--st_checkpoint_dir","/opt/ml/checkpoints","--st_instance_count","1","--st_instance_type","ml.m4.10xlarge","--weight_decay","0.7744002774231975"]
    SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
    SM_HP_BATCH_SIZE=126
    SM_HP_WEIGHT_DECAY=0.7744002774231975
    SM_HP_ST_CHECKPOINT_DIR=/opt/ml/checkpoints
    SM_HP_ST_INSTANCE_COUNT=1
    SM_HP_N_UNITS_2=322
    SM_HP_DATASET_PATH=./
    SM_HP_N_UNITS_1=107
    SM_HP_DROPOUT_2=0.20979101632756325
    SM_HP_DROPOUT_1=0.4715702331554363
    SM_HP_EPOCHS=81
    SM_HP_LEARNING_RATE=0.0029903699075321814
    SM_HP_ST_INSTANCE_TYPE=ml.m4.10xlarge
    PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
    Invoking script with the following command:
    /opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975
    Traceback (most recent call last):
      File "traincode_report_withcheckpointing.py", line 29, in <module>
        from benchmarks.checkpoint import resume_from_checkpointed_model, \
    ModuleNotFoundError: No module named 'benchmarks'
    2022-01-18 16:34:38,444 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
    Command "/opt/conda/bin/python3.6 traincode_report_withcheckpointing.py --batch_size 126 --dataset_path ./ --dropout_1 0.4715702331554363 --dropout_2 0.20979101632756325 --epochs 81 --learning_rate 0.0029903699075321814 --n_units_1 107 --n_units_2 322 --st_checkpoint_dir /opt/ml/checkpoints --st_instance_count 1 --st_instance_type ml.m4.10xlarge --weight_decay 0.7744002774231975"
    Traceback (most recent call last):
      File "traincode_report_withcheckpointing.py", line 29, in <module>
        from benchmarks.checkpoint import resume_from_checkpointed_model, \
    ModuleNotFoundError: No module named 'benchmarks'
    
    opened by talesa 14
  • Bug with Seeded Searchers

    Bug with Seeded Searchers

    Opened on behalf of @timyber:

    When we are setting a fixed seed, random sampling always have the same behaviors. and It would be running out of search space if we run a large number of training jobs. This can be reproduced by testing large budget (e.g. max_training_jobs: 100, batch_size: 1) and setting the seed to a fixed value.

    bug 
    opened by wistuba 12
  • Numeric and Log-Scale Choice

    Numeric and Log-Scale Choice

    There is no equivalent of choice for numeric values. E.g., in the FCNet blackbox the learning rate is defined as 'hp_init_lr': choice([0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]). This will not allow model-based approaches to encode this hyperparameter correctly. Would be great to identify them as numeric and also indicate whether log transform is needed.

    enhancement 
    opened by wistuba 10
  • Gridsearcher issue 2

    Gridsearcher issue 2

    Issue #, if available: #378

    Description of changes: Added support for continuous hyperparameters to Gridsearch, added a unit test for it as well

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mina-ghashami 9
  • [WIP] Integrate YAHPO Gym

    [WIP] Integrate YAHPO Gym

    Description of changes:

    First draft for including YAHPO Gym as a BlackBoxRecipe. This is not entirely straightforward and I might need some input from @geoalgo on how to progress.

    Currently, yahpo has a nested structure: /scenario/instance where scenario is a problem set and all instances within a scenario share the same search space. (A scenario is e.g. xgboost and the instances are different datasets) In the current design, the user would call

    bb = load_blackbox("YAHPO")[<scenario>]
    bb.set_instance(<instance>)
    

    If we unnest this, this would (in total) be around 850 instances.

    @geoalgo Could you perhaps do a pass / help me think about how to integrate the different designs? I guess we might want to have one Recipe per scenario as you do in the icml_2020 recipe? Would this bloat the Recipes?

    I will list a few open to-do's:

    • [ ] Check what needs to be serialized / How to distribute the .onnx neural networks.
    • [ ] Check where the YAHPO setup (pointer to data dir etc.) needs to happen.
    • [ ] Add an example script.
    opened by pfistfl 9
  • Different searchers suggest same initial random configs. New methods …

    Different searchers suggest same initial random configs. New methods …

    …in baselines

    Issue #, if available:

    Description of changes: Ensures that all searchers return the same random initial configs when started with the same seed. Also:

    • New classes in baselines.py
    • Split searcher.py (got too large)
    • Make sure that BOHB schedulers do not return duplicates

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mseeger 0
  • Allow get_config to return the same config more than once

    Allow get_config to return the same config more than once

    Issue #, if available: 415

    Description of changes: Introduces flag allow_duplicates to searchers, which allows to return the same config more than once. Also contains a new test on searchers, whether they properly implement allow_duplicates=False (the default).

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by mseeger 0
  • pytest: Add pytest-xdist

    pytest: Add pytest-xdist

    feat/parallelize-tests

    Why

    With pytest-xdist we can parallelize the test suite. This results in a quick win for a faster feedback loop.

    Numbers

    Below are the results of the test suite on my machine (an i9-13900k):

    | command | run # | real | user | sys | | --- | --- | --- | --- | --- | | pytest | 1 | 0m35.141s | 1m12.942s | 2m6.306s | | pytest | 2 | 0m37.991s | 2m18.906s | 2m49.160s | | pytest | 3 | 0m36.207s | 1m36.687s | 2m39.535s | | pytest -n 1 --dist loadgroup | 1 | 0m35.896s | 1m25.656s | 2m17.317s | | pytest -n 1 --dist loadgroup | 2 | 0m38.354s | 2m2.959s | 3m7.145s | | pytest -n 1 --dist loadgroup | 3 | 0m38.792s | 2m14.633s | 3m8.681s | | pytest -n 2 --dist loadgroup | 1 | 0m25.270s | 2m24.090s | 3m14.851s | | pytest -n 2 --dist loadgroup | 2 | 0m28.600s | 3m51.649s | 4m6.193s | | pytest -n 2 --dist loadgroup | 3 | 0m28.093s | 3m43.806s | 3m49.875s | | pytest -n 3 --dist loadgroup | 1 | 0m22.370s | 2m59.735s | 3m32.893s | | pytest -n 3 --dist loadgroup | 2 | 0m19.252s | 1m55.877s | 2m37.809s | | pytest -n 3 --dist loadgroup | 3 | 0m22.168s | 2m56.645s | 3m25.640s | | pytest -n 4 --dist loadgroup | 1 | 0m20.715s | 2m50.816s | 3m20.745s | | pytest -n 4 --dist loadgroup | 2 | 0m20.518s | 2m48.112s | 3m36.456s | | pytest -n 4 --dist loadgroup | 3 | 0m20.832s | 3m2.586s | 3m28.027s |

    The average of each run's real times are:

    | command | real | | --- | --- | | pytest | 0m36.446s | | pytest -n 1 --dist loadgroup | 0m37.681s | | pytest -n 2 --dist loadgroup | 0m27.321s | | pytest -n 3 --dist loadgroup | 0m21.263s | | pytest -n 4 --dist loadgroup | 0m20.688s |

    Going beyond four processes doesn't seem to yield any further improvements (likely because some portions of the test suite involve parallelized operations).

    How

    I added pytest-xdist to the dev extra requirements. I also updated the pytest.ini file with addopts to configure the test suite to run with four processes and run tests with the same group on the same worker (to avoid resource contention). The test_cholesky_factorization test was refactored into two smaller tests which both increase their timeout with respect to input size. Additionally, tests which involved parallelized operations were added to an xdist_group named parallel to ensure they are run on the same worker (to avoid resource contention).

    While the standard GitHub runners only have two cores (the unit-tests.yml workflow has been edited accordingly), local development can benefit from this.


    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by ConnorBaker 3
  • add dependabot

    add dependabot

    Issue #, if available:

    Description of changes:

    Add dependabot to keep dependencies up-to-date

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by wesk 3
  • Files seem to be corrupted (hash mismatch)

    Files seem to be corrupted (hash mismatch)

    Hello all,

    I updated to the most recent library version and noticed an issue in using the benchmarks. I am getting the following: Files seem to be corrupted (hash mismatch), which keeps persisting across multiple code reruns. The issue was fixed in PR #428 initially.

    The hash is consistent between the calls, however, it seems not to match the hardcoded one and not to be the same between different operating systems. For example for PD1:

    Hardcoded one in syne-tune:

    bd5b599179b1c5163d146a26dd2d559e5cb561f491ef48a22503e821651fd4d1

    On Windows (11) I get:

    f693fed481e344267c3a897eb8e629056e93304ee21e6e90955029c9804cdfda

    On Linux (CentOS 7.9) I get:

    ca162d81cacadb1e177ec319e65d68f812140bbc5864b0dceac28bbcca328a70

    I am able to overcome the problem by hardcoding the hash to my local specific value, but it seems the function that calculates the hash is not working as intended maybe, unless I am doing something wrong.

    bug 
    opened by ArlindKadra 4
  • feat: Add `py.typed` file to package so type annotations are exposed

    feat: Add `py.typed` file to package so type annotations are exposed

    Hello all!

    Would you be interested in adding a py.typed file to your package?

    Per PEP-561 (https://peps.python.org/pep-0561/), library authors who want to support type-checking of their code must add a py.typed file to their package and include it as part of the package data so it's redistributed.

    Doing so would allow downstream users of the library to benefit from the inline type annotations you have, freeing them of the need to create and maintain type stubs.

    opened by ConnorBaker 4
Releases(v0.3.3)
  • v0.3.3(Dec 19, 2022)

    [0.3.3] - 2022-12-19

    We release version 0.3.3 which you can install with pip install syne-tune[extra].

    Thanks to all contributors (sorted by chronological commit order): @mseeger, @mina-ghashami, @aaronkl, @jgolebiowski, @Valavanca, @TrellixVulnTeam, @geoalgo, @wistuba, @mlblack

    Added

    • Revamped documentation hosted at https://syne-tune.readthedocs.io
    • New tutorial: Benchmarking in Syne Tune
    • Added section on backends in Basics of Syne Tune tutorial
    • Control of re-creating of blackboxes by checking and storing hash codes
    • New benchmark: Transformer on WikiText-2
    • Support SageMaker managed warm pools in SageMaker backend
    • Improvements for benchmarking with YAHPO blackboxes
    • Support points_to_evaluate in BORE
    • SageMaker backend supports delete_checkpoints=True

    Changed

    • GridSearch supports all domain types now
    • BlackboxSurrogate of blackbox repository supports different modes
    • Add timeout to unit tests
    • New unit tests which run schedulers for longer, using simulator backend

    Fixed

    • HyperbandScheduler: does_pause_resume for all types
    • ASHA with type="promotion" did not work when checkpointing not implemented
    • Fixed preprocessing of PD1 blackbox
    • SageMaker backend reports on true number of busy workers (fixes issue #250)
    • Fix issue with uploading/syncing to S3 of YAHPO blackbox
    • Fix YAHPO surrogate evaluations in the presence of inactive hyperparameters
    • Fix treatment of Status.paused in TuningStatus and Tuner
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Oct 14, 2022)

    Added

    • New tutorial: How to Contribute a New Scheduler
    • New tutorial: Multi-Fidelity Hyperparameter Optimization
    • YAHPO benchmarks integrated into blackbox repository
    • PD1 benchmarks integrated into blackbox repository
    • New HPO algorithm: Hyper-Tune
    • New HPO algorithm: Differential Evolution Hyperband (DEHB)
    • New experimental HPO algorithm: Neuralband
    • New HPO algorithm: Grid search (categorical variables only)
    • BOTorch searcher
    • MOBSTER algorithm supports independent GPs at each rung level
    • Support for launching experiments in benchmarking/commons, for local, SageMaker, and simulator back-end
    • New benchmark: Fine-tuning Hugging Face transformers
    • Add IPython util function to display results as parallel categories plot
    • New hyperparameter types ordinal, logordinal
    • Support no checkpointing in BlackboxRepositoryBackend
    • Plateau rule as StoppingCriterion
    • Automate PyPI releases: python-publish.yml
    • Add license hook

    Changed

    • Replace PyTorch MLP by sklearn in BORE (better performance)
    • AWS dependencies moved out of core into aws
    • New dependencies yahpo

    Fixed

    • In SageMaker back-end, trials with low IDs received reports several times. This is fixed
    • Fixing issue with checkpoint_s3_uri usage
    • Fix mode in BOTorch searcher when maximizing
    • Avoid experiment abort due to throttling of SageMaker job launching
    • Surrogate model for lcbench defaults to 1-NN now
    • Fix conditional imports, so Syne Tune can be run with reduced dependencies
    • Fix lcbench blackbox (ignore first and last fidelity)
    • Fix bug in BlackboxSimulatorBackend for pause/resume scheduling (issue #304)
    • Revert wait_trial_completion_when_stopping to False
    • Terminate with error when tuning sees an exception
    • Docker Building Fixed by Adding Line Breaks At End of Requirements Files
    • Control Decision for Running Trials When Stopping Criterion is Met
    • Fix mode MSR and HB+BB
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Sep 16, 2022)

Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
Code for "The Box Size Confidence Bias Harms Your Object Detector"

The Box Size Confidence Bias Harms Your Object Detector - Code Disclaimer: This repository is for research purposes only. It is designed to maintain r

Johannes G. 24 Dec 07, 2022
Exadel CompreFace is a free and open-source face recognition GitHub project

Exadel CompreFace is a leading free and open-source face recognition system Exadel CompreFace is a free and open-source face recognition service that

Exadel 2.6k Jan 04, 2023
Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

Searching to Learn with Instructional Scaffolding This is the data and analysis code for the paper "Searching to Learn with Instructional Scaffolding"

Arthur Câmara 2 Mar 02, 2022
Residual Pathway Priors for Soft Equivariance Constraints

Residual Pathway Priors for Soft Equivariance Constraints This repo contains the implementation and the experiments for the paper Residual Pathway Pri

Marc Finzi 13 Oct 12, 2022
Official PyTorch implementation of the paper Image-Based CLIP-Guided Essence Transfer.

TargetCLIP- official pytorch implementation of the paper Image-Based CLIP-Guided Essence Transfer This repository finds a global direction in StyleGAN

Hila Chefer 221 Dec 13, 2022
Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs

Continuous Query Decomposition This repository contains the official implementation for our ICLR 2021 (Oral) paper, Complex Query Answering with Neura

UCL Natural Language Processing 71 Dec 29, 2022
Code and dataset for AAAI 2021 paper FixMyPose: Pose Correctional Describing and Retrieval Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal.

FixMyPose / फिक्समाइपोज़ Code and dataset for AAAI 2021 paper "FixMyPose: Pose Correctional Describing and Retrieval" Hyounghun Kim*, Abhay Zala*, Grah

4 Sep 19, 2022
Distributed DataLoader For Pytorch Based On Ray

Dpex——用户无感知分布式数据预处理组件 一、前言 随着GPU与CPU的算力差距越来越大以及模型训练时的预处理Pipeline变得越来越复杂,CPU部分的数据预处理已经逐渐成为了模型训练的瓶颈所在,这导致单机的GPU配置的提升并不能带来期望的线性加速。预处理性能瓶颈的本质在于每个GPU能够使用的C

Dalong 23 Nov 02, 2022
A very impractical 3D rendering engine that runs in the python terminal.

Terminal-3D-Render A very impractical 3D rendering engine that runs in the python terminal. do NOT try to run this program using the standard python I

23 Dec 31, 2022
Implementation for paper: Self-Regulation for Semantic Segmentation

Self-Regulation for Semantic Segmentation This is the PyTorch implementation for paper Self-Regulation for Semantic Segmentation, ICCV 2021. Citing SR

Dong ZHANG 30 Nov 21, 2022
ANN model for prediction a spatio-temporal distribution of supercooled liquid in mixed-phase clouds using Doppler cloud radar spectra.

VOODOO Revealing supercooled liquid beyond lidar attenuation Explore the docs » Report Bug · Request Feature Table of Contents About The Project Built

remsens-lim 2 Apr 28, 2022
retweet 4 satoshi ⚡️

rt4sat retweet 4 satoshi This bot is the codebase for https://twitter.com/rt4sat please feel free to create an issue if you saw any bugs basically thi

6 Sep 30, 2022
OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

🔥 DrugOOD 🔥 : OOD Dataset Curator and Benchmark for AI Aided Drug Discovery This is the official implementation of the DrugOOD project, this is the

108 Dec 17, 2022
Deep learned, hardware-accelerated 3D object pose estimation

Isaac ROS Pose Estimation Overview This repository provides NVIDIA GPU-accelerated packages for 3D object pose estimation. Using a deep learned pose e

NVIDIA Isaac ROS 41 Dec 18, 2022
ML models implementation practice

Let's implement various ML algorithms with numpy/tf Vanilla Neural Network https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae

Jinsoo Heo 4 Jul 04, 2021
A computational block to solve entity alignment over textual attributes in a knowledge graph creation pipeline.

How to apply? Create your config.ini file following the example provided in config.ini Choose one of the options below to run: Run with Python3 pip in

Scientific Data Management Group 3 Jun 23, 2022
a simple, efficient, and intuitive text editor

Oxygen beta a simple, efficient, and intuitive text editor Overview oxygen is a simple, efficient, and intuitive text editor designed as more featured

Aarush Gupta 1 Feb 23, 2022
Misc YOLOL scripts for use in the Starbase space sandbox videogame

starbase-misc Misc YOLOL scripts for use in the Starbase space sandbox videogame. Each directory contains standalone YOLOL scripts. They don't really

4 Oct 17, 2021
(CVPR 2022) Energy-based Latent Aligner for Incremental Learning

Energy-based Latent Aligner for Incremental Learning Accepted to CVPR 2022 We illustrate an Incremental Learning model trained on a continuum of tasks

Joseph K J 37 Jan 03, 2023