generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish

Overview

DPDispatcher

DPDispatcher is a python package used to generate HPC(High Performance Computing) scheduler systems (Slurm/PBS/LSF/dpcloudserver) jobs input scripts and submit these scripts to HPC systems and poke until they finish.
​ DPDispatcher will monitor (poke) until these jobs finish and download the results files (if these jobs is running on remote systems connected by SSH).

For more information, check the documentation.

Installation

DPDispatcher can installed by pip:

pip install dpdispatcher

Usage

See Getting Started for usage.

Contributing

DPDispatcher is maintained by Deep Modeling's developers and welcome other people. See Contributing Guide to become a contributor! 🤓

Comments
  • paramiko.ssh_exception.SSHException: Server connection dropped

    paramiko.ssh_exception.SSHException: Server connection dropped

    I recently encountered a problem when using dpgen to perform the model deviation. When paramiko downloads a large file, it will report the following error paramiko.ssh_exception.SSHException: Server connection dropped The error occurs when downloading the tar file of model_devi (123 Gb), which is consistent with the situation described in this article https://zhuanlan.zhihu.com/p/102372919. At present, the process of dpgen is to download the trajectory to the local and then obtain the candidate structures through analysis. If it can be changed to analyze the trajectory on the remote server, get the candidate structure and then download it locally, it can avoid downloading large trajectory files, greatly speeds up the process, and avoid the above errors. I hope the developers can consider my suggestion, thank you very much.

    opened by wankiwi 8
  • add ratio_unfinished param to allow jobs failed or discarded in a submission

    add ratio_unfinished param to allow jobs failed or discarded in a submission

    In order to speed up dpgen jobs, it may be practicable to accelerate FP stage by just skipping slow jobs or waiting finished asynchronously.

    While running dpgen jobs, we found the duration time of FP phrase is a large part in each iteration. This is because the DFT computing for some candidates are really hard and time consuming, and we need to wait all those long tail to be finished before going to the next iteration. We found that the proportion of those candidates is very small, which may less than 1%.

    So we may add new params "ratio_unfinished" to optimize the execution: if most of fp jobs have finished, we can directly discard the remaining jobs and go to the next iteration. We think it is acceptable since the ratio is very small. Further more, we can also enable async check to wait these jobs finished and download results asynchronously, but this step is need to be done in dpgen. Now we modify dpdispatcher as the first step.

    Following is our test number, which can significantly saving time: image

    opened by shazj99 7
  • paramiko.ssh_exception.AuthenticationException: Authentication failed.

    paramiko.ssh_exception.AuthenticationException: Authentication failed.

    When I updated dpgen from 10.0 to 10.6, I encountered an ssh error

    dpgen: 10.0 to 10.6

    ssh erro for dpgen10.0

    image

    image I solved the problem by changing the look_for_keys in client.py for 10.0 version, however, It doesnt work for 10.6 version.

    ssh erro for dpgen10.6

    Traceback (most recent call last): File "/data/jccao/app/deepmd2_1_5/bin/dpgen", line 8, in sys.exit(main()) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpgen/main.py", line 185, in main args.func(args) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpgen/generator/run.py", line 3642, in gen_run run_iter (args.PARAM, args.MACHINE) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpgen/generator/run.py", line 3607, in run_iter run_train (ii, jdata, mdata) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpgen/generator/run.py", line 598, in run_train submission = make_submission( File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpgen/dispatcher/Dispatcher.py", line 359, in make_submission machine = Machine.load_from_dict(abs_mdata_machine) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpdispatcher/machine.py", line 129, in load_from_dict context = BaseContext.load_from_dict(machine_dict) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpdispatcher/base_context.py", line 35, in load_from_dict context = context_class.load_from_dict(context_dict) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpdispatcher/ssh_context.py", line 350, in load_from_dict ssh_context = cls( File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpdispatcher/ssh_context.py", line 323, in init self.ssh_session = SSHSession(**remote_profile) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpdispatcher/ssh_context.py", line 44, in init self._setup_ssh() File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpdispatcher/utils.py", line 162, in wrapper return func(*args, **kwargs) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/dpdispatcher/ssh_context.py", line 166, in _setup_ssh ts.auth_password(self.username, self.password) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/paramiko/transport.py", line 1564, in auth_password return self.auth_handler.wait_for_response(my_event) File "/data/jccao/app/deepmd2_1_5/lib/python3.10/site-packages/paramiko/auth_handler.py", line 245, in wait_for_response raise e paramiko.ssh_exception.AuthenticationException: Authentication failed.

    Details

    enhancement 
    opened by caojiachun 6
  • why jobs have started running in diffrent folder when I run two same submission?

    why jobs have started running in diffrent folder when I run two same submission?

    Hi , guys.

    I got a problem: Firstly, I run a job submission, then I stop it ( use Ctrl+C ). Secondly, I restart this same job, but the dpdispatcher cannot restart this jobs submission from last folder. It run a new calculation from a new folder.

    Why cause this and how to fix it?

    Thanks for any help.

    opened by LiuGaoyong 6
  • `check_status()` always return `JobStatus.unknown` unless `qstat -x` not return `0`

    `check_status()` always return `JobStatus.unknown` unless `qstat -x` not return `0`

    Some environment info:

    dpdispatcher: updated with branch master python: 3.8 "pbs": torque-6.1

    Description

    dpdispatcher works well on Machine with batch_type = slurm, shell but not PBS. I think maybe qstat -x in Line 53 has different behavior on my nodes and yours. https://github.com/deepmodeling/dpdispatcher/blob/60f5c90ef3b57dbbb270aeea8565bde74a53fdbd/dpdispatcher/pbs.py#L48-L78

    1. If one job has been finished moments ago, and qstat return with nothing but return to stderr. Code in check_status() will go into Line 56, and return JobStatus.finished or JobStatus.terminated. And truth is I can get correct result (files downloaded, python exit with 0) after a while.
    2. But the script will raise a RuntimeError
    RuntimeError: job_state for job ...<json info of the job, not important>... is unknown
    

    Locate the problem

    qstat -x seems to return a somewhat XML format (maybe) like: (on shell, not from script)

    <?xml version="1.0"?>
    <Data></Data>
    

    on my nodes.

    I'm not sure whether I did anything wrong, or "pbs" not works the same way as yours. Perhaps @felix5572 knows this well. Maybe you are using PBS PRO but I'm using Torque.

    By the way, I can also got job_state from my qstat -x return with label <job_state> but not simple .split()[-2]

    <?xml version="1.0"?>
    <Data>
    <Job>
    ....
    <job_state>C</job_state>
    ....
    </Job>
    </Data>
    
    opened by saltball 6
  • Task has no attribute 'load_from_json'

    Task has no attribute 'load_from_json'

    Traceback (most recent call last): File "submitted.py", line 6, in task0 = Task.load_from_json('json/task.json') AttributeError: type object 'Task' has no attribute 'load_from_json'

    ----here is the env I used

    Name Version Build Channel

    bcrypt 3.2.0 pypi_0 pypi ca-certificates 2021.5.30 h4653dfc_0 conda-forge certifi 2021.5.30 pypi_0 pypi cffi 1.14.6 pypi_0 pypi charset-normalizer 2.0.4 pypi_0 pypi cryptography 3.4.7 pypi_0 pypi dargs 0.2.6 pypi_0 pypi dpdispatcher 0.3.39 pypi_0 pypi idna 3.2 pypi_0 pypi libcxx 12.0.1 h168391b_0 conda-forge libffi 3.3 h9f76cd9_2 conda-forge ncurses 6.2 h9aa5885_4 conda-forge openssl 1.1.1k h3422bc3_1 conda-forge paramiko 2.7.2 pypi_0 pypi pip 21.2.4 pyhd8ed1ab_0 conda-forge pycparser 2.20 pypi_0 pypi pynacl 1.4.0 pypi_0 pypi python 3.8.10 hf9733c0_1_cpython conda-forge python_abi 3.8 2_cp38 conda-forge readline 8.1 hedafd6a_0 conda-forge requests 2.26.0 pypi_0 pypi setuptools 57.4.0 py38h10201cd_0 conda-forge six 1.16.0 pypi_0 pypi sqlite 3.36.0 h72a2b83_0 conda-forge tk 8.6.10 hf7e6567_1 conda-forge urllib3 1.26.6 pypi_0 pypi wheel 0.37.0 pyhd8ed1ab_1 conda-forge xz 5.2.5 h642e427_1 conda-forge zlib 1.2.11 h31e879b_1009 conda-forge

    documentation enhancement 
    opened by csu1505110121 5
  • Add prepend_script and append_script for job.resource

    Add prepend_script and append_script for job.resource

    It is sometimes in need of executing command lines before or after task submitted to startup or shutdown necessary environment. While most issue with prepend script required like this might be solved by source a static script on remote server (which might also be difficult when remote server is not so stable.), scripts to be executed after tasks finished might be difficult under current dpdispatcher resource. Here in this PR, prepend_script and append_script parameters have been add to job.resource in form of List, in which a single line could be an item, like this:

    "prepend_script": [
        "conda activate test_env",
        "export PATH=/path/to/package:$PATH",
        "send_an_email_to [email protected]"
        "sleep 1919810"
    ]
    

    and the expected output:

    conda activate test_env
    export PATH=/path/to/package:$PATH
    send_an_email_to [email protected]
    sleep 1919810
    

    Another little change is just add delay=True for logger to prevent the generation of dpdispatcher.log even if no log information output.

    opened by Cloudac7 4
  • Re-running dpdispatcher will re-upload forward_files to the remote and replace it

    Re-running dpdispatcher will re-upload forward_files to the remote and replace it

    I found that re-running dpdispatcher would re-upload forward_files to the remote and replace it. But this can cause some problems:

    • If forward_files is large, it will take a lot of time to retransfer files.
    • Sometimes forward_files are rewritten during calculation, then the substitution will cause an error.

    So it seems to me that since dpdispatcher checks for old submission, there is no need to re-upload forward_files.

    opened by LavendaRaphael 4
  • Fix: NoneType Error on DPCloudServer

    Fix: NoneType Error on DPCloudServer

    After deleting a job group manually on bohrium, a NoneType error will be raised when submitting the job in the same path. To solve this problem, I add a check to NoneType and a note to help the user submit the job again. (Only the last commit in this pr is applied.)

    opened by HuangJiameng 3
  • Support optional compress for SSHContext

    Support optional compress for SSHContext

    For a large number of small files, compression can be very CPU-intensive and time-consuming. See https://github.com/deepmodeling/dpgen/issues/766. Add tar_compress in dict remote_profile. The archive will be compressed in upload and download if it is True. If not, compression will be skipped.

    opened by HuangJiameng 3
  • add wait_time to Resources for delayed submission

    add wait_time to Resources for delayed submission

    For some special queue or host, job submission might require waiting time after each single job submitted, to prevent crash when tasks were submitted together. Therefore, parameter wait_time is added to Resources, to support the special issue.

    The default value of wait_time is 0, and it accepts a value of int. If set to a value larger than 0, it will sleep for wait_time seconds after each job submission.

    Also, a condition is added for mirror_gitee action to prevent error message after pushing to the forked repository instead of the main one.

    opened by Cloudac7 3
  • RuntimeError: Authentication failed, try to provide password

    RuntimeError: Authentication failed, try to provide password

    We have two batch job system (Torque and LSF). This example is run by RSA type key ( id_rsa ) for logging. But results are as below: In Torque, RuntimeError: Authentication failed, try to provide password In LSF, it works.

    So I try connecting by paramiko.

    import paramiko
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect("Torque ip", username="user", key_filename="/home/user/.ssh/id_rsa")
    ssh.connect("LSF ip", username="user", key_filename="/home/user/.ssh/id_rsa")
    

    In Torque, ValueError: q must be exactly 160, 224, or 256 bits long paramiko In LSF, it works.

    wontfix 
    opened by scott-5 3
  • paramiko.ssh_exception.AuthenticationException: Authentication failed.

    paramiko.ssh_exception.AuthenticationException: Authentication failed.

    Summary

    When I use dpgen run param.json machine.json to run a job, I frequently get errors: paramiko.ssh_exception.AuthenticationException: Authentication failed.

    DeePMD-kit Version

    2.1.1

    TensorFlow Version

    tf=2.5.0

    Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

    python=3.8.5 cudatoolkit-11.3.1 gcc=7.5

    Details

    hello , dear Developers

    When I use dpgen run param.json machine.json to run a job, I frequently find errors as follows: Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206.

    Description

    Traceback (most recent call last): File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/bin/dpgen", line 8, in sys.exit(main()) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpgen/main.py", line 185, in main args.func(args) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpgen/generator/run.py", line 3642, in gen_run run_iter (args.PARAM, args.MACHINE) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpgen/generator/run.py", line 3628, in run_iter run_fp (ii, jdata, mdata) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpgen/generator/run.py", line 3018, in run_fp run_fp_inner(iter_index, jdata, mdata, forward_files, backward_files, _vasp_check_fin, File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpgen/generator/run.py", line 2985, in run_fp_inner submission = make_submission( File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 359, in make_submission machine = Machine.load_from_dict(abs_mdata_machine) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpdispatcher/machine.py", line 134, in load_from_dict context = BaseContext.load_from_dict(machine_dict) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpdispatcher/base_context.py", line 41, in load_from_dict context = context_class.load_from_dict(context_dict) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 350, in load_from_dict ssh_context = cls( File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 323, in init self.ssh_session = SSHSession(**remote_profile) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 44, in init self._setup_ssh() File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpdispatcher/utils.py", line 162, in wrapper return func(*args, **kwargs) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 166, in _setup_ssh ts.auth_password(self.username, self.password) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/paramiko/transport.py", line 1564, in auth_password return self.auth_handler.wait_for_response(my_event) File "/HOME/zhoujy/.conda/envs/deepmd-kit2.1.1/lib/python3.9/site-packages/paramiko/auth_handler.py", line 245, in wait_for_response raise e paramiko.ssh_exception.AuthenticationException: Authentication failed.

    here is my machine.json,hostname、username and password is correct , about the fp , I want to run it on remote cluster: { "api_version": "1.0", "train": [ { "machine": { "context_type": "local", "batch_type": "Slurm", "machine_type": "Slurm", "local_root": "./", "_remote_root": "/data/run01/scz5616/HHM/dpmd-project/WFpro/S2tet/work", "remote_root": "/HOME/zhoujy/run/dp-test/work" }, "resources": { "module_list": [], "_source_list": [ "/data/run01/scz5616/HHM/dpmd-project/WFpro/S2tet/train.sh" ], "source_list": ["/HOME/zhoujy/run/dp-test/train.sh"], "cpu_per_node": 6, "number_node": 1, "gpu_per_node": 1, "queue_name": "gpu_c128", "_exclude_list": [], "_time_limit": "24:0:0", "group_size": 1 }, "command": "dp" } ], "model_devi": [ { "machine": { "context_type": "local", "batch_type": "Slurm", "machine_type": "Slurm", "local_root": "./", "_remote_root": "/data/run01/scz5616/HHM/dpmd-project/WFpro/S2tet/work", "remote_root": "/HOME/zhoujy/run/dp-test/work" }, "resources": { "_module_list": [], "_source_list": [ "/data/run01/scz5616/HHM/dp-test/lammps.sh" ], "cpu_per_node": 6, "number_node": 1, "gpu_per_node": 1, "queue_name": "gpu", "_exclude_list": [], "_time_limit": "23:0:0", "group_size": 1 }, "command": "lmp" } ], "fp": [ { "machine": { "context_type": "ssh", "batch_type": "Slurm", "_machine_type": "Slurm", "local_root": "./", "remote_root": "/public1/ws133/sc94566/zhou/work", "remote_profile": { "hostname": "36.103.203.6", "username": "[email protected]", "port": 22, "password": "xxxxxxxxxxxxxxxxxxxx" } }, "resources": { "number_node": 1, "cpu_per_node": 64, "_custom_flags": [ "-p G1Part_sce" ], "queue_name": "amd_256", "_with_mpi": false, "source_list": [ "/public1/ws133/sc94566/zhou/env.sh" ], "_time_limit": "120:0:0", "_comment": "that's all", "group_size": 100 }, "command": "ulimit -s unlimited; srun -n 64 vasp_std" } ] }

    bug 
    opened by zhoujingyu13687306871 2
  • RuntimeError in make_model_devi step

    RuntimeError in make_model_devi step

    After I updated the dpdispatcher version to 0.4.18, I got the following error when DPGEN performed the make_model_devi, which can be solved when downgrading dpdispatcher to 0.4.17.

    2022-09-20 04:00:40,341 - INFO : job: 31fcd1c1d95b2fedff35615bf29adbc61e3057e5 315398 finished
    INFO:dpgen:-------------------------iter.000007 task 02--------------------------
    INFO:dpgen:-------------------------iter.000007 task 03--------------------------
    INFO:dpgen:-------------------------iter.000007 task 04--------------------------
    Traceback (most recent call last):
      File "/home/kwwan/.local/bin/dpgen", line 8, in <module>
        sys.exit(main())
      File "/home/kwwan/.local/lib/python3.8/site-packages/dpgen/main.py", line 185, in main
        args.func(args)
      File "/home/kwwan/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 3914, in gen_run
        run_iter (args.PARAM, args.MACHINE)
      File "/home/kwwan/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 3787, in run_iter
        run_model_devi (ii, jdata, mdata)
      File "/home/kwwan/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 1614, in run_model_devi
        run_md_model_devi(iter_index,jdata,mdata)
      File "/home/kwwan/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 1608, in run_md_model_devi
        submission.run_submission()
      File "/home/kwwan/software/Anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 176, in run_submission
        self.generate_jobs()
      File "/home/kwwan/software/Anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 340, in generate_jobs
        self.bind_machine(self.machine)
      File "/home/kwwan/software/Anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 163, in bind_machine
        self.machine.context.bind_submission(self)
      File "/home/kwwan/software/Anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 389, in bind_submission
        self.block_checkcall(f"mv {old_remote_root} {self.remote_root}")
      File "/home/kwwan/software/Anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 537, in block_checkcall
        raise RuntimeError("Get error code %d in calling %s through ssh with job: %s . message: %s" %
    RuntimeError: Get error code 1 in calling mv /data/home/scv3616/run/wankw/temp/dpmd_remote/447fbf8e9ee0ecc33a67e8f01f1847a2d3888f29 /data/home/scv3616/run/wankw/temp/dpmd_remote/5b3271c64c830aca6cfc836322191dc2482054ad through ssh with job: 5b3271c64c830aca6cfc836322191dc2482054ad . message:
    
    opened by wankiwi 5
  • ratio_unfinished for group_size > 1

    ratio_unfinished for group_size > 1

    I want dpdispatcher to handle those failed tasks via parameter ratio_unfinished, but I found that it doesn't work as expected when group_size exceeds 1. In this case, even if only one task in the group fails, all tasks in the group are deleted, which is unexpected.

    enhancement 
    opened by LavendaRaphael 0
  • Add more examples

    Add more examples

    I have added two examples below:

    • https://docs.deepmodeling.com/projects/dpdispatcher/en/latest/examples/expanse.html
    • https://docs.deepmodeling.com/projects/dpdispatcher/en/latest/examples/shell.html

    Here, I would like to solicit more examples to add to the documentation. They should cover more job scheduling packages including PBS, LSF, Lebesgue, etc.

    documentation 
    opened by njzjz 0
Releases(v0.5.1)
  • v0.5.1(Jan 6, 2023)

    What's Changed

    • fix local context with uploading files in the subdirectory by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/300

    Full Changelog: https://github.com/deepmodeling/dpdispatcher/compare/v0.5.0...v0.5.1

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jan 5, 2023)

    What's Changed

    • fix codecov by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/274
    • Add prepend_script and append_script for job.resource by @Cloudac7 in https://github.com/deepmodeling/dpdispatcher/pull/273
    • Remove os.chdir() method; add support for other key_file types other than RSA by @Cloudac7 in https://github.com/deepmodeling/dpdispatcher/pull/275
    • add tests for Python 3.11, macos, and windows by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/276
    • skip building docker out of deepmodeling by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/278
    • add cloudserver to the docker by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/282
    • add Optional to type hints when default is None by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/283
    • avoid compressing duplicated files by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/284
    • fix shell when filename contains special charaters by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/285
    • add type checker by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/286
    • disable tqdm when stderr is redirected by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/277
    • fix bohrium remote root by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/287
    • fix pass action by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/292
    • Support redirect log from bohrium which will be used by dflow by @KZHIWEI in https://github.com/deepmodeling/dpdispatcher/pull/298
    • add look_for_keys option by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/299

    Full Changelog: https://github.com/deepmodeling/dpdispatcher/compare/v0.4.19...v0.5.0

    Source code(tar.gz)
    Source code(zip)
  • v0.4.19(Nov 3, 2022)

    What's Changed

    • migrate from setup.py to pyproject.toml by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/265
    • document different contexts and batches by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/266
    • Fix typo in pr#266 by @HuangJiameng in https://github.com/deepmodeling/dpdispatcher/pull/267
    • Change Lebesgue API Service To Bohrium API Service by @KZHIWEI in https://github.com/deepmodeling/dpdispatcher/pull/268
    • drop Python 3.6 support by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/270
    • support machine and context alias by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/269

    Full Changelog: https://github.com/deepmodeling/dpdispatcher/compare/v0.4.18...v0.4.19

    Source code(tar.gz)
    Source code(zip)
  • v0.4.18(Sep 18, 2022)

    What's Changed

    • add retry for totp authentication by @PKUfjh in https://github.com/deepmodeling/dpdispatcher/pull/246
    • fix authing using secrets and totp by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/251
    • change source of mock-ssh-server; uncomment 3.8 by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/252
    • add ci tests on slurm by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/253
    • fix codecov upload by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/255
    • add tests for openpbs by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/256
    • add tests for slurm job array by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/257
    • migrate ssh ci test to docker by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/258
    • support port and key_filename for rsync by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/249
    • add tests for LazyLocalContext by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/259
    • add tests for empty transfer by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/260
    • ssh: move remote_root when it changes by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/261

    Full Changelog: https://github.com/deepmodeling/dpdispatcher/compare/v0.4.17...v0.4.18

    Source code(tar.gz)
    Source code(zip)
  • v0.4.16(Aug 11, 2022)

  • v0.4.14(Jul 12, 2022)

  • v0.4.12(Jun 30, 2022)

  • v0.4.11(Jun 22, 2022)

    What's Changed

    • docs: use dargs directive by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/196
    • catch socket.timeout for ut by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/197
    • docs: add links for classes, methods, and parameters by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/198
    • set Machine and BaseContext as abstract classes by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/199
    • follow symlink in LocalContext downloading by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/201

    Full Changelog: https://github.com/deepmodeling/dpdispatcher/compare/v0.4.10...v0.4.11

    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(May 9, 2022)

    Breaking Change

    • enable strict check for arguments by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/183 The hash of the submission may change in this version. Do not upgrade dpdispatcher before a submission is finished.

    What's Changed

    • Fix symlink subdirs not uploaded to remote by @LavendaRaphael in https://github.com/deepmodeling/dpdispatcher/pull/185
    • allow batch_type with strict check; check kwargs when batch_type exsits by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/186
    • doc: add links to DP-GUI by @njzjz in https://github.com/deepmodeling/dpdispatcher/pull/187

    Full Changelog: https://github.com/deepmodeling/dpdispatcher/compare/v0.4.8...v0.4.9

    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Dec 2, 2021)

  • v0.3.46(Nov 20, 2021)

  • v0.3.44(Oct 29, 2021)

  • v0.3.43(Oct 12, 2021)

  • v0.3.42(Sep 28, 2021)

Owner
DeepModeling
Define the future of scientific computing together
DeepModeling
A powerful workflow engine implemented in pure Python

Spiff Workflow Summary Spiff Workflow is a workflow engine implemented in pure Python. It is based on the excellent work of the Workflow Patterns init

Samuel 1.3k Jan 08, 2023
Python job scheduling for humans.

schedule Python job scheduling for humans. Run Python functions (or any other callable) periodically using a friendly syntax. A simple to use API for

Dan Bader 10.4k Jan 02, 2023
A Lightweight Cluster/Cloud VM Job Management Tool 🚀

Lightweight Cluster/Cloud VM Job Management 🚀 Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SS

29 Dec 12, 2022
Automate SQL Jobs Monitoring with python

Automate_SQLJobsMonitoring_python Using python 3rd party modules we can automate

Aejaz Ayaz 1 Dec 27, 2021
Aiorq is a distributed task queue with asyncio and redis

Aiorq is a distributed task queue with asyncio and redis, which rewrite from arq to make improvement and include web interface.

PY-GZKY 5 Mar 18, 2022
Crontab jobs management in Python

Plan Plan is a Python package for writing and deploying cron jobs. Plan will convert Python code to cron syntax. You can easily manage you

Shipeng Feng 1.2k Dec 28, 2022
generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish

DPDispatcher DPDispatcher is a python package used to generate HPC(High Performance Computing) scheduler systems (Slurm/PBS/LSF/dpcloudserver) jobs in

DeepModeling 23 Nov 30, 2022
A simple scheduler tool that provides desktop notifications about classes and opens their meet links in the browser automatically at the start of the class.

This application provides desktop notifications about classes and opens their meet links in browser automatically at the start of the class.

Anshit 14 Jun 29, 2022
dragonscales is a highly customizable asynchronous job-scheduler framework

dragonscales 🐉 dragonscales is a highly customizable asynchronous job-scheduler framework. This framework is used to scale the execution of multiple

Sorcero 2 May 16, 2022
Vertigo is an application used to schedule @code4tomorrow classes.

Vertigo Vertigo is an application used to schedule @code4tomorrow classes. It uses the Google Sheets API and is deployed using AWS. Documentation Lear

Ben Nguyen 4 Feb 10, 2022
A task scheduler with task scheduling, timing and task completion time tracking functions

A task scheduler with task scheduling, timing and task completion time tracking functions. Could be helpful for time management in daily life.

ArthurLCW 0 Jan 15, 2022
The easiest way to automate your data

Hello, world! 👋 We've rebuilt data engineering for the data science era. Prefect is a new workflow management system, designed for modern infrastruct

Prefect 10.9k Jan 04, 2023
A calendaring app for Django. It is now stable, Please feel free to use it now. Active development has been taken over by bartekgorny.

Django-schedule A calendaring/scheduling application, featuring: one-time and recurring events calendar exceptions (occurrences changed or cancelled)

Tony Hauber 814 Dec 26, 2022
CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

CoSA is a scheduler for spatial DNN accelerators that generate high-performance schedules in one shot using mixed integer programming

UC Berkeley Architecture Research 44 Dec 13, 2022
Remote task execution tool

Gunnery Gunnery is a multipurpose task execution tool for distributed systems with web-based interface. If your application is divided into multiple s

Gunnery 747 Nov 09, 2022
Another Scheduler is a Kubernetes controller that automatically starts, stops, or restarts pods from a deployment at a specified time using a cron annotation.

Another Scheduler Another Scheduler is a Kubernetes controller that automatically starts, stops, or restarts pods from a deployment at a specified tim

Diego Najar 66 Nov 19, 2022
A Python concurrency scheduling library, compatible with asyncio and trio.

aiometer aiometer is a Python 3.6+ concurrency scheduling library compatible with asyncio and trio and inspired by Trimeter. It makes it easier to exe

Florimond Manca 182 Dec 26, 2022
Python-Repeated-Timer is an open-source & highly performing timer using only standard-libraries.

Python Repeated Timer Python-Repeated-Timer is an open-source & highly performing timer using only standard-libraries.

TACKHYUN JUNG 3 Oct 09, 2022
Clepsydra is a mini framework for task scheduling

Intro Clepsydra is a mini framework for task scheduling All parts are designed to be replaceable. Main ideas are: No pickle! Tasks are stored in reada

Andrey Tikhonov 15 Nov 04, 2022
A flexible python library for building your own cron-like system, with REST APIs and a Web UI.

Nextdoor Scheduler ndscheduler is a flexible python library for building your own cron-like system to schedule jobs, which is to run a tornado process

1k Dec 15, 2022