当前位置:网站首页>Slurm tutorial

Slurm tutorial

2022-06-22 04:40:00 Humorous Jing Kejun

Common terms

user: user name
node: Computing node
core:cpu nucleus
job: Homework
job step: Operation steps , A single job can have multiple job steps
partition: Partition , The job needs to run in a specific partition
QOS: Service quality , Can be understood as the user can use CPU、 Memory and other resource constraints
tasks: Number of tasks , By default, a task uses a cpu nucleus , It can be understood as necessary for the operation cpu Check the number
socket:cpu slot , It can be understood as physics cpu Number of pieces
stdout: Standard output file , A file that outputs information when a program runs normally , Generally refers to the information output to the screen
stderr: Standard error file , A file that outputs information when a program runs in error , Generally refers to the information output to the screen

command

sbatch: Submit job script . This script typically contains one or more srun Command to start parallel tasks
sinfo: Show partition or node status , You can filter through parameter options 、 And sort
squeue: Display the job and job status of the queue
scancel: Cancel queued or running jobs
scontrol: To display or set slurm Homework 、 Partition 、 Nodes and so on
sacctmgr: Display the... Associated with the setting account QOS Etc
sacct: Display historical job information
srun: Run parallel jobs , With multiple options , Such as : Maximum and minimum number of nodes 、 Number of processors 、 Specify and exclude nodes
Node status view

 Insert picture description here

  • PARRITION: The partition where the node is located
  • AVAIL: Zone status ,up Identification available ,down Identity not available
  • TIMELIMIT: Maximum program run time ,infinite Means unrestricted , If the limit format is days-houres:minutes:seconds
  • NODES: Number of nodes
  • NODELIST: List of node names
  • STATE: Node status , Possible states include :
    • allocated、alloc : Already allocated
    • completing、comp: Finishing
    • down: Downtime
    • drained、drain: Has lost its vitality
    • fail: invalid
    • idle: Free
    • mixed: blend , The node is running a job , But some free time CPU nucleus , New jobs are acceptable
    • reserved、resv: Reserve resources
    • unknown、unk: Unknown reason
    • If the status has a suffix *, Indicates that the node is not responding
View partition information

 Insert picture description here

  • DisableRootJobs: Don't allow root Submit the assignment
  • Maxtime: Maximum operation time
  • LLN: Whether to schedule according to the minimum load node
  • Maxnodes: Maximum number of nodes
  • Hidden: Whether it is a hidden partition
  • Default: Whether it is the default partition
  • OverSubscribe: Is timeout allowed
  • ExclusiveUser: Excluded users
View job information

 Insert picture description here

  • JOBID: Assignment number
  • PARITION: Section name
  • NAME: Job name
  • USER: user name
  • ST: state , Common states include :
    • PD、Q: Queuing ,PENDING
    • R: Running ,RUNNING
    • CA: Cancelled ,CANCELLED
    • CG: Finishing ,COMPLETIONG
    • F: Failed ,FAILED
    • TO: Overtime ,TIMEOUT
    • NF: Node failure ,NODE FAILURE
    • CD: Completed ,COMPLETED
View job information

 Insert picture description here

Submit job in batch mode
1. Users write job scripts

2. Submit the assignment

3. Jobs are queued for resource allocation

4. Load and execute the job script in the first node

5. End of script execution , Release resources

6. Users can view the running results in the output file
 Insert picture description here

The job script is a text file , First line one “#!” start , Specify the interpreter
In the script, you can use srun Load computing tasks
A job can contain multiple job steps
The script is submitted on the management node , Actually execute on the compute node
The script output is written to the output file

Here are some common job resource requirement parameters , Use #SBATCH -xx xxx You can write it into the script in the way of

-J,–job-name: Specify the job name
-N,–nodes: Number of nodes
-n,–ntasks: The use of CPU Check the number
–mem: Specify the physical memory used on each node
-t,–time: The elapsed time , Jobs that exceed the time limit will be terminated
-p,–partition: The specified partition
–reservation: Reserve resources
-w,–nodelist: Specify the node to run the job
-x,–exclude: Do not include the specified node in the node assigned to the job
–ntasks-per-node: Specify how many... Are used for each node CPU The core
–begin: Specify the job start time
-D,–chdir: Specify the script / The working directory of the command

Reprint :

https://cloud.tencent.com/developer/article/1672432

原网站

版权声明
本文为[Humorous Jing Kejun]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206220435110378.html