当前位置:网站首页>Slurm tutorial
Slurm tutorial
2022-06-22 04:40:00 【Humorous Jing Kejun】
Common terms
user: user name
node: Computing node
core:cpu nucleus
job: Homework
job step: Operation steps , A single job can have multiple job steps
partition: Partition , The job needs to run in a specific partition
QOS: Service quality , Can be understood as the user can use CPU、 Memory and other resource constraints
tasks: Number of tasks , By default, a task uses a cpu nucleus , It can be understood as necessary for the operation cpu Check the number
socket:cpu slot , It can be understood as physics cpu Number of pieces
stdout: Standard output file , A file that outputs information when a program runs normally , Generally refers to the information output to the screen
stderr: Standard error file , A file that outputs information when a program runs in error , Generally refers to the information output to the screen
command
sbatch: Submit job script . This script typically contains one or more srun Command to start parallel tasks
sinfo: Show partition or node status , You can filter through parameter options 、 And sort
squeue: Display the job and job status of the queue
scancel: Cancel queued or running jobs
scontrol: To display or set slurm Homework 、 Partition 、 Nodes and so on
sacctmgr: Display the... Associated with the setting account QOS Etc
sacct: Display historical job information
srun: Run parallel jobs , With multiple options , Such as : Maximum and minimum number of nodes 、 Number of processors 、 Specify and exclude nodes
Node status view

- PARRITION: The partition where the node is located
- AVAIL: Zone status ,up Identification available ,down Identity not available
- TIMELIMIT: Maximum program run time ,infinite Means unrestricted , If the limit format is days-houres:minutes:seconds
- NODES: Number of nodes
- NODELIST: List of node names
- STATE: Node status , Possible states include :
- allocated、alloc : Already allocated
- completing、comp: Finishing
- down: Downtime
- drained、drain: Has lost its vitality
- fail: invalid
- idle: Free
- mixed: blend , The node is running a job , But some free time CPU nucleus , New jobs are acceptable
- reserved、resv: Reserve resources
- unknown、unk: Unknown reason
- If the status has a suffix *, Indicates that the node is not responding
View partition information

- DisableRootJobs: Don't allow root Submit the assignment
- Maxtime: Maximum operation time
- LLN: Whether to schedule according to the minimum load node
- Maxnodes: Maximum number of nodes
- Hidden: Whether it is a hidden partition
- Default: Whether it is the default partition
- OverSubscribe: Is timeout allowed
- ExclusiveUser: Excluded users
View job information

- JOBID: Assignment number
- PARITION: Section name
- NAME: Job name
- USER: user name
- ST: state , Common states include :
- PD、Q: Queuing ,PENDING
- R: Running ,RUNNING
- CA: Cancelled ,CANCELLED
- CG: Finishing ,COMPLETIONG
- F: Failed ,FAILED
- TO: Overtime ,TIMEOUT
- NF: Node failure ,NODE FAILURE
- CD: Completed ,COMPLETED
View job information

Submit job in batch mode
1. Users write job scripts
2. Submit the assignment
3. Jobs are queued for resource allocation
4. Load and execute the job script in the first node
5. End of script execution , Release resources
6. Users can view the running results in the output file 
The job script is a text file , First line one “#!” start , Specify the interpreter
In the script, you can use srun Load computing tasks
A job can contain multiple job steps
The script is submitted on the management node , Actually execute on the compute node
The script output is written to the output file
Here are some common job resource requirement parameters , Use #SBATCH -xx xxx You can write it into the script in the way of
-J,–job-name: Specify the job name
-N,–nodes: Number of nodes
-n,–ntasks: The use of CPU Check the number
–mem: Specify the physical memory used on each node
-t,–time: The elapsed time , Jobs that exceed the time limit will be terminated
-p,–partition: The specified partition
–reservation: Reserve resources
-w,–nodelist: Specify the node to run the job
-x,–exclude: Do not include the specified node in the node assigned to the job
–ntasks-per-node: Specify how many... Are used for each node CPU The core
–begin: Specify the job start time
-D,–chdir: Specify the script / The working directory of the command
Reprint :
https://cloud.tencent.com/developer/article/1672432
边栏推荐
- 使用Echart绘制3D饼环图、仪表盘、电池图
- Case driven: a detailed guide from getting started to mastering shell programming
- WPF DataContext 使用(2)
- IDEA藍屏的解决方案
- cadence allegro 17. X conversion tool for downgrading to 16.6
- requests cookie更新值
- Handling of noready fault in kubernetes cluster
- Calculation of audio frame size
- Go 学习笔记
- Windows10 cannot access LAN shared folder
猜你喜欢

Use echart to draw 3D pie chart, dashboard and battery diagram

Write the first C application -- Hello, C

网页设计与制作期末大作业报告——小众音乐网站

slurm 使用教程

IDEA蓝屏的解决方案

网页设计与制作期末大作业报告——大学生线上花店

Network Interview eight part essay of daily knowledge points (TCP, startling group phenomenon, collaborative process)

New chief maintenance personnel for QT project

PCM data format

Spark - Executor 初始化 && 报警都进行1次
随机推荐
有了这几个刷题网站,还愁跳槽不涨薪?
It is easy to analyze and improve R & D efficiency by understanding these five figures
【sdx12】使用QCMAP_CLI启动WiFi操作说明
浏览器--常用的搜索操作符大全--使用/实例
mysql笔记
What is a forum virtual host? How to choose?
WPF DataContext 使用(2)
什么是论坛虚拟主机?如何挑选?
【SDX62】IPA log抓取操作说明
Lua exports as an external link library and uses
Go learning notes
Cloud security daily 220621: Intel microcode vulnerability found in Ubuntu operating system, which needs to be upgraded as soon as possible
Researcher of Shangtang intelligent medical team interprets organ image processing under intelligent medical treatment
Daily question: the difference between ArrayList and LinkedList
IDEA安装及其使用详解教程
Cloud native enthusiast weekly: Chaos mesh upgraded to CNCF incubation project
Is it safe to open an account in Guoyuan futures?
Pytorch之contiguous函数
EcRT of EtherCAT igh source code_ slave_ config_ Understanding of dc() function.
WPF DataContext usage (2)