Slurm (Simple Linux Utility for Resource Management, http://slurm.schedmd.com/ )是一个开源的、具有容错性、高度可扩展的集群管理和作业调度系统,适用于大型和小型 Linux 集群。Slurm 不需要对内核进行修改,它的运行方式相对独立以避免节点相互干扰,提高运行效率。
作为集群工作负载管理器, Slurm 具有三个关键功能。首先,它分配独占和或非独占节点提供给用户在一段时间内访问资源(计算节点),以便他们可以执行业务计算。其次,它提供了一个用于启动、执行和监视分配节点上的工作(通常是并行作业)的框架。最后它通过管理待处理作业的队列来仲裁资源争用。所有需运行的作业,无论是用于程序调试还是业务计算,都可以通过交互式并行 srun 、批处理式 sbatch 或分配式 salloc 等命令提交,提交后可以利用相关命令查询作业状态等。
为了更有效地管理和分配资源,优化作业调度,提升系统利用率,并满足多样化的作业需求,队列成为任务调度中不可或缺的配置项。合理的队列设置能够确保高优先级的任务优先获得所需资源,从而最大化资源利用效率。
本文介绍在Slurm系统环境下,当出现作业提交或作业状态变化时,如何通过恰当的队列配置策略来实现尽可能多的任务调度处理,以达到最佳性能。
Slurm任务按优先级排序执行,若分区存在不可调度任务,则后续任务暂停。高优先级任务可抢占低优先级任务资源,被抢占任务可以取消、重置或挂起。如果您启用回填调度(默认),按 bf_interval 周期计算低优任务能否在不延迟高优任务前提下运行,需占用整机资源并可能触发整机抢占。调度配置通过 slurm.conf 指定 SchedulerType (默认 sched/backfill 插件)及详细参数 SchedulerParameters ,具体参数配置,请参见官方文档(https://slurm.schedmd.com/sched_config.html)。
在调度过程中,所有任务都会被整合到一个列表中,并通过不同的优先级算法来确定执行顺序。Slurm支持以下两种队列类型:
默认情况下,Slurm采用先进先出FIFO为基础分配作业优先级。关于优先级调度的配置文件存放 slurm.conf 中,可以通过修改参数 PriorityType 来配置优先级。
# 1.找到并编辑slurm.conf文件
sudo nano /etc/slurm/slurm.conf
# 2.启用抢占模式,并指定基于先进先出优先级的抢占策略
PriorityType=priority/basic Slurm多因素调度通过加权计算以下因子确定任务优先级:作业执行时长、资源差异(已分配vs已消耗)、作业规模、用户参数、数据分区、TRES(资源等价值)类型及服务质量(QOS)。权重分配与具体计算逻辑,请参见多因素优先级配置说明(https://slurm.schedmd.com/priority_multifactor.html)。
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (priority_job_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factorSlurm作业优先级通过以下加权因子计算:
您可以通过动态调整权重参数,实现公平、高效的任务调度。
典型应用示例:
# 1.找到并编辑slurm.conf文件
sudo nano /etc/slurm.conf
# 2.启用抢占模式,并指定基于分区优先级的抢占策略
PreemptType=preempt/partition_prio
# 3.当作业被抢占时的行为,定义了当一个作业被抢占时会发生什么。# cancel表示取消该作业;suspend则会暂停它直到资源再次可用。。
PreemptMode=suspend # 或者 "cancel"参数 | 推荐值 | 作用 |
|---|---|---|
SelectType | select/cons_tres | 定义资源分配策略,指定如何将任务分配到节点资源上。说明:由slurm cluster创建出的worker使用了dynamic node特性,因此只支持select/cons_tres类型。 |
SelectTypeParameters | CR_Core | 传递给SelectType插件的具体参数,控制资源分配细节。 |
SchedulerType | sched/backfill | 指定调度算法类型,决定任务如何被安排至节点。 |
PriorityType | priority/multifactor | 定义任务优先级计算规则,决定任务调度顺序。 |
PreemptMode | SUSPEND,GANG | 启用抢占策略的条件,即决定在什么情况下允许抢占运行中的任务。说明:开启抢占时,在select/cons_tres类型的select插件下只允许使用SUSPEND、GANG。 |
PreemptType | preempt/partition_prio | 选择抢占机制类型,指定具体如何执行抢占行为。说明当前支持preempt/qos与preempt/partition_prio,本例中使用分区作为抢占的依据。 |
在slurm.conf中新增一条分区记录如:
# sudo nano /etc/slurm/slurm.conf
...
NodeName=rhel-efserver NodeAddr=192.168.1.25 CPUs=8 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=1024 Procs=1 State=UNKNOWN
NodeName=rhel-compute NodeAddr=192.168.1.24 CPUs=16 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=10240 Procs=1 State=UNKNOWN
NodeName=rhel-dcv NodeAddr=192.168.1.26 CPUs=8 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=8192 Procs=1 State=UNKNOWN
NodeName=rhel-openscow NodeAddr=192.168.1.28 CPUs=16 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=4096 Procs=1 State=UNKNOWN
NodeName=rhel-login NodeAddr=192.168.1.27 CPUs=8 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=2048 Procs=1 State=UNKNOWN
PartitionName=compute Nodes=rhel-compute Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:1
PartitionName=dcv Nodes=rhel-dcv Default=NO MaxTime=INFINITE State=UP
PartitionName=debug Nodes=rhel-dcv,rhel-efserver,rhel-compute,rhel-openscow,rhel-login Default=NO MaxTime=INFINITE State=UP
# high_prio_debug分区优先级Priority=10000,其他默认为1
PartitionName=high_prio_debug Nodes=rhel-dcv,rhel-efserver,rhel-compute,rhel-openscow,rhel-login Priority=10000 Default=NO MaxTime=INFINITE State=UP查看分区信息:
[root@rhel-openscow ~]# scontrol show partitions
PartitionName=compute
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=rhel-compute
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
State=UP TotalCPUs=16 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=16,mem=10G,node=1,billing=16
PartitionName=dcv
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=rhel-dcv
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
State=UP TotalCPUs=8 TotalNodes=1 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=8,mem=8G,node=1,billing=8
PartitionName=debug
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=rhel-compute,rhel-dcv,rhel-efserver,rhel-login,rhel-openscow
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
State=UP TotalCPUs=56 TotalNodes=5 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=56,mem=25G,node=5,billing=56
PartitionName=high_prio_debug
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=rhel-compute,rhel-dcv,rhel-efserver,rhel-login,rhel-openscow
PriorityJobFactor=10000 PriorityTier=10000 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
State=UP TotalCPUs=56 TotalNodes=5 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=56,mem=25G,node=5,billing=56连续提交多个任务:srun sleep 1d &
[root@rhel-openscow ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
61 compute sleep root PD 0:00 1 (Priority)
60 compute sleep root PD 0:00 1 (Resources)
58 compute sleep root R 0:17 1 rhel-compute
59 compute sleep root R 0:17 1 rhel-compute
56 compute sleep root R 0:18 1 rhel-compute
57 compute sleep root R 0:18 1 rhel-compute
54 compute sleep root R 0:19 1 rhel-compute
55 compute sleep root R 0:19 1 rhel-compute
52 compute sleep root R 0:20 1 rhel-compute
53 compute sleep root R 0:20 1 rhel-compute
50 compute sleep root R 0:21 1 rhel-compute
51 compute sleep root R 0:21 1 rhel-compute
48 compute sleep root R 0:22 1 rhel-compute
49 compute sleep root R 0:22 1 rhel-compute
47 compute sleep root R 0:23 1 rhel-compute
45 compute sleep root R 0:24 1 rhel-compute
46 compute sleep root R 0:24 1 rhel-compute
44 compute sleep root R 0:26 1 rhel-compute在高优先级分区 high_prio_debug 提交任务:
# 任务44的ST(状态)从R变为了S,任务62的状态变为了R,说明任务44被挂起,任务62抢占资源运行
[root@rhel-openscow ~]# srun --partition=high_prio_debug -w rhel-compute sleep 1d &
[19] 62843
[root@rhel-openscow ~]#
[root@rhel-openscow ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
60 compute sleep root PD 0:00 1 (Resources)
61 compute sleep root PD 0:00 1 (Priority)
58 compute sleep root R 0:52 1 rhel-compute
59 compute sleep root R 0:52 1 rhel-compute
56 compute sleep root R 0:53 1 rhel-compute
57 compute sleep root R 0:53 1 rhel-compute
54 compute sleep root R 0:54 1 rhel-compute
55 compute sleep root R 0:54 1 rhel-compute
52 compute sleep root R 0:55 1 rhel-compute
53 compute sleep root R 0:55 1 rhel-compute
50 compute sleep root R 0:56 1 rhel-compute
51 compute sleep root R 0:56 1 rhel-compute
48 compute sleep root R 0:57 1 rhel-compute
49 compute sleep root R 0:57 1 rhel-compute
47 compute sleep root R 0:58 1 rhel-compute
45 compute sleep root R 0:59 1 rhel-compute
46 compute sleep root R 0:59 1 rhel-compute
44 compute sleep root S 0:56 1 rhel-compute
62 high_prio sleep root R 0:05 1 rhel-compute更新任务为高优先级任务:
# 任务60立即运行,任务45的ST(状态)从R变为了S,说明任务45被挂起,任务60抢占资源运行
[root@rhel-openscow ~]# scontrol update jobid=60 partition=high_prio_debug nodelist=rhel-compute
[root@rhel-openscow ~]# srun: job 60 has been allocated resources
[root@rhel-openscow ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
61 compute sleep root PD 0:00 1 (Resources)
58 compute sleep root R 5:21 1 rhel-compute
59 compute sleep root R 5:21 1 rhel-compute
56 compute sleep root R 5:22 1 rhel-compute
57 compute sleep root R 5:22 1 rhel-compute
54 compute sleep root R 5:23 1 rhel-compute
55 compute sleep root R 5:23 1 rhel-compute
52 compute sleep root R 5:24 1 rhel-compute
53 compute sleep root R 5:24 1 rhel-compute
50 compute sleep root R 5:25 1 rhel-compute
51 compute sleep root R 5:25 1 rhel-compute
48 compute sleep root R 5:26 1 rhel-compute
49 compute sleep root R 5:26 1 rhel-compute
47 compute sleep root R 5:27 1 rhel-compute
46 compute sleep root R 5:28 1 rhel-compute
45 compute sleep root S 5:20 1 rhel-compute
44 compute sleep root S 0:56 1 rhel-compute
60 high_prio sleep root R 0:08 1 rhel-compute
62 high_prio sleep root R 4:34 1 rhel-computeSlurm需配置高/低优先级QOS(默认已存在优先级0的 normal),并通过 sacctmgr 创建高优QOS启用抢占。需在 slurm.conf 开启抢占功能(如 PreemptMode=priority)。
但需注意:若 PreemptType=SUSPEND,GANG ,高优任务抢占后,低优任务会以分时模式与高优任务共存(非完全中断)。配置QOS需要使用 sacctmgr 工具,以下是创建一个高优QOS的常用命令。
sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10参数 | 推荐值 | 作用 |
|---|---|---|
SelectType | select/cons_tres | 定义资源分配策略,指定如何将任务分配到节点资源上。说明:由slurm cluster创建出的worker使用了dynamic node特性,因此只支持select/cons_tres类型。 |
SelectTypeParameters | CR_Core | 传递给SelectType插件的具体参数,控制资源分配细节。 |
SchedulerType | sched/backfill | 指定调度算法类型,决定任务如何被安排至节点。 |
PriorityType | priority/multifactor | 定义任务优先级计算规则,决定任务调度顺序。 |
PreemptMode | SUSPEND,GANG | 启用抢占策略的条件,即决定在什么情况下允许抢占运行中的任务。说明:开启抢占时,在select/cons_tres类型的select插件下只允许使用SUSPEND、GANG。 |
PreemptType | preempt/qos | 选择抢占机制类型,指定具体如何执行抢占行为。说明:当前支持preempt/qos与preempt/partition_prio,本例中使用QOS作为抢占的依据。 |
查看当前QOS:
[root@rhel-openscow ~]# sacctmgr show qos format=name
Name
----------
normal
low创建高优先级QOS:
[root@rhel-openscow ~]# sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10
Adding QOS(s)
high
Settings
Description = high
Preempt = normal
PreemptMode = GANG,SUSPEND
Priority = 10
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
[root@rhel-openscow ~]#
[root@rhel-openscow ~]# sacctmgr show qos format=name
Name
----------
normal
low
high
[root@rhel-openscow ~]# sacctmgr show qos format=name,priority,preempt
Name Priority Preempt
---------- ---------- ----------
normal 0
low 0
high 10 normal新建测试脚本:
# sudo nano test.sh
#!/bin/bash
srun sleep 10m连续提交多个任务:
# sbatch test.sh
# ...
[root@rhel-openscow ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
79 compute test.sh root PD 0:00 1 (Resources)
77 compute test.sh root R 0:07 1 rhel-compute
78 compute test.sh root R 0:07 1 rhel-compute
73 compute test.sh root R 0:10 1 rhel-compute
74 compute test.sh root R 0:10 1 rhel-compute
75 compute test.sh root R 0:10 1 rhel-compute
76 compute test.sh root R 0:10 1 rhel-compute
71 compute test.sh root R 0:13 1 rhel-compute
72 compute test.sh root R 0:13 1 rhel-compute
67 compute test.sh root R 0:14 1 rhel-compute
68 compute test.sh root R 0:14 1 rhel-compute
69 compute test.sh root R 0:14 1 rhel-compute
70 compute test.sh root R 0:14 1 rhel-compute
64 compute test.sh root R 0:17 1 rhel-compute
65 compute test.sh root R 0:17 1 rhel-compute
66 compute test.sh root R 0:17 1 rhel-compute
63 compute test.sh root R 0:20 1 rhel-compute向高优先级QOS提交任务:
# 高优先级QOS任务开始执行,通过分时的方式与其他任务共享资源
[root@rhel-openscow ~]# sbatch -w rhel-compute --qos=high test.sh
Submitted batch job 80
[root@rhel-openscow ~]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
79 compute test.sh root PD 0:00 1 (Resources)
77 compute test.sh root R 0:44 1 rhel-compute
78 compute test.sh root R 0:44 1 rhel-compute
73 compute test.sh root R 0:47 1 rhel-compute
74 compute test.sh root R 0:47 1 rhel-compute
75 compute test.sh root R 0:47 1 rhel-compute
76 compute test.sh root R 0:47 1 rhel-compute
71 compute test.sh root R 0:50 1 rhel-compute
72 compute test.sh root R 0:50 1 rhel-compute
67 compute test.sh root R 0:51 1 rhel-compute
68 compute test.sh root R 0:51 1 rhel-compute
69 compute test.sh root R 0:51 1 rhel-compute
70 compute test.sh root R 0:51 1 rhel-compute
64 compute test.sh root R 0:54 1 rhel-compute
65 compute test.sh root R 0:54 1 rhel-compute
66 compute test.sh root R 0:54 1 rhel-compute
63 compute test.sh root R 0:57 1 rhel-compute
80 compute test.sh root S 0:00 1 rhel-compute作业大小优先级是由 PriorityWeightJobSize 和PriorityWeightAge=1000 共同决定。
作业大小因素
非紧急任务需高效利用集群资源(不超期限)。当任务执行时间未知时,回填调度失效,此时优先调度小任务减少队头阻塞,同时依据排队时间提升大任务优先级防饿死;临近截止的大任务可抢占小任务资源(挂起小任务直至其完成)。
为提高非紧急任务集群利用率(不超过截止时间),您可以采取以下策略进行设置:
通过实施上述措施,可以在保证关键任务按时完成的同时最大化利用集群资源,同时也兼顾了不同类型任务之间的平衡。
在slurm.conf中需要进行如下的配置(这里只展示特殊配置,slurm.conf中的其他配置不受影响):
PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0作业等待时间因素
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。