Extend Kubernetes 系列: Extend Kubernetes - Kubectl Plugin; Extend Kubernetes - FlexVolume And CSI; Extend Kubernetes - CRI; Extend Kubernetes - CNI
注意:以下流程为 基于谓词和优先级的调度器(Predicates and Priorities) · v1.0.0 ~ v1.14.0
由于当前的主流扩展方式 Webhook(Scheduler Extender)方式有一些限制:
为了解决上面的问题提出, Scheduler Framework 为默认调度器定义了新的扩展点和 API,并通过插件的方式提供
一般来说,我们有4种扩展 Kubernetes 调度器的方法。
扩展方式 | 优缺点 |
---|---|
clone 官方的 kube-scheduler 修改 | 不易维护 |
独立 kube-scheduler,配合 pod.spec.schedulerName | 可能会产生调度冲突问题,比如一个 scheduler bind的时候实际资源已经被 另一个 scheduler 分配了 |
extend scheduler | policy 文件可配置 Webhook,包含 支持 Predicate、Priority、Bind、preemption扩展点,实现简单 |
scheduling Framework | Kubernetes v1.15 引入,可插拔, 未来主流方式,废弃 extend scheduler |
k8s-scheduler-extender-example
Kube-batch, gang scheduler 是某些领域,比如大数据、批量计算场景 常用的的调度方式,即讲一组资源当成一个 group,如果有 group 够用的资源就整个调度,或者整个不调度 (而传统的 kubernetes 的调度粒度为 pod). kubebatch 试图解决此类问题,并且想把这种通用的需求变成标准,解决所有类似的问题.
为 gpu share divice 扩展的 scheduler,支持多个 pod 共享 gpu显存和 card. 目前的 device 机制能注册资源总量,但是对于调度来讲,信息不太够,因此 gpushare-scheduler-extender 提供了一层 filter 帮助判断 node 上是否有足够的 gpu 资源.
受限于目前主流使用的 kubernetes 版本限制,我们还是采用 extender sheduler 的方式进行实践.
想象这样一种场景:我们将所有的 kubernetes 中的节点分为两组:一组为 group a, 固定节点,包月购买; 另一组为 group b, 按量付费,满足一些弹性需求。
针对这种场景,我们对调度器的需求是
具体实现代码在 u2takey/k8s-scheduler-extender-example
核心实现为(省略部分次要代码)
GroupPriority = Prioritize{
Name: "group_score",
Func: func(_ v1.Pod, nodes []v1.Node) (*schedulerapi.HostPriorityList, error) {
var priorityList schedulerapi.HostPriorityList
priorityList = make([]schedulerapi.HostPriority, len(nodes))
for i, node := range nodes {
priorityList[i] = schedulerapi.HostPriority{
Host: node.Name,
Score: 1000,
}
if group, ok := node.Labels["group"]; ok && group == "Scale" {
// Details: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
pods, err := indexer.ByIndex("node", node.Name)
cpu, mem:= &resource.Quantity{}, &resource.Quantity{}
for _, obj := range pods{
if pod, ok := obj.(*v1.Pod); ok{
for _, container := range pod.Spec.Containers{
cpu.Add(*container.Resources.Requests.Cpu())
mem.Add(*container.Resources.Requests.Memory())
}
}
}
nodeCpu, nodeMem := node.Status.Capacity.Cpu(), node.Status.Capacity.Memory()
score := (toFloat(cpu)/toFloat(nodeCpu) + toFloat(mem)/toFloat(nodeMem))* 100.0
priorityList[i].Score = int64(score)
}
log.Printf("score for %s %d\n", node.Name, priorityList[i].Score)
}
return &priorityList, nil
},
}
使用 terraform 新建 k8s 集群,进行测试 配置为 (省略了变量配置),新建的 worker 数量为 4,配置为 2u4G
provider "tencentcloud" {
secret_id = var.secret_id
secret_key = var.secret_key
region = var.region
}
# test cluster
resource "tencentcloud_kubernetes_cluster" "managed_cluster" {
vpc_id = var.vpc
cluster_cidr = "10.4.0.0/16"
cluster_max_pod_num = 32
cluster_desc = "cluster created by terraform"
cluster_max_service_num = 32
container_runtime = "containerd"
cluster_version = "1.14.3"
worker_config {
count = 4
availability_zone = var.availability_zone
instance_type = var.default_instance_type
system_disk_size = 50
security_group_ids = [var.sg]
internet_charge_type = "TRAFFIC_POSTPAID_BY_HOUR"
internet_max_bandwidth_out = 100
public_ip_assigned = true
subnet_id = var.subnet
key_ids = [var.key_id]
}
cluster_deploy_type = "MANAGED_CLUSTER"
provisioner "local-exec" {
command = <<EOT
echo "${self.certification_authority}" > /tmp/{self.user_name}.cert;
kubectl config set-credentials ${self.id} --username=${self.user_name} --password=${self.password};
kubectl config set-cluster ${self.id} --server=https://${self.domain} --certificate-authority=/tmp/{self.user_name}.cert --embed-certs=true;
kubectl config set-context ${self.id} --cluster=${self.id} --user=${self.id} ;
kubectl config use-context ${self.id};
EOT
}
provisioner "local-exec" {
when = "destroy"
command = <<EOT
kubectl config unset users.${self.id};
kubectl config unset contexts.${self.id};
kubectl config unset clusters.${self.id};
EOT
}
}
新建完成之后 patch 其中两个节点 为 group: Scale, 即上面描述的 groupB,用于 scale 的group
kubectl patch node 10.203.0.16 10.203.0.6 -p '{"metadata":{"labels":{"group":"Scale"}}}'
创建 deploy 进行测试, request limit 为 500m/500M, 逐渐扩容,观察调度情况, 可以发现 副本会优先向 group A 平均调度 (10.203.0.14, 10.203.0.11), 直到 groupA 资源不足,此时会向 group B 调度,group B中会尽量少用 节点,优先选择了一个节点 (10.203.0.6), 直到这个节点资源不足.
# 6 副本, 优先在 groupA 平均调度
k8s-scheduler-extender-example on master [!+?] via 🐹 v1.13.7 on 🐳 v19.03.5 at ☸️ cls-0026rllg
➜ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-866d5f6df5-4gxcn 1/1 Running 0 5s 10.4.0.107 10.203.0.11 <none> <none>
nginx-866d5f6df5-4wwn8 1/1 Running 0 18s 10.4.0.41 10.203.0.14 <none> <none>
nginx-866d5f6df5-cnpld 1/1 Running 0 36s 10.4.0.40 10.203.0.14 <none> <none>
nginx-866d5f6df5-drpsz 1/1 Running 0 18s 10.4.0.106 10.203.0.11 <none> <none>
nginx-866d5f6df5-frb6c 1/1 Running 0 18s 10.4.0.42 10.203.0.14 <none> <none>
nginx-866d5f6df5-xg79m 1/1 Running 0 18s 10.4.0.105 10.203.0.11 <none> <none>
(base)
# 7 副本, 此时 groupA 资源不足,调度到 groupB
➜ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-866d5f6df5-4gxcn 1/1 Running 0 12s 10.4.0.107 10.203.0.11 <none> <none>
nginx-866d5f6df5-4wwn8 1/1 Running 0 25s 10.4.0.41 10.203.0.14 <none> <none>
nginx-866d5f6df5-89fxh 0/1 ContainerCreating 0 2s <none> 10.203.0.6 <none> <none>
nginx-866d5f6df5-cnpld 1/1 Running 0 43s 10.4.0.40 10.203.0.14 <none> <none>
nginx-866d5f6df5-drpsz 1/1 Running 0 25s 10.4.0.106 10.203.0.11 <none> <none>
nginx-866d5f6df5-frb6c 1/1 Running 0 25s 10.4.0.42 10.203.0.14 <none> <none>
nginx-866d5f6df5-xg79m 1/1 Running 0 25s 10.4.0.105 10.203.0.11 <none> <none>
(base)
# 9 副本, 集中将新增副本调度到 10.203.0.6
➜ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-866d5f6df5-4gxcn 1/1 Running 0 39s 10.4.0.107 10.203.0.11 <none> <none>
nginx-866d5f6df5-4wwn8 1/1 Running 0 52s 10.4.0.41 10.203.0.14 <none> <none>
nginx-866d5f6df5-89fxh 1/1 Running 0 29s 10.4.0.72 10.203.0.6 <none> <none>
nginx-866d5f6df5-9ng2n 1/1 Running 0 3s 10.4.0.74 10.203.0.6 <none> <none>
nginx-866d5f6df5-cnpld 1/1 Running 0 70s 10.4.0.40 10.203.0.14 <none> <none>
nginx-866d5f6df5-drpsz 1/1 Running 0 52s 10.4.0.106 10.203.0.11 <none> <none>
nginx-866d5f6df5-frb6c 1/1 Running 0 52s 10.4.0.42 10.203.0.14 <none> <none>
nginx-866d5f6df5-q7rhc 1/1 Running 0 16s 10.4.0.73 10.203.0.6 <none> <none>
nginx-866d5f6df5-xg79m 1/1 Running 0 52s 10.4.0.105 10.203.0.11 <none> <none>
(base)
# 10 副本,此时 10.203.0.6 资源不足,向 10.203.0.16 调度
➜ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-866d5f6df5-4gxcn 1/1 Running 0 56s 10.4.0.107 10.203.0.11 <none> <none>
nginx-866d5f6df5-4wwn8 1/1 Running 0 69s 10.4.0.41 10.203.0.14 <none> <none>
nginx-866d5f6df5-89fxh 1/1 Running 0 46s 10.4.0.72 10.203.0.6 <none> <none>
nginx-866d5f6df5-9ng2n 1/1 Running 0 20s 10.4.0.74 10.203.0.6 <none> <none>
nginx-866d5f6df5-cnpld 1/1 Running 0 87s 10.4.0.40 10.203.0.14 <none> <none>
nginx-866d5f6df5-drpsz 1/1 Running 0 69s 10.4.0.106 10.203.0.11 <none> <none>
nginx-866d5f6df5-frb6c 1/1 Running 0 69s 10.4.0.42 10.203.0.14 <none> <none>
nginx-866d5f6df5-q7rhc 1/1 Running 0 33s 10.4.0.73 10.203.0.6 <none> <none>
nginx-866d5f6df5-sc4x6 1/1 Running 0 6s 10.4.0.10 10.203.0.16 <none> <none>
nginx-866d5f6df5-xg79m 1/1 Running 0 69s 10.4.0.105 10.203.0.11 <none> <none>
最后别忘了 terraform destroy 销毁集群
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。