业界AI应用中,GPU的使用逐渐增加,业界普遍存在以下两大问题:
腾讯云qGPU提供的GPU共享能力,支持在多个容器间共享 GPU 卡并提供容器间显存、算力强隔离能力,在使用中以更小的粒度进行调度。在保证业务稳定的前提下,为云上用户控制资源成本,提高运行效率提供帮助。
本实践基于腾讯云TKE实现,适用于各AI训练及推理应用场景。
文章分为两部分,第一部分为qGPU云原生化安装,提供全量qGPU和混用nvidia+qGPU两种不同的安装方式,以供实际场景选用。第二部分为qGPU能力验证,分别从调度、隔离和在离线混布三个方面,提供操作用例。
本次实践环境,采用腾讯云TKE,其中
注:在验证还原时,根据用例按需配置节点,参见【测试用例】-【环境准备】
在TKE集群入口新建节点,购买节点过程中可以选择
kubectl describe node x.x.x.x
不要开启【qGPU共享】,否则无法混合开通节点,(开启后,后面开启的任何节点,都是qGPU节点)
选择【公共镜像】,可选centos7.8操作系统,安装470 GPU驱动
kubectl label node x.x.x.x gputype=nvidia
通过TKE控制台,添加【qGPU组件】插件
选择【市场镜像】 —— 选择标识为“混部”的机器,OS里已经安装GPU驱动,无需重复安装
注意:在这种混用模式下,qGPU只能用【市场镜像】,【公共镜像】都是Nvidia方案节点
给节点打上标签,方便调度任务时,通过nodeSelector进行选择
kubectl label node x.x.x.x gputype=qgpu
kubectl describe node x.x.x.x
操作过程对qGPU的以下四个能力,进行方案实践
为避免数据抖动,操作时可先锁定GPU频率,根据显卡型号在GPU节点执行如下命令:
显卡型号 | 操作命令 |
---|---|
T4 | nvidia-smi --lock-gpu-clocks=1590 |
V100 | nvidia-smi --lock-gpu-clocks=1530 |
A100 | nvidia-smi --lock-gpu-clocks=1410 |
A10 | nvidia-smi --lock-gpu-clocks=1695 |
实践基于Tensorflow CNN Benchmark进行模拟仿真,程序根据卷积神经网络的多个图像分类模型,进行图像数据的分类训练。使用如下方式创建镜像,并推送镜像仓库TCR
mkdir -p /tmp/qgpu
cd /tmp/qgpu
git clone https://github.com/tensorflow/benchmarks.git
cat <<'EOF' > start.sh
#! /bin/bash
python3 /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --forward_only=True --model=resnet50 --num_batches="$@" --allow_growth --batch_size=16
EOF
chmod +x start.sh
cat <<EOF > Dockerfile
FROM nvcr.io/nvidia/tensorflow:21.08-tf1-py3
ADD benchmarks /benchmarks
ADD start.sh /start.sh
EOF
docker build -t qgpu-tf-test:21.08-tf1-py3 .
kubectl describe node <node-name>
Allocatable: nvidia.com/gpu: 1
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene1-job1
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: nvidia
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
nvidia.com/gpu: 1
EOF
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene1-job2
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: nvidia
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
nvidia.com/gpu: 1
kubectl describe node <node-name>
Allocatable: tke.cloud.tencent.com/qgpu-core: 100
tke.cloud.tencent.com/qgpu-memory: 14
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene2-job1
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "5"
tke.cloud.tencent.com/qgpu-core: "40"
EOF
kubectl exec -it scene2-job1 nvidia-smi
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene2-job1
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "5"
tke.cloud.tencent.com/qgpu-core: "40"
EOF
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene2-job2
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "5"
tke.cloud.tencent.com/qgpu-core: "30"
EOF
kubectl exec -it scene2-job1 nvidia-smi
kubectl exec -it scene2-job2 nvidia-smi
kubectl logs scene2-job1 --tail=20
kubectl logs scene2-job2 --tail=20
YAML中Pod1与Pod2设置不同算力,因默认抢占策略,两个Pod会抢占剩余资源,因运行负载都很重,导致运行结果基本相同。其中Pod1训练日志约为202 images/sec,Pod2训练日志约为198 images/sec。
通过底层CVM节点监控,也可以查看到GPU使用率为100%
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene2-job1
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "5"
tke.cloud.tencent.com/qgpu-core: "40"
EOF
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene2-job2
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "5"
tke.cloud.tencent.com/qgpu-core: "30"
EOF
等待两个Pod进入Running状态后,执行如下命令,并记录输出信息
kubectl exec -it scene2-job1 nvidia-smi
kubectl exec -it scene2-job2 nvidia-smi
删除所有GPU负载,执行如下命令,查找binpack,替换为spread并保存退出
kubectl edit deploy -n kube-system qgpu-scheduler
等待旧的qgpu-scheduler Pod退出,新的qgpu-scheduler Pod进入Running状态。再次提交GPU负载,并执行nvidia-smi命令,再次记录输出信息
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene2-job1
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-core: "100" #显卡比例
EOF
cat /proc/qgpu/0/policy
kubectl label node <node-name> --overwrite tke.cloud.tencent.com/qgpu-schedule-policy=fixed-share
再次执行以下命令
cat /proc/qgpu/0/policy
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene3-job1
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "6"
tke.cloud.tencent.com/qgpu-core: "60"
EOF
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene3-job2
namespace: qgpu-test
spec:
template:
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "3"
tke.cloud.tencent.com/qgpu-core: "30"
EOF
kubectl exec -it scene3-job1 nvidia-smi
kubectl exec -it scene3-job2 nvidia-smi
kubectl logs scene2-job1 --tail=20
kubectl logs scene2-job2 --tail=20
注意:开启在离线混部后,前面的best和fix隔离策略,都会失效。在离线混部为节点维度。
kubectl label node xxxx mixed-qgpu-enable=enable
kubectl delete pod qgpu-manager-xxx -n kube-system
kubectl delete pod qgpu-scheduler-xxx -n kube-system
kubectl describe node xxxx
如下显示即为成功
apiVersion: v1
kind: Pod
annotations:
tke.cloud.tencent.com/app-class: offline
spec:
containers:
- name: offline-container
resources:
requests:
tke.cloud.tencent.com/qgpu-core-greedy: xx
tke.cloud.tencent.com/qgpu-memory: xx
apiVersion: v1
kind: Pod
annotations:
tke.cloud.tencent.com/app-class: online
spec:
containers:
- name: online-container
resources:
requests:
tke.cloud.tencent.com/qgpu-memory: xx
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene4-job1
namespace: qgpu-test
spec:
template:
metadata:
annotations:
tke.cloud.tencent.com/app-class: offline #离线
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "/start.sh", "50000" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "7" #显存大小
tke.cloud.tencent.com/qgpu-core-greedy: "60" #低优资源
EOF
cat <<EOF | kubectl create -f -
apiVersion: batch/v1
kind: Job
metadata:
name: scene4-job2
namespace: qgpu-test
spec:
template:
metadata:
annotations:
tke.cloud.tencent.com/app-class: online #在线
spec:
nodeSelector:
gputype: qgpu
restartPolicy: Never
containers:
- name: container1
image: ccr.ccs.tencentyun.com/qcbm/qgpu-tf-test:21.08-tf1-py3
command: [ "bash", "-c", "while true; do /start.sh 2000 && sleep 90;done" ]
resources:
limits:
tke.cloud.tencent.com/qgpu-memory: "7" #显存大小
EOF
kubectl logs -f scene4-job1
kubectl logs -f scene4-job2
kubectl exec -it scene4-job1 nvidia-smi
kubectl exec -it scene4-job2 nvidia-smi
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。