业界AI应用中,GPU的使用逐渐增加,腾讯云TACO是一种异构计算加速软件服务,搭配腾讯自研的软硬件协同优化组件和硬件厂商特有优化方案,支持物理机、云服务器、容器等产品的计算加速、图形渲染、视频转码各个应用场景,帮助用户实现全方位全场景的降本增效。
本实践采用TACO Train AI中的HARP、LightCC优化技术,通过无侵入式方式,对Horovod分布式训练框架进行优化加速。过程中通过不同训练模型,不同Batch-Size,验证TACO在训练速度上的优化效果。
本次实践环境,采用腾讯云TKE,其中
curl -s -L http://mirrors.tencent.com/install/GPU/taco/taco_setup.sh | sudo bash
cat /proc/meminfo | grep HugePages_Total
ls -l /usr/local/tfabric/tools/config/ztcp*.conf
本次实践程序,采用Horovod基于随机数据的分布式训练benchmark脚本。其中TACO运行环境采用腾讯云taco-train的官方镜像
ccr.ccs.tencentyun.com/qcloud/taco-train:ttf115-cu112-cvm-0.4.1
因TACO插件式集成特性,从TACO镜像中移除HARP加速库,即可得到原生Horovod运行环境镜像
cat <<EOF > Dockerfile
FROM ccr.ccs.tencentyun.com/qcloud/taco-train:ttf115-cu112-cvm-0.4.1
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Asia/Shanghai
RUN apt-get update || :
RUN apt-get install tzdata
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo '$TZ' > /etc/timezone
RUN mv /usr/lib/x86_64-linux-gnu/libnccl-net.so /root/
EOF
docker build -t ttf115-cu112-cvm-0.4.1-horovod .
采用官方taco-train镜像,部署TACO环境任务,大页内存按照单机如下数量进行配置。
hugepages=(卡数 * 5 + 10)
YAML示例文件如下:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: taco-bench
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: ccr.ccs.tencentyun.com/qcbm/taco-train:ttf115-cu112-cvm-0.4.2
name: mpi-launcher
command: ["/bin/sh", "-ec", "sleep infinity"]
resources:
limits:
cpu: 1
memory: 2Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: ccr.ccs.tencentyun.com/qcbm/taco-train:ttf115-cu112-cvm-0.4.2
name: mpi-worker
securityContext:
privileged: true
volumeMounts:
- mountPath: /sys/
name: sys
- mountPath: /dev/hugepages
name: dev-hge
- mountPath: /usr/local/tfabric/tools
name: tfabric
resources:
limits:
hugepages-1Gi: "15Gi"
memory: "5Gi"
nvidia.com/gpu: 1 # requesting 1 GPU
volumes:
- name: sys
hostPath:
path: /sys/
- name: dev-hge
hostPath:
path: /dev/hugepages/
- name: tfabric
hostPath:
path: /usr/local/tfabric/tools/
采用自定义horovod镜像,YAML示例文件如下:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: horovod-bench
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: ccr.ccs.tencentyun.com/qcbm/horovod-train:ttf115-cu112-cvm-0.4.2
name: mpi-launcher
command: ["/bin/sh", "-ec", "sleep infinity"]
resources:
limits:
cpu: 1
memory: 2Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: ccr.ccs.tencentyun.com/qcbm/horovod-train:ttf115-cu112-cvm-0.4.2
name: mpi-worker
securityContext:
privileged: true
volumeMounts:
- mountPath: /sys/
name: sys
- mountPath: /dev/hugepages
name: dev-hge
- mountPath: /usr/local/tfabric/tools
name: tfabric
resources:
limits:
hugepages-1Gi: "15Gi"
memory: "5Gi"
nvidia.com/gpu: 1 # requesting 1 GPU
volumes:
- name: sys
hostPath:
path: /sys/
- name: dev-hge
hostPath:
path: /dev/hugepages/
- name: tfabric
hostPath:
path: /usr/local/tfabric/tools/
基于TACO训练加速组件:LightCC(基于 Horovod 深度优化的分布式训练框架)及HARP(自研用户态网络协议栈),对比原生Horovod环境,ResNet50及VGG16的多机训练加速提升
kubectl exec -i -t -n horovod-test horovod-bench-launcher -c mpi-launcher -- sh -c "clear;(bash || sh)"
kubectl exec -i -t -n taco-test taco-bench-launcher -c mpi-launcher -- sh -c "clear;(bash || sh)"
采用如下命令,分别在TACO-bench和Horovod-bench里执行计算
/usr/local/openmpi/bin/mpirun -np 2 -H taco-bench-worker-0:1,taco-bench-worker-1:1 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0 -x HOROVOD_CYCLE_TIME=0 -x LIGHT_TOPK_ALLREDUCE=1 -x LIGHT_TOPK_THRESHOLD=2097152 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 /mnt/tensorflow_synthetic_benchmark.py --model=VGG16 --batch-size=16
/usr/local/openmpi/bin/mpirun -np 2 -H horovod-bench-worker-0:1,horovod-bench-worker-1:1 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0 -x HOROVOD_CYCLE_TIME=0 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 /mnt/tensorflow_synthetic_benchmark.py --model=VGG16 --batch-size=16
因实践环境采用随机模拟数据,计算数据量少,通过小Batch-Size参数设置去增加通信频次。当Batch-Size设置为16时,VGG模型下TACO HARP和LightCC优化对比效果提升明显,如下图所示:
采用如下命令,分别在TACO-bench和Horovod-bench里执行计算
/usr/local/openmpi/bin/mpirun -np 2 -H taco-bench-worker-0:1,taco-bench-worker-1:1 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0 -x HOROVOD_CYCLE_TIME=0 -x LIGHT_TOPK_ALLREDUCE=1 -x LIGHT_TOPK_THRESHOLD=2097152 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 /mnt/tensorflow_synthetic_benchmark.py --model=ResNet50 --batch-size=16
/usr/local/openmpi/bin/mpirun -np 2 -H horovod-bench-worker-0:1,horovod-bench-worker-1:1 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0 -x HOROVOD_CYCLE_TIME=0 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 /mnt/tensorflow_synthetic_benchmark.py --model=ResNet50 --batch-size=16
ResNet50相对VGG16模型,网络通信次数较少,我们同样设置Batch-Size为16,TACO HARP和LightCC对于模型计算,依然有明显提升,如下图
采用如下命令,分别在TACO-bench和Horovod-bench里执行计算
/usr/local/openmpi/bin/mpirun -np 2 -H taco-bench-worker-0:1,taco-bench-worker-1:1 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0 -x HOROVOD_CYCLE_TIME=0 -x LIGHT_TOPK_ALLREDUCE=1 -x LIGHT_TOPK_THRESHOLD=2097152 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 /mnt/tensorflow_synthetic_benchmark.py --model=ResNet50 --batch-size=128
/usr/local/openmpi/bin/mpirun -np 2 -H horovod-bench-worker-0:1,horovod-bench-worker-1:1 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0 -x HOROVOD_CYCLE_TIME=0 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 /mnt/tensorflow_synthetic_benchmark.py --model=ResNet50 --batch-size=128
当Batch-Size 128的时候,整体训练过程计算占比大,计算数据量少,梯度更新快,导致通信占比小,HARP对比原生网络协议栈优势不明显。该设置场景对比效果如下:
在AI训练通信量很大时,最常遇到的一个问题就是网络带宽不够,在集群之间通信带宽受限的情况下,会显著影响节点间数据交换效率。
基于这个问题TACO提出了topk压缩算法,也就是LightCC优化,将梯度进行压缩,降低每次的通信量,并基于算法提供补偿方式,在对模型的训练精度影响很小的前提下,大大提升分布式训练的性能。
另外,用户态自研网络协议栈 HARP,通过内存零拷贝、多实例隔离和数据平面无锁设计,降低内核协议栈开销,显著提升分布式训练过程中网络通信效率。且对比业界Infiniband 或 RoCE方案,成本大幅降低。
实践过程证明,TACO对于通信占比大的训练程序,具有明显的训练加速效果,实现云上AI场景模型训练降本增效。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。