
最近需要 Horovod 相关的知识,在这里记录一下,进行备忘:

保持更新,更多内容请关注 cnblogs.com/xuyaowen;
Horovod 安装:
安装 cuda 9.0; https://cloud.tencent.com/developer/article/1766888
编译安装nccl 根据cuda 9.0; https://cloud.tencent.com/developer/article/1766933
安装 gcc 4.9: https://cloud.tencent.com/developer/article/1766993
python 版本 Python 3.6.9 (具体环境请自行适配)
安装 openmpi 4.0 : https://cloud.tencent.com/developer/article/1766932
pip 安装 Horovod 框架:
HOROVOD_NCCL_HOME=nccl的home目录 HOROVOD_NCCL_LIB=nccl的lib目录 HOROVOD_NCCL_INCLUDE=nccl的include目录 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
HOROVOD_NCCL_HOME=/home/name/nccl/build/ HOROVOD_NCCL_LIB=/home/name/nccl/build/lib/ HOROVOD_NCCL_INCLUDE=/home/name/nccl/build/include/ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod安装后,使用:python -c "import horovod.tensorflow as hvd;" 命令进行测试,如果无错误输出,则表示安装成功;之后可参考官方手册使用Horovod;
➜ openmpi python -c "import horovod.tensorflow as hvd;"
/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/name/anaconda3/envs/gnnalgos/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])参考连接:
https://github.com/horovod/horovod (官方文档,可以参考安装和使用)
https://www.infoq.cn/article/J4ry_9bsfbcNkv6dfuqC
http://fyubang.com/2019/07/08/distributed-training/ (讲解了分布式多卡训练相关的基础知识)
分布式多卡-pytorch,tensorflow 系列教程 (较为详细的教程,讲解了现有较为优秀的框架的特点和使用方式)
https://zhuanlan.zhihu.com/p/78303865 (安装使用参考,本文中的安装步骤参考此教程)