首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >RDMA - GDR GPU Direct RDMA快速入门2

RDMA - GDR GPU Direct RDMA快速入门2

原创
作者头像
晓兵
发布2025-03-30 07:28:18
发布2025-03-30 07:28:18
1.3K40
代码可运行
举报
文章被收录于专栏:AIAILinux内核DPU
运行总次数:0
代码可运行

接上文: https://cloud.tencent.com/developer/article/2508958

两种业务场景

MPI与GDR基准测试

https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/2791440385/GPUDirect+Benchmarking

perftest与GDR(use_cuda)基准测试

  • 用perftest测试CUDA和GDR时, 不支持内联消息(check_link 时会设置 user_param->inline_size = 0)

测试记录

将GPU内存注册为DPU上的普通MR
代码语言:javascript
代码运行次数:0
运行
复制
#客户端:
root@gdr114:~/project/rdma/perftest# ./ib_write_bw -d mlx5_0 --use_cuda=0 192.168.1.116
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 05:00
​
Picking device No. 0
[pid = 4800, dev = 0] device name = [Tesla V100-SXM2-16GB]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007c8f3da00000 pointer=0x7c8f3da00000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 PCIe relax order: ON       Lock-free      : OFF
 WARNING: CPU is not PCIe relaxed ordering compliant.
 WARNING: You should disable PCIe RO with `--disable_pcie_relaxed` for both server and client.
 ibv_wr* API     : ON       Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x01c9 PSN 0xa7f08e RKey 0x1fccbc VAddr 0x007c8f3da10000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:114
 remote address: LID 0000 QPN 0x01c9 PSN 0x682a07 RKey 0x1fccbc VAddr 0x007851b5a10000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:116
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1200.907000 != 2100.000000. CPU Frequency is not max.
 65536      5000             2471.09            2452.52          0.039240
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007c8f3da00000
destroying current CUDA Ctx
​
#服务端:
root@gdr116:~/project/rdma/perftest# ./ib_write_bw -d mlx5_0 --use_cuda=0
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
​
************************************
* Waiting for client to connect... *
************************************
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 04:00
​
Picking device No. 0
[pid = 5561, dev = 0] device name = [Tesla V100-SXM2-16GB]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007851b5a00000 pointer=0x7851b5a00000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 PCIe relax order: ON       Lock-free      : OFF
 WARNING: CPU is not PCIe relaxed ordering compliant.
 WARNING: You should disable PCIe RO with `--disable_pcie_relaxed` for both server and client.
 ibv_wr* API     : ON       Using DDP      : OFF
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x01c9 PSN 0x682a07 RKey 0x1fccbc VAddr 0x007851b5a10000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:116
 remote address: LID 0000 QPN 0x01c9 PSN 0xa7f08e RKey 0x1fccbc VAddr 0x007c8f3da10000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:114
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             2471.09            2452.52          0.039240
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007851b5a00000
destroying current CUDA Ctx

复制

将GPU内存注册为DPU上的DMA_BUF类型的MR
代码语言:javascript
代码运行次数:0
运行
复制
客户端:
root@gdr114:~/project/rdma/perftest# ./ib_write_bw -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf 192.168.1.116
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 05:00
​
Picking device No. 0
[pid = 3573, dev = 0] device name = [NVIDIA RTX A6000]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007c7d90600000 pointer=0x7c7d90600000
using DMA-BUF for GPU buffer address at 0x7c7d90600000 aligned at 0x7c7d90600000 with aligned size 131072
Calling ibv_reg_dmabuf_mr(offset=0, size=131072, addr=0x7c7d90600000, fd=37) for QP #0
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 PCIe relax order: ON       Lock-free      : OFF
 WARNING: CPU is not PCIe relaxed ordering compliant.
 WARNING: You should disable PCIe RO with `--disable_pcie_relaxed` for both server and client.
 ibv_wr* API     : ON       Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x01ca PSN 0x76adeb RKey 0x176d5d VAddr 0x007c7d90610000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:114
 remote address: LID 0000 QPN 0x01ca PSN 0xe79753 RKey 0x176f5f VAddr 0x00792a04610000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:116
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 1200.966000 != 3299.943000. CPU Frequency is not max.
 65536      5000             2758.05            2758.04          0.044129
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007c7d90600000
destroying current CUDA Ctx
​
​
服务端:
root@gdr116:~/project/rdma/perftest# ./ib_write_bw -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
​
************************************
* Waiting for client to connect... *
************************************
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 04:00
​
Picking device No. 0
[pid = 4694, dev = 0] device name = [NVIDIA RTX A6000]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000792a04600000 pointer=0x792a04600000
using DMA-BUF for GPU buffer address at 0x792a04600000 aligned at 0x792a04600000 with aligned size 131072
Calling ibv_reg_dmabuf_mr(offset=0, size=131072, addr=0x792a04600000, fd=37) for QP #0
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF      Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC       Using SRQ      : OFF
 PCIe relax order: ON       Lock-free      : OFF
 WARNING: CPU is not PCIe relaxed ordering compliant.
 WARNING: You should disable PCIe RO with `--disable_pcie_relaxed` for both server and client.
 ibv_wr* API     : ON       Using DDP      : OFF
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x01ca PSN 0xe79753 RKey 0x176f5f VAddr 0x00792a04610000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:116
 remote address: LID 0000 QPN 0x01ca PSN 0x76adeb RKey 0x176d5d VAddr 0x007c7d90610000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:114
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             2758.05            2758.04          0.044129
---------------------------------------------------------------------------------------
deallocating GPU buffer 0000792a04600000
destroying current CUDA Ctx

源码分析

代码语言:javascript
代码运行次数:0
运行
复制
客户端:
ib_write_bw -d mlx5_0 --use_cuda=0 192.168.1.116

初始化CUDA内存

代码语言:javascript
代码运行次数:0
运行
复制
cuda_memory_init(struct memory_ctx * ctx) (\root\project\rdma\perftest\src\cuda_memory.c:111)
ctx_init(struct pingpong_context * ctx, struct pingpong_context * ctx@entry, struct perftest_parameters * user_param, struct perftest_parameters * user_param@entry) (\root\project\rdma\perftest\src\perftest_resources.c:2089)
main(int argc, char ** argv) (\root\project\rdma\perftest\src\write_bw.c:161)
​
...
cuda_memory_init
    return_value = init_gpu(cuda_ctx)
        CUresult error = cuInit(0)
        error = cuCtxCreate(&ctx->cuContext, CU_CTX_MAP_HOST, ctx->cuDevice)
        error = cuCtxSetCurrent(ctx->cuContext)
​
    
分配和注册内存:
ibv_reg_mr
cuda_memory_allocate_buffer(struct memory_ctx * ctx, int alignment, uint64_t size, int * dmabuf_fd, uint64_t * dmabuf_offset, void ** addr, _Bool * can_init) (\root\project\rdma\perftest\src\cuda_memory.c:194)
create_single_mr(struct pingpong_context * ctx, struct pingpong_context * ctx@entry, struct perftest_parameters * user_param, struct perftest_parameters * user_param@entry, int qp_index, int qp_index@entry) (\root\project\rdma\perftest\src\perftest_resources.c:1661)
create_mr(struct pingpong_context * ctx, struct pingpong_context * ctx@entry, struct perftest_parameters * user_param, struct perftest_parameters * user_param@entry) (\root\project\rdma\perftest\src\perftest_resources.c:1858)
ctx_init(struct pingpong_context * ctx, struct pingpong_context * ctx@entry, struct perftest_parameters * user_param, struct perftest_parameters * user_param@entry) (\root\project\rdma\perftest\src\perftest_resources.c:2094)
main(int argc, char ** argv) (\root\project\rdma\perftest\src\write_bw.c:161)

GPU上可能不支持DMA_BUF(Tesla V100)

代码语言:javascript
代码运行次数:0
运行
复制
root@gdr116:~/project/rdma/perftest# ./ib_write_bw -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
​
************************************
* Waiting for client to connect... *
************************************
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 04:00
​
Picking device No. 0
[pid = 54519, dev = 0] device name = [Tesla V100-SXM2-16GB]
creating CUDA Ctx
making it the current CUDA Ctx
DMA-BUF is not supported on this GPU // 该GPU不支持DMA_BUF
Failed to init memory
 Couldn't create IB resources
destroying current CUDA Ctx
​
检测GPU是否支持DMA_BUF的代码:
#ifdef HAVE_CUDA_DMABUF
    if (cuda_ctx->use_dmabuf) {
        int is_supported = 0;
​
        CUCHECK(cuDeviceGetAttribute(&is_supported, CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED, cuda_ctx->cuDevice));
        if (!is_supported) {
            fprintf(stderr, "DMA-BUF is not supported on this GPU\n");
            return FAILURE;
        }
    }
#endif

CUDA设备操作

代码语言:javascript
代码运行次数:0
运行
复制
/**
 * Device properties
 */
typedef enum CUdevice_attribute_enum {
    CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 1,                          /**< Maximum number of threads per block */
    CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_X = 2,                                /**< Maximum block dimension X */
    CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Y = 3,                                /**< Maximum block dimension Y */
    CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Z = 4,                                /**< Maximum block dimension Z */
    CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X = 5,                                 /**< Maximum grid dimension X */
    CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Y = 6,                                 /**< Maximum grid dimension Y */
    CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_Z = 7,                                 /**< Maximum grid dimension Z */
    CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK = 8,                    /**< Maximum shared memory available per block in bytes */
    CU_DEVICE_ATTRIBUTE_SHARED_MEMORY_PER_BLOCK = 8,                        /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK */
    CU_DEVICE_ATTRIBUTE_TOTAL_CONSTANT_MEMORY = 9,                          /**< Memory available on device for __constant__ variables in a CUDA C kernel in bytes */
    CU_DEVICE_ATTRIBUTE_WARP_SIZE = 10,                                     /**< Warp size in threads */
    CU_DEVICE_ATTRIBUTE_MAX_PITCH = 11,                                     /**< Maximum pitch in bytes allowed by memory copies */
    CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12,                       /**< Maximum number of 32-bit registers available per block */
    CU_DEVICE_ATTRIBUTE_REGISTERS_PER_BLOCK = 12,                           /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK */
    CU_DEVICE_ATTRIBUTE_CLOCK_RATE = 13,                                    /**< Typical clock frequency in kilohertz */
    CU_DEVICE_ATTRIBUTE_TEXTURE_ALIGNMENT = 14,                             /**< Alignment requirement for textures */
    CU_DEVICE_ATTRIBUTE_GPU_OVERLAP = 15,                                   /**< Device can possibly copy memory and execute a kernel concurrently. Deprecated. Use instead CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT. */
    CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT = 16,                          /**< Number of multiprocessors on device */
    CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT = 17,                           /**< Specifies whether there is a run time limit on kernels */
    CU_DEVICE_ATTRIBUTE_INTEGRATED = 18,                                    /**< Device is integrated with host memory */
    CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY = 19,                           /**< Device can map host memory into CUDA address space */
    CU_DEVICE_ATTRIBUTE_COMPUTE_MODE = 20,                                  /**< Compute mode (See ::CUcomputemode for details) */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_WIDTH = 21,                       /**< Maximum 1D texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_WIDTH = 22,                       /**< Maximum 2D texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_HEIGHT = 23,                      /**< Maximum 2D texture height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_WIDTH = 24,                       /**< Maximum 3D texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_HEIGHT = 25,                      /**< Maximum 3D texture height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_DEPTH = 26,                       /**< Maximum 3D texture depth */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_WIDTH = 27,               /**< Maximum 2D layered texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_HEIGHT = 28,              /**< Maximum 2D layered texture height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_LAYERS = 29,              /**< Maximum layers in a 2D layered texture */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_ARRAY_WIDTH = 27,                 /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_WIDTH */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_ARRAY_HEIGHT = 28,                /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_HEIGHT */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES = 29,             /**< Deprecated, use CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LAYERED_LAYERS */
    CU_DEVICE_ATTRIBUTE_SURFACE_ALIGNMENT = 30,                             /**< Alignment requirement for surfaces */
    CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,                            /**< Device can possibly execute multiple kernels concurrently */
    CU_DEVICE_ATTRIBUTE_ECC_ENABLED = 32,                                   /**< Device has ECC support enabled */
    CU_DEVICE_ATTRIBUTE_PCI_BUS_ID = 33,                                    /**< PCI bus ID of the device */
    CU_DEVICE_ATTRIBUTE_PCI_DEVICE_ID = 34,                                 /**< PCI device ID of the device */
    CU_DEVICE_ATTRIBUTE_TCC_DRIVER = 35,                                    /**< Device is using TCC driver model */
    CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE = 36,                             /**< Peak memory clock frequency in kilohertz */
    CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH = 37,                       /**< Global memory bus width in bits */
    CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE = 38,                                 /**< Size of L2 cache in bytes */
    CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,                /**< Maximum resident threads per multiprocessor */
    CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,                            /**< Number of asynchronous engines */
    CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,                            /**< Device shares a unified address space with the host */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_LAYERED_WIDTH = 42,               /**< Maximum 1D layered texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_LAYERED_LAYERS = 43,              /**< Maximum layers in a 1D layered texture */
    CU_DEVICE_ATTRIBUTE_CAN_TEX2D_GATHER = 44,                              /**< Deprecated, do not use. */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_WIDTH = 45,                /**< Maximum 2D texture width if CUDA_ARRAY3D_TEXTURE_GATHER is set */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_GATHER_HEIGHT = 46,               /**< Maximum 2D texture height if CUDA_ARRAY3D_TEXTURE_GATHER is set */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE = 47,             /**< Alternate maximum 3D texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE = 48,            /**< Alternate maximum 3D texture height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE = 49,             /**< Alternate maximum 3D texture depth */
    CU_DEVICE_ATTRIBUTE_PCI_DOMAIN_ID = 50,                                 /**< PCI domain ID of the device */
    CU_DEVICE_ATTRIBUTE_TEXTURE_PITCH_ALIGNMENT = 51,                       /**< Pitch alignment requirement for textures */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURECUBEMAP_WIDTH = 52,                  /**< Maximum cubemap texture width/height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH = 53,          /**< Maximum cubemap layered texture width/height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS = 54,         /**< Maximum layers in a cubemap layered texture */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE1D_WIDTH = 55,                       /**< Maximum 1D surface width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_WIDTH = 56,                       /**< Maximum 2D surface width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_HEIGHT = 57,                      /**< Maximum 2D surface height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE3D_WIDTH = 58,                       /**< Maximum 3D surface width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE3D_HEIGHT = 59,                      /**< Maximum 3D surface height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE3D_DEPTH = 60,                       /**< Maximum 3D surface depth */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE1D_LAYERED_WIDTH = 61,               /**< Maximum 1D layered surface width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE1D_LAYERED_LAYERS = 62,              /**< Maximum layers in a 1D layered surface */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_LAYERED_WIDTH = 63,               /**< Maximum 2D layered surface width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_LAYERED_HEIGHT = 64,              /**< Maximum 2D layered surface height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACE2D_LAYERED_LAYERS = 65,              /**< Maximum layers in a 2D layered surface */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACECUBEMAP_WIDTH = 66,                  /**< Maximum cubemap surface width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH = 67,          /**< Maximum cubemap layered surface width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS = 68,         /**< Maximum layers in a cubemap layered surface */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_LINEAR_WIDTH = 69,                /**< Deprecated, do not use. Use cudaDeviceGetTexture1DLinearMaxWidth() or cuDeviceGetTexture1DLinearMaxWidth() instead. */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LINEAR_WIDTH = 70,                /**< Maximum 2D linear texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LINEAR_HEIGHT = 71,               /**< Maximum 2D linear texture height */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_LINEAR_PITCH = 72,                /**< Maximum 2D linear texture pitch in bytes */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH = 73,             /**< Maximum mipmapped 2D texture width */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT = 74,            /**< Maximum mipmapped 2D texture height */
    CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR = 75,                      /**< Major compute capability version number */
    CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR = 76,                      /**< Minor compute capability version number */
    CU_DEVICE_ATTRIBUTE_MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH = 77,             /**< Maximum mipmapped 1D texture width */
    CU_DEVICE_ATTRIBUTE_STREAM_PRIORITIES_SUPPORTED = 78,                   /**< Device supports stream priorities */
    CU_DEVICE_ATTRIBUTE_GLOBAL_L1_CACHE_SUPPORTED = 79,                     /**< Device supports caching globals in L1 */
    CU_DEVICE_ATTRIBUTE_LOCAL_L1_CACHE_SUPPORTED = 80,                      /**< Device supports caching locals in L1 */
    CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR = 81,          /**< Maximum shared memory available per multiprocessor in bytes */
    CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,              /**< Maximum number of 32-bit registers available per multiprocessor */
    CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY = 83,                                /**< Device can allocate managed memory on this system */
    CU_DEVICE_ATTRIBUTE_MULTI_GPU_BOARD = 84,                               /**< Device is on a multi-GPU board */
    CU_DEVICE_ATTRIBUTE_MULTI_GPU_BOARD_GROUP_ID = 85,                      /**< Unique id for a group of devices on the same multi-GPU board */
    CU_DEVICE_ATTRIBUTE_HOST_NATIVE_ATOMIC_SUPPORTED = 86,                  /**< Link between the device and the host supports native atomic operations (this is a placeholder attribute, and is not supported on any current hardware)*/
    CU_DEVICE_ATTRIBUTE_SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO = 87,         /**< Ratio of single precision performance (in floating-point operations per second) to double precision performance */
    CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS = 88,                        /**< Device supports coherently accessing pageable memory without calling cudaHostRegister on it */
    CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89,                     /**< Device can coherently access managed memory concurrently with the CPU */
    CU_DEVICE_ATTRIBUTE_COMPUTE_PREEMPTION_SUPPORTED = 90,                  /**< Device supports compute preemption. */
    CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM = 91,       /**< Device can access host registered memory at the same virtual address as the CPU */
    CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS_V1 = 92,                     /**< Deprecated, along with v1 MemOps API, ::cuStreamBatchMemOp and related APIs are supported. */
    CU_DEVICE_ATTRIBUTE_CAN_USE_64_BIT_STREAM_MEM_OPS_V1 = 93,              /**< Deprecated, along with v1 MemOps API, 64-bit operations are supported in ::cuStreamBatchMemOp and related APIs. */
    CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_WAIT_VALUE_NOR_V1 = 94,              /**< Deprecated, along with v1 MemOps API, ::CU_STREAM_WAIT_VALUE_NOR is supported. */
    CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH = 95,                            /**< Device supports launching cooperative kernels via ::cuLaunchCooperativeKernel */
    CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH = 96,               /**< Deprecated, ::cuLaunchCooperativeKernelMultiDevice is deprecated. */
    CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN = 97,             /**< Maximum optin shared memory per block */
    CU_DEVICE_ATTRIBUTE_CAN_FLUSH_REMOTE_WRITES = 98,                       /**< The ::CU_STREAM_WAIT_VALUE_FLUSH flag and the ::CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES MemOp are supported on the device. See \ref CUDA_MEMOP for additional details. */
    CU_DEVICE_ATTRIBUTE_HOST_REGISTER_SUPPORTED = 99,                       /**< Device supports host memory registration via ::cudaHostRegister. */
    CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES = 100, /**< Device accesses pageable memory via the host's page tables. */
    CU_DEVICE_ATTRIBUTE_DIRECT_MANAGED_MEM_ACCESS_FROM_HOST = 101,          /**< The host can directly access managed memory on the device without migration. */
    CU_DEVICE_ATTRIBUTE_VIRTUAL_ADDRESS_MANAGEMENT_SUPPORTED = 102,         /**< Deprecated, Use CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED*/
    CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED = 102,         /**< Device supports virtual memory management APIs like ::cuMemAddressReserve, ::cuMemCreate, ::cuMemMap and related APIs */
    CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR_SUPPORTED = 103,  /**< Device supports exporting memory to a posix file descriptor with ::cuMemExportToShareableHandle, if requested via ::cuMemCreate */
    CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_WIN32_HANDLE_SUPPORTED = 104,           /**< Device supports exporting memory to a Win32 NT handle with ::cuMemExportToShareableHandle, if requested via ::cuMemCreate */
    CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_WIN32_KMT_HANDLE_SUPPORTED = 105,       /**< Device supports exporting memory to a Win32 KMT handle with ::cuMemExportToShareableHandle, if requested via ::cuMemCreate */
    CU_DEVICE_ATTRIBUTE_MAX_BLOCKS_PER_MULTIPROCESSOR = 106,                /**< Maximum number of blocks per multiprocessor */
    CU_DEVICE_ATTRIBUTE_GENERIC_COMPRESSION_SUPPORTED = 107,                /**< Device supports compression of memory */
    CU_DEVICE_ATTRIBUTE_MAX_PERSISTING_L2_CACHE_SIZE = 108,                 /**< Maximum L2 persisting lines capacity setting in bytes. */
    CU_DEVICE_ATTRIBUTE_MAX_ACCESS_POLICY_WINDOW_SIZE = 109,                /**< Maximum value of CUaccessPolicyWindow::num_bytes. */
    CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED = 110,      /**< Device supports specifying the GPUDirect RDMA flag with ::cuMemCreate */
    CU_DEVICE_ATTRIBUTE_RESERVED_SHARED_MEMORY_PER_BLOCK = 111,             /**< Shared memory reserved by CUDA driver per block in bytes */
    CU_DEVICE_ATTRIBUTE_SPARSE_CUDA_ARRAY_SUPPORTED = 112,                  /**< Device supports sparse CUDA arrays and sparse CUDA mipmapped arrays */
    CU_DEVICE_ATTRIBUTE_READ_ONLY_HOST_REGISTER_SUPPORTED = 113,            /**< Device supports using the ::cuMemHostRegister flag ::CU_MEMHOSTERGISTER_READ_ONLY to register memory that must be mapped as read-only to the GPU */
    CU_DEVICE_ATTRIBUTE_TIMELINE_SEMAPHORE_INTEROP_SUPPORTED = 114,         /**< External timeline semaphore interop is supported on the device */
    CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED = 115,                       /**< Device supports using the ::cuMemAllocAsync and ::cuMemPool family of APIs */
    CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED = 116,                    /**< Device supports GPUDirect RDMA APIs, like nvidia_p2p_get_pages (see https://docs.nvidia.com/cuda/gpudirect-rdma for more information) */
    CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_FLUSH_WRITES_OPTIONS = 117,         /**< The returned attribute shall be interpreted as a bitmask, where the individual bits are described by the ::CUflushGPUDirectRDMAWritesOptions enum */
    CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WRITES_ORDERING = 118,              /**< GPUDirect RDMA writes to the device do not need to be flushed for consumers within the scope indicated by the returned attribute. See ::CUGPUDirectRDMAWritesOrdering for the numerical values returned here. */
    CU_DEVICE_ATTRIBUTE_MEMPOOL_SUPPORTED_HANDLE_TYPES = 119,               /**< Handle types supported with mempool based IPC */
    CU_DEVICE_ATTRIBUTE_CLUSTER_LAUNCH = 120,                               /**< Indicates device supports cluster launch */
    CU_DEVICE_ATTRIBUTE_DEFERRED_MAPPING_CUDA_ARRAY_SUPPORTED = 121,        /**< Device supports deferred mapping CUDA arrays and CUDA mipmapped arrays */
    CU_DEVICE_ATTRIBUTE_CAN_USE_64_BIT_STREAM_MEM_OPS = 122,                /**< 64-bit operations are supported in ::cuStreamBatchMemOp and related MemOp APIs. */
    CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_WAIT_VALUE_NOR = 123,                /**< ::CU_STREAM_WAIT_VALUE_NOR is supported by MemOp APIs. */
    CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED = 124,                            /**< Device supports buffer sharing with dma_buf mechanism. 查看GPU是否支持DMA_BUF */ 
    CU_DEVICE_ATTRIBUTE_IPC_EVENT_SUPPORTED = 125,                          /**< Device supports IPC Events. */ 
    CU_DEVICE_ATTRIBUTE_MEM_SYNC_DOMAIN_COUNT = 126,                        /**< Number of memory domains the device supports. */
    CU_DEVICE_ATTRIBUTE_TENSOR_MAP_ACCESS_SUPPORTED = 127,                  /**< Device supports accessing memory using Tensor Map. */
    CU_DEVICE_ATTRIBUTE_UNIFIED_FUNCTION_POINTERS = 129,                    /**< Device supports unified function pointers. */
    CU_DEVICE_ATTRIBUTE_MAX
} CUdevice_attribute;
​

常用命令

NUMA绑定

代码语言:javascript
代码运行次数:0
运行
复制
numactl --cpunodebind=0 --localalloc
numactl --cpunodebind=<node_list> <command>

查看GPU拓扑

代码语言:javascript
代码运行次数:0
运行
复制
2025/03/11 19:22:44 s114 nvidia-smi topo -m
    GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PHB PHB 0-13,28-41  0       N/A
NIC0    PHB  X  PIX             
NIC1    PHB PIX  X              
​
Legend:
​
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
​
NIC Legend:
​
  NIC0: mlx5_0
  NIC1: mlx5_1
​
​
2025/03/11 19:22:44 s116 nvidia-smi topo -m
    GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PIX PIX 0-13,28-41  0       N/A
NIC0    PIX  X  PIX             
NIC1    PIX PIX  X              
​
Legend:
​
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
​
NIC Legend:
​
  NIC0: mlx5_0
  NIC1: mlx5_1
​
​
lstopo-no-graphics

将GPU时钟频率提升/设置为875MHz

代码语言:javascript
代码运行次数:0
运行
复制
nvidia-smi -i 0 -ac 3004,875

其他

GDA

IBGDA does not work with DMABUF

NVSHMEM and GPUDirect Async 对比

https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async

GDS

CPU与GDS的IO路径对比

添加描述

Nvidia开源驱动的好处

可以跟踪代码路径并查看内核事件调度如何与您的工作负载交互,从而更快地进行根本原因调试。此外,企业软件开发人员现在可以将驱动程序无缝集成到为其项目配置的定制 Linux 内核中。这有助于通过 Linux 终端用户社区的意见和评论来提高 NVIDIA GPU 驱动程序的质量和安全性

参考: https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/

QA

perftest注册GPU内存时失败, 返回错误码14(GPU地址错误)

代码语言:javascript
代码运行次数:0
运行
复制
#define EFAULT      14  /* Bad address */
代码语言:javascript
代码运行次数:0
运行
复制
libmlx5.so.1!mlx5_reg_mr(struct ibv_pd * pd, void * addr, size_t length, uint64_t hca_va, int acc) (\root\project\rdma\rdma-core\providers\mlx5\verbs.c:646)
libibverbs.so.1!ibv_reg_mr_iova2(struct ibv_pd * pd, void * addr, size_t length, uint64_t iova, unsigned int access) (\root\project\rdma\rdma-core\libibverbs\verbs.c:323)
__ibv_reg_mr(int is_access_const, unsigned int access) (\usr\include\infiniband\verbs.h:2591)
create_single_mr(struct pingpong_context * ctx, struct pingpong_context * ctx@entry, struct perftest_parameters * user_param, struct perftest_parameters * user_param@entry, int qp_index, int qp_index@entry) (\root\project\rdma\perftest\src\perftest_resources.c:1722)
create_mr(struct pingpong_context * ctx, struct pingpong_context * ctx@entry, struct perftest_parameters * user_param, struct perftest_parameters * user_param@entry) (\root\project\rdma\perftest\src\perftest_resources.c:1858)
ctx_init(struct pingpong_context * ctx, struct pingpong_context * ctx@entry, struct perftest_parameters * user_param, struct perftest_parameters * user_param@entry) (\root\project\rdma\perftest\src\perftest_resources.c:2094)
main(int argc, char ** argv) (\root\project\rdma\perftest\src\write_bw.c:161)

建议软件迁移到dma_buf API

支持 nv-p2p 中的 4k 页面大小。– 4316020 当与配置了 4K 页面大小的 Linux 内核一起使用时,用于 MLNX_OFED 和 GPUDirect Storage 中的 Peer-direct 支持的 NVIDIA 驱动程序内核模式 GPUDirect RDMA API 在 GH200 平台上不受支持。这些 API 不起作用,可能会导致内核内存损坏。

强烈建议用户将其软件堆栈移至 dma-buf API,这需要开源 GPU 驱动程序、Linux 内核 5.12 或更高版本以及 NVIDIA Turing™ + GPU。由于 dma-buf API 在 4K 页面内核上正常工作,因此使用这些 API 是缓解此问题的理想方法

原文:

代码语言:javascript
代码运行次数:0
运行
复制
Support for 4k page size in nv-p2p. – 4316020
The NVIDIA driver's kernel mode GPUDirect RDMA APIs that are used for Peer-direct support in MLNX_OFED and GPUDirect Storage are not supported on GH200 platforms when used with Linux kernels that are configured with the 4K page size. These APIs are not functional and might lead to a kernel memory corruption.
​
Users are strongly encouraged to move their software stack to the dma-buf APIs, which requires the open-source GPU driver, Linux kernel 5.12 or later, and NVIDIA Turing™ + GPU. Since the dma-buf APIs work correctly on 4K page kernels, using the APIs is ideal mitigation for this issue.

参考

GPU Direct

Nvidia Magnum IO: https://www.nvidia.com/en-us/data-center/magnum-io/

Nvidia GPU Direct: https://developer.nvidia.com/gpudirect

GDR白皮书(使用GDR开发一个内核模块(Developing a Linux Kernel Module using GPUDirect RDMA)): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

GDR benchmark tests: https://docs.nvidia.com/networking/display/gpudirectrdmav18/benchmark+tests

现代服务器上的GDR基准测试: https://developer.nvidia.com/blog/benchmarking-gpudirect-rdma-on-modern-server-platforms/

GDR ibv_reg_mr与ibv_reg_dmabuf_mr间的差异: https://forums.developer.nvidia.com/t/gpudirect-rdma-difference-between-ibv-reg-mr-and-ibv-reg-dmabuf-mr/262313

linux内核dma_buf文档: https://kernel.org/doc/html/v5.18/userspace-api/media/v4l/dmabuf.html

CUDA感知MPI: https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/

提升性能(GPU时钟频率): https://developer.nvidia.com/blog/cuda-pro-tip-increase-application-performance-nvidia-gpu-boost/

GPU Direct RDMA 技术解析与实践: https://mp.weixin.qq.com/s/gviH1YbddJx_s7U4TI2W2w

使用 GPUDirect RDMA 为配备 NVIDIA GPU 的 InfiniBand 集群实现高效的节点间 MPI 通信: https://ieeexplore.ieee.org/abstract/document/6687341

GPU 和加速器在现代超级计算系统中无处不在。来自各个领域的科学应用正在被修改以利用它们的计算能力。然而,数据移动仍然是充分利用 GPU 潜力的关键瓶颈。GPU 内存中的数据必须先移动到主机内存中,然后才能通过网络发送。像 MVAPICH2 这样的 MPI 库已经提供了使用流水线等技术来缓解这一瓶颈的解决方案。GPUDirect RDMA 是 CUDA 5.0 中引入的一项功能,它允许第三方设备(如网络适配器)通过 PCIe 总线直接访问 GPU 设备内存中的数据。NVIDIA 已与 Mellanox 合作,使该解决方案可用于 InfiniBand 集群。在本文中,我们评估了 InfiniBand 的第一个版本的 GPUDirect RDMA,并提出了 MVAPICH2 MPI 库中的设计以有效利用此功能。我们强调了当前一代架构在有效使用 GPUDirect RDMA 方面存在的局限性,并通过 MVAPICH2 中的新设计解决了这些问题。据我们所知,这是首次使用 GPUDirect RDMA 展示节点间 GPU 到 GPU MPI 通信解决方案的工作。结果表明,对于 4 字节和 128 KB 消息,所提出的设计分别将使用 MPI Send/MPI Recv 的节点间 GPU 到 GPU 通信的延迟提高了 69% 和 32%。这些设计分别将使用 4 KB 和 64 KB 消息实现的单向带宽提高了 2 倍和 35%。我们使用两个终端应用程序展示了所提出的设计的影响:LBMGPU 和 AWP-ODC。它们分别将这些应用程序中的通信时间缩短了 35% 和 40%。 发表于:2013 年第 42 届并行处理国际会议

MLNX_OFED GDR支持: https://network.nvidia.com/products/GPUDirect-RDMA/

Perftest中GPU不支持DMA_BUF问题与答案(use_cuda_dmabuf): https://forums.developer.nvidia.com/t/use-cuda-dmabuf-is-not-supported-on-this-gpu/260105

NCCL环境搭建: https://cloud.baidu.com/doc/GPU/s/Yl3mr0ren

Nvidia_v5.35驱动说明: https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-183-06/index.html

CUDA

Linux下Nvidia CUDA安装指导: https://docs.nvidia.com/cuda/cuda-installation-guide-linux

参考文献

[1] 卸载 GPU 加速应用程序中的通信控制逻辑(GDA)_Offloading communication control logic in GPU accelerated applications

[2] PROGRESS OF UPSTREAM GPU RDMA SUPPORT_2021(Jianxin Xiong)

Linux

DMA_BUF文档: https://docs.kernel.org/driver-api/dma-buf.html

Nvidia开源GPU驱动(真正的dma_buf支持问题讨论): https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/243

AI

deepseek(deepEP)all2all源码分析: https://www.cnblogs.com/CQzhangyu/p/18741625

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 接上文: https://cloud.tencent.com/developer/article/2508958
    • 两种业务场景
      • MPI与GDR基准测试
      • perftest与GDR(use_cuda)基准测试
      • 测试记录
      • 源码分析
      • 初始化CUDA内存
      • GPU上可能不支持DMA_BUF(Tesla V100)
      • CUDA设备操作
    • 常用命令
      • NUMA绑定
      • 查看GPU拓扑
      • 将GPU时钟频率提升/设置为875MHz
  • 其他
    • GDA
      • NVSHMEM and GPUDirect Async 对比
    • GDS
      • CPU与GDS的IO路径对比
    • Nvidia开源驱动的好处
  • QA
    • perftest注册GPU内存时失败, 返回错误码14(GPU地址错误)
    • 建议软件迁移到dma_buf API
  • 参考
    • GPU Direct
    • CUDA
    • 参考文献
    • Linux
    • AI
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档