前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >libfabric_ofa_简介_指南_设计思想_高性能网络2

libfabric_ofa_简介_指南_设计思想_高性能网络2

原创
作者头像
晓兵
修改于 2025-06-13 02:38:18
修改于 2025-06-13 02:38:18
8700
代码可运行
举报
文章被收录于专栏:DPUDPU
运行总次数:0
代码可运行

上文: libfabric_ofa_简介_指南_设计思想_高性能网络1: https://cloud.tencent.com/developer/article/2531002

Memory Footprint 内存占用

Memory footprint concerns are most notable among high-performance computing (HPC) applications that communicate with thousands of peers. Excessive memory consumption impacts application scalability, limiting the number of peers that can operate in parallel to solve problems. There is often a trade-off between minimizing the memory footprint needed for network communication, application performance, and ease of use of the network interface.

As we discussed with the socket API semantics, part of the ease of using sockets comes from the network layering copying the user's buffer into an internal buffer belonging to the network stack. The amount of internal buffering that's made available to the application directly correlates with the bandwidth that an application can achieve. In general, larger internal buffering increases network performance, with a cost of increasing the memory footprint consumed by the application. This memory footprint exists independent of the amount of memory allocated directly by the application. Eliminating network buffering not only helps with performance, but also scalability, by reducing the memory footprint needed to support the application.

While network memory buffering increases as an application scales, it can often be configured to a fixed size. The amount of buffering needed is dependent on the number of active communication streams being used at any one time. That number is often significantly lower than the total number of peers that an application may need to communicate with. The amount of memory required to address the peers, however, usually has a linear relationship with the total number of peers.

With the socket API, each peer is identified using a struct sockaddr. If we consider a UDP based socket application using IPv4 addresses, a peer is identified by the following address.

内存占用问题在与数千个对等点进行通信的高性能计算 (HPC) 应用程序中最为显着。过多的内存消耗会影响应用程序的可扩展性,从而限制可以并行运行以解决问题的对等点的数量。通常在最小化网络通信所需的内存占用、应用程序性能和网络接口的易用性之间进行权衡。

正如我们对套接字 API 语义所讨论的,使用套接字的部分易用性来自网络分层,将用户的缓冲区复制到属于网络堆栈的内部缓冲区。应用程序可用的内部缓冲量与应用程序可以实现的带宽直接相关。通常,较大的内部缓冲会提高网络性能,但代价是会增加应用程序消耗的内存占用。这种内存占用独立于应用程序直接分配的内存量。通过减少支持应用程序所需的内存占用,消除网络缓冲不仅有助于提高性能,还有助于提高可扩展性。

虽然网络内存缓冲随着应用程序的扩展而增加,但通常可以将其配置为固定大小。所需的缓冲量取决于任何时候使用的活动通信流的数量。该数字通常远低于应用程序可能需要与之通信的对等点的总数。然而,寻址对等点所需的内存量通常与对等点的总数呈线性关系。

使用套接字 API,每个对等点都使用 struct sockaddr 进行标识。如果我们考虑使用 IPv4 地址的基于 UDP 的套接字应用程序,则对等点由以下地址标识。

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
/* IPv4 socket address - with typedefs removed */
struct sockaddr_in {
    uint16_t sin_family; /* AF_INET */
    uint16_t sin_port;
    struct {
        uint32_t sin_addr;
    } in_addr;
};

In total, the application requires 8-bytes of addressing for each peer. If the app communicates with a million peers, that explodes to roughly 8 MB of memory space that is consumed just to maintain the address list. If IPv6 addressing is needed, then the requirement increases by a factor of 4.

Luckily, there are some tricks that can be used to help reduce the addressing memory footprint, though doing so will introduce more instructions into code path to access the network stack. For instance, we can notice that all addresses in the above example have the same sin_family value (AF_INET). There's no need to store that for each address. This potentially shrinks each address from 8 bytes to 6. (We may be left with unaligned data, but that's a trade-off to reducing the memory consumption). Depending on how the addresses are assigned, further reduction may be possible. For example, if the application uses the same set of port addresses at each node, then we can eliminate storing the port, and instead calculate it from some base value. This type of trick can be applied to the IP portion of the address if the app is lucky enough to run across sequential IP addresses.

The main issue with this sort of address reduction is that it is difficult to achieve. It requires that each application check for and handle address compression, exposing the application to the addressing format used by the networking stack. It should be kept in mind that TCP/IP and UDP/IP addresses are logical addresses, not physical. When running over Ethernet, the addresses that appear at the link layer are MAC addresses, not IP addresses. The IP to MAC address association is managed by the network software. We would like to provide addressing that is simple for an application to use, but at the same time can provide a minimal memory footprint.

总的来说,该应用程序需要为每个对等方提供 8 字节的寻址。如果应用程序与一百万个对等点进行通信,那么这会爆发到大约 8 MB 的内存空间,仅用于维护地址列表。如果需要 IPv6 寻址,则要求增加 4 倍。

幸运的是,有一些技巧可以用来帮助减少寻址内存占用,尽管这样做会在代码路径中引入更多指令来访问网络堆栈。例如,我们可以注意到上例中的所有地址都具有相同的 sin_family 值 (AF_INET)。无需为每个地址存储它。这可能会将每个地址从 8 个字节缩小到 6 个。(我们可能会留下未对齐的数据,但这是减少内存消耗的权衡)。根据地址的分配方式,可能会进一步减少。例如,如果应用程序在每个节点使用相同的端口地址集,那么我们可以消除存储端口,而是从某个基值计算它。如果应用程序足够幸运地跨连续 IP 地址运行,则可以将这种类型的技巧应用于地址的 IP 部分。

这种地址减少的主要问题是难以实现。它要求每个应用程序检查并处理地址压缩,将应用程序暴露给网络堆栈使用的寻址格式。应该记住,TCP/IP 和 UDP/IP 地址是逻辑地址,而不是物理地址。在以太网上运行时,出现在链路层的地址是 MAC 地址,而不是 IP 地址。 IP 到 MAC 地址的关联由网络软件管理。我们希望为应用程序提供简单易用的寻址,但同时可以提供最小的内存占用。

Communication Resources 通讯资源

We need to take a brief detour in the discussion in order to delve deeper into the network problem and solution space. Instead of continuing to think of a socket as a single entity, with both send and receive capabilities, we want to consider its components separately. A network socket can be viewed as three basic constructs: a transport level address, a send or transmit queue, and a receive queue. Because our discussion will begin to pivot away from pure socket semantics, we will refer to our network 'socket' as an endpoint.

In order to reduce an application's memory footprint, we need to consider features that fall outside of the socket API. So far, much of the discussion has been around sending data to a peer. We now want to focus on the best mechanisms for receiving data.

With sockets, when an app has data to receive (indicated, for example, by a POLLIN event), we call recv(). The network stack copies the receive data into its buffer and returns. If we want to avoid the data copy on the receive side, we need a way for the application to post its buffers to the network stack before data arrives.

Arguably, a natural way of extending the socket API to support this feature is to have each call to recv() simply post the buffer to the network layer. As data is received, the receive buffers are removed in the order that they were posted. Data is copied into the posted buffer and returned to the user. It would be noted that the size of the posted receive buffer may be larger (or smaller) than the amount of data received. If the available buffer space is larger, hypothetically, the network layer could wait a short amount of time to see if more data arrives. If nothing more arrives, the receive completes with the buffer returned to the application.

This raises an issue regarding how to handle buffering on the receive side. So far, with sockets we've mostly considered a streaming protocol. However, many applications deal with messages which end up being layered over the data stream. If they send an 8 KB message, they want the receiver to receive an 8 KB message. Message boundaries need to be maintained.

If an application sends and receives a fixed sized message, buffer allocation becomes trivial. The app can post X number of buffers each of an optimal size. However, if there is a wide mix in message sizes, difficulties arise. It is not uncommon for an app to have 80% of its messages be a couple hundred of bytes or less, but 80% of the total data that it sends to be in large transfers that are, say, a megabyte or more. Pre-posting receive buffers in such a situation is challenging.

A commonly used technique used to handle this situation is to implement one application level protocol for smaller messages, and use a separate protocol for transfers that are larger than some given threshold. This would allow an application to post a bunch of smaller messages, say 4 KB, to receive data. For transfers that are larger than 4 KB, a different communication protocol is used, possibly over a different socket or endpoint.

我们需要在讨论中绕道而行,以便更深入地研究网络问题和解决方案空间。我们不想继续将套接字视为具有发送和接收功能的单个实体,而是要单独考虑其组件。网络套接字可以被视为三个基本结构:传输层地址、发送或传输队列以及接收队列。因为我们的讨论将开始脱离纯套接字语义,我们将把我们的网络“套接字”称为端点。

为了减少应用程序的内存占用,我们需要考虑不属于套接字 API 的特性。到目前为止,大部分讨论都是围绕向对等点发送数据进行的。我们现在要关注接收数据的最佳机制。

对于套接字,当应用程序有数据要接收(例如,由 POLLIN 事件指示)时,我们调用 recv()。网络堆栈将接收到的数据复制到其缓冲区并返回。如果我们想避免接收端的数据复制,我们需要一种方法让应用程序在数据到达之前将其缓冲区发布到网络堆栈。

可以说,扩展套接字 API 以支持此功能的一种自然方式是让对 recv() 的每次调用都简单地将缓冲区发布到网络层。当接收到数据时,接收缓冲区会按照它们发布的顺序被移除。数据被复制到发布的缓冲区并返回给用户。应注意,发布的接收缓冲区的大小可能大于(或小于)接收的数据量。如果可用缓冲区空间更大,假设网络层可以等待很短的时间来查看是否有更多数据到达。如果没有其他内容到达,则接收完成,缓冲区返回给应用程序。

这引发了一个关于如何在接收端处理缓冲的问题。到目前为止,对于套接字,我们主要考虑的是流式协议。然而,许多应用程序处理的消息最终被分层覆盖在数据流上。如果他们发送一个 8 KB 的消息,他们希望接收者接收一个 8 KB 的消息。需要维护消息边界。

如果应用程序发送和接收固定大小的消息,则缓冲区分配变得微不足道。该应用程序可以发布 X 个缓冲区,每个缓冲区都具有最佳大小。但是,如果消息大小混杂,就会出现困难。应用程序有 80% 的消息是几百字节或更少的情况并不少见,但它发送的总数据的 80% 是大型传输,例如 1 兆字节或更多。在这种情况下预先发布接收缓冲区是具有挑战性的。

用于处理这种情况的常用技术是为较小的消息实现一个应用程序级协议,并为大于某个给定阈值的传输使用单独的协议。这将允许应用程序发布一堆较小的消息,例如 4 KB,以接收数据。对于大于 4 KB 的传输,使用不同的通信协议,可能通过不同的套接字或端点。

Shared Receive Queues 共享接收队列(SRQ)

If an application pre-posts receive buffers to a network queue, it needs to balance the size of each buffer posted, the number of buffers that are posted to each queue, and the number of queues that are in use. With a socket like approach, each socket would maintain an independent receive queue where data is placed. If an application is using 1000 endpoints and posts 100 buffers, each 4 KB, that results in 400 MB of memory space being consumed to receive data. (We can start to realize that by eliminating memory copies, one of the trade offs is increased memory consumption.) While 400 MB seems like a lot of memory, there is less than half a megabyte allocated to a single receive queue. At today's networking speeds, that amount of space can be consumed within milliseconds. The result is that if only a few endpoints are in use, the application will experience long delays where flow control will kick in and back the transfers off.

There are a couple of observations that we can make here. The first is that in order to achieve high scalability, we need to move away from a connection-oriented protocol, such as streaming sockets. Secondly, we need to reduce the number of receive queues that an application uses.

A shared receive queue is a network queue that can receive data for many different endpoints at once. With shared receive queues, we no longer associate a receive queue with a specific transport address. Instead network data will target a specific endpoint address. As data arrives, the endpoint will remove an entry from the shared receive queue, place the data into the application's posted buffer, and return it to the user. Shared receive queues can greatly reduce the amount of buffer space needed by an applications. In the previous example, if a shared receive queue were used, the app could post 10 times the number of buffers (1000 total), yet still consume 100 times less memory (4 MB total). This is far more scalable. The drawback is that the application must now be aware of receive queues and shared receive queues, rather than considering the network only at the level of a socket.

如果应用程序将接收缓冲区预先发布到网络队列,它需要平衡每个发布的缓冲区的大小、发布到每个队列的缓冲区数量以及正在使用的队列数量。使用类似套接字的方法,每个套接字将维护一个放置数据的独立接收队列。如果应用程序使用 1000 个端点并发布 100 个缓冲区,每个 4 KB,这将导致 400 MB 的内存空间用于接收数据。 (我们可以开始意识到,通过消除内存副本,权衡之一是增加了内存消耗。)虽然 400 MB 似乎是很多内存,但分配给单个接收队列的内存不到 0.5 兆字节。以今天的网络速度,可以在几毫秒内消耗掉这么多的空间。结果是,如果只有少数端点在使用,应用程序将经历长时间的延迟,此时流量控制将启动并停止传输。

我们可以在这里进行一些观察。首先是为了实现高可扩展性,我们需要远离面向连接的协议,例如流式套接字。其次,我们需要减少应用程序使用的接收队列数量。

共享接收队列是一个网络队列,可以同时接收许多不同端点的数据。使用共享接收队列,我们不再将接收队列与特定传输地址相关联。相反,网络数据将针对特定的端点地址。当数据到达时,端点将从共享接收队列中删除一个条目,将数据放入应用程序的发布缓冲区,并将其返回给用户。共享接收队列可以大大减少应用程序所需的缓冲区空间量。在前面的示例中,如果使用共享接收队列,应用程序可以发布 10 倍的缓冲区(总共 1000 个),但仍然消耗 100 倍的内存(总共 4 MB)。这更具可扩展性。缺点是应用程序现在必须知道接收队列和共享接收队列,而不是仅在套接字级别考虑网络。

Multi-Receive Buffers 多个接收缓冲区

Shared receive queues greatly improve application scalability; however, it still results in some inefficiencies as defined so far. We've only considered the case of posting a series of fixed sized memory buffers to the receive queue. As mentioned, determining the size of each buffer is challenging. Transfers larger than the fixed size require using some other protocol in order to complete. If transfers are typically much smaller than the fixed size, then the extra buffer space goes unused.

Again referring to our example, if the application posts 1000 buffers, then it can only receive 1000 messages before the queue is emptied. At data rates measured in millions of messages per second, this will introduce stalls in the data stream. An obvious solution is to increase the number of buffers posted. The problem is dealing with variable sized messages, including some which are only a couple hundred bytes in length. For example, if the average message size in our case is 256 bytes or less, then even though we've allocated 4 MB of buffer space, we only make use of 6% of that space. The rest is wasted in order to handle messages which may only occasionally be up to 4 KB.

A second optimization that we can make is to fill up each posted receive buffer as messages arrive. So, instead of a 4 KB buffer being removed from use as soon as a single 256 byte message arrives, it can instead receive up to 16, 256 byte, messages. We refer to such a feature as 'multi-receive' buffers.

With multi-receive buffers, instead of posting a bunch of smaller buffers, we instead post a single larger buffer, say the entire 4 MB, at once. As data is received, it is placed into the posted buffer. Unlike TCP streams, we still maintain message boundaries. The advantages here are twofold. Not only is memory used more efficiently, allowing us to receive more smaller messages at once and larger messages overall, but we reduce the number of function calls that the application must make to maintain its supply of available receive buffers.

When combined with shared receive queues, multi-receive buffers help support optimal receive side buffering and processing. The main drawback to supporting multi-receive buffers are that the application will not necessarily know up front how many messages may be associated with a single posted memory buffer. This is rarely a problem for applications.

共享接收队列极大地提高了应用程序的可扩展性;但是,它仍然会导致迄今为止定义的一些低效率。我们只考虑了将一系列固定大小的内存缓冲区发布到接收队列的情况。如前所述,确定每个缓冲区的大小具有挑战性。大于固定大小的传输需要使用其他协议才能完成。如果传输通常比固定大小小得多,则额外的缓冲区空间将未被使用。

再次参考我们的示例,如果应用程序发布 1000 个缓冲区,那么在队列清空之前它只能接收 1000 条消息。在以每秒数百万条消息测量的数据速率下,这将在数据流中引入停顿。一个明显的解决方案是增加发布的缓冲区数量。问题在于处理可变大小的消息,包括一些只有几百字节长度的消息。例如,如果我们案例中的平均消息大小为 256 字节或更小,那么即使我们分配了 4 MB 的缓冲区空间,我们也只使用了该空间的 6%。其余的被浪费以处理可能只是偶尔达到 4 KB 的消息。

我们可以进行的第二个优化是在消息到达时填满每个发布的接收缓冲区。因此,不是在单个 256 字节消息到达时立即从使用中删除 4 KB 缓冲区,而是可以接收多达 16、256 字节的消息。我们将这种特性称为“多接收”缓冲区。

对于多接收缓冲区,我们不是发布一堆较小的缓冲区,而是一次发布一个更大的缓冲区,比如整个 4 MB。接收到数据后,会将其放入已发布的缓冲区中。与 TCP 流不同,我们仍然维护消息边界。这里的优势是双重的。不仅内存使用效率更高,允许我们一次接收更多较小的消息和整体较大的消息,而且我们减少了应用程序为维持其可用接收缓冲区的供应而必须进行的函数调用的数量。

当与共享接收队列结合使用时,多接收缓冲区有助于支持最佳接收端缓冲和处理。支持多接收缓冲区的主要缺点是应用程序不一定预先知道有多少消息可能与单个发布的内存缓冲区相关联。这对于应用程序来说很少是一个问题。

Optimal Hardware Allocation 最佳硬件分配

As part of scalability considerations, we not only need to consider the processing and memory resources of the host system, but also the allocation and use of the NIC hardware. We've referred to network endpoints as combination of transport addressing, transmit queues, and receive queues. The latter two queues are often implemented as hardware command queues. Command queues are used to signal the NIC to perform some sort of work. A transmit queue indicates that the NIC should transfer data. A transmit command often contains information such as the address of the buffer to transmit, the length of the buffer, and destination addressing data. The actual format and data contents vary based on the hardware implementation.

NICs have limited resources. Only the most scalable, high-performance applications likely need to be concerned with utilizing NIC hardware optimally. However, such applications are an important and specific focus of OFI. Managing NIC resources is often handled by a resource manager application, which is responsible for allocating systems to competing applications, among other activities.

Supporting applications that wish to make optimal use of hardware requires that hardware related abstractions be exposed to the application. Such abstractions cannot require a specific hardware implementation, and care must be taken to ensure that the resulting API is still usable by developers unfamiliar with dealing with such low level details. Exposing concepts such as shared receive queues is an example of giving an application more control over how hardware resources are used.

作为可扩展性考虑的一部分,我们不仅需要考虑主机系统的处理和内存资源,还要考虑网卡硬件的分配和使用。我们将网络端点称为传输寻址、传输队列和接收队列的组合。后两个队列通常实现为硬件命令队列。命令队列用于向 NIC 发出信号以执行某种工作。传输队列指示 NIC 应该传输数据。传输命令通常包含诸如要传输的缓冲区地址、缓冲区长度和目标寻址数据等信息。实际格式和数据内容因硬件实现而异。

NIC 的资源有限。只有最具可扩展性的高性能应用程序才可能需要关注以最佳方式利用 NIC 硬件。然而,此类应用是 OFI 的一个重要且具体的重点。管理 NIC 资源通常由资源管理器应用程序处理,该应用程序负责将系统分配给竞争应用程序以及其他活动。

支持希望充分利用硬件的应用程序需要向应用程序公开与硬件相关的抽象。这种抽象不需要特定的硬件实现,必须注意确保生成的 API 仍然可供不熟悉处理此类低级细节的开发人员使用。公开诸如共享接收队列之类的概念是让应用程序更好地控制硬件资源使用方式的一个示例。

Sharing Command Queues 共享命令队列

By exposing the transmit and receive queues to the application, we open the possibility for the application that makes use of multiple endpoints to determine how those queues might be shared. We talked about the benefits of sharing a receive queue among endpoints. The benefits of sharing transmit queues are not as obvious.

An application that uses more addressable endpoints than there are transmit queues will need to share transmit queues among the endpoints. By controlling which endpoint uses which transmit queue, the application can prioritize traffic. A transmit queue can also be configured to optimize for a specific type of data transfer, such as large transfers only.

From the perspective of a software API, sharing transmit or receive queues implies exposing those constructs to the application, and allowing them to be associated with different endpoint addresses.

通过向应用程序公开传输和接收队列,我们为应用程序打开了可能性,该应用程序利用多个端点来确定如何共享这些队列。 我们讨论了在端点之间共享接收队列的好处。 共享传输队列的好处并不那么明显。

使用比传输队列更多的可寻址端点的应用程序将需要在端点之间共享传输队列。 通过控制哪个端点使用哪个传输队列,应用程序可以优先处理流量。 传输队列也可以配置为优化特定类型的数据传输,例如仅大型传输。

从软件 API 的角度来看,共享传输或接收队列意味着将这些构造暴露给应用程序,并允许它们与不同的端点地址相关联。

Multiple Queues 多队列

The opposite of a shared command queue are endpoints that have multiple queues. An application that can take advantage of multiple transmit or receive queues can increase parallel handling of messages without synchronization constraints. Being able to use multiple command queues through a single endpoint has advantages over using multiple endpoints. Multiple endpoints require separate addresses, which increases memory use. A single endpoint with multiple queues can continue to expose a single address, while taking full advantage of available NIC resources.

与共享命令队列相反的是具有多个队列的端点。 可以利用多个传输或接收队列的应用程序可以增加对消息的并行处理而没有同步限制。 能够通过单个端点使用多个命令队列比使用多个端点具有优势。 多个端点需要单独的地址,这会增加内存使用。 具有多个队列的单个端点可以继续公开单个地址,同时充分利用可用的 NIC 资源。

Progress Model Considerations 进展模型注意事项

One aspect of the sockets programming interface that developers often don't consider is the location of the protocol implementation. This is usually managed by the operating system kernel. The network stack is responsible for handling flow control messages, timing out transfers, re-transmitting unacknowledged transfers, processing received data, and sending acknowledgments. This processing requires that the network stack consume CPU cycles. Portions of that processing can be done within the context of the application thread, but much must be handled by kernel threads dedicated to network processing.

By moving the network processing directly into the application process, we need to be concerned with how network communication makes forward progress. For example, how and when are acknowledgments sent? How are timeouts and message re-transmissions handled? The progress model defines this behavior, and it depends on how much of the network processing has been offloaded onto the NIC.

More generally, progress is the ability of the underlying network implementation to complete processing of an asynchronous request. In many cases, the processing of an asynchronous request requires the use of the host processor. For performance reasons, it may be undesirable for the provider to allocate a thread for this purpose, which will compete with the application thread(s). We can avoid thread context switches if the application thread can be used to make forward progress on requests -- check for acknowledgments, retry timed out operations, etc. Doing so requires that the application periodically call into the network stack.

开发人员通常不考虑的套接字编程接口的一个方面是协议实现的位置。这通常由操作系统内核管理。网络堆栈负责处理流控制消息、超时传输、重新传输未确认的传输、处理接收到的数据以及发送确认。此处理要求网络堆栈消耗 CPU 周期。该处理的一部分可以在应用程序线程的上下文中完成,但许多必须由专用于网络处理的内核线程处理。

通过将网络处理直接移动到应用程序进程中,我们需要关注网络通信如何向前推进。例如,如何以及何时发送确认?如何处理超时和消息重传?进度模型定义了这种行为,它取决于有多少网络处理已卸载到 NIC 上。

更一般地说,进度是底层网络实现完成异步请求处理的能力。在许多情况下,异步请求的处理需要使用主机处理器。出于性能原因,提供者可能不希望为此目的分配一个线程,这将与应用程序线程竞争。如果应用程序线程可用于对请求进行前向处理,我们可以避免线程上下文切换——检查确认、重试超时操作等。这样做需要应用程序定期调用网络堆栈。

Ordering 排序

Network ordering is a complex subject. With TCP sockets, data is sent and received in the same order. Buffers are re-usable by the application immediately upon returning from a function call. As a result, ordering is simple to understand and use. UDP sockets complicate things slightly. With UDP sockets, messages may be received out of order from how they were sent. In practice, this often doesn't occur, particularly, if the application only communicates over a local area network, such as Ethernet.

With our evolving network API, there are situations where exposing different order semantics can improve performance. These details will be discussed further below.

网络排序是一个复杂的主题。 使用 TCP 套接字,数据以相同的顺序发送和接收。 从函数调用返回后,应用程序可以立即重用缓冲区。 因此,订购很容易理解和使用。 UDP 套接字会使事情稍微复杂化。 使用 UDP 套接字,接收到的消息可能与发送方式不同。 实际上,这通常不会发生,尤其是当应用程序仅通过局域网(例如以太网)进行通信时。

随着我们不断发展的网络 API,在某些情况下公开不同的顺序语义可以提高性能。 这些细节将在下面进一步讨论。

Messages 消息

UDP sockets allow messages to arrive out of order because each message is routed from the sender to the receiver independently. This allows packets to take different network paths, to avoid congestion or take advantage of multiple network links for improved bandwidth. We would like to take advantage of the same features in those cases where the application doesn't care in which order messages arrive.

Unlike UDP sockets, however, our definition of message ordering is more subtle. UDP messages are small, MTU sized packets. In our case, messages may be gigabytes in size. We define message ordering to indicate whether the start of each message is processed in order or out of order. This is related to, but separate from the order of how the message payload is received.

An example will help clarify this distinction. Suppose that an application has posted two messages to its receive queue. The first receive points to a 4 KB buffer. The second receive points to a 64 KB buffer. The sender will transmit a 4 KB message followed by a 64 KB message. If messages are processed in order, then the 4 KB send will match with the 4 KB received, and the 64 KB send will match with the 64 KB receive. However, if messages can be processed out of order, then the sends and receives can mismatch, resulting in the 64 KB send being truncated.

In this example, we're not concerned with what order the data is received in. The 64 KB send could be broken in 64 1-KB transfers that take different routes to the destination. So, bytes 2k-3k could be received before bytes 1k-2k. Message ordering is not concerned with ordering within a message, only between messages. With ordered messages, the messages themselves need to be processed in order.

The more relaxed message ordering can be the more optimizations that the network stack can use to transfer the data. However, the application must be aware of message ordering semantics, and be able to select the desired semantic for its needs. For the purposes of this section, messages refers to transport level operations, which includes RDMA and similar operations (some of which have not yet been discussed).

UDP 套接字允许消息无序到达,因为每条消息都是从发送方独立路由到接收方的。这允许数据包采用不同的网络路径,以避免拥塞或利用多个网络链接来提高带宽。在应用程序不关心消息到达的顺序的情况下,我们希望利用相同的功能。

然而,与 UDP 套接字不同,我们对消息顺序的定义更加微妙。 UDP 消息是 MTU 大小的小数据包。在我们的例子中,消息的大小可能是千兆字节。我们定义消息排序来指示每条消息的开始是按顺序处理还是乱序处理。这与接收消息有效负载的顺序有关,但与之不同。

一个例子将有助于阐明这种区别。假设应用程序已将两条消息发布到其接收队列。第一个接收指向一个 4 KB 的缓冲区。第二个接收指向一个 64 KB 的缓冲区。发送者将发送一个 4 KB 的消息,然后是一个 64 KB 的消息。如果消息按顺序处理,则 4 KB 发送将与 4 KB 接收匹配,64 KB 发送将与 64 KB 接收匹配。但是,如果可以乱序处理消息,则发送和接收可能会不匹配,从而导致 64 KB 发送被截断。

在此示例中,我们不关心接收数据的顺序。64 KB 发送可能会在 64 个 1 KB 传输中中断,这些传输采用不同的路由到达目的地。因此,可以在字节 1k-2k 之前接收字节 2k-3k。消息排序不关心消息的within 排序,只关心between 消息的排序。对于有序消息,消息本身需要按顺序处理。

消息排序越宽松,网络堆栈可以用来传输数据的优化就越多。但是,应用程序必须了解消息排序语义,并且能够根据需要选择所需的语义。就本节而言,消息指的是传输层操作,包括 RDMA 和类似操作(其中一些尚未讨论)。

Data 数据

Data ordering refers to the receiving and placement of data both within and between messages. Data ordering is most important to messages that can update the same target memory buffer. For example, imagine an application that writes a series of database records directly into a peer memory location. Data ordering, combined with message ordering, ensures that the data from the second write updates memory after the first write completes. The result is that the memory location will contain the records carried in the second write.

Enforcing data ordering between messages requires that the messages themselves be ordered. Data ordering can also apply within a single message, though this level of ordering is usually less important to applications. Intra-message data ordering indicates that the data for a single message is received in order. Some applications use this feature to 'spin' reading the last byte of a receive buffer. Once the byte changes, the application knows that the operation has completed and all earlier data has been received. (Note that while such behavior is interesting for benchmark purposes, using such a feature in this way is strongly discouraged. It is not portable between networks or platforms.)

数据排序是指在消息内和消息之间接收和放置数据。数据排序对于可以更新相同目标内存缓冲区的消息来说是最重要的。例如,想象一个将一系列数据库记录直接写入对等内存位置的应用程序。数据排序与消息排序相结合,可确保来自第二次写入的数据在第一次写入完成后更新内存。结果是内存位置将包含第二次写入时携带的记录。

强制消息之间的数据排序要求消息本身是有序的。数据排序也可以应用在单个消息中,尽管这种排序级别通常对应用程序不太重要。消息内数据排序表示按顺序接收单个消息的数据。一些应用程序使用此功能来“旋转”读取接收缓冲区的最后一个字节。一旦字节发生变化,应用程序就知道操作已经完成并且所有之前的数据都已经收到。 (请注意,虽然这种行为对于基准测试来说很有趣,但强烈建议不要以这种方式使用这种功能。它不能在网络或平台之间移植。)

Completions 完成

Completion ordering refers to the sequence that asynchronous operations report their completion to the application. Typically, unreliable data transfer will naturally complete in the order that they are submitted to a transmit queue. Each operation is transmitted to the network, with the completion occurring immediately after. For reliable data transfers, an operation cannot complete until it has been acknowledged by the peer. Since ack packets can be lost or possibly take different paths through the network, operations can be marked as completed out of order. Out of order acks is more likely if messages can be processed out of order.

Asynchronous interfaces require that the application track their outstanding requests. Handling out of order completions can increase application complexity, but it does allow for optimizing network utilization.

完成顺序是指异步操作向应用程序报告完成的顺序。 通常,不可靠的数据传输自然会按照它们提交到传输队列的顺序完成。 每个操作都被传输到网络,然后立即完成。 对于可靠的数据传输,操作只有在对等方确认后才能完成。 由于 ack 数据包可能会丢失或可能通过网络采用不同的路径,因此可以将操作标记为无序完成。 如果消息可以乱序处理,则更有可能出现乱序确认。

异步接口要求应用程序跟踪其未完成的请求。 处理乱序完成会增加应用程序的复杂性,但它确实可以优化网络利用率。

OFI Architecture 开放Fabric接口架构

Libfabric is well architected to support the previously discussed features, with specific focus on exposing direct network access to an application. Direct network access, sometimes referred to as RDMA, allows an application to access network resources without operating system intervention. Data transfers can occur between networking hardware and application memory with minimal software overhead. Although libfabric supports scalable network solutions, it does not mandate any implementation. And the APIs have been defined specifically to allow multiple implementations.

The following diagram highlights the general architecture of the interfaces exposed by libfabric. For reference, the diagram shows libfabric in reference to a NIC.

Libfabric 的架构很好,可以支持前面讨论的功能,特别关注公开对应用程序的直接网络访问。 直接网络访问(有时称为 RDMA)允许应用程序访问网络资源而无需操作系统干预。 数据传输可以在网络硬件和应用程序内存之间以最小的软件开销进行。 尽管 libfabric 支持可扩展的网络解决方案,但它并不强制要求任何实现。 并且 API 已被专门定义为允许多种实现。

下图突出了 libfabric 公开的接口的一般架构。 作为参考,该图显示了参考 NIC 的 libfabric。

Framework versus Provider 框架和提供者(实用程序+底层提供者)

OFI is divided into two separate components. The main component is the OFI framework, which defines the interfaces that applications use. The OFI framework provides some generic services; however, the bulk of the OFI implementation resides in the providers. Providers plug into the framework and supply access to fabric hardware and services. Providers are often associated with a specific hardware device or NIC. Because of the structure of the OFI framework, applications access the provider implementation directly for most operations, in order to ensure the lowest possible software latency.

One important provider is referred to as the sockets provider. This provider implements the libfabric API over TCP sockets. A primary objective of the sockets provider is to support development efforts. Developers can write and test their code over the sockets provider on a small system, possibly even a laptop, before debugging on a larger cluster. The sockets provider can also be used as a fallback mechanism for applications that wish to target libfabric features for high-performance networks, but which may still need to run on small clusters connected, for example, by Ethernet.

The UDP provider has a similar goal, but implements a much smaller feature set than the sockets provider. The UDP provider is implemented over UDP sockets. It only implements those features of libfabric which would be most useful for applications wanting unreliable, unconnected communication. The primary goal of the UDP provider is to provide a simple building block upon which the framework can construct more complex features, such as reliability. As a result, a secondary objective of the UDP provider is to improve application scalability when restricted to using native operating system sockets.

The final generic (not associated with a specific network technology) provider is often referred to as the utility provider. The utility provider is a collection of software modules that can be used to extend the feature coverage of any provider. For example, the utility provider layers over the UDP provider to implement connection-oriented and reliable endpoint types. It can similarly layer over a provider that only supports connection-oriented communication to expose reliable, connection-less (aka reliable datagram) semantics.

Other providers target specific network technologies and systems, such as InfiniBand, Cray Aries networks, or Intel Omni-Path Architecture.

OFI 分为两个独立的组件。主要组件是 OFI 框架,它定义了应用程序使用的接口。 OFI 框架提供了一些通用服务;然而,大部分的 OFI 实现都存在于提供程序中。提供商插入框架并提供对结构硬件和服务的访问。提供程序通常与特定的硬件设备或 NIC 相关联。由于 OFI 框架的结构,应用程序直接访问提供程序实现以进行大多数操作,以确保尽可能低的软件延迟。

一个重要的提供者被称为套接字提供者。此提供程序通过 TCP 套接字实现 libfabric API。套接字提供者的主要目标是支持开发工作。开发人员可以在小型系统(甚至可能是笔记本电脑)上通过套接字提供程序编写和测试他们的代码,然后再在更大的集群上进行调试。对于希望针对高性能网络的 libfabric 功能但可能仍需要在连接的小型集群(例如通过以太网)上运行的应用程序,套接字提供程序也可以用作后备机制。

UDP 提供者具有类似的目标,但实现的功能集比套接字提供者小得多。 UDP 提供程序是通过 UDP 套接字实现的。它只实现了 libfabric 的那些特性,这些特性对于需要不可靠、未连接通信的应用程序最有用。 UDP 提供者的主要目标是提供一个简单的构建块,框架可以在该构建块上构建更复杂的特性,例如可靠性。因此,UDP 提供程序的次要目标是在仅限于使用本机操作系统套接字时提高应用程序的可伸缩性。

最终的通用(与特定网络技术无关)提供商通常被称为公用事业提供商。实用程序提供程序是一组软件模块,可用于扩展任何提供程序的功能覆盖范围。例如,实用程序提供程序在 UDP 提供程序之上分层,以实现面向连接和可靠的端点类型。它可以类似地覆盖仅支持面向连接的通信的提供者,以公开可靠的、无连接(也称为可靠数据报)语义。

其他供应商针对特定的网络技术和系统,例如 InfiniBand、Cray Aries 网络或英特尔 Omni-Path 架构。

Control services

Control services are used by applications to discover information about the types of communication services available in the system. For example, discovery will indicate what fabrics are reachable from the local node, and what sort of communication each provides.

In terms of implementation, control services are handled primarily by a single API, fi_getinfo(). Modeled very loosely on getaddrinfo(), it is used not just to discover what features are available in the system, but also how they might best be used by an application desiring maximum performance.

Control services themselves are not considered performance critical. However, the information exchanged between an application and the providers must be expressive enough to indicate the most performant way to access the network. Those details must be balanced with ease of use. As a result, the fi_getinfo() call provides the ability to access complex network details, while allowing an application to ignore them if desired.

应用程序使用控制服务来发现有关系统中可用的通信服务类型的信息。 例如,发现将指示可以从本地节点访问哪些结构,以及每个结构提供什么样的通信。

在实现方面,控制服务主要由单个 API fi_getinfo() 处理。 在 getaddrinfo() 上非常松散地建模,它不仅用于发现系统中可用的功能,而且还用于发现需要最大性能的应用程序如何最好地使用它们。

控制服务本身不被视为性能关键。 但是,应用程序和提供者之间交换的信息必须具有足够的表达性,以指示访问网络的最佳性能方式。 这些细节必须与易用性相平衡。 因此,fi_getinfo() 调用提供了访问复杂网络详细信息的能力,同时允许应用程序在需要时忽略它们。

Communication Services

Communication interfaces are used to setup communication between nodes. It includes calls to establish connections (connection management), as well as functionality used to address connection-less endpoints (address vectors).

The best match to socket routines would be connect(), bind(), listen(), and accept(). In fact the connection management calls are modeled after those functions, but with improved support for the asynchronous nature of the calls. For performance and scalability reasons, connection-less endpoints use a unique model, that is not based on sockets or other network interfaces. Address vectors are discussed in detail later, but target applications needing to talk with potentially thousands to millions of peers. For applications communicating with a handful of peers, address vectors can slightly complicate initialization for connection-less endpoints. (Connection-oriented endpoints may be a better option for such applications).

通信接口用于建立节点之间的通信。 它包括建立连接的调用(连接管理),以及用于寻址无连接端点的功能(地址向量)。

与套接字例程的最佳匹配是 connect()、bind()、listen() 和 accept()。 事实上,连接管理调用是根据这些功能建模的,但改进了对调用的异步特性的支持。 出于性能和可扩展性的原因,无连接端点使用独特的模型,它不基于套接字或其他网络接口。 地址向量将在稍后详细讨论,但目标应用程序需要与潜在的数千到数百万个对等方进行通信。 对于与少数对等方通信的应用程序,地址向量可能会使无连接端点的初始化稍微复杂化。 (对于此类应用程序,面向连接的端点可能是更好的选择)。

Completion Services

OFI exports asynchronous interfaces. Completion services are used to report the results of submitted data transfer operations. Completions may be reported using the cleverly named completions queues, which provide details about the operation that completed. Or, completions may be reported using lower-impact counters that simply return the number of operations that have completed.

Completion services are designed with high-performance, low-latency in mind. The calls map directly into the providers, and data structures are defined to minimize memory writes and cache impact. Completion services do not have corresponding socket APIs. (For Windows developers, they are similar to IO completion ports).

OFI 导出异步接口。 完成服务用于报告提交的数据传输操作的结果。 可以使用巧妙命名的完成队列来报告完成,该队列提供有关已完成操作的详细信息。 或者,可以使用影响较小的计数器报告完成,该计数器仅返回已完成的操作数。

完成服务的设计考虑了高性能、低延迟。 调用直接映射到提供程序,并定义数据结构以最小化内存写入和缓存影响。 完成服务没有相应的套接字 API。 (对于 Windows 开发人员来说,它们类似于 IO 完成端口)。

Data Transfer Services

Applications have needs of different data transfer semantics. The data transfer services in OFI are designed around different communication paradigms. Although shown outside the data transfer services, triggered operations are strongly related to the data transfer operations.

There are four basic data transfer interface sets. Message queues expose the ability to send and receive data with message boundaries being maintained. Message queues act as FIFOs, with sent messages matched with receive buffers in the order that messages are received. The message queue APIs are derived from the socket data transfer APIs, such as send(). sendto(), sendmsg(), recv(), recvmsg(), etc.

Tag matching is similar to message queues in that it maintains message boundaries. Tag matching differs from message queues in that received messages are directed into buffers based on small steering tags that are carried in the sent message. This allows a receiver to post buffers labeled 1, 2, 3, and so forth, with sends labeled respectively. The benefit is that send 1 will match with receive buffer 1, independent of how send operations may be transmitted or re-ordered by the network.

RMA stands for remote memory access. RMA transfers allow an application to write data directly into a specific memory location in a target process, or to read memory from a specific address at the target process and return the data into a local buffer. RMA is essentially equivalent to RDMA; the exception being that RDMA originally defined a specific transport implementation of RMA.

Atomic operations are often viewed as a type of extended RMA transfer. They permit direct access to the memory on the target process. The benefit of atomic operations is that they allow for manipulation of the memory, such as incrementing the value found at the target buffer. So, where RMA can write the value X to a remote memory buffer, atomics can change the value of the remote memory buffer, say Y, to Y + 1. Because RMA and atomic operations provide direct access to a process’s memory buffers, additional security synchronization is needed.

应用程序需要不同的数据传输语义。 OFI 中的数据传输服务是围绕不同的通信范式设计的。尽管显示在数据传输服务之外,但触发操作与数据传输操作密切相关。

有四个基本的数据传输接口集。消息队列公开了在维护消息边界的情况下发送和接收数据的能力。消息队列充当 FIFO,发送的消息按照接收消息的顺序与接收缓冲区匹配。消息队列 API 派生自套接字数据传输 API,例如 send()。 sendto()、sendmsg()、recv()、recvmsg() 等

标签匹配类似于消息队列,因为它维护消息边界。标签匹配与消息队列的不同之处在于,接收到的消息根据发送消息中携带的小导向标签被定向到缓冲区中。这允许接收者发布标记为 1、2、3 等的缓冲区,并分别标记发送。好处是发送 1 将与接收缓冲区 1 匹配,而与网络如何传输或重新排序发送操作无关。

RMA 代表远程内存访问。 RMA 传输允许应用程序将数据直接写入目标进程中的特定内存位置,或者从目标进程的特定地址读取内存并将数据返回到本地缓冲区。 RMA 本质上等同于 RDMA;例外是 RDMA 最初定义了 RMA 的特定传输实现。

原子操作通常被视为一种扩展 RMA 传输。它们允许直接访问目标进程的内存。原子操作的好处是它们允许对内存进行操作,例如增加在目标缓冲区中找到的值。因此,在 RMA 可以将值 X 写入远程内存缓冲区的情况下,原子可以将远程内存缓冲区的值(例如 Y)更改为 Y + 1。因为 RMA 和原子操作提供对进程内存缓冲区的直接访问,所以额外的安全性需要同步。

Memory Registration

Memory registration is the security mechanism used to grant a remote peer access to local memory buffers. Registered memory regions associate memory buffers with permissions granted for access by fabric resources. A memory buffer must be registered before it can be used as the target of an RMA or atomic data transfer. Memory registration supports a simple protection mechanism. After a memory buffer has been registered, that registration request (buffer's address, buffer length, and access permission) is given a registration key. Peers that issue RMA or atomic operations against that memory buffer must provide this key as part of their operation. This helps protects against unintentional accesses to the region. (Memory registration can help guard against malicious access, but it is often too weak by itself to ensure system isolation. Other, fabric specific, mechanisms protect against malicious access. Those mechanisms are currently outside of the scope of the libfabric API.)

Memory registration often plays a secondary role with high-performance networks. In order for a NIC to read or write application memory directly, it must access the physical memory pages that back the application's address space. Modern operating systems employ page files that swap out virtual pages from one process with the virtual pages from another. As a result, a physical memory page may map to different virtual addresses depending on when it is accessed. Furthermore, when a virtual page is swapped in, it may be mapped to a new physical page. If a NIC attempts to read or write application memory without being linked into the virtual address manager, it could access the wrong data, possibly corrupting an application's memory. Memory registration can be used to avoid this situation from occurring. For example, registered pages can be marked such that the operating system locks the virtual to physical mapping, avoiding any possibility of the virtual page being paged out or remapped.

内存注册是用于授予远程对等方访问本地内存缓冲区的安全机制。已注册的内存区域将内存缓冲区与授予结构资源访问权限相关联。必须先注册内存缓冲区,然后才能将其用作 RMA 或原子数据传输的目标。内存注册支持简单的保护机制。注册内存缓冲区后,该注册请求(缓冲区地址、缓冲区长度和访问权限)将获得注册密钥。对该内存缓冲区发出 RMA 或原子操作的对等方必须提供此密钥作为其操作的一部分。这有助于防止无意访问该区域。 (内存注册可以帮助防止恶意访问,但它本身通常太弱而无法确保系统隔离。其他特定于结构的机制可以防止恶意访问。这些机制目前超出了 libfabric API 的范围。)

内存注册通常在高性能网络中扮演次要角色。为了让 NIC 直接读取或写入应用程序内存,它必须访问支持应用程序地址空间的物理内存页。现代操作系统使用页面文件将一个进程的虚拟页面与另一个进程的虚拟页面交换出来。因此,物理内存页可能会根据访问时间映射到不同的虚拟地址。此外,当一个虚拟页面被换入时,它可能被映射到一个新的物理页面。如果 NIC 尝试在未链接到虚拟地址管理器的情况下读取或写入应用程序内存,它可能会访问错误的数据,可能会损坏应用程序的内存。可以使用内存注册来避免这种情况的发生。例如,可以标记已注册的页面,以便操作系统锁定虚拟到物理的映射,避免虚拟页面被调出或重新映射的任何可能性。

Object Model 对象模型

Interfaces exposed by OFI are associated with different objects. The following diagram shows a high-level view of the parent-child relationships.

OFI 公开的接口与不同的对象相关联。 下图显示了父子关系的高级视图。

Fabric

A fabric represents a collection of hardware and software resources that access a single physical or virtual network. For example, a fabric may be a single network subnet or cluster. All network ports on a system that can communicate with each other through the fabric belong to the same fabric. A fabric shares network addresses and can span multiple providers.

Fabrics are the top level object from which other objects are allocated.

fabric表示访问单个物理或虚拟网络的硬件和软件资源的集合。 例如,结构可以是单个网络子网或集群。 一个系统上所有可以通过Fabric进行通信的网络端口都属于同一个Fabric。 一个结构共享网络地址并且可以跨越多个提供商。

Fabrics 是分配其他对象的顶级对象。

Domain

A domain represents a logical connection into a fabric. For example, a domain may correspond to a physical or virtual NIC. Because domains often correlate to a single NIC, a domain defines the boundary within which other resources may be associated. Objects such as completion queues and active endpoints must be part of the same domain in order to be related to each other.

域代表与fabric的逻辑连接。 例如,域可能对应于物理或虚拟 NIC。 由于域通常与单个 NIC 相关联,因此域定义了其他资源可能关联的边界。 完成队列和活动端点等对象必须属于同一域才能相互关联。

Passive Endpoint 被动端点

Passive endpoints are used by connection-oriented protocols to listen for incoming connection requests. Passive endpoints often map to software constructs and may span multiple domains. They are best represented by a listening socket. Unlike the socket API, however, in which an allocated socket may be used with either a connect() or listen() call, a passive endpoint may only be used with a listen call.

面向连接的协议使用被动端点来侦听传入的连接请求。 被动端点通常映射到软件结构并且可能跨越多个域。 它们最好由侦听套接字表示。 然而,与套接字 API 不同,其中分配的套接字可以与 connect() 或 listen() 调用一起使用,被动端点只能与 listen 调用一起使用。

Event Queues

EQs are used to collect and report the completion of asynchronous operations and events. Event queues handle control events, which are not directly associated with data transfer operations. The reason for separating control events from data transfer events is for performance reasons. Control events usually occur during an application's initialization phase, or at a rate that's several orders of magnitude smaller than data transfer events. Event queues are most commonly used by connection-oriented protocols for notification of connection request or established events. A single event queue may combine multiple hardware queues with a software queue and expose them as a single abstraction.

事件队列用于收集和报告异步操作和事件的完成情况。 事件队列处理control 事件,这些事件与数据传输操作没有直接关联。 将控制事件与数据传输事件分开的原因是出于性能原因。 控制事件通常发生在应用程序的初始化阶段,或者以比数据传输事件小几个数量级的速率发生。 面向连接的协议最常使用事件队列来通知连接请求或已建立的事件。 单个事件队列可以将多个硬件队列与软件队列结合起来,并将它们公开为单个抽象。

Wait Sets

The intended objective of a wait set is to reduce system resources used for signaling events. For example, a wait set may allocate a single file descriptor. All fabric resources that are associated with the wait set will signal that file descriptor when an event occurs. The advantage is that the number of opened file descriptors is greatly reduced. The closest operating system semantic would be the Linux epoll construct. The difference is that a wait set does not merely multiplex file descriptors to another file descriptor, but allows for their elimination completely. Wait sets allow a single underlying wait object to be signaled whenever a specified condition occurs on an associated event queue, completion queue, or counter.

等待集的预期目标是减少用于信令事件的系统资源。 例如,等待集可以分配单个文件描述符。 当事件发生时,与等待集相关联的所有结构资源都会向该文件描述符发出信号。 优点是打开的文件描述符的数量大大减少。 最接近的操作系统语义是 Linux epoll 结构。 不同之处在于等待集不仅将文件描述符多路复用到另一个文件描述符,而且允许完全消除它们。 等待集允许在关联的事件队列、完成队列或计数器上发生指定条件时发出单个底层等待对象的信号。

Active Endpoint

Active endpoints are data transfer communication portals. Active endpoints are used to perform data transfers, and are conceptually similar to a connected TCP or UDP socket. Active endpoints are often associated with a single hardware NIC, with the data transfers partially or fully offloaded onto the NIC.

活动端点是数据传输通信门户。 活动端点用于执行数据传输,在概念上类似于连接的 TCP 或 UDP 套接字。 活动端点通常与单个硬件 NIC 相关联,数据传输部分或全部卸载到 NIC 上。

Completion Queue

Completion queues are high-performance queues used to report the completion of data transfer operations. Unlike event queues, completion queues are often associated with a single hardware NIC, and may be implemented entirely in hardware. Completion queue interfaces are designed to minimize software overhead.

完成队列是用于报告数据传输操作完成的高性能队列。 与事件队列不同,完成队列通常与单个硬件 NIC 相关联,并且可以完全在硬件中实现。 完成队列接口旨在最大限度地减少软件开销。

Completion Counter

Completion queues are used to report information about which request has completed. However, some applications use this information simply to track how many requests have completed. Other details are unnecessary. Completion counters are optimized for this use case. Rather than writing entries into a queue, completion counters allow the provider to simply increment a count whenever a completion occurs.

完成队列用于报告有关哪个请求已完成的信息。 但是,某些应用程序仅使用此信息来跟踪已完成的请求数。 其他细节是不必要的。 完成计数器已针对此用例进行了优化。 完成计数器不是将条目写入队列,而是允许提供者在完成时简单地增加计数。

Poll Set

OFI allows providers to use an application’s thread to process asynchronous requests. This can provide performance advantages for providers that use software to progress the state of a data transfer. Poll sets allow an application to group together multiple objects, such that progress can be driven across all associated data transfers. In general, poll sets are used to simplify applications where a manual progress model is employed.

OFI 允许提供者使用应用程序的线程来处理异步请求。 这可以为使用软件来推进数据传输状态的提供商提供性能优势。 轮询集允许应用程序将多个对象组合在一起,以便可以跨所有关联的数据传输推动进度。 通常,轮询集用于简化采用手动进度模型的应用程序。

Memory Region

Memory regions describe application’s local memory buffers. In order for fabric resources to access application memory, the application must first grant permission to the fabric provider by constructing a memory region. Memory regions are required for specific types of data transfer operations, such as RMA and atomic operations.

内存区域描述应用程序的本地内存缓冲区。 为了让fabric资源访问应用程序内存,应用程序必须首先通过构造一个内存区域向fabric提供者授予权限。 特定类型的数据传输操作(例如 RMA 和原子操作)需要内存区域。

Address Vectors 地址向量

Address vectors are used by connection-less endpoints. They map higher level addresses, such as IP addresses, which may be more natural for an application to use, into fabric specific addresses. The use of address vectors allows providers to reduce the amount of memory required to maintain large address look-up tables, and eliminate expensive address resolution and look-up methods during data transfer operations.

地址向量由无连接端点使用。 它们将更高级别的地址(例如 IP 地址,对于应用程序使用起来可能更自然)映射到特定于结构的地址。 地址向量的使用允许提供商减少维护大型地址查找表所需的内存量,并在数据传输操作期间消除昂贵的地址解析和查找方法。

下文: https://cloud.tencent.com/developer/article/2531005

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Communication Resources 通讯资源
    • Shared Receive Queues 共享接收队列(SRQ)
    • Multi-Receive Buffers 多个接收缓冲区
  • Optimal Hardware Allocation 最佳硬件分配
    • Sharing Command Queues 共享命令队列
    • Multiple Queues 多队列
  • Progress Model Considerations 进展模型注意事项
  • Ordering 排序
    • Messages 消息
    • Data 数据
    • Completions 完成
  • OFI Architecture 开放Fabric接口架构
    • Framework versus Provider 框架和提供者(实用程序+底层提供者)
    • Control services
    • Communication Services
    • Completion Services
    • Data Transfer Services
    • Memory Registration
  • Object Model 对象模型
    • Fabric
    • Domain
    • Passive Endpoint 被动端点
    • Event Queues
    • Wait Sets
    • Active Endpoint
    • Completion Queue
    • Completion Counter
    • Poll Set
    • Memory Region
    • Address Vectors 地址向量
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档