首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >libfabric_ofa_简介_指南_设计思想_高性能网络3

libfabric_ofa_简介_指南_设计思想_高性能网络3

原创
作者头像
晓兵
修改2025-06-13 10:39:11
修改2025-06-13 10:39:11
16300
代码可运行
举报
文章被收录于专栏:DPUDPU
运行总次数:0
代码可运行

上文: https://cloud.tencent.com/developer/article/2531004

Communication Model 通信模型

OFI supports three main communication endpoint types: reliable-connected, unreliable datagram, and reliable-unconnected. (The fourth option, unreliable-connected is unused by applications, so is not included as part of the current implementation). Communication setup is based on whether the endpoint is connected or unconnected. Reliability is a feature of the endpoint's data transfer protocol.

Connected Communications

The following diagram highlights the general usage behind connection-oriented communication. Connected communication is based on the flow used to connect TCP sockets, with improved asynchronous support.

下图突出了面向连接的通信背后的一般用法。 连接通信基于用于连接 TCP 套接字的流,并具有改进的异步支持。

Connections require the use of both passive and active endpoints. In order to establish a connection, an application must first create a passive endpoint and associate it with an event queue. The event queue will be used to report the connection management events. The application then calls listen on the passive endpoint. A single passive endpoint can be used to form multiple connections.

The connecting peer allocates an active endpoint, which is also associated with an event queue. Connect is called on the active endpoint, which results in sending a connection request (CONNREQ) message to the passive endpoint. The CONNREQ event is inserted into the passive endpoint’s event queue, where the listening application can process it.

Upon processing the CONNREQ, the listening application will allocate an active endpoint to use with the connection. The active endpoint is bound with an event queue. Although the diagram shows the use of a separate event queue, the active endpoint may use the same event queue as used by the passive endpoint. Accept is called on the active endpoint to finish forming the connection. It should be noted that the OFI accept call is different than the accept call used by sockets. The differences result from OFI supporting process direct I/O.

OFI does not define the connection establishment protocol, but does support a traditional three-way handshake used by many technologies. After calling accept, a response is sent to the connecting active endpoint. That response generates a CONNECTED event on the remote event queue. If a three-way handshake is used, the remote endpoint will generate an acknowledgment message that will generate a CONNECTED event for the accepting endpoint. Regardless of the connection protocol, both the active and passive sides of the connection will receive a CONNECTED event that signals that the connection has been established.

连接需要使用被动和主动端点。为了建立连接,应用程序必须首先创建一个被动端点并将其与事件队列相关联。事件队列将用于报告连接管理事件。然后应用程序在被动端点上调用监听。单个被动端点可用于形成多个连接。

连接对等方分配一个活动端点,该端点也与一个事件队列相关联。在主动端点上调用 Connect,这会导致向被动端点发送连接请求 (CONNREQ) 消息。 CONNREQ 事件被插入到被动端点的事件队列中,监听应用程序可以在其中处理它。

在处理 CONNREQ 后,监听应用程序将分配一个活动端点以用于连接。活动端点与事件队列绑定。尽管该图显示了使用单独的事件队列,但主动端点可以使用与被动端点相同的事件队列。在活动端点上调用 Accept 以完成连接的形成。需要注意的是,OFI 接受调用不同于套接字使用的接受调用。差异源于 OFI 支持进程直接 I/O。

OFI 没有定义连接建立协议,但支持许多技术使用的传统三次握手。调用accept 后,会向连接的活动端点发送响应。该响应在远程事件队列上生成一个 CONNECTED 事件。如果使用三次握手,远程端点将生成一个确认消息,该消息将为接受端点生成一个 CONNECTED 事件。不管连接协议如何,连接的主动和被动侧都将收到一个 CONNECTED 事件,表明连接已建立。

Connection-less Communications

Connection-less communication allows data transfers between active endpoints without going through a connection setup process. The diagram below shows the basic components needed to setup connection-less communication. Connection-less communication setup differs from UDP sockets in that it requires that the remote addresses be stored with libfabric.

无连接通信允许活动端点之间的数据传输,而无需经过连接设置过程。 下图显示了设置无连接通信所需的基本组件。 无连接通信设置与 UDP 套接字的不同之处在于它要求远程地址与 libfabric 一起存储

OFI requires the addresses of peer endpoints be inserted into a local addressing table, or address vector, before data transfers can be initiated against the remote endpoint. Address vectors abstract fabric specific addressing requirements and avoid long queuing delays on data transfers when address resolution is needed. For example, IP addresses may need to be resolved into Ethernet MAC addresses. Address vectors allow this resolution to occur during application initialization time. OFI does not define how an address vector be implemented, only its conceptual model.

Because address vector setup is considered a control operation, and often occurs during an application's initialization phase, they may be used both synchronously and asynchronously. When used synchronously, calls to insert new addresses into the AV block until the resolution completes. When an address vector is used asynchronously, it must be associated with an event queue. With the asynchronous model, after an address has been inserted into the AV and the fabric specific details have been resolved, a completion event is generated on the event queue. Data transfer operations against that address are then permissible on active endpoints that are associated with the address vector.

All connection-less endpoints that transfer data must be associated with an address vector.

OFI 要求将对等端点的地址插入到本地寻址表或地址向量中,然后才能针对远程端点启动数据传输。地址向量抽象结构特定的寻址要求,并在需要地址解析时避免数据传输的长时间排队延迟。例如,可能需要将 IP 地址解析为以太网 MAC 地址。地址向量允许在应用程序初始化期间进行此解析。 OFI 没有定义地址向量是如何实现的,只定义了它的概念模型。

因为地址向量设置被认为是一种控制操作,并且经常发生在应用程序的初始化阶段,所以它们可以同步和异步使用。同步使用时,调用将新地址插入 AV 块,直到解析完成。当异步使用地址向量时,它必须与事件队列相关联。使用异步模型,在将地址插入 AV 并解析结构特定细节后,在事件队列上生成完成事件。然后在与地址向量关联的活动端点上允许针对该地址的数据传输操作。

所有传输数据的无连接端点都必须与地址向量相关联。

Endpoints

Endpoints represent communication portals, and all data transfer operations are initiated on endpoints. OFI defines the conceptual model for how endpoints are exposed to applications, as demonstrated in the diagrams below.

端点代表通信入口,所有数据传输操作都是在端点上发起的。 OFI 定义了端点如何向应用程序公开的概念模型,如下图所示。

Endpoints are usually associated with a transmit context and a receive context. Transmit and receive contexts are often implemented using hardware queues that are mapped directly into the process’s address space, though OFI does not require this implementation. Although not shown, an endpoint may be configured only to transmit or receive data. Data transfer requests are converted by the underlying provider into commands that are inserted into transmit and/or receive contexts.

Endpoints are also associated with completion queues. Completion queues are used to report the completion of asynchronous data transfer operations. An endpoint may direct completed transmit and receive operations to separate completion queues, or the same queue (not shown)

端点通常与发送上下文和接收上下文相关联。 发送和接收上下文通常使用直接映射到进程地址空间的硬件队列来实现,尽管 OFI 不需要这种实现。 尽管未示出,端点可以被配置为仅发送或接收数据。 底层提供者将数据传输请求转换为插入传输和/或接收上下文的命令。

端点也与完成队列相关联。 完成队列用于报告异步数据传输操作的完成情况。 端点可以将完成的发送和接收操作引导到单独的完成队列或同一个队列(未显示)

Shared Contexts

A more advanced usage model of endpoints that allows for resource sharing is shown below.

下面显示了允许资源共享的更高级的端点使用模型。

Because transmit and receive contexts may be associated with limited hardware resources, OFI defines mechanisms for sharing contexts among multiple endpoints. The diagram above shows two endpoints each sharing transmit and receive contexts. However, endpoints may share only the transmit context or only the receive context or neither. Shared contexts allow an application or resource manager to prioritize where resources are allocated and how shared hardware resources should be used.

Completions are still associated with the endpoints, with each endpoint being associated with their own completion queue(s).

因为发送和接收上下文可能与有限的硬件资源相关联,OFI 定义了在多个端点之间共享上下文的机制。 上图显示了两个端点,每个端点共享传输和接收上下文。 然而,端点可以只共享传输上下文或只共享接收上下文,或者两者都不共享。 共享上下文允许应用程序或资源管理器优先考虑分配资源的位置以及应如何使用共享硬件资源。

完成仍然与端点相关联,每个端点都与它们自己的完成队列相关联。

Receive Contexts

TODO

Transmit Contexts

TODO

Scalable Endpoints

The final endpoint model is known as a scalable endpoint. Scalable endpoints allow a single endpoint to take advantage of multiple underlying hardware resources.

最终端点模型称为可扩展端点。 可扩展端点允许单个端点利用多个底层硬件资源。

Scalable endpoints have multiple transmit and/or receive contexts. Applications can direct data transfers to use a specific context, or the provider can select which context to use. Each context may be associated with its own completion queue. Scalable contexts allow applications to separate resources to avoid thread synchronization or data ordering restrictions.

可扩展端点具有多个传输和/或接收上下文。 应用程序可以指导数据传输使用特定的上下文,或者提供者可以选择使用哪个上下文。 每个上下文可能与它自己的完成队列相关联。 可扩展上下文允许应用程序分离资源以避免线程同步或数据排序限制。

Data Transfers

Obviously, the goal of network communication is to transfer data between systems. In the same way that sockets defines different data transfer semantics for TCP versus UDP sockets (streaming versus datagram messages), OFI defines different data transfer semantics. However, unlike sockets, OFI allows different semantics over a single endpoint, even when communicating with the same peer.

OFI defines separate API sets for the different data transfer semantics; although, there are strong similarities between the API sets. The differences are the result of the parameters needed to invoke each type of data transfer.

显然,网络通信的目标是在系统之间传输数据。 就像套接字为 TCP 与 UDP 套接字(流与数据报消息)定义不同的数据传输语义一样,OFI 定义了不同的数据传输语义。 然而,与套接字不同,OFI 允许在单个端点上使用不同的语义,即使在与同一个对等点通信时也是如此。

OFI 为不同的数据传输语义定义了单独的 API 集; 不过,API 集之间有很强的相似性。 不同之处在于调用每种类型的数据传输所需的参数。

Message transfers 消息传输

Message transfers are most similar to UDP datagram transfers. The sender requests that data be transferred as a single transport operation to a peer. Even if the data is referenced using an I/O vector, it is treated as a single logical unit. The data is placed into a waiting receive buffer at the peer. Unlike UDP sockets, message transfers may be reliable or unreliable, and many providers support message transfers that are gigabytes in size.

Message transfers are usually invoked using API calls that contain the string "send" or "recv". As a result they may be referred to simply as sends or receives.

Message transfers involve the target process posting memory buffers to the receive context of its endpoint. When a message arrives from the network, a receive buffer is removed from the Rx context, and the data is copied from the network into the receive buffer. Messages are matched with posted receives in the order that they are received. Note that this may differ from the order that messages are sent, depending on the transmit side's ordering semantics. Furthermore, received messages may complete out of order. For instance, short messages could complete before larger messages, especially if the messages originate from different peers. Completion ordering semantics indicate the order that posted receive operations complete.

Conceptually, on the transmit side, messages are posted to a transmit context. The network processes messages from the Tx context, packetizing the data into outbound messages. Although many implementations process the Tx context in order (i.e. the Tx context is a true queue), ordering guarantees determine the actual processing order. For example, sent messages may be copied to the network out of order if targeting different peers.

In the default case, OFI defines ordering semantics such that messages 1, 2, 3, etc. from the sender are received in the same order at the target. Relaxed ordering semantics is an optimization technique that applications can opt into in order to improve network performance and utilization.

消息传输与 UDP 数据报传输最相似。发送方请求将数据作为单个传输操作传输到对等方。即使使用 I/O 向量引用数据,它也被视为单个逻辑单元。数据被放置在对等端的等待接收缓冲区中。与 UDP 套接字不同,消息传输可能是可靠的,也可能是不可靠的,并且许多提供商支持千兆字节大小的消息传输。

消息传输通常使用包含字符串“send”或“recv”的 API 调用来调用。因此,它们可以简称为发送或接收。

消息传输涉及目标进程将内存缓冲区发布到其端点的接收上下文。当消息从网络到达时,接收缓冲区会从 Rx 上下文中删除,并且数据会从网络复制到接收缓冲区中。消息按接收顺序与已发布的接收相匹配。请注意,这可能与发送消息的顺序不同,具体取决于发送方的排序语义。此外,接收到的消息可能会乱序完成。例如,短消息可以在较大消息之前完成,特别是如果消息来自不同的对等方。完成排序语义指示发布的接收操作完成的顺序。

从概念上讲,在传输端,消息被发布到传输上下文。网络处理来自 Tx 上下文的消息,将数据打包成出站消息。尽管许多实现按顺序处理 Tx 上下文(即 Tx 上下文是一个真正的队列),但排序保证决定了实际的处理顺序。例如,如果针对不同的对等点,发送的消息可能会乱序复制到网络。

在默认情况下,OFI 定义了排序语义,使得来自发送者的消息 1、2、3 等在目标处以相同的顺序接收。宽松排序语义(Relaxed ordering)是一种优化技术,应用程序可以选择使用它来提高网络性能和利用率。

Tagged messages

Tagged messages are similar to message transfers except that the messages carry one additional piece of information, a message tag. Tags are application defined values that are part of the message transfer protocol and are used to route packets at the receiver. At a high level, they are roughly similar to sequence numbers or message ids. The difference is that tag values are set by the application, may be any value, and duplicate tag values are allowed.

Each sent message carries a single tag value, which is used to select a receive buffer into which the data is copied. On the receiving side, message buffers are also marked with a tag. Messages that arrive from the network search through the posted receive messages until a matching tag is found. Tags allow messages to be placed into overlapping groups.

Tags are often used to identify virtual communication groups or roles. For example, one tag value may be used to identify a group of systems that contain input data into a program. A second tag value could identify the systems involved in the processing of the data. And a third tag may identify systems responsible for gathering the output from the processing. (This is purely a hypothetical example for illustrative purposes only). Moreover, tags may carry additional data about the type of message being used by each group. For example, messages could be separated based on whether the context carries control or data information.

In practice, message tags are typically divided into fields. For example, the upper 16 bits of the tag may indicate a virtual group, with the lower 16 bits identifying the message purpose. The tag message interface in OFI is designed around this usage model. Each sent message carries exactly one tag value, specified through the API. At the receiver, buffers are associated with both a tag value and a mask. The mask is applied to both the send and receive tag values (using a bit-wise AND operation). If the resulting values match, then the tags are said to match. The received data is then placed into the matched buffer.

For performance reasons, the mask is specified as 'ignore' bits. Although this is backwards from how many developers think of a mask (where the bits that are valid would be set to 1), the definition ends up mapping well with applications. The actual operation performed when matching tags is:

带标签的消息与消息传输类似,只是消息带有一条附加信息,即消息标签。标签是应用程序定义的值,它们是消息传输协议的一部分,用于在接收器处路由数据包。在高层次上,它们与序列号或消息 ID 大致相似。不同之处在于标签值由应用程序设置,可以是任何值,并且允许重复的标签值。

每个发送的消息都带有一个标签值,用于选择将数据复制到的接收缓冲区。在接收端,消息缓冲区也标有标签。从网络到达的消息通过发布的接收消息进行搜索,直到找到匹配的标签。标签允许将消息放入重叠的组中。

标签通常用于标识虚拟通信组或角色。例如,一个标签值可用于标识包含程序输入数据的一组系统。第二个标签值可以识别涉及数据处理的系统。第三个标签可以识别负责收集处理输出的系统。 (这纯粹是一个假设示例,仅用于说明目的)。此外,标签可以携带有关每个组正在使用的消息类型的附加数据。例如,可以根据上下文是否携带控制信息或数据信息来分离消息。

在实践中,消息标签通常分为字段。例如,标签的高 16 位可以指示虚拟组,而低 16 位标识消息目的。 OFI 中的标签消息接口就是围绕这种使用模型设计的。每条发送的消息都只携带一个标签值,通过 API 指定。在接收方,缓冲区与标记值和掩码相关联。掩码应用于发送和接收标记值(使用按位与运算)。如果结果值匹配,则称标签匹配。然后将接收到的数据放入匹配的缓冲区中。

出于性能原因,掩码被指定为“忽略”位。尽管这与许多开发人员对掩码(其中有效位将设置为 1)的想法相反,但该定义最终与应用程序很好地映射。匹配标签时实际执行的操作是:

代码语言:javascript
代码运行次数:0
运行
复制
send_tag | ignore == recv_tag | ignore
/* this is equivalent to:
 * send_tag & ~ignore == recv_tag & ~ignore
 */

Tagged messages are equivalent of message transfers if a single tag value is used. But tagged messages require that the receiver perform the matching operation at the target, which can impact performance versus untagged messages.

如果使用单个标记值,则标记消息等效于消息传输。 但是标记消息要求接收者在目标处执行匹配操作,这可能会影响与未标记消息相比的性能。

RMA

RMA operations are architected such that they can require no processing at the RMA target. NICs which offload transport functionality can perform RMA operations without impacting host processing. RMA write operations transmit data from the initiator to the target. The memory location where the data should be written is carried within the transport message itself.

RMA read operations fetch data from the target system and transfer it back to the initiator of the request, where it is copied into memory. This too can be done without involving the host processor at the target system when the NIC supports transport offloading.

The advantage of RMA operations is that they decouple the processing of the peers. Data can be placed or fetched whenever the initiator is ready without necessarily impacting the peer process.

Because RMA operations allow a peer to directly access the memory of a process, additional protection mechanisms are used to prevent unintentional or unwanted access. RMA memory that is updated by a write operation or is fetched by a read operation must be registered for access with the correct permissions specified.

RMA 操作的架构使得它们不需要在 RMA 目标上进行处理。卸载传输功能的 NIC 可以在不影响主机处理的情况下执行 RMA 操作。 RMA 写操作将数据从发起方传输到目标方。应该写入数据的内存位置由传输消息本身携带。

RMA 读取操作从目标系统获取数据并将其传输回请求的发起者,然后将其复制到内存中。当 NIC 支持传输卸载时,这也可以在不涉及目标系统的主机处理器的情况下完成。

RMA 操作的优势在于它们将对等点的处理解耦。只要发起者准备好,就可以放置或获取数据,而不必影响对等进程。

因为 RMA 操作允许对等方直接访问进程的内存,所以使用额外的保护机制来防止无意或不需要的访问。通过写操作更新或通过读操作获取的 RMA 内存必须注册以使用指定的正确权限进行访问。

Atomic operations

Atomic transfers are used to read and update data located in remote memory regions in an atomic fashion. Conceptually, they are similar to local atomic operations of a similar nature (e.g. atomic increment, compare and swap, etc.). The benefit of atomic operations is they enable offloading basic arithmetic capabilities onto a NIC. Unlike other data transfer operations, atomics require knowledge of the format of the data being accessed.

A single atomic function may operate across an array of data, applying an atomic operation to each entry, but the atomicity of an operation is limited to a single data type or entry. OFI defines a wide variety of atomic operations across all common data types. However support for a given operation is dependent on the provider implementation.

原子传输用于以原子方式读取和更新位于远程内存区域的数据。 从概念上讲,它们类似于性质相似的局部原子操作(例如原子增量、比较和交换等)。 原子操作的好处是它们可以将基本的算术能力卸载到 NIC 上。 与其他数据传输操作不同,原子操作需要了解所访问数据的格式。

单个原子函数可以跨数据数组进行操作,将原子操作应用于每个条目,但操作的原子性仅限于单个数据类型或条目。 OFI 定义了所有常见数据类型的各种原子操作。 但是,对给定操作的支持取决于提供者的实现。

Fabric Interfaces

A full description of the libfabric API is documented in the relevant man pages. This section provides an introduction to select interfaces, including how they may be used. It does not attempt to capture all subtleties or use cases, nor describe all possible data structures or fields.

libfabric API 的完整描述记录在相关手册页中。 本节介绍选择接口,包括如何使用它们。 它不会试图捕捉所有的细微之处或用例,也不会描述所有可能的数据结构或字段。

Using fi_getinfo

https://ofiwg.github.io/libfabric/v1.13.2/man/fi_getinfo.3.html

The fi_getinfo() call is one of the first calls that most applications will invoke. It is designed to be easy to use for simple applications, but extensible enough to configure a network for optimal performance. It serves several purposes. First, it abstracts away network implementation and addressing details. Second, it allows an application to specify which features they require of the network. Last, it provides a mechanism for a provider to report how an application can use the network in order to achieve the best performance.

fi_getinfo() 调用是大多数应用程序将首先调用的调用之一。 它旨在易于用于简单的应用程序,但可扩展性足以配置网络以获得最佳性能。 它有几个目的。 首先,它抽象出网络实现和寻址细节。 其次,它允许应用程序指定他们需要网络的哪些功能。 最后,它为提供商提供了一种机制来报告应用程序如何使用网络以获得最佳性能。

fi_getinfo, fi_freeinfo - Obtain / free fabric interface information

fi_allocinfo, fi_dupinfo - Allocate / duplicate an fi_info structure 分配/复制一个 fi_info 结构

代码语言:javascript
代码运行次数:0
运行
复制
/* API prototypes */
struct fi_info *fi_allocinfo(void);
​
int fi_getinfo(int version, const char *node, const char *service,
    uint64_t flags, struct fi_info *hints, struct fi_info **info);
代码语言:javascript
代码运行次数:0
运行
复制
/* Sample initialization code flow */
struct fi_info *hints, *info;
​
hints = fi_allocinfo();
​
/* hints will point to a cleared fi_info structure
 * Initialize hints here to request specific network capabilities
 */
​
fi_getinfo(FI_VERSION(1, 4), NULL, NULL, 0, hints, &info);
fi_freeinfo(hints);
​
/* Use the returned info structure to allocate fabric resources */

The hints parameter is the key for requesting fabric services. The fi_info structure contains several data fields, plus pointers to a wide variety of attributes. The fi_allocinfo() call simplifies the creation of an fi_info structure. In this example, the application is merely attempting to get a list of what providers are available in the system and the features that they support. Note that the API is designed to be extensible. Versioning information is provided as part of the fi_getinfo() call. The version is used by libfabric to determine what API features the application is aware of. In this case, the application indicates that it can properly handle any feature that was defined for the 1.4 release (or earlier).

Applications should always hard code the version that they are written for into the fi_getinfo() call. This ensures that newer versions of libfabric will provide backwards compatibility with that used by the application.

Typically, an application will initialize the hints parameter to list the features that it will use.

hints(提示/示意) 参数是请求fabric 服务的关键。 fi_info 结构包含几个数据字段,以及指向各种属性的指针。 fi_allocinfo() 调用简化了 fi_info 结构的创建。在此示例中,应用程序仅尝试获取系统中可用的提供程序及其支持的功能的列表。请注意,API 设计为可扩展的。版本信息作为 fi_getinfo() 调用的一部分提供。 libfabric 使用该版本来确定应用程序知道哪些 API 功能。在这种情况下,应用程序表明它可以正确处理为 1.4 版本(或更早版本)定义的任何功能。

应用程序应该总是将它们所编写的版本硬编码到 fi_getinfo() 调用中。这确保了较新版本的 libfabric 将提供与应用程序使用的向后兼容性。

通常,应用程序将初始化提示参数以列出它将使用的功能。

代码语言:javascript
代码运行次数:0
运行
复制
/* Taking a peek at the contents of fi_info */
struct fi_info {
    struct fi_info *next;
    uint64_t caps;
    uint64_t mode;
    uint32_t addr_format;
    size_t src_addrlen;
    size_t dest_addrlen;
    void *src_addr;
    void *dest_addr;
    fid_t handle;
    struct fi_tx_attr *tx_attr;
    struct fi_rx_attr *rx_attr;
    struct fi_ep_attr *ep_attr;
    struct fi_domain_attr *domain_attr;
    struct fi_fabric_attr *fabric_attr;
};

The fi_info structure references several different attributes, which correspond to the different OFI objects that an application allocates. Details of the various attribute structures are defined below. For basic applications, modifying or accessing most attribute fields are unnecessary. Many applications will only need to deal with a few fields of fi_info, most notably the capability (caps) and mode bits.

On success, the fi_getinfo() function returns a linked list of fi_info structures. Each entry in the list will meet the conditions specified through the hints parameter. The returned entries may come from different network providers, or may differ in the returned attributes. For example, if hints does not specify a particular endpoint type, there may be an entry for each of the three endpoint types. As a general rule, libfabric returns the list of fi_info structures in order from most desirable to least. High-performance network providers are listed before more generic providers, such as the socket or UDP providers.

fi_info 结构引用了几个不同的属性,这些属性对应于应用程序分配的不同 OFI 对象。各种属性结构的细节定义如下。对于基本应用程序,不需要修改或访问大多数属性字段。许多应用程序只需要处理 fi_info 的几个字段,最值得注意的是能力(上限)和模式位。

成功时,fi_getinfo() 函数返回 fi_info 结构的链表。列表中的每个条目都将满足通过hints 参数指定的条件。返回的条目可能来自不同的网络提供商,或者返回的属性可能不同。例如,如果提示没有指定特定的端点类型,则三种端点类型中的每一种都可能有一个条目。作为一般规则,libfabric 按从最理想到最不理想的顺序返回 fi_info 结构列表。高性能网络提供程序列在更通用的提供程序之前,例如套接字或 UDP 提供程序。

Capabilities 网卡能力

The fi_info caps field is used to specify the features and services that the application requires of the network. This field is a bit-mask of desired capabilities. There are capability bits for each of the data transfer services mentioned above: FI_MSG, FI_TAGGED, FI_RMA, and FI_ATOMIC. Applications should set each bit for each set of operations that it will use. These bits are often the only bits set by an application.

In some cases, additional bits may be used to limit how a feature will be used. For example, an application can use the FI_SEND or FI_RECV bits to indicate that it will only send or receive messages, respectively. Similarly, an application that will only initiate RMA writes, can set the FI_WRITE bit, leaving FI_REMOTE_WRITE unset. The FI_SEND and FI_RECV bits can be used to restrict the supported message and tagged operations. By default, if FI_MSG or FI_TAGGED are set, the resulting endpoint will be enabled to both send and receive messages. Likewise, FI_READ, FI_WRITE, FI_REMOTE_READ, FI_REMOTE_WRITE can restrict RMA and atomic operations.

Capabilities are grouped into two general categories: primary and secondary. Primary capabilities must explicitly be requested by an application, and a provider must enable support for only those primary capabilities which were selected. Secondary capabilities may optionally be requested by an application. If requested, a provider must support a capability if it is asked for or fail the fi_getinfo request. A provider may optionally report non-selected secondary capabilities if doing so would not compromise performance or security. That is, a provider may grant an application a secondary capability, whether the application requests it or not.

All of the capabilities discussed so far are primary. Secondary capabilities mostly deal with features desired by highly scalable, high-performance applications. For example, the FI_MULTI_RECV secondary capability indicates if the provider can support the multi-receive buffers feature described above.

Because different providers support different sets of capabilities, applications that desire optimal network performance may need to code for a capability being either present or absent. When present, such capabilities can offer a scalability or performance boost. When absent, an application may prefer to adjust its protocol or implementation to work around the network limitations. Although providers can often emulate features, doing so can impact overall performance, including the performance of data transfers that otherwise appear unrelated to the feature in use. For example, if a provider needs to insert protocol headers into the message stream in order to implement a given capability, the appearance of that header could negatively impact the performance of all transfers. By exposing such limitations to the application, the app has better control over how to best emulate the feature or work around its absence.

It is recommended that applications code for only those capabilities required to achieve the best performance. If a capability would have little to no effect on overall performance, developers should avoid using such features as part of an initial implementation. This will allow the application to work well across the widest variety of hardware. Application optimizations can then add support for less common features. To see which features are supported by which providers, see the libfabric Provider Feature Maxtrix for the relevant release.

fi_info caps 字段用于指定应用程序需要的网络功能和服务。该字段是所需功能的位掩码。上面提到的每个数据传输服务都有能力位:FI_MSG、FI_TAGGED、FI_RMA 和 FI_ATOMIC。应用程序应为其将使用的每组操作设置每个位。这些位通常是应用程序设置的唯一位。

在某些情况下,可能会使用额外的位来限制功能的使用方式。例如,应用程序可以使用 FI_SEND 或 FI_RECV 位来分别指示它将仅发送或接收消息。同样,仅启动 RMA 写入的应用程序可以设置 FI_WRITE 位,而 FI_REMOTE_WRITE 未设置。 FI_SEND 和 FI_RECV 位可用于限制支持的消息和标记操作。默认情况下,如果设置了 FI_MSG 或 FI_TAGGED,则生成的端点将启用发送和接收消息。同样,FI_READ、FI_WRITE、FI_REMOTE_READ、FI_REMOTE_WRITE 可以限制 RMA 和原子操作。

能力分为两大类:主要的和次要的。主要功能必须由应用程序明确请求,并且提供者必须仅启用对那些被选择的主要功能的支持。辅助能力可以由应用程序可选地请求。如果请求,提供者必须在请求或失败 fi_getinfo 请求时支持功能。如果这样做不会损害性能或安全性,提供者可以选择性地报告未选择的辅助功能。也就是说,提供者可以授予应用程序次要能力,无论应用程序是否请求它。

到目前为止讨论的所有功能都是主要的。辅助功能主要处理高度可扩展的高性能应用程序所需的功能。例如,FI_MULTI_RECV 辅助功能指示提供程序是否可以支持上述多接收缓冲区功能。

因为不同的供应商支持不同的功能集,所以需要最佳网络性能的应用程序可能需要针对存在或不存在的功能进行编码。如果存在,此类功能可以提供可扩展性或性能提升。如果不存在,应用程序可能更愿意调整其协议或实现以解决网络限制。尽管提供商通常可以模拟功能,但这样做会影响整体性能,包括在其他情况下看起来与使用的功能无关的数据传输的性能。例如,如果提供者需要将协议标头插入消息流中以实现给定的功能,则该标头的出现可能会对所有传输的性能产生负面影响。通过将这些限制暴露给应用程序,应用程序可以更好地控制如何最好地模拟该功能或解决它的缺失。

建议应用程序仅针对实现最佳性能所需的功能进行编码。如果一项功能对整体性能几乎没有影响,开发人员应避免将此类功能用作初始实现的一部分。这将允许应用程序在最广泛的硬件上运行良好。然后,应用程序优化可以添加对不太常见的功能的支持。要查看哪些提供程序支持哪些功能,请参阅相关版本的 libfabric Provider Feature Maxtrix

Mode Bits

Where capability bits represent features desired by applications, mode bits correspond to behavior requested by the provider. That is, capability bits are top down requests, whereas mode bits are bottom up restrictions. Mode bits are set by the provider to request that the application use the API in a specific way in order to achieve optimal performance. Mode bits often imply that the additional work needed by the application will be less overhead than forcing that same implementation down into the provider. Mode bits arise as a result of hardware implementation restrictions.

An application developer decides which mode bits they want to or can easily support as part of their development process. Each mode bit describes a particular behavior that the application must follow to use various interfaces. Applications set the mode bits that they support when calling fi_getinfo(). If a provider requires a mode bit that isn't set, that provider will be skipped by fi_getinfo(). If a provider does not need a mode bit that is set, it will respond to the fi_getinfo() call, with the mode bit cleared. This indicates that the application does not need to perform the action required by the mode bit.

One of the most common mode bits needed by providers is FI_CONTEXT. This mode bit requires that applications pass in a libfabric defined data structure (struct fi_context) into any data transfer function. That structure must remain valid and unused by the application until the data transfer operation completes. The purpose behind this mode bit is that the struct fi_context provides "scratch" space that the provider can use to track the request. For example, it may need to insert the request into a linked list, or track the number of times that an outbound transfer has been retried. Since many applications already track outstanding operations with their own data structure, by embedding the struct fi_context into that same structure, overall performance can be improved. This avoids the provider needing to allocate and free internal structures for each request.

Continuing with this example, if an application does not already track outstanding requests, then it would leave the FI_CONTEXT mode bit unset. This would indicate that the provider needs to get and release its own structure for tracking purposes. In this case, the costs would essentially be the same whether it were done by the application or provider.

For the broadest support of different network technologies, applications should attempt to support as many mode bits as feasible. Most providers attempt to support applications that cannot support any mode bits, with as small an impact as possible. However, implementation of mode bit avoidance in the provider will often impact latency tests.

能力位代表应用程序所需的功能,模式位对应于提供者请求的行为。也就是说,能力位是自上而下的请求,而模式位是自下而上的限制。模式位由提供程序设置以请求应用程序以特定方式使用 API 以获得最佳性能。模式位通常意味着应用程序所需的额外工作将比将相同的实现强制到提供程序中更少的开销。模式位是硬件实现限制的结果。

作为开发过程的一部分,应用程序开发人员决定他们想要或可以轻松支持哪些模式位。每个模式位都描述了应用程序使用各种接口必须遵循的特定行为。应用程序在调用 fi_getinfo() 时设置它们支持的模式位。如果提供程序需要未设置的模式位,则 fi_getinfo() 将跳过该提供程序。如果提供者不需要设置模式位,它将响应 fi_getinfo() 调用,并清除模式位。这表明应用程序不需要执行模式位所需的操作。

提供者需要的最常见的模式位之一是 FI_CONTEXT。此模式位要求应用程序将 libfabric 定义的数据结构 (struct fi_context) 传递到任何数据传输函数中。在数据传输操作完成之前,该结构必须保持有效且未被应用程序使用。此模式位背后的目的是 struct fi_context 提供提供者可以用来跟踪请求的“临时”空间。例如,它可能需要将请求插入到链表中,或跟踪出站传输已重试的次数。由于许多应用程序已经使用自己的数据结构跟踪未完成的操作,通过将结构 fi_context 嵌入到相同的结构中,可以提高整体性能。这避免了提供者需要为每个请求分配和释放内部结构。

继续此示例,如果应用程序尚未跟踪未完成的请求,则它将保留 FI_CONTEXT 模式位未设置。这表明提供者需要获取和发布其自己的结构以进行跟踪。在这种情况下,无论是由应用程序还是提供商完成,成本基本上是相同的。

为了最广泛地支持不同的网络技术,应用程序应尝试支持尽可能多的模式位。大多数提供商都试图以尽可能小的影响来支持无法支持任何模式位的应用程序。但是,在提供程序中实现模式位避免通常会影响延迟测试。

FIDs

FID stands for fabric identifier. It is the conceptual equivalent to a file descriptor. All fabric resources are represented by a fid structure, and all fid's are derived from a base fid type. In object-oriented terms, a fid would be the parent class. The contents of a fid are visible to the application.

FID 代表fabric标识符。 它在概念上等同于文件描述符。 所有的结构资源都由一个fid 结构表示,所有的fid 都派生自一个基本的fid 类型。 在面向对象的术语中,fid 将是父类。 fid 的内容对应用程序是可见的。

代码语言:javascript
代码运行次数:0
运行
复制
/* Base FID definition */
enum {
    FI_CLASS_UNSPEC,
    FI_CLASS_FABRIC,
    FI_CLASS_DOMAIN,
    ...
};
​
struct fi_ops {
    size_t size;
    int (*close)(struct fid *fid);
    ...
};
​
/* All fabric interface descriptors must start with this structure */
struct fid {
    size_t fclass;
    void *context;
    struct fi_ops *ops;
};
​

The fid structure is designed as a trade-off between minimizing memory footprint versus software overhead. Each fid is identified as a specific object class. Examples are given above (e.g. FI_CLASS_FABRIC). The context field is an application defined data value. The context field is usually passed as a parameter into the call that allocates the fid structure (e.g. fi_fabric() or fi_domain()). The use of the context field is application specific. Applications often set context to a corresponding structure that they've allocated. The context field is the only field that applications are recommended to access directly. Access to other fields should be done using defined function calls.

The ops field points to a set of function pointers. The fi_ops structure defines the operations that apply to that class. The size field in the fi_ops structure is used for extensibility, and allows the fi_ops structure to grow in a backward compatible manner as new operations are added. The fid deliberately points to the fi_ops structure, rather than embedding the operations directly. This allows multiple fids to point to the same set of ops, which minimizes the memory footprint of each fid. (Internally, providers usually set ops to a static data structure, with the fid structure dynamically allocated.)

Although it's possible for applications to access function pointers directly, it is strongly recommended that the static inline functions defined in the man pages be used instead. This is required by applications that may be built using the FABRIC_DIRECT library feature. (FABRIC_DIRECT is a compile time option that allows for highly optimized builds by tightly coupling an application with a specific provider. See the man pages for more details.)

Other OFI classes are derived from this structure, adding their own set of operations.

fid 结构被设计为最小化内存占用与软件开销之间的权衡。每个fid 都被标识为一个特定的对象类。上面给出了示例(例如 FI_CLASS_FABRIC)。上下文字段是应用程序定义的数据值。上下文字段通常作为参数传递给分配 fid 结构的调用(例如 fi_fabric() 或 fi_domain())。上下文字段的使用是特定于应用程序的。应用程序通常将上下文设置为它们已分配的相应结构。上下文字段是唯一建议应用程序直接访问的字段。应该使用定义的函数调用来访问其他字段。

ops 字段指向一组函数指针。 fi_ops 结构定义了适用于该类的操作。 fi_ops 结构中的 size 字段用于可扩展性,并允许 fi_ops 结构在添加新操作时以向后兼容的方式增长。 fid 故意指向 fi_ops 结构,而不是直接嵌入操作。这允许多个 fid 指向同一组操作,从而最大限度地减少每个 fid 的内存占用。 (在内部,提供者通常将 ops 设置为静态数据结构,并动态分配 fid 结构。)

尽管应用程序可以直接访问函数指针,但强烈建议改用手册页中定义的静态内联函数。这是可能使用 FABRIC_DIRECT 库功能构建的应用程序所必需的。 (FABRIC_DIRECT 是一个编译时选项,它允许通过将应用程序与特定提供程序紧密耦合来进行高度优化的构建。有关更多详细信息,请参见手册页。)

其他 OFI 类都是从这种结构派生的,添加了自己的一组操作。

代码语言:javascript
代码运行次数:0
运行
复制
/* Example of deriving a new class for a fabric object */
struct fi_ops_fabric {
    size_t size;
    int (*domain)(struct fid_fabric *fabric, struct fi_info *info,
        struct fid_domain **dom, void *context);
    ...
};
​
struct fid_fabric {
    struct fid fid;
    struct fi_ops_fabric *ops;
};

Other fid classes follow a similar pattern as that shown for fid_fabric. The base fid structure is followed by zero or more pointers to operation sets.

其他 fid 类遵循与 fid_fabric 类似的模式。 基本 fid 结构后跟零个或多个指向操作集的指针。

Fabric

The top-level object that applications open is the fabric identifier. The fabric can mostly be viewed as a container object by applications, though it does identify which provider applications use. (Future extensions are likely to expand methods that apply directly to the fabric object. An example is adding topology data to the API.)

Opening a fabric is usually a straightforward call after calling fi_getinfo().

应用程序打开的顶级对象是结构标识符。 Fabric 主要可以被应用程序视为一个容器对象,尽管它确实确定了哪些提供程序应用程序使用。 (未来的扩展可能会扩展直接应用于结构对象的方法。一个例子是向 API 添加拓扑数据。)

在调用 fi_getinfo() 之后打开一个结构通常是一个简单的调用。

代码语言:javascript
代码运行次数:0
运行
复制
int fi_fabric(struct fi_fabric_attr *attr, struct fid_fabric **fabric, void *context);

The fabric attributes can be directly accessed from struct fi_info. The newly opened fabric is returned through the 'fabric' parameter. The 'context' parameter appears in many operations. It is a user-specified value that is associated with the fabric. It may be used to point to an application specific structure and is retrievable from struct fid_fabric.

结构属性可以直接从 struct fi_info 访问。 通过'fabric'参数返回新打开的fabric。 'context' 参数出现在许多操作中。 它是与结构关联的用户指定值。 它可用于指向特定于应用程序的结构,并可从 struct fid_fabric 中检索。

Attributes

The fabric attributes are straightforward.

代码语言:javascript
代码运行次数:0
运行
复制
struct fi_fabric_attr {
    struct fid_fabric *fabric;
    char *name;
    char *prov_name;
    uint32_t prov_version;
};

The only field that applications are likely to use directly is the prov_name. This is a string value that can be used by hints to select a specific provider for use. On most systems, there will be multiple providers available. Only one is likely to represent the high-performance network attached to the system. Others are generic providers that may be available on any system, such as the TCP socket and UDP providers.

The fabric field is used to help applications manage open fabric resources. If an application has already opened a fabric that can support the returned fi_info structure, this will be set to that fabric. The contents of struct fid_fabric is visible to applications. It contains a pointer to the application's context data that was provided when the fabric was opened.

应用程序可能直接使用的唯一字段是 prov_name。 这是一个字符串值,提示可以使用它来选择要使用的特定提供程序。 在大多数系统上,将有多个提供程序可用。 只有一个可能代表连接到系统的高性能网络。 其他是通用提供程序,可以在任何系统上使用,例如 TCP 套接字和 UDP 提供程序。

Fabric 字段用于帮助应用程序管理开放的 Fabric 资源。 如果应用程序已经打开了可以支持返回的 fi_info 结构的结构,则会将其设置为该结构。 struct fid_fabric 的内容对应用程序可见。 它包含一个指向应用程序上下文数据的指针,该数据在打开结构时提供。

Environment Variables

Environment variables are used by providers to configure internal options for optimal performance or memory consumption. Libfabric provides an interface for querying which environment variables are usable, along with an application to display the information to a command window. Although environment variables are usually configured by an administrator, an application can query for variables programmatically.

提供者使用环境变量来配置内部选项以实现最佳性能或内存消耗。 Libfabric 提供了一个用于查询哪些环境变量可用的接口,以及一个将信息显示到命令窗口的应用程序。 尽管环境变量通常由管理员配置,但应用程序可以通过编程方式查询变量。

代码语言:javascript
代码运行次数:0
运行
复制
/* APIs to query for supported environment variables */
enum fi_param_type {
    FI_PARAM_STRING,
    FI_PARAM_INT,
    FI_PARAM_BOOL
};
​
struct fi_param {
    /* The name of the environment variable */
    const char *name;
    /* What type of value it stores: string, integer, or boolean */
    enum fi_param_type type;
    /* A description of how the variable is used */
    const char *help_string;
    /* The current value of the variable */
    const char *value;
};
​
int fi_getparams(struct fi_param **params, int *count);
void fi_freeparams(struct fi_param *params);

The modification of environment variables is typically a tuning activity done on larger clusters. However there are a few values that are useful for developers. These can be seen by executing the fi_info command.

环境变量的修改通常是在较大的集群上进行的调整活动。 但是,有一些值对开发人员有用。 这些可以通过执行 fi_info 命令来查看。

代码语言:javascript
代码运行次数:0
运行
复制
$ fi_info -e
# FI_LOG_LEVEL: String
# Specify logging level: warn, trace, info, debug (default: warn)
​
# FI_LOG_PROV: String
# Specify specific provider to log (default: all)
​
# FI_LOG_SUBSYS: String
# Specify specific subsystem to log (default: all)
​
# FI_PROVIDER: String
# Only use specified provider (default: all available)

Full documentation for these variables is available in the man pages. Variables beyond these may only be documented directly in the library itself, and available using the 'fi_info -e' command.

The FI_LOG_LEVEL can be used to increase the debug output from libfabric and the providers. Note that in the release build of libfabric, debug output from data path operations (transmit, receive, and completion processing) may not be available. The FI_PROVIDER variable can be used to enable or disable specific providers. This is useful to ensure that a given provider will be used.

手册页中提供了这些变量的完整文档。 超出这些的变量只能直接记录在库本身中,并且可以使用“fi_info -e”命令获得。

FI_LOG_LEVEL 可用于增加 libfabric 和提供程序的调试输出。 请注意,在 libfabric 的发布版本中,数据路径操作(传输、接收和完成处理)的调试输出可能不可用。 FI_PROVIDER 变量可用于启用或禁用特定提供程序。 这对于确保使用给定的提供程序很有用。

Domains

Domains usually map to a specific local network interface adapter. A domain may either refer to the entire NIC, a port on a multi-port NIC, or a virtual device exposed by a NIC. From the viewpoint of the application, a domain identifies a set of resources that may be used together.

Similar to a fabric, opening a domain is straightforward after calling fi_getinfo().

域通常映射到特定的本地网络接口适配器。 域可以指整个 NIC、多端口 NIC 上的端口或 NIC 公开的虚拟设备。 从应用程序的角度来看,域标识了一组可以一起使用的资源。

与结构类似,调用 fi_getinfo() 后打开域很简单。

代码语言:javascript
代码运行次数:0
运行
复制
int fi_domain(struct fid_fabric *fabric, struct fi_info *info,
    struct fid_domain **domain, void *context);

The fi_info structure returned from fi_getinfo() can be passed directly to fi_domain() to open a new domain.

从 fi_getinfo() 返回的 fi_info 结构可以直接传递给 fi_domain() 以打开一个新域。

Attributes

A domain defines the relationship between data transfer services (endpoints) and completion services (completion queues and counters). Many of the domain attributes describe that relationship and its impact to the application.

域定义了数据传输服务(端点)和完成服务(完成队列和计数器)之间的关系。 许多域属性描述了这种关系及其对应用程序的影响。

代码语言:javascript
代码运行次数:0
运行
复制
struct fi_domain_attr {
    struct fid_domain *domain;
    char *name;
    enum fi_threading threading;
    enum fi_progress control_progress;
    enum fi_progress data_progress;
    enum fi_resource_mgmt resource_mgmt;
    enum fi_av_type av_type;
    enum fi_mr_mode mr_mode;
    size_t mr_key_size;
    size_t cq_data_size;
    size_t cq_cnt;
    size_t ep_cnt;
    size_t tx_ctx_cnt;
    size_t rx_ctx_cnt;
    size_t max_ep_tx_ctx;
    size_t max_ep_rx_ctx;
    size_t max_ep_stx_ctx;
    size_t

Details of select attributes and their impact to the application are described below.

选择属性的详细信息及其对应用程序的影响如下所述。

Threading

OFI defines a unique threading model. The libfabric design is heavily influenced by object-oriented programming concepts. A multi-threaded application must determine how libfabric objects (domains, endpoints, completion queues, etc.) will be allocated among its threads, or if any thread can access any object. For example, an application may spawn a new thread to handle each new connected endpoint. The domain threading field provides a mechanism for an application to identify which objects may be accessed simultaneously by different threads. This in turn allows a provider to optimize or, in some cases, eliminate internal synchronization and locking around those objects.

The threading is best described as synchronization levels. As threading levels increase, greater potential parallelism is achieved. For example, an application can indicate that it will only access an endpoint from a single thread. This allows the provider to avoid acquiring locks around data transfer calls, knowing that there cannot be two simultaneous calls to send data on the same endpoint. The provider would only need to provide serialization if separate endpoints accessed the same shared software or hardware resources.

Threading defines where providers could optimize synchronization primitives. However, providers may still implement more serialization than is needed by the application. (This is usually a result of keeping the provider implementation simpler).

Various threading models are described in detail in the man pages. Developers should study the fi_domain man page and available threading options, and select a mode that is best suited for how the application was designed. If an application leaves the value undefined, providers will report the highest (most parallel) threading level that they support.

FI 定义了一个独特的线程模型。 libfabric 设计深受面向对象编程概念的影响。多线程应用程序必须确定如何在其线程之间分配 libfabric 对象(域、端点、完成队列等),或者任何线程是否可以访问任何对象。例如,应用程序可能会产生一个新线程来处理每个新连接的端点。域线程字段为应用程序提供了一种机制来识别哪些对象可以被不同的线程同时访问。这反过来又允许提供者优化,或者在某些情况下,消除围绕这些对象的内部同步和锁定。

最好将线程描述为同步级别。随着线程级别的增加,实现了更大的潜在并行性。例如,应用程序可以指示它将仅从单个线程访问端点。这允许提供者避免获取围绕数据传输调用的锁,因为它知道不能有两个同时调用来在同一端点上发送数据。如果单独的端点访问相同的共享软件或硬件资源,提供者只需要提供序列化。

线程定义了提供者可以在哪里优化同步原语。但是,提供者仍可能实现比应用程序所需的更多的序列化。 (这通常是保持提供者实现更简单的结果)。

手册页中详细描述了各种线程模型。开发人员应研究 fi_domain 手册页和可用的线程选项,并选择最适合应用程序设计方式的模式。如果应用程序未定义该值,提供者将报告他们支持的最高(最并行)线程级别。

Progress

As previously discussed, progress models are a result of using the host processor in order to perform some portion of the transport protocol. In order to simplify development, OFI defines two progress models: automatic or manual. It does not attempt to identify which specific interface features may be offloaded, or what operations require additional processing by the application's thread.

Automatic progress means that an operation initiated by the application will eventually complete, even if the application makes no further calls into the libfabric API. The operation is either offloaded entirely onto hardware, the provider uses an internal thread, or the operating system kernel may perform the task. The use of automatic progress may increase system overhead and latency in the latter two cases. For control operations, this is usually acceptable. However, the impact to data transfers may be measurable, especially if internal threads are required to provide automatic progress.

The manual progress model can avoid this overhead for providers that do not offload all transport features into hardware. With manual progress the provider implementation will handle transport operations as part of specific libfabric functions. For example, a call to fi_cq_read() which retrieves a list of completed operations may also be responsible for sending ack messages to notify peers that a message has been received. Since reading the completion queue is part of the normal operation of an application, there is little impact to the application and additional threads are avoided.

Applications need to take care when using manual progress, particularly if they link into libfabric multiple times through different code paths or library dependencies. If application threads are used to drive progress, such as responding to received data with ACKs, then it is critical that the application thread call into libfabric in a timely manner.

OFI defines wait and poll set objects that are specifically designed to assist with driving manual progress.

如前所述,进度模型是使用主机处理器来执行传输协议的某些部分的结果。为了简化开发,OFI 定义了两种进度模型:自动或手动。它不会尝试识别哪些特定的接口功能可以卸载,或者哪些操作需要应用程序线程的额外处理。

自动进度意味着应用程序启动的操作最终将完成,即使应用程序没有进一步调用 libfabric API。该操作要么完全卸载到硬件上,要么提供程序使用内部线程,要么操作系统内核可以执行任务。在后两种情况下,使用自动进度可能会增加系统开销和延迟。对于控制操作,这通常是可以接受的。但是,对数据传输的影响可能是可衡量的,尤其是在需要内部线程来提供自动进度的情况下。

对于不将所有传输功能卸载到硬件中的提供商,手动进度模型可以避免这种开销。随着手动进度,提供程序实现将处理传输操作作为特定 libfabric 功能的一部分。例如,调用 fi_cq_read() 检索已完成操作的列表也可能负责发送 ack 消息以通知对等方已收到消息。由于读取完成队列是应用程序正常操作的一部分,因此对应用程序的影响很小,并且避免了额外的线程。

应用程序在使用手动进度时需要小心,尤其是当它们通过不同的代码路径或库依赖项多次链接到 libfabric 时。如果应用程序线程用于推动进度,例如使用 ACK 响应接收到的数据,那么应用程序线程及时调用 libfabric 至关重要。

OFI 定义了专门用于帮助推动手动进度的等待和轮询集对象。

下文: https://cloud.tencent.com/developer/article/2531046

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Communication Model 通信模型
    • Connected Communications
    • Connection-less Communications
  • Endpoints
    • Shared Contexts
      • Receive Contexts
      • Transmit Contexts
    • Scalable Endpoints
  • Data Transfers
    • Message transfers 消息传输
    • Tagged messages
    • RMA
    • Atomic operations
  • Fabric Interfaces
    • Using fi_getinfo
      • Capabilities 网卡能力
      • Mode Bits
  • FIDs
  • Fabric
    • Attributes
    • Environment Variables
  • Domains
    • Attributes
    • Threading
    • Progress
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档