NVIDIA Docs Hub NVIDIA Networking Networking Software RDMA Aware Networks Programming User Manual Programming Examples Using IBV Verbs

Synopsis for RDMA_RC Example Using IBV Verbs
Code for Send, Receive, RDMA Read, RDMA Write
Synopsis for Multicast Example Using RDMA_CM and IBV Verbs
- Main
- Run
Code for Multicast Using RDMA_CM and IBV Verbs
Programming Examples Using RDMA Verbs
Experimental APIs
- Dynamically Connected Transport
Verbs API for Extended Atomics Support
- Supported Hardware
- Verbs Interface Changes
User-Mode Memory Registration (UMR)
- Interfaces
Cross-Channel Communications Support

Programming Examples Using IBV Verbs

This chapter provides code examples using the IBV Verbs

Synopsis for RDMA_RC Example Using IBV Verbs

The following is a synopsis of the functions in the programming example, in the order that they are called.

Main

Parse command line. The user may set the TCP port, device name, and device port for the test. If set, these values will override default values in config. The last parameter is the server name. If the server name is set, this designates a server to connect to and therefore puts the program into client mode. Otherwise the program is in server mode.

Call print_config.

Call resources_init.

Call resources_create.

Call connect_qp.

If in server mode, do a call post_send with IBV_WR_SEND operation.

Call poll_completion. Note that the server side expects a completion from the SEND request and the client side expects a RECEIVE completion.

If in client mode, show the message we received via the RECEIVE operation, otherwise, if we are in server mode, load the buffer with a new message.

Sync client<->server.

At this point the server goes directly to the next sync. All RDMA operations are done strictly by the client.

***Client only ***

Call post_send with IBV_WR_RDMA_READ to perform a RDMA read of server’s buffer.

Call poll_completion.

Show server’s message.

Setup send buffer with new message.

Call post_send with IBV_WR_RDMA_WRITE to perform a RDMA write of server’s buffer.

Call poll_completion.

*** End client only operations ***

Sync client<->server.

If server mode, show buffer, proving RDMA write worked.

Call resources_destroy.

Free device name string.

Done.

print_config

Print out configuration information.

resources_init

Clears resources struct.

resources_create

Call sock_connect to connect a TCP socket to the peer.

Get the list of devices, locate the one we want, and open it.

Free the device list.

Get the port information.

Create a PD.

Create a CQ.

Allocate a buffer, initialize it, register it.

Create a QP.

sock_connect

If client, resolve DNS address of server and initiate a connection to it.

If server, listen for incoming connection on indicated port.

connect_qp

Call modify_qp_to_init.

Call post_receive.

Call sock_sync_data to exchange information between server and client.

Call modify_qp_to_rtr.

Call modify_qp_to_rts.

Call sock_sync_data to synchronize client<->server

modify_qp_to_init

Transition QP to INIT state.

post_receive

Prepare a scatter/gather entry for the receive buffer.

Prepare an RR.

Post the RR.

sock_sync_data

Using the TCP socket created with sock_connect, synchronize the given set of data between client and the server. Since this function is blocking, it is also called with dummy data to synchronize the timing of the client and server.

modify_qp_to_rtr

Transition QP to RTR state.

modify_qp_to_rts

Transition QP to RTS state.

post_send

Prepare a scatter/gather entry for data to be sent (or received in RDMA read case).

Create an SR. Note that IBV_SEND_SIGNALED is redundant.

If this is an RDMA operation, set the address and key.

Post the SR.

poll_completion

Poll CQ until an entry is found or MAX_POLL_CQ_TIMEOUT milliseconds are reached.

resources_destroy

Release/free/deallocate all items in resource struct.

Code for Send, Receive, RDMA Read, RDMA Write

Copy
Copied!

            
            /*

* BUILD COMMAND:

* gcc -Wall -I/usr/local/ofed/include -O2 -o RDMA_RC_example -L/usr/local/ofed/lib64 -L/usr/local/ofed/lib -libverbs RDMA_RC_example.c

*

*/

/******************************************************************************

*

* RDMA Aware Networks Programming Example

*

* This code demonstrates how to perform the following operations using the * VPI Verbs API:

*

* Send

* Receive

* RDMA Read

* RDMA Write

*

*****************************************************************************/

 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

 

/* poll CQ timeout in millisec (2 seconds) */

#define MAX_POLL_CQ_TIMEOUT 2000

#define MSG "SEND operation "

#define RDMAMSGR "RDMA read operation "

#define RDMAMSGW "RDMA write operation"

#define MSG_SIZE (strlen(MSG) + 1)

 

#if __BYTE_ORDER == __LITTLE_ENDIAN

static inline uint64_t htonll(uint64_t x) { return bswap_64(x); }

static inline uint64_t ntohll(uint64_t x) { return bswap_64(x); }

#elif __BYTE_ORDER == __BIG_ENDIAN

static inline uint64_t htonll(uint64_t x) { return x; }

static inline uint64_t ntohll(uint64_t x) { return x; }

#else

#error __BYTE_ORDER is neither __LITTLE_ENDIAN nor __BIG_ENDIAN

#endif

 

/* structure of test parameters */

struct config_t

{

const char	*dev_name;	/* IB device name */
char	*server_name;	/* server host name */
u_int32_t	tcp_port;	/* server TCP port */
int	ib_port;	/* local IB port to work with */
int	gid_idx;	/* gid index to use */

};

/* structure to exchange data which is needed to connect the QPs */

struct cm_con_data_t

{

uint64_t	addr;	/* Buffer address */
uint32_t	rkey;	/* Remote key */
uint32_t	qp_num;	/* QP number */
uint16_t	lid;	/* LID of the IB port */
uint8_t	gid[16];	/* gid */

} __attribute__ ((packed));

/* structure of system resources */

struct resources

{

struct ibv_device_attr device_attr;	/* Device attributes */
struct ibv_port_attr	port_attr;	/* IB port attributes */
struct cm_con_data_t	remote_props;	/* values to connect to remote side */
struct ibv_context	*ib_ctx;	/* device handle */
struct ibv_pd	*pd;	/* PD handle */
struct ibv_cq	*cq;	/* CQ handle */
struct ibv_qp	*qp;	/* QP handle */
struct ibv_mr	*mr;	/* MR handle for buf */
char	*buf;	/* memory buffer pointer, used for RDMA and send ops */
int	sock;	/* TCP socket file descriptor */

};

struct config_t config =

{

NULL,	/* dev_name */
NULL,	/* server_name */
19875,	/* tcp_port */
1,	/* ib_port */
-1	/* gid_idx */

};

Copy
Copied!

            
            /******************************************************************************
 
Socket operations
 
 
 
For simplicity, the example program uses TCP sockets to exchange control
 
information. If a TCP/IP stack/connection is not available, connection manager
 
(CM) may be used to pass this information. Use of CM is beyond the scope of
 
this example
 
 
 
******************************************************************************/
 
 
 
 
 
/******************************************************************************
 
* Function: sock_connect
 
*
 
* Input
 
* servername URL of server to connect to (NULL for server mode)
 
* port port of service
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* socket (fd) on success, negative error code on failure
 
*
 
* Description
 
* Connect a socket. If servername is specified a client connection will be
 
* initiated to the indicated server and port. Otherwise listen on the
 
* indicated port for an incoming connection.
 
*
 
******************************************************************************/
 
 
 
static int sock_connect(const char *servername, int port)
 
{

struct addrinfo	*resolved_addr = NULL;
struct addrinfo	*iterator;
char	service[6];
int	sockfd = -1;
int	listenfd = 0;
int	tmp;

Copy
Copied!

            
            struct addrinfo hints =
 
{
 
.ai_flags = AI_PASSIVE,
 
.ai_family = AF_INET,
 
.ai_socktype = SOCK_STREAM
 
};
 
 
 
if (sprintf(service, "%d", port) < 0)
 
goto sock_connect_exit;
 
 
 
/* Resolve DNS address, use sockfd as temp storage */
 
 
sockfd = getaddrinfo(servername, service, &hints, &resolved_addr);
 
 
 
if (sockfd < 0)
 
{
 
fprintf(stderr, "%s for %s:%d\n", gai_strerror(sockfd), servername, port);
 
goto sock_connect_exit;
 
}
 
 
 
/* Search through results and find the one we want */
 
 
for (iterator = resolved_addr; iterator ; iterator = iterator->ai_next)
 
{
 
sockfd = socket(iterator->ai_family, iterator->ai_socktype, iterator->ai_protocol);
 
 
 
if (sockfd >= 0)
 
{
 
if (servername)
 
/* Client mode. Initiate connection to remote */
 
if((tmp=connect(sockfd, iterator->ai_addr, iterator->ai_addrlen)))
 
{
 
fprintf(stdout, "failed connect \n");
 
close(sockfd);
 
sockfd = -1;
 
}
 
else
 
{
 
/* Server mode. Set up listening socket an accept a connection */
 
listenfd = sockfd;
 
sockfd = -1;
 
if(bind(listenfd, iterator->ai_addr, iterator->ai_addrlen))
 
goto sock_connect_exit;
 
listen(listenfd, 1);
 
sockfd = accept(listenfd, NULL, 0);
 
}
 
}
 
}
 
 
 
sock_connect_exit:
 
 
if(listenfd)
 
close(listenfd);
 
 
 
if(resolved_addr)
 
freeaddrinfo(resolved_addr);
 
 
 
if (sockfd < 0)
 
{
 
if(servername)
 
fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port);
 
else
 
{
 
perror("server accept");
 
fprintf(stderr, "accept() failed\n");
 
}
 
}
 
 
 
return sockfd;
 
}
 
 
 
 
 
/******************************************************************************
 
* Function: sock_sync_data
 
*
 
* Input

* sock	socket to transfer data on
* xfer_size	size of data to transfer
* local_data	pointer to data to be sent to remote

Copy
Copied!

            
            *
 
* Output
 
* remote_data pointer to buffer to receive remote data
 
*
 
* Returns
 
* 0 on success, negative error code on failure
 
*
 
* Description
 
* Sync data across a socket. The indicated local data will be sent to the
 
* remote. It will then wait for the remote to send its data back. It is
 
* assumed that the two sides are in sync and call this function in the proper
 
* order. Chaos will ensue if they are not. :)
 
*
 
* Also note this is a blocking function and will wait for the full data to be
 
* received from the remote.
 
*
 
******************************************************************************/
 
 
 
int sock_sync_data(int sock, int xfer_size, char *local_data, char *remote_data)
 
{
 
int rc;
 
int read_bytes = 0;
 
int total_read_bytes = 0;
 
 
 
rc = write(sock, local_data, xfer_size);
 
if(rc < xfer_size)
 
fprintf(stderr, "Failed writing data during sock_sync_data\n");
 
else
 
rc = 0;
 
 
while(!rc && total_read_bytes < xfer_size)
 
{
 
read_bytes = read(sock, remote_data, xfer_size);
 
if(read_bytes > 0)
 
total_read_bytes += read_bytes;
 
else
 
rc = read_bytes;
 
}
 
 
 
return rc;
 
}
 
 
 
 
 
/******************************************************************************
 
End of socket operations
 
******************************************************************************/
 
 
 
/* poll_completion */
 
/******************************************************************************
 
* Function: poll_completion
 
*
 
* Input
 
* res pointer to resources structure
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, 1 on failure
 
*
 
* Description
 
* Poll the completion queue for a single event. This function will continue to
 
* poll the queue until MAX_POLL_CQ_TIMEOUT milliseconds have passed.
 
*
 
******************************************************************************/
 
 
 
static int poll_completion(struct resources *res)
 
{

struct ibv_wc	wc;
unsigned long	start_time_msec;
unsigned long	cur_time_msec;
struct timeval	cur_time;
int	poll_result;
int	rc = 0;

Copy
Copied!

            
            /* poll the completion for a while before giving up of doing it .. */
 
gettimeofday(&cur_time, NULL);
 
start_time_msec = (cur_time.tv_sec * 1000) + (cur_time.tv_usec / 1000);
 
 
 
do
 
{
 
poll_result = ibv_poll_cq(res->cq, 1, &wc);
 
gettimeofday(&cur_time, NULL);
 
cur_time_msec = (cur_time.tv_sec * 1000) + (cur_time.tv_usec / 1000);
 
} while ((poll_result == 0) && ((cur_time_msec - start_time_msec) < MAX_POLL_CQ_TIMEOUT));
 
 
 
if(poll_result < 0)
 
{
 
/* poll CQ failed */
 
fprintf(stderr, "poll CQ failed\n");
 
rc = 1;
 
}
 
else if (poll_result == 0)
 
{
 
/* the CQ is empty */
 
fprintf(stderr, "completion wasn't found in the CQ after timeout\n");
 
rc = 1;
 
}
 
else
 
{
 
/* CQE found */
 
fprintf(stdout, "completion was found in CQ with status 0x%x\n", wc.status);
 
 
 
/* check the completion status (here we don't care about the completion opcode */
 
if (wc.status != IBV_WC_SUCCESS)
 
{
 
fprintf(stderr, "got bad completion with status: 0x%x, vendor syndrome: 0x%x\n", wc.status, wc.vendor_err);
 
rc = 1;
 
}
 
}
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: post_send
 
*
 
* Input
 
* res pointer to resources structure
 
* opcode IBV_WR_SEND, IBV_WR_RDMA_READ or IBV_WR_RDMA_WRITE
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, error code on failure
 
*
 
* Description
 
* This function will create and post a send work request
 
******************************************************************************/
 
 
 
static int post_send(struct resources *res, int opcode)
 
{

struct ibv_send_wr	sr;
struct ibv_sge	sge;
struct ibv_send_wr	*bad_wr = NULL;
int	rc;

Copy
Copied!

            
            /* prepare the scatter/gather entry */
 
memset(&sge, 0, sizeof(sge));
 
 
 
sge.addr = (uintptr_t)res->buf;
 
sge.length = MSG_SIZE;
 
sge.lkey = res->mr->lkey;
 
 
 
/* prepare the send work request */
 
memset(&sr, 0, sizeof(sr));
 
 
 
sr.next = NULL;
 
sr.wr_id = 0;
 
sr.sg_list = &sge;
 
sr.num_sge = 1;
 
sr.opcode = opcode;
 
sr.send_flags = IBV_SEND_SIGNALED;
 
 
 
if(opcode != IBV_WR_SEND)
 
{
 
sr.wr.rdma.remote_addr = res->remote_props.addr;
 
sr.wr.rdma.rkey = res->remote_props.rkey;
 
}
 
 
 
/* there is a Receive Request in the responder side, so we won't get any into RNR flow */
 
rc = ibv_post_send(res->qp, &sr, &bad_wr);
 
if (rc)
 
fprintf(stderr, "failed to post SR\n");
 
else
 
{
 
switch(opcode)
 
{
 
case IBV_WR_SEND:
 
fprintf(stdout, "Send Request was posted\n");
 
break;
 
 
 
case IBV_WR_RDMA_READ:
 
fprintf(stdout, "RDMA Read Request was posted\n");
 
break;
 
 
 
case IBV_WR_RDMA_WRITE:
 
fprintf(stdout, "RDMA Write Request was posted\n");
 
break;
 
 
 
default:
 
fprintf(stdout, "Unknown Request was posted\n");
 
break;
 
}
 
}
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: post_receive
 
*
 
* Input
 
* res pointer to resources structure
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, error code on failure
 
*
 
* Description
 
*
 
******************************************************************************/
 
 
 
static int post_receive(struct resources *res)
 
{

struct ibv_recv_wr	rr;
struct ibv_sge	sge;
struct ibv_recv_wr	*bad_wr;
int	rc;

Copy
Copied!

            
            /* prepare the scatter/gather entry */
 
memset(&sge, 0, sizeof(sge));
 
sge.addr = (uintptr_t)res->buf;
 
sge.length = MSG_SIZE;
 
sge.lkey = res->mr->lkey;
 
 
 
/* prepare the receive work request */
 
memset(&rr, 0, sizeof(rr));
 
 
 
rr.next = NULL;
 
rr.wr_id = 0;
 
rr.sg_list = &sge;
 
rr.num_sge = 1;
 
 
 
/* post the Receive Request to the RQ */
 
rc = ibv_post_recv(res->qp, &rr, &bad_wr);
 
if (rc)
 
fprintf(stderr, "failed to post RR\n");
 
else
 
fprintf(stdout, "Receive Request was posted\n");
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: resources_init
 
*
 
* Input
 
* res pointer to resources structure
 
*
 
* Output
 
* res is initialized
 
*
 
* Returns
 
* none
 
*
 
* Description
 
* res is initialized to default values
 
******************************************************************************/
 
static void resources_init(struct resources *res)
 
{
 
memset(res, 0, sizeof *res);
 
res->sock = -1;
 
}
 
 
 
/******************************************************************************
 
* Function: resources_create
 
*
 
* Input
 
* res pointer to resources structure to be filled in
 
*
 
* Output
 
* res filled in with resources
 
*
 
* Returns
 
* 0 on success, 1 on failure
 
*
 
* Description
 
*
 
* This function creates and allocates all necessary system resources. These
 
* are stored in res.
 
*****************************************************************************/
 
 
 
static int resources_create(struct resources *res)
 
{
 
struct ibv_device **dev_list = NULL;
 
struct ibv_qp_init_attr qp_init_attr;
 
struct ibv_device *ib_dev = NULL;

size_t	size;
int	i;
int	mr_flags = 0;
int	cq_size = 0;
int	num_devices;
int	rc = 0;

Copy
Copied!

            
            /* if client side */
 
if (config.server_name)
 
{
 
res->sock = sock_connect(config.server_name, config.tcp_port);
 
if (res->sock < 0)
 
{
 
fprintf(stderr, "failed to establish TCP connection to server %s, port %d\n",
 
config.server_name, config.tcp_port);
 
rc = -1;
 
goto resources_create_exit;
 
}
 
}
 
else
 
{
 
fprintf(stdout, "waiting on port %d for TCP connection\n", config.tcp_port);
 
 
 
res->sock = sock_connect(NULL, config.tcp_port);
 
if (res->sock < 0)
 
{
 
fprintf(stderr, "failed to establish TCP connection with client on port %d\n",
 
config.tcp_port);
 
rc = -1;
 
goto resources_create_exit;
 
}
 
}
 
 
 
fprintf(stdout, "TCP connection was established\n");
 
 
 
fprintf(stdout, "searching for IB devices in host\n");
 
 
 
/* get device names in the system */
 
dev_list = ibv_get_device_list(&num_devices);
 
if (!dev_list)
 
{
 
fprintf(stderr, "failed to get IB devices list\n");
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
/* if there isn't any IB device in host */
 
if (!num_devices)
 
{
 
fprintf(stderr, "found %d device(s)\n", num_devices);
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
fprintf(stdout, "found %d device(s)\n", num_devices);
 
 
 
/* search for the specific device we want to work with */
 
for (i = 0; i < num_devices; i ++)
 
{
 
if(!config.dev_name)
 
{
 
config.dev_name = strdup(ibv_get_device_name(dev_list[i]));
 
fprintf(stdout, "device not specified, using first one found: %s\n", config.dev_name);
 
}
 
if (!strcmp(ibv_get_device_name(dev_list[i]), config.dev_name))
 
{
 
ib_dev = dev_list[i];
 
break;
 
}
 
}
 
 
 
/* if the device wasn't found in host */
 
if (!ib_dev)
 
{
 
fprintf(stderr, "IB device %s wasn't found\n", config.dev_name);
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
/* get device handle */
 
res->ib_ctx = ibv_open_device(ib_dev);
 
if (!res->ib_ctx)
 
{
 
fprintf(stderr, "failed to open device %s\n", config.dev_name);
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
/* We are now done with device list, free it */
 
 
 
ibv_free_device_list(dev_list);
 
dev_list = NULL;
 
ib_dev = NULL;
 
 
 
 
 
/* query port properties */
 
if (ibv_query_port(res->ib_ctx, config.ib_port, &res->port_attr))
 
{
 
fprintf(stderr, "ibv_query_port on port %u failed\n", config.ib_port);
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
/* allocate Protection Domain */
 
res->pd = ibv_alloc_pd(res->ib_ctx);
 
if (!res->pd)
 
{
 
fprintf(stderr, "ibv_alloc_pd failed\n");
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
/* each side will send only one WR, so Completion Queue with 1 entry is enough */
 
cq_size = 1;
 
res->cq = ibv_create_cq(res->ib_ctx, cq_size, NULL, NULL, 0);
 
if (!res->cq)
 
{
 
fprintf(stderr, "failed to create CQ with %u entries\n", cq_size);
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
/* allocate the memory buffer that will hold the data */
 
 
 
size = MSG_SIZE;
 
res->buf = (char *) malloc(size);
 
 
 
if (!res->buf )
 
{
 
fprintf(stderr, "failed to malloc %Zu bytes to memory buffer\n", size);
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
memset(res->buf, 0 , size);
 
 
 
/* only in the server side put the message in the memory buffer */
 
if (!config.server_name)
 
{
 
strcpy(res->buf, MSG);
 
fprintf(stdout, "going to send the message: '%s'\n", res->buf);
 
}
 
else
 
memset(res->buf, 0, size);
 
 
 
/* register the memory buffer */
 
 
 
mr_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE ;
 
res->mr = ibv_reg_mr(res->pd, res->buf, size, mr_flags);
 
if (!res->mr)
 
{
 
fprintf(stderr, "ibv_reg_mr failed with mr_flags=0x%x\n", mr_flags);
 
rc = 1;
 
goto resources_create_exit;
 
}
 
 
 
fprintf(stdout, "MR was registered with addr=%p, lkey=0x%x, rkey=0x%x, flags=0x%x\n",
 
res->buf, res->mr->lkey, res->mr->rkey, mr_flags);
 
 
 
 
 
/* create the Queue Pair */
 
memset(&qp_init_attr, 0, sizeof(qp_init_attr));
 
 
 
qp_init_attr.qp_type = IBV_QPT_RC;
 
qp_init_attr.sq_sig_all = 1;
 
qp_init_attr.send_cq = res->cq;
 
qp_init_attr.recv_cq = res->cq;
 
qp_init_attr.cap.max_send_wr = 1;
 
qp_init_attr.cap.max_recv_wr = 1;
 
qp_init_attr.cap.max_send_sge = 1;
 
qp_init_attr.cap.max_recv_sge = 1;
 
 
 
res->qp = ibv_create_qp(res->pd, &qp_init_attr);
 
if (!res->qp)
 
{
 
fprintf(stderr, "failed to create QP\n");
 
rc = 1;
 
goto resources_create_exit;
 
}
 
fprintf(stdout, "QP was created, QP number=0x%x\n", res->qp->qp_num);
 
 
 
resources_create_exit:
 
 
 
if(rc)
 
{
 
/* Error encountered, cleanup */
 
 
 
if(res->qp)
 
{
 
ibv_destroy_qp(res->qp);
 
res->qp = NULL;
 
}
 
 
 
if(res->mr)
 
{
 
ibv_dereg_mr(res->mr);
 
res->mr = NULL;
 
}
 
 
 
if(res->buf)
 
{
 
free(res->buf);
 
res->buf = NULL;
 
}
 
 
 
if(res->cq)
 
{
 
ibv_destroy_cq(res->cq);
 
res->cq = NULL;
 
}
 
 
if(res->pd)
 
{
 
ibv_dealloc_pd(res->pd);
 
res->pd = NULL;
 
}
 
 
 
if(res->ib_ctx)
 
{
 
ibv_close_device(res->ib_ctx);
 
res->ib_ctx = NULL;
 
}
 
 
 
if(dev_list)
 
{
 
ibv_free_device_list(dev_list);
 
dev_list = NULL;
 
}
 
if (res->sock >= 0)
 
{
 
if (close(res->sock))
 
fprintf(stderr, "failed to close socket\n");
 
res->sock = -1;
 
}
 
}
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: modify_qp_to_init
 
*
 
* Input
 
* qp QP to transition
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, ibv_modify_qp failure code on failure
 
*
 
* Description
 
* Transition a QP from the RESET to INIT state
 
******************************************************************************/
 
 
 
static int modify_qp_to_init(struct ibv_qp *qp)
 
{

struct ibv_qp_attr	attr;
int	flags;
int	rc;

Copy
Copied!

            
            memset(&attr, 0, sizeof(attr));
 
 
 
attr.qp_state = IBV_QPS_INIT;
 
attr.port_num = config.ib_port;
 
attr.pkey_index = 0;
 
attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE;
 
 
 
flags = IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS;
 
 
 
rc = ibv_modify_qp(qp, &attr, flags);
 
if (rc)
 
fprintf(stderr, "failed to modify QP state to INIT\n");
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: modify_qp_to_rtr
 
*
 
* Input

*	qp	QP to transition
*	remote_qpn	remote QP number
*	dlid	destination LID
*	dgid	destination GID (mandatory for RoCEE)

Copy
Copied!

            
            *
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, ibv_modify_qp failure code on failure
 
*
 
* Description
 
* Transition a QP from the INIT to RTR state, using the specified QP number
 
******************************************************************************/
 
 
 
static int modify_qp_to_rtr(struct ibv_qp *qp, uint32_t remote_qpn, uint16_t dlid, uint8_t *dgid)
 
{

struct ibv_qp_attr	attr;
int	flags;
int	rc;

Copy
Copied!

            
            memset(&attr, 0, sizeof(attr));
 
 
 
attr.qp_state = IBV_QPS_RTR;
 
attr.path_mtu = IBV_MTU_256;
 
attr.dest_qp_num = remote_qpn;
 
attr.rq_psn = 0;
 
attr.max_dest_rd_atomic = 1;
 
attr.min_rnr_timer = 0x12;
 
attr.ah_attr.is_global = 0;
 
attr.ah_attr.dlid = dlid;
 
attr.ah_attr.sl = 0;
 
attr.ah_attr.src_path_bits = 0;
 
attr.ah_attr.port_num = config.ib_port;
 
if (config.gid_idx >= 0)
 
{
 
attr.ah_attr.is_global = 1;
 
attr.ah_attr.port_num = 1;
 
memcpy(&attr.ah_attr.grh.dgid, dgid, 16);
 
attr.ah_attr.grh.flow_label = 0;
 
attr.ah_attr.grh.hop_limit = 1;
 
attr.ah_attr.grh.sgid_index = config.gid_idx;
 
attr.ah_attr.grh.traffic_class = 0;
 
}
 
 
 
flags = IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
 
IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;
 
 
 
rc = ibv_modify_qp(qp, &attr, flags);
 
if (rc)
 
fprintf(stderr, "failed to modify QP state to RTR\n");
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: modify_qp_to_rts
 
*
 
* Input
 
* qp QP to transition
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, ibv_modify_qp failure code on failure
 
*
 
* Description
 
* Transition a QP from the RTR to RTS state
 
******************************************************************************/
 
 
 
static int modify_qp_to_rts(struct ibv_qp *qp)
 
{

struct ibv_qp_attr	attr;
int	flags;
int	rc;

Copy
Copied!

            
            memset(&attr, 0, sizeof(attr));
 
 
 
attr.qp_state = IBV_QPS_RTS;
 
attr.timeout = 0x12;
 
attr.retry_cnt = 6;
 
attr.rnr_retry = 0;
 
attr.sq_psn = 0;
 
attr.max_rd_atomic = 1;
 
 
 
flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
 
IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;
 
 
 
rc = ibv_modify_qp(qp, &attr, flags);
 
if (rc)
 
fprintf(stderr, "failed to modify QP state to RTS\n");
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: connect_qp
 
*
 
* Input
 
* res pointer to resources structure
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, error code on failure
 
*
 
* Description
 
* Connect the QP. Transition the server side to RTR, sender side to RTS
 
******************************************************************************/
 
 
 
static int connect_qp(struct resources *res)
 
{
 
struct cm_con_data_t local_con_data;
 
struct cm_con_data_t remote_con_data;
 
struct cm_con_data_t tmp_con_data;
 
int rc = 0;
 
char temp_char;
 
union ibv_gid my_gid;
 
 
 
 
 
if (config.gid_idx >= 0)
 
{
 
rc = ibv_query_gid(res->ib_ctx, config.ib_port, config.gid_idx, &my_gid);
 
if (rc)
 
{
 
fprintf(stderr, "could not get gid for port %d, index %d\n", config.ib_port, config.gid_idx);
 
return rc;
 
}
 
} else
 
memset(&my_gid, 0, sizeof my_gid);
 
 
 
 
 
/* exchange using TCP sockets info required to connect QPs */
 
local_con_data.addr = htonll((uintptr_t)res->buf);
 
local_con_data.rkey = htonl(res->mr->rkey);
 
local_con_data.qp_num = htonl(res->qp->qp_num);
 
local_con_data.lid = htons(res->port_attr.lid);
 
memcpy(local_con_data.gid, &my_gid, 16);
 
 
 
fprintf(stdout, "\nLocal LID = 0x%x\n", res->port_attr.lid);
 
 
 
if (sock_sync_data(res->sock, sizeof(struct cm_con_data_t), (char *) &local_con_data, (char *) &tmp_con_data) < 0)
 
{
 
fprintf(stderr, "failed to exchange connection data between sides\n");
 
rc = 1;
 
goto connect_qp_exit;
 
}
 
 
 
remote_con_data.addr = ntohll(tmp_con_data.addr);
 
remote_con_data.rkey = ntohl(tmp_con_data.rkey);
 
remote_con_data.qp_num = ntohl(tmp_con_data.qp_num);
 
remote_con_data.lid = ntohs(tmp_con_data.lid);
 
memcpy(remote_con_data.gid, tmp_con_data.gid, 16);
 
 
 
/* save the remote side attributes, we will need it for the post SR */
 
res->remote_props = remote_con_data;
 
 
 
fprintf(stdout, "Remote address = 0x%"PRIx64"\n", remote_con_data.addr);
 
fprintf(stdout, "Remote rkey = 0x%x\n", remote_con_data.rkey);
 
 
fprintf(stdout, "Remote QP number = 0x%x\n", remote_con_data.qp_num);
 
fprintf(stdout, "Remote LID = 0x%x\n", remote_con_data.lid);
 
if (config.gid_idx >= 0)
 
{
 
uint8_t *p = remote_con_data.gid;
 
fprintf(stdout, "Remote GID = %02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x\n",
p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]);
 
}
 
 
 
/* modify the QP to init */
 
rc = modify_qp_to_init(res->qp);
 
if (rc)
 
{
 
fprintf(stderr, "change QP state to INIT failed\n");
 
goto connect_qp_exit;
 
}
 
 
 
/* let the client post RR to be prepared for incoming messages */
 
if (config.server_name)
 
{
 
rc = post_receive(res);
 
if (rc)
 
{
 
fprintf(stderr, "failed to post RR\n");
 
goto connect_qp_exit;
 
}
 
}
 
 
 
 
 
/* modify the QP to RTR */
 
rc = modify_qp_to_rtr(res->qp, remote_con_data.qp_num, remote_con_data.lid, remote_con_data.gid);
 
if (rc)
 
{
 
fprintf(stderr, "failed to modify QP state to RTR\n");
 
goto connect_qp_exit;
 
}
 
 
 
rc = modify_qp_to_rts(res->qp);
 
if (rc)
 
{
 
fprintf(stderr, "failed to modify QP state to RTR\n");
 
goto connect_qp_exit;
 
}
 
 
 
fprintf(stdout, "QP state was change to RTS\n");
 
 
 
 
 
/* sync to make sure that both sides are in states that they can connect to prevent packet loose */
 
if (sock_sync_data(res->sock, 1, "Q", &temp_char)) /* just send a dummy char back and forth */
 
{
 
fprintf(stderr, "sync error after QPs are were moved to RTS\n");
 
rc = 1;
 
}
 
 
 
connect_qp_exit:
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: resources_destroy
 
*
 
* Input
 
* res pointer to resources structure
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, 1 on failure
 
*
 
* Description
 
* Cleanup and deallocate all resources used
 
******************************************************************************/
 
 
 
static int resources_destroy(struct resources *res)
 
{
 
int rc = 0;
 
 
 
if (res->qp)
 
if (ibv_destroy_qp(res->qp))
 
{
 
fprintf(stderr, "failed to destroy QP\n");
 
rc = 1;
 
}
 
 
 
if (res->mr)
 
if (ibv_dereg_mr(res->mr))
 
{
 
fprintf(stderr, "failed to deregister MR\n");
 
rc = 1;
 
}
 
 
 
if (res->buf)
 
free(res->buf);
 
 
 
if (res->cq)
 
if (ibv_destroy_cq(res->cq))
 
{
 
fprintf(stderr, "failed to destroy CQ\n");
 
rc = 1;
 
}
 
 
 
if (res->pd)
 
if (ibv_dealloc_pd(res->pd))
 
{
 
fprintf(stderr, "failed to deallocate PD\n");
 
rc = 1;
 
}
 
 
 
if (res->ib_ctx)
 
if (ibv_close_device(res->ib_ctx))
 
{
 
fprintf(stderr, "failed to close device context\n");
 
rc = 1;
 
}
 
 
 
if (res->sock >= 0)
 
if (close(res->sock))
 
{
 
fprintf(stderr, "failed to close socket\n");
 
rc = 1;
 
}
 
 
 
return rc;
 
}
 
 
 
/******************************************************************************
 
* Function: print_config
 
*
 
* Input
 
* none
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* none
 
*
 
* Description
 
* Print out config information
 
******************************************************************************/
 
static void print_config(void)
 
{
 
fprintf(stdout, " ------------------------------------------------\n");

fprintf(stdout,	" Device name	: \"%s\"\n", config.dev_name);
fprintf(stdout,	" IB port	: %u\n", config.ib_port);

if (config.server_name)

fprintf(stdout, " IP

: %s\n", config.server_name);

fprintf(stdout,

" TCP port

: %u\n", config.tcp_port);

if (config.gid_idx >= 0)

fprintf(stdout, " GID index

: %u\n", config.gid_idx);

Copy
Copied!

            
            fprintf(stdout, " ------------------------------------------------\n\n");
 
}
 
 
 
/******************************************************************************
 
* Function: usage
 
*
 
* Input
 
* argv0 command line arguments
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* none
 
*
 
* Description
 
* print a description of command line syntax
 
******************************************************************************/
 
 
 
static void usage(const char *argv0)
 
{
 
fprintf(stdout, "Usage:\n");
 
fprintf(stdout, " %s start a server and wait for connection\n", argv0);
 
fprintf(stdout, " %s <host> connect to server at <host>\n", argv0);
 
fprintf(stdout, "\n");
 
fprintf(stdout, "Options:\n");
 
fprintf(stdout, " -p, --port <port> listen on/connect to port <port> (default 18515)\n");
 
fprintf(stdout, " -d, --ib-dev <dev> use IB device <dev> (default first device found)\n");
 
fprintf(stdout, " -i, --ib-port <port> use port <port> of IB device (default 1)\n");
 
fprintf(stdout, " -g, --gid_idx <git index> gid index to be used in GRH (default not used)\n");
 
}
 
 
 
/******************************************************************************
 
* Function: main
 
*
 
* Input
 
* argc number of items in argv
 
* argv command line parameters
 
*
 
* Output
 
* none
 
*
 
* Returns
 
* 0 on success, 1 on failure
 
*
 
* Description
 
* Main program code
 
******************************************************************************/
 
 
 
int main(int argc, char *argv[])
 
{

struct resources	res;
int	rc = 1;
char	temp_char;

/* parse the command line parameters */

while (1)

{

int c;

static struct option long_options[] =

{

{name = "port",	has_arg = 1,	val = 'p' },
{name = "ib-dev",	has_arg = 1,	val = 'd' },
{name = "ib-port",	has_arg = 1,	val = 'i' },
{name = "gid-idx",	has_arg = 1,	val = 'g' },
{name = NULL,	has_arg = 0,	val = '\0'}

Copy
Copied!

            
            };
 
 
 
c = getopt_long(argc, argv, "p:d:i:g:", long_options, NULL);
 
if (c == -1)
 
break;
 
 
 
switch (c)
 
{
 
case 'p':
 
config.tcp_port = strtoul(optarg, NULL, 0);
 
break;
 
 
 
case 'd':
 
config.dev_name = strdup(optarg);
 
break;
 
 
 
case 'i':
 
config.ib_port = strtoul(optarg, NULL, 0);
 
if (config.ib_port < 0)
 
{
 
usage(argv[0]);
 
return 1;
 
}
 
break;
 
 
 
case 'g':
 
config.gid_idx = strtoul(optarg, NULL, 0);
 
if (config.gid_idx < 0)
 
{
 
usage(argv[0]);
 
return 1;
 
}
 
break;
 
 
default:
 
usage(argv[0]);
 
return 1;
 
}
 
}
 
 
 
/* parse the last parameter (if exists) as the server name */
 
if (optind == argc - 1)
 
config.server_name = argv[optind];
 
else if (optind < argc)
 
{
 
usage(argv[0]);
 
return 1;
 
}
 
 
 
/* print the used parameters for info*/
 
print_config();
 
 
 
/* init all of the resources, so cleanup will be easy */
 
resources_init(&res);
 
 
 
/* create resources before using them */
 
if (resources_create(&res))
 
{
 
fprintf(stderr, "failed to create resources\n");
 
goto main_exit;
 
}
 
 
 
/* connect the QPs */
 
if (connect_qp(&res))
 
{
 
fprintf(stderr, "failed to connect QPs\n");
 
goto main_exit;
 
}
 
 
 
/* let the server post the sr */
 
if (!config.server_name)
 
if (post_send(&res, IBV_WR_SEND))
 
{
 
fprintf(stderr, "failed to post sr\n");
 
goto main_exit;
 
}
 
 
 
/* in both sides we expect to get a completion */
 
if (poll_completion(&res))
 
{
 
fprintf(stderr, "poll completion failed\n");
 
goto main_exit;
 
}
 
 
 
/* after polling the completion we have the message in the client buffer too */
 
if (config.server_name)
 
fprintf(stdout, "Message is: '%s'\n", res.buf);
 
else
 
{
 
/* setup server buffer with read message */
 
strcpy(res.buf, RDMAMSGR);
 
}
 
 
 
/* Sync so we are sure server side has data ready before client tries to read it */
 
if (sock_sync_data(res.sock, 1, "R", &temp_char)) /* just send a dummy char back and forth */
 
{
 
fprintf(stderr, "sync error before RDMA ops\n");
 
rc = 1;
 
goto main_exit;
 
}
 
 
 
 
 
/* Now the client performs an RDMA read and then write on server.
 
Note that the server has no idea these events have occured */
 
 
 
if (config.server_name)
 
{
 
/* First we read contens of server's buffer */
 
 
if (post_send(&res, IBV_WR_RDMA_READ))
 
{
 
fprintf(stderr, "failed to post SR 2\n");
 
rc = 1;
 
goto main_exit;
 
}
 
 
 
if (poll_completion(&res))
 
{
 
fprintf(stderr, "poll completion failed 2\n");
 
rc = 1;
 
goto main_exit;
 
}
 
 
 
fprintf(stdout, "Contents of server's buffer: '%s'\n", res.buf);
 
 
 
/* Now we replace what's in the server's buffer */
 
strcpy(res.buf, RDMAMSGW);
 
 
 
fprintf(stdout, "Now replacing it with: '%s'\n", res.buf);
 
 
if (post_send(&res, IBV_WR_RDMA_WRITE))
 
{
 
fprintf(stderr, "failed to post SR 3\n");
 
rc = 1;
 
goto main_exit;
 
}
 
 
 
if (poll_completion(&res))
 
{
 
fprintf(stderr, "poll completion failed 3\n");
 
rc = 1;
 
goto main_exit;
 
}
 
}
 
 
 
/* Sync so server will know that client is done mucking with its memory */
 
 
if (sock_sync_data(res.sock, 1, "W", &temp_char)) /* just send a dummy char back and forth */
 
{
 
fprintf(stderr, "sync error after RDMA ops\n");
 
rc = 1;
 
goto main_exit;
 
}
 
 
 
if(!config.server_name)
 
fprintf(stdout, "Contents of server buffer: '%s'\n", res.buf);
 
 
 
rc = 0;
 
 
 
main_exit:
 
if (resources_destroy(&res))
 
{
 
fprintf(stderr, "failed to destroy resources\n");
 
rc = 1;
 
}
 
 
 
if(config.dev_name)
 
free((char *) config.dev_name);
 
 
 
fprintf(stdout, "\ntest result is %d\n", rc);
 
 
 
return rc;
 
}

Synopsis for Multicast Example Using RDMA_CM and IBV Verbs

This code example for Multicast, uses RDMA-CM and VPI (and hence can be run both over IB and over LLE).

Notes:

In order to run the multicast example on either IB or LLE, no change is needed to the test's code. However if RDMA_CM is used, it is required that the network interface will be configured and up (whether it is used over RoCE or over IB).
For the IB case, a join operation is involved, yet it is performed by the rdma_cm kernel code.
For the LLE case, no join is required. All MGIDs are resolved into MACs at the host.
To inform the multicast example which port to use, you need to specify "-b <IP address>” to bind to the desired device port.

Main

Get command line parameters.
m - MC address, destination port
M - unmapped MC address, requires also bind address (parameter “b”)
s - sender flag.
b - bind address.
c - connections amount.
C - message count.
S - message size.
p - port space (UDP default; IPoIB)
Create event channel to receive asynchronous events.
Allocate Node and creates an identifier that is used to track communication information
Start the “run” main function.
On ending - release and free resources.

API definition files: rdma/rdma_cma.h and infiniband/verbs.h

Run

Get source (if provided for binding) and destination addresses - convert the input addresses to socket presentation.
Joining:
1. For all connections:
  if source address is specifically provided, then bind the rdma_cm object to the corresponding network interface. (Associates a source address with an rdma_cm identifier).
  if unmapped MC address with bind address provided, check the remote address and then bind.
2. Poll on all the connection events and wait that all rdma_cm objects joined the MC group.
Send & receive:
1. If sender: send the messages to all connection nodes (function “post_sends”).
2. If receiver: poll the completion queue (function “poll_cqs”) till messages arrival.

On ending - release network resources (per all connections: leaves the multicast group and detaches its associated QP from the group)

Code for Multicast Using RDMA_CM and IBV Verbs

Copy
Copied!

            
            Multicast Code Example
/*
 * BUILD COMMAND: 
 * gcc -g -Wall -D_GNU_SOURCE -g -O2 -o examples/mckey  examples/mckey.c  -libverbs -lrdmacm
 * 
* $Id$
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

struct cmatest_node 
{
	int			id;
	struct rdma_cm_id	*cma_id;
	int			connected;
	struct ibv_pd	*pd;
	struct ibv_cq	*cq;
	struct ibv_mr	*mr;
	struct ibv_ah	*ah;
	uint32_t		remote_qpn;
	uint32_t		remote_qkey;
	void			*mem;
};

struct cmatest 
{
	struct rdma_event_channel *channel;
	struct cmatest_node *nodes;
	int conn_index;
	int connects_left;

	struct sockaddr_in6	dst_in;
	struct sockaddr		*dst_addr;
	struct sockaddr_in6	src_in;
	struct sockaddr		*src_addr;
};

static struct cmatest test;
static int connections = 1;
static int message_size = 100;
static int message_count = 10;
static int is_sender;
static int unmapped_addr;
static char *dst_addr;
static char *src_addr;
static enum rdma_port_space port_space = RDMA_PS_UDP;

static int create_message(struct cmatest_node *node)
{
	if (!message_size)
	message_count = 0;

	if (!message_count)
	return 0;

	node->mem = malloc(message_size + sizeof(struct ibv_grh));
	if (!node->mem) 
	{
	printf("failed message allocation\n");
		return -1;
	}
	node->mr = ibv_reg_mr(node->pd, node->mem, 	message_size + sizeof(struct ibv_grh),
	IBV_ACCESS_LOCAL_WRITE);
	if (!node->mr) 
	{
	printf("failed to reg MR\n");
	goto err;
	}
	return 0;
err:
	free(node->mem);
	return -1;
}

static int verify_test_params(struct cmatest_node *node)
{
	struct ibv_port_attr port_attr;
	int ret;

	ret = ibv_query_port(node->cma_id->verbs, node->cma_id->port_num, &port_attr);
	if (ret)
	return ret;

	if (message_count && message_size > (1 << (port_attr.active_mtu + 7))) 
	{
	printf("mckey: message_size %d is larger than active mtu %d\n", message_size, 1 << 	(port_attr.active_mtu + 7));
	return -EINVAL;
	}

	return 0;
}

static int init_node(struct cmatest_node *node)
{
	struct ibv_qp_init_attr init_qp_attr;
	int cqe, ret;

	node->pd = ibv_alloc_pd(node->cma_id->verbs);
	if (!node->pd) 
	{
	ret = -ENOMEM;
	printf("mckey: unable to allocate PD\n");
	goto out;
	}

	cqe = message_count ? message_count * 2 : 2;
	node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0);
	if (!node->cq) 
	{
	ret = -ENOMEM;
	printf("mckey: unable to create CQ\n");
	goto out;
	}

	memset(&init_qp_attr, 0, sizeof init_qp_attr);
	init_qp_attr.cap.max_send_wr = message_count ? message_count : 1;
	init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1;
	init_qp_attr.cap.max_send_sge = 1;
	init_qp_attr.cap.max_recv_sge = 1;
	init_qp_attr.qp_context = node;
	init_qp_attr.sq_sig_all = 0;
	init_qp_attr.qp_type = IBV_QPT_UD;
	init_qp_attr.send_cq = node->cq;
	init_qp_attr.recv_cq = node->cq;
	ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr);
	if (ret) 
	{
	printf("mckey: unable to create QP: %d\n", ret);
	goto out;
	}

	ret = create_message(node);
	if (ret) 
	{
	printf("mckey: failed to create messages: %d\n", ret);
	goto out;
	}
out:
	return ret;
}

static int post_recvs(struct cmatest_node *node)
{
	struct ibv_recv_wr recv_wr, *recv_failure;
	struct ibv_sge sge;
	int i, ret = 0;

	if (!message_count)
	return 0;

	recv_wr.next = NULL;
	recv_wr.sg_list = &sge;
	recv_wr.num_sge = 1;
	recv_wr.wr_id = (uintptr_t) node;

	sge.length = message_size + sizeof(struct ibv_grh);
	sge.lkey = node->mr->lkey;
	sge.addr = (uintptr_t) node->mem;

	for (i = 0; i < message_count && !ret; i++ ) 
	{
	ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure);
	if (ret) 
	{
	printf("failed to post receives: %d\n", ret);
	break;
	}
	}
	return ret;
}

static int post_sends(struct cmatest_node *node, int signal_flag)
{
	struct ibv_send_wr send_wr, *bad_send_wr;
	struct ibv_sge sge;
	int i, ret = 0;

	if (!node->connected || !message_count)
	return 0;

	send_wr.next = NULL;
	send_wr.sg_list = &sge;
	send_wr.num_sge = 1;
	send_wr.opcode = IBV_WR_SEND_WITH_IMM;
	send_wr.send_flags = signal_flag;
	send_wr.wr_id = (unsigned long)node;
	send_wr.imm_data = htonl(node->cma_id->qp->qp_num);

	send_wr.wr.ud.ah = node->ah;
	send_wr.wr.ud.remote_qpn = node->remote_qpn;
	send_wr.wr.ud.remote_qkey = node->remote_qkey;

	sge.length = message_size;
	sge.lkey = node->mr->lkey;
	sge.addr = (uintptr_t) node->mem;

	for (i = 0; i < message_count && !ret; i++) 
	{
	ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr);
	if (ret)
	printf("failed to post sends: %d\n", ret);
	}
	return ret;
}

static void connect_error(void)
{
	test.connects_left--;
}

static int addr_handler(struct cmatest_node *node)
{
	int ret;

	ret = verify_test_params(node);
	if (ret)
	goto err;

	ret = init_node(node);
	if (ret)
	goto err;

	if (!is_sender) 
	{
	ret = post_recvs(node);
	if (ret)
	goto err;
	}

	ret = rdma_join_multicast(node->cma_id, test.dst_addr, node);
	if (ret) 
	{
	printf("mckey: failure joining: %d\n", ret);
	goto err;
	}
	return 0;
err:
	connect_error();
	return ret;
}

static int join_handler(struct cmatest_node *node, 	struct rdma_ud_param *param)
{
	char buf[40];

	inet_ntop(AF_INET6, param->ah_attr.grh.dgid.raw, buf, 40);
	printf("mckey: joined dgid: %s\n", buf);

	node->remote_qpn = param->qp_num;
	node->remote_qkey = param->qkey;
	node->ah = ibv_create_ah(node->pd, ¶m->ah_attr);
	if (!node->ah)
	{
	printf("mckey: failure creating address handle\n");
	goto err;
	}

	node->connected = 1;
	test.connects_left--;
	return 0;
err:
	connect_error();
	return -1;
}

static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
{
	int ret = 0;

	switch (event->event) 
	{
	case RDMA_CM_EVENT_ADDR_RESOLVED:
	ret = addr_handler(cma_id->context);
	break;
	case RDMA_CM_EVENT_MULTICAST_JOIN:
	ret = join_handler(cma_id->context, &event->param.ud);
	break;
	case RDMA_CM_EVENT_ADDR_ERROR:
	case RDMA_CM_EVENT_ROUTE_ERROR:
	case RDMA_CM_EVENT_MULTICAST_ERROR:
	printf("mckey: event: %s, error: %d\n", 	 rdma_event_str(event->event), event->status);
	connect_error();
	ret = event->status;
	break;
	case RDMA_CM_EVENT_DEVICE_REMOVAL:
	/* Cleanup will occur after test completes. */
	break;
	default:
	break;
	}
	return ret;
}

static void destroy_node(struct cmatest_node *node)
{
	if (!node->cma_id)
	return;

	if (node->ah)
	ibv_destroy_ah(node->ah);

	if (node->cma_id->qp)
	rdma_destroy_qp(node->cma_id);

	if (node->cq)
	ibv_destroy_cq(node->cq);

	if (node->mem) 
	{
	ibv_dereg_mr(node->mr);
	free(node->mem);
	}

	if (node->pd)
	ibv_dealloc_pd(node->pd);

	/* Destroy the RDMA ID after all device resources */
	rdma_destroy_id(node->cma_id);
}

static int alloc_nodes(void)
{
	int ret, i;

	test.nodes = malloc(sizeof *test.nodes * connections);
	if (!test.nodes) 
	{
	printf("mckey: unable to allocate memory for test nodes\n");
	return -ENOMEM;
	}
	memset(test.nodes, 0, sizeof *test.nodes * connections);

	for (i = 0; i < connections; i++) 
	{
	test.nodes[i].id = i;
	ret = rdma_create_id(test.channel, &test.nodes[i].cma_id, &test.nodes[i], port_space);
	if (ret)
	goto err;
	}
	return 0;
err:
	while (--i >= 0)
	rdma_destroy_id(test.nodes[i].cma_id);
	free(test.nodes);
	return ret;
}

static void destroy_nodes(void)
{
	int i;

	for (i = 0; i < connections; i++)
	destroy_node(&test.nodes[i]);
	free(test.nodes);
}

static int poll_cqs(void)
{
	struct ibv_wc wc[8];
	int done, i, ret;

	for (i = 0; i < connections; i++) 
	{
	if (!test.nodes[i].connected)
	continue;

	for (done = 0; done < message_count; done += ret) 
	{
	ret = ibv_poll_cq(test.nodes[i].cq, 8, wc);
	if (ret < 0) 
	{
	printf("mckey: failed polling CQ: %d\n", ret);
	return ret;
	}
	}
	}
	return 0;
}

static int connect_events(void)
{
	struct rdma_cm_event *event;
	int ret = 0;

	while (test.connects_left && !ret) 
	{
		ret = rdma_get_cm_event(test.channel, &event);
		if (!ret) 
		{
			ret = cma_handler(event->id, event);
			rdma_ack_cm_event(event);
		}
	}
	return ret;
}

static int get_addr(char *dst, struct sockaddr *addr)
{
	struct addrinfo *res;
	int ret;

	ret = getaddrinfo(dst, NULL, NULL, &res);
	if (ret) 
	{
	printf("getaddrinfo failed - invalid hostname or IP address\n");
	return ret;
	}

	memcpy(addr, res->ai_addr, res->ai_addrlen);
	freeaddrinfo(res);
	return ret;
}

static int run(void)
{
	int i, ret;

	printf("mckey: starting %s\n", is_sender ? "client" : "server");
	if (src_addr) 
	{
	ret = get_addr(src_addr, (struct sockaddr *) &test.src_in);
	if (ret)
	return ret;
	}

	ret = get_addr(dst_addr, (struct sockaddr *) &test.dst_in);
	if (ret)
	return ret;

	printf("mckey: joining\n");
	for (i = 0; i < connections; i++) 
	{
		if (src_addr) 
		{
			ret = rdma_bind_addr(test.nodes[i].cma_id, test.src_addr);
			if (ret)
			{
				printf("mckey: addr bind failure: %d\n", ret);
				connect_error();
				return ret;
			}
		}

		if (unmapped_addr)
			ret = addr_handler(&test.nodes[i]);
		else
			ret = rdma_resolve_addr(test.nodes[i].cma_id, 					test.src_addr, test.dst_addr, 						2000);
		if (ret) 
		{
			printf("mckey: resolve addr failure: %d\n", ret);
			connect_error();
			return ret;
		}
	}

	ret = connect_events();
	if (ret)
		goto out;

	/*
	 * Pause to give SM chance to configure switches.  We don't want to
	 * handle reliability issue in this simple test program.
	 */
	sleep(3);

	if (message_count) 
	{
		if (is_sender)
		{
			printf("initiating data transfers\n");
			for (i = 0; i < connections; i++) 
			{
				ret = post_sends(&test.nodes[i], 0);
				if (ret)
					goto out;
			}
	} 
	else 
	{
		printf("receiving data transfers\n");
		ret = poll_cqs();
		if (ret)
			goto out;
	}
	printf("data transfers complete\n");
	}
out:
	for (i = 0; i < connections; i++) 
	{
		ret = rdma_leave_multicast(test.nodes[i].cma_id, test.dst_addr);
		if (ret)
			printf("mckey: failure leaving: %d\n", ret);
	}
	return ret;
}

int main(int argc, char **argv)
{
	int op, ret;


	while ((op = getopt(argc, argv, "m:M:sb:c:C:S:p:")) != -1) 
	{
		switch (op) 
		{
		case 'm':
			dst_addr = optarg;
			break;
		case 'M':
			unmapped_addr = 1;
			dst_addr = optarg;
			break;
		case 's':
			is_sender = 1;
			break;
		case 'b':
			src_addr = optarg;
			test.src_addr = (struct sockaddr *) &test.src_in;
			break;
		case 'c':
			connections = atoi(optarg);
			break;
		case 'C':
			message_count = atoi(optarg);
			break;
		case 'S':
			message_size = atoi(optarg);
			break;
		case 'p':
			port_space = strtol(optarg, NULL, 0);
			break;
		default:
			printf("usage: %s\n", argv[0]);
			printf("\t-m multicast_address\n");
			printf("\t[-M unmapped_multicast_address]\n"
			       "\t replaces -m and requires -b\n");
			printf("\t[-s(ender)]\n");
			printf("\t[-b bind_address]\n");
			printf("\t[-c connections]\n");
			printf("\t[-C message_count]\n");
			printf("\t[-S message_size]\n");
			printf("\t[-p port_space - %#x for UDP (default), %#x for IPOIB]\n", RDMA_PS_UDP, RDMA_PS_IPOIB);
			exit(1);
		}
	}

	test.dst_addr = (struct sockaddr *) &test.dst_in;
	test.connects_left = connections;

	test.channel = rdma_create_event_channel();
	if (!test.channel) 
	{
		printf("failed to create event channel\n");
		exit(1);
	}

	if (alloc_nodes())
		exit(1);

	ret = run();

	printf("test complete\n");
	destroy_nodes();
	rdma_destroy_event_channel(test.channel);

	printf("return status %d\n", ret);
	return ret;
}

Programming Examples Using RDMA Verbs

This chapter provides code examples using the RDMA Verbs

Automatic Path Migration (APM)

Copy
Copied!

            
            //*
 * Compile Command:
 * gcc apm.c -o apm -libverbs -lrdmacm
 * 
 * Description:
 * This example demonstrates Automatic Path Migration (APM). The basic flow is
 * as follows:
 * 1. Create connection between client and server
 * 2. Set the alternate path details on each side of the connection
 * 3. Perform send operations back and forth between client and server
 * 4. Cause the path to be migrated (manually or automatically)
 * 5. Complete sends using the alternate path
 * 
 * There are two ways to cause the path to be migrated.
 * 1. Use the ibv_modify_qp verb to set path_mig_state = IBV_MIG_MIGRATED
 * 2. Assuming there are two ports on at least one side of the connection, and
 *    each port has a path to the other host, pull out the cable of the original
 *    port and watch it migrate to the other port.
 * 
 * Running the Example:
 * This example requires a specific IB network configuration to properly
 * demonstrate APM. Two hosts are required, one for the client and one for the
 * server. At least one of these two hosts must have a IB card with two ports.
 * Both of these ports should be connected to the same subnet and each have a
 * route to the other host through an IB switch.
 * The executable can operate as either the client or server application. Start
 * the server side first on one host then start the client on the other host. With default parameters, the 
 * client and server will exchange 100 sends over 100 seconds. During that time,
 * manually unplug the cable connected to the original port of the two port
 * host, and watch the path get migrated to the other port. It may take up to
 * a minute for the path to migrated. To see the path get migrated by software,
 * use the -m option on the client side.
 * 
 * Server:
 * ./apm -s
 * 
 * Client (-a is IP of remote interface):
 * ./apm -a 192.168.1.12
 *      
 */
 
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <getopt.h>
#include <rdma/rdma_verbs.h>
 
#define VERB_ERR(verb, ret) \
        fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)
 
/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH 1000000
#define DEFAULT_MSEC_DELAY 500
 
/* Resources used in the example */
struct context
{
    /* User parameters */
    int server;
    char *server_name;
    char *server_port;
    int msg_count;
    int msg_length;
    int msec_delay;
    uint8_t alt_srcport;
    uint16_t alt_dlid;
    uint16_t my_alt_dlid;
    int migrate_after;
 
    /* Resources */
    struct rdma_cm_id *id;
    struct rdma_cm_id *listen_id;
    struct ibv_mr *send_mr;
    struct ibv_mr *recv_mr;
    char *send_buf;
    char *recv_buf;
    pthread_t async_event_thread;
};
 
/*
 * Function:    async_event_thread
 * 
 * Input:
 *      arg    The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      NULL
 * 
 * Description:
 *      Reads any Asynchronous events that occur during the sending of data
 *      and prints out the details of the event. Specifically migration
 *      related events.
 */
static void *async_event_thread(void *arg)
{
    struct ibv_async_event event;
    int ret;
 
    struct context *ctx = (struct context *) arg;
 
    while (1) {
        ret = ibv_get_async_event(ctx->id->verbs, &event);
        if (ret) {
            VERB_ERR("ibv_get_async_event", ret);
            break;
        }
 
        switch (event.event_type) {
        case IBV_EVENT_PATH_MIG:
            printf("QP path migrated\n");
            break;
        case IBV_EVENT_PATH_MIG_ERR:
            printf("QP path migration error\n");
            break;
        default:
            printf("Async Event %d\n", event.event_type);
            break;
        }
 
        ibv_ack_async_event(&event);
    }
 
    return NULL;
}
 
/*
 * Function:    get_alt_dlid_from_private_data  
 * 
 * Input:
 *      event  The RDMA event containing private data
 * 
 * Output:
 *      dlid   The DLID that was sent in the private data
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Takes the private data sent from the remote side and returns the
 *      destination LID that was contained in the private data
 */
int get_alt_dlid_from_private_data(struct rdma_cm_event *event, uint16_t *dlid)
{
    if (event->param.conn.private_data_len < 4) {
        printf("unexpected private data len: %d",
               event->param.conn.private_data_len);
        return -1;
    }
 
    *dlid = ntohs(*((uint16_t *) event->param.conn.private_data));
    return 0;
}
 
/*
 * Function:    get_alt_port_details 
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      First, query the device to determine if path migration is supported.
 *      Next, queries all the ports on the device to determine if there is
 *      different port than the current one to use as an alternate port. If so,
 *      copy the port number and dlid to the context so they can be used when
 *      the alternate path is loaded. 
 * 
 * Note: 
 *      This function assumes that if another port is found in the active state,
 *      that the port is connected to the same subnet as the initial port and
 *      that there is a route to the other hosts alternate port.
 */
int get_alt_port_details(struct context *ctx)
{
    int ret, i;
    struct ibv_qp_attr qp_attr;
    struct ibv_qp_init_attr qp_init_attr;
    struct ibv_device_attr dev_attr;
 
    /* This example assumes the alternate port we want to use is on the same
     * HCA. Ports from other HCAs can be used as alternate paths as well. Get
     * a list of devices using ibv_get_device_list or rdma_get_devices.*/
    ret = ibv_query_device(ctx->id->verbs, &dev_attr);
    if (ret) {
        VERB_ERR("ibv_query_device", ret);
        return ret;
    }
 
    /* Verify the APM is supported by the HCA */
    if (!(dev_attr.device_cap_flags | IBV_DEVICE_AUTO_PATH_MIG)) {
        printf("device does not support auto path migration!\n");
        return -1;
    }
 
    /* Query the QP to determine which port we are bound to */
    ret = ibv_query_qp(ctx->id->qp, &qp_attr, 0, &qp_init_attr);
    if (ret) {
        VERB_ERR("ibv_query_qp", ret);
        return ret;
    }
 
    for (i = 1; i <= dev_attr.phys_port_cnt; i++) {
        /* Query all ports until we find one in the active state that is
         * not the port we are currently connected to. */
 
        struct ibv_port_attr port_attr;
        ret = ibv_query_port(ctx->id->verbs, i, &port_attr);
        if (ret) {
            VERB_ERR("ibv_query_device", ret);
            return ret;
        }
 
        if (port_attr.state == IBV_PORT_ACTIVE) {
            ctx->my_alt_dlid = port_attr.lid;
            ctx->alt_srcport = i;
            if (qp_attr.port_num != i)
                break;
        }
    }
 
    return 0;
}
 
/*
 * Function:    load_alt_path
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Uses ibv_modify_qp to load the alternate path information and set the
 *      path migration state to rearm.
 */
int load_alt_path(struct context *ctx)
{
    int ret;
    struct ibv_qp_attr qp_attr;
    struct ibv_qp_init_attr qp_init_attr;
 
    /* query to get the current attributes of the qp */
    ret = ibv_query_qp(ctx->id->qp, &qp_attr, 0, &qp_init_attr);
    if (ret) {
        VERB_ERR("ibv_query_qp", ret);
        return ret;
    }
 
    /* initialize the alternate path attributes with the current path 
     * attributes */
    memcpy(&qp_attr.alt_ah_attr, &qp_attr.ah_attr, sizeof (struct ibv_ah_attr));
 
    /* set the alt path attributes to some basic values */
    qp_attr.alt_pkey_index = qp_attr.pkey_index;
    qp_attr.alt_timeout = qp_attr.timeout;
    qp_attr.path_mig_state = IBV_MIG_REARM;
 
    /* if an alternate path was supplied, set the source port and the dlid */
    if (ctx->alt_srcport)
        qp_attr.alt_port_num = ctx->alt_srcport;
    else
        qp_attr.alt_port_num = qp_attr.port_num;
 
    if (ctx->alt_dlid)
        qp_attr.alt_ah_attr.dlid = ctx->alt_dlid;
 
    printf("loading alt path - local port: %d, dlid: %d\n",
           qp_attr.alt_port_num, qp_attr.alt_ah_attr.dlid);
 
    ret = ibv_modify_qp(ctx->id->qp, &qp_attr,
                        IBV_QP_ALT_PATH | IBV_QP_PATH_MIG_STATE);
    if (ret) {
        VERB_ERR("ibv_modify_qp", ret);
        return ret;
    }
}
 
/*
 * Function:    reg_mem
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Registers memory regions to use for our data transfer  
 */
int reg_mem(struct context *ctx)
{
    ctx->send_buf = (char *) malloc(ctx->msg_length);
    memset(ctx->send_buf, 0x12, ctx->msg_length);
 
    ctx->recv_buf = (char *) malloc(ctx->msg_length);
    memset(ctx->recv_buf, 0x00, ctx->msg_length);
 
    ctx->send_mr = rdma_reg_msgs(ctx->id, ctx->send_buf, ctx->msg_length);
    if (!ctx->send_mr) {
        VERB_ERR("rdma_reg_msgs", -1);
        return -1;
    }
 
    ctx->recv_mr = rdma_reg_msgs(ctx->id, ctx->recv_buf, ctx->msg_length);
    if (!ctx->recv_mr) {
        VERB_ERR("rdma_reg_msgs", -1);
        return -1;
    }
 
    return 0;
}
 
/*
 * Function:    getaddrinfo_and_create_ep
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Gets the address information and creates our endpoint  
 */
int getaddrinfo_and_create_ep(struct context *ctx)
{
    int ret;
    struct rdma_addrinfo *rai, hints;
    struct ibv_qp_init_attr qp_init_attr;
 
    memset(&hints, 0, sizeof (hints));
    hints.ai_port_space = RDMA_PS_TCP;
    if (ctx->server == 1)
        hints.ai_flags = RAI_PASSIVE; /* this makes it a server */
 
    printf("rdma_getaddrinfo\n");
    ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
    if (ret) {
        VERB_ERR("rdma_getaddrinfo", ret);
        return ret;
    }
 
    memset(&qp_init_attr, 0, sizeof (qp_init_attr));
 
    qp_init_attr.cap.max_send_wr = 1;
    qp_init_attr.cap.max_recv_wr = 1;
    qp_init_attr.cap.max_send_sge = 1;
    qp_init_attr.cap.max_recv_sge = 1;
 
    printf("rdma_create_ep\n");
    ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
    if (ret) {
        VERB_ERR("rdma_create_ep", ret);
        return ret;
    }
 
    rdma_freeaddrinfo(rai);
 
    return 0;
}
 
/*
 * Function:    get_connect_request
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Wait for a connect request from the client 
 */
int get_connect_request(struct context *ctx)
{
    int ret;
 
    printf("rdma_listen\n");
    ret = rdma_listen(ctx->id, 4);
    if (ret) {
        VERB_ERR("rdma_listen", ret);
        return ret;
    }
 
    ctx->listen_id = ctx->id;
 
    printf("rdma_get_request\n");
    ret = rdma_get_request(ctx->listen_id, &ctx->id);
    if (ret) {
        VERB_ERR("rdma_get_request", ret);
        return ret;
    }
 
    if (ctx->id->event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
        printf("unexpected event: %s",
               rdma_event_str(ctx->id->event->event));
        return ret;
    }
 
    /* If the alternate path info was not set on the command line, get
     * it from the private data */
    if (ctx->alt_dlid == 0 && ctx->alt_srcport == 0) {
        ret = get_alt_dlid_from_private_data(ctx->id->event, &ctx->alt_dlid);
        if (ret) {
            return ret;
        }
    }
 
    return 0;
}
 
/*
 * Function:    establish_connection
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Create the connection. For the client, call rdma_connect. For the
 *      server, the connect request was already received, so just do
 *      rdma_accept to complete the connection.
 */
int establish_connection(struct context *ctx)
{
    int ret;
    uint16_t private_data;
    struct rdma_conn_param conn_param;
 
    /* post a receive to catch the first send */
    ret = rdma_post_recv(ctx->id, NULL, ctx->recv_buf, ctx->msg_length,
                         ctx->recv_mr);
    if (ret) {
        VERB_ERR("rdma_post_recv", ret);
        return ret;
    }
 
    /* send the dlid for the alternate port in the private data */
    private_data = htons(ctx->my_alt_dlid);
 
    memset(&conn_param, 0, sizeof (conn_param));
    conn_param.private_data_len = sizeof (int);
    conn_param.private_data = &private_data;
    conn_param.responder_resources = 2;
    conn_param.initiator_depth = 2;
    conn_param.retry_count = 5;
    conn_param.rnr_retry_count = 5;
 
    if (ctx->server) {
        printf("rdma_accept\n");
        ret = rdma_accept(ctx->id, &conn_param);
        if (ret) {
            VERB_ERR("rdma_accept", ret);
            return ret;
        }
    }
    else {
        printf("rdma_connect\n");
        ret = rdma_connect(ctx->id, &conn_param);
        if (ret) {
            VERB_ERR("rdma_connect", ret);
            return ret;
        }
 
        if (ctx->id->event->event != RDMA_CM_EVENT_ESTABLISHED) {
            printf("unexpected event: %s",
                   rdma_event_str(ctx->id->event->event));
            return -1;
        }
 
        /* If the alternate path info was not set on the command line, get
         * it from the private data */
        if (ctx->alt_dlid == 0 && ctx->alt_srcport == 0) {
            ret = get_alt_dlid_from_private_data(ctx->id->event,
                                                 &ctx->alt_dlid);
            if (ret)
                return ret;
        }
    }
 
    return 0;
}
 
/*
 * Function:    send_msg
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Performs an Send and gets the completion
 *      
 */
int send_msg(struct context *ctx)
{
    int ret;
    struct ibv_wc wc;
 
    ret = rdma_post_send(ctx->id, NULL, ctx->send_buf, ctx->msg_length,
                         ctx->send_mr, IBV_SEND_SIGNALED);
    if (ret) {
        VERB_ERR("rdma_send_recv", ret);
        return ret;
    }
 
    ret = rdma_get_send_comp(ctx->id, &wc);
    if (ret < 0) {
        VERB_ERR("rdma_get_send_comp", ret);
        return ret;
    }
 
    return 0;
}
 
/*
 * Function:    recv_msg
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Waits for a receive completion and posts a new receive buffer
 */
int recv_msg(struct context *ctx)
{
    int ret;
    struct ibv_wc wc;
 
    ret = rdma_get_recv_comp(ctx->id, &wc);
    if (ret < 0) {
        VERB_ERR("rdma_get_recv_comp", ret);
        return ret;
    }
 
    ret = rdma_post_recv(ctx->id, NULL, ctx->recv_buf, ctx->msg_length,
                         ctx->recv_mr);
    if (ret) {
        VERB_ERR("rdma_post_recv", ret);
        return ret;
    }
 
    return 0;
}
 
/*
 * Function:    main
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      
 */
int main(int argc, char** argv)
{
    int ret, op, i, send_cnt, recv_cnt;
    struct context ctx;
    struct ibv_qp_attr qp_attr;
 
    memset(&ctx, 0, sizeof (ctx));
    memset(&qp_attr, 0, sizeof (qp_attr));
 
    ctx.server = 0;
    ctx.server_port = DEFAULT_PORT;
    ctx.msg_count = DEFAULT_MSG_COUNT;
    ctx.msg_length = DEFAULT_MSG_LENGTH;
    ctx.msec_delay = DEFAULT_MSEC_DELAY;
    ctx.alt_dlid = 0;
    ctx.alt_srcport = 0;
    ctx.migrate_after = -1;
 
    while ((op = getopt(argc, argv, "sa:p:c:l:d:r:m:")) != -1) {
        switch (op) {
        case 's':
            ctx.server = 1;
            break;
        case 'a':
            ctx.server_name = optarg;
            break;
        case 'p':
            ctx.server_port = optarg;
            break;
        case 'c':
            ctx.msg_count = atoi(optarg);
            break;
        case 'l':
            ctx.msg_length = atoi(optarg);
            break;
        case 'd':
            ctx.alt_dlid = atoi(optarg);
            break;
        case 'r':
            ctx.alt_srcport = atoi(optarg);
            break;
        case 'm':
            ctx.migrate_after = atoi(optarg);
            break;
        case 'w':
            ctx.msec_delay = atoi(optarg);
            break;
        default:
            printf("usage: %s [-s or -a required]\n", argv[0]);
            printf("\t[-s[erver mode]\n");
            printf("\t[-a ip_address]\n");
            printf("\t[-p port_number]\n");
            printf("\t[-c msg_count]\n");
            printf("\t[-l msg_length]\n");
            printf("\t[-d alt_dlid] (requires -r)\n");
            printf("\t[-r alt_srcport] (requires -d)\n");
            printf("\t[-m num_iterations_then_migrate] (client only)\n");
            printf("\t[-w msec_wait_between_sends]\n");
            exit(1);
        }
    }
 
    printf("mode:       %s\n", (ctx.server) ? "server" : "client");
    printf("address:    %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
    printf("port:       %s\n", ctx.server_port);
    printf("count:      %d\n", ctx.msg_count);
    printf("length:     %d\n", ctx.msg_length);
    printf("alt_dlid:   %d\n", ctx.alt_dlid);
    printf("alt_port:   %d\n", ctx.alt_srcport);
    printf("mig_after:  %d\n", ctx.migrate_after);
    printf("msec_wait:  %d\n", ctx.msec_delay);
    printf("\n");
 
    if (!ctx.server && !ctx.server_name) {
        printf("server address must be specified for client mode\n");
        exit(1);
    }
 
    /* both of these must be set or neither should be set */
    if (!((ctx.alt_dlid > 0 && ctx.alt_srcport > 0) ||
        (ctx.alt_dlid == 0 && ctx.alt_srcport == 0))) {
        printf("-d and -r must be used together\n");
        exit(1);
    }
 
    if (ctx.migrate_after > ctx.msg_count) {
        printf("num_iterations_then_migrate must be less than msg_count\n");
        exit(1);
    }
 
    ret = getaddrinfo_and_create_ep(&ctx);
    if (ret)
        goto out;
 
    if (ctx.server) {
        ret = get_connect_request(&ctx);
        if (ret)
            goto out;
    }
 
    /* only query for alternate port if information was not specified on the 
     * command line */
    if (ctx.alt_dlid == 0 && ctx.alt_srcport == 0) {
        ret = get_alt_port_details(&ctx);
        if (ret)
            goto out;
    }
 
    /* create a thread to handle async events */
    pthread_create(&ctx.async_event_thread, NULL, async_event_thread, &ctx);
 
    ret = reg_mem(&ctx);
    if (ret)
        goto out;
 
    ret = establish_connection(&ctx);
 
    /* load the alternate path after the connection was created. This can be
     * done at connection time, but the connection must be created and 
     * established using all ib verbs */
    ret = load_alt_path(&ctx);
    if (ret)
        goto out;
 
    send_cnt = recv_cnt = 0;
 
    for (i = 0; i < ctx.msg_count; i++) {
        if (ctx.server) {
            if (recv_msg(&ctx))
                break;
 
            printf("recv: %d\n", ++recv_cnt);
        }
 
        if (ctx.msec_delay > 0)
            usleep(ctx.msec_delay * 1000);
 
        if (send_msg(&ctx))
            break;
 
        printf("send: %d\n", ++send_cnt);
 
        if (!ctx.server) {
            if (recv_msg(&ctx))
                break;
 
            printf("recv: %d\n", ++recv_cnt);
        }
 
        /* migrate the path manually if desired after the specified number of
         * sends */
        if (!ctx.server && i == ctx.migrate_after) {
            qp_attr.path_mig_state = IBV_MIG_MIGRATED;
            ret = ibv_modify_qp(ctx.id->qp, &qp_attr, IBV_QP_PATH_MIG_STATE);
            if (ret) {
                VERB_ERR("ibv_modify_qp", ret);
                goto out;
            }
        }
    }
 
    rdma_disconnect(ctx.id);
 
out:
    if (ctx.send_mr)
        rdma_dereg_mr(ctx.send_mr);
 
    if (ctx.recv_mr)
        rdma_dereg_mr(ctx.recv_mr);
 
    if (ctx.id)
        rdma_destroy_ep(ctx.id);
 
    if (ctx.listen_id)
        rdma_destroy_ep(ctx.listen_id);
 
    if (ctx.send_buf)
        free(ctx.send_buf);
 
    if (ctx.recv_buf)
        free(ctx.recv_buf);
 
    return ret;
}

Multicast Code Example Using RDMA CM

Copy
Copied!

            
            /*
 * Compile Command:
 * gcc mc.c -o mc -libverbs -lrdmacm
 * 
 * Description:
 * Both the sender and receiver create a UD Queue Pair and join the specified
 * multicast group (ctx.mcast_addr). If the join is successful, the sender must
 * create an Address Handle (ctx.ah). The sender then posts the specified
 * number of sends (ctx.msg_count) to the multicast group. The receiver waits
 * to receive each one of the sends and then both sides leave the multicast
 * group and cleanup resources.
 * 
 * Running the Example:
 * The executable can operate as either the sender or receiver application. It
 * can be demonstrated on a simple fabric of two nodes with the sender
 * application running on one node and the receiver application running on the
 * other. Each node must be configured to support IPoIB and the IB interface
 * (ex. ib0) must be assigned an IP Address. Finally, the fabric must be
 * initialized using OpenSM.
 * 
 * Receiver (-m is the multicast address, often the IP of the receiver):
 * ./mc -m 192.168.1.12
 * 
 * Sender (-m is the multicast address, often the IP of the receiver):
 * ./mc -s -m 192.168.1.12
 *      
 */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <getopt.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <rdma/rdma_verbs.h>
 
#define VERB_ERR(verb, ret) \
        fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)
 
/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 4
#define DEFAULT_MSG_LENGTH 64
 
/* Resources used in the example */
struct context
{
    /* User parameters */
    int sender;
    char *bind_addr;
    char *mcast_addr;
    char *server_port;
    int msg_count;
    int msg_length;
 
    /* Resources */
    struct sockaddr mcast_sockaddr;
    struct rdma_cm_id *id;
    struct rdma_event_channel *channel;
    struct ibv_pd *pd;
    struct ibv_cq *cq;
    struct ibv_mr *mr;
    char *buf;
    struct ibv_ah *ah;
    uint32_t remote_qpn;
    uint32_t remote_qkey;
    pthread_t cm_thread;
};
 
/*
 * Function:    cm_thread
 * 
 * Input:
 *      arg     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      NULL
 * 
 * Description:
 *      Reads any CM events that occur during the sending of data
 *      and prints out the details of the event
 */
static void *cm_thread(void *arg)
{
    struct rdma_cm_event *event;
    int ret;
 
    struct context *ctx = (struct context *) arg;
 
    while (1) {
        ret = rdma_get_cm_event(ctx->channel, &event);
        if (ret) {
            VERB_ERR("rdma_get_cm_event", ret);
            break;
        }
 
        printf("event %s, status %d\n",
               rdma_event_str(event->event), event->status);
 
        rdma_ack_cm_event(event);
    }
 
    return NULL;
}
 
/*
 * Function:    get_cm_event
 * 
 * Input:
 *      channel The event channel
 *      type    The event type that is expected
 * 
 * Output:
 *      out_ev  The event will be passed back to the caller, if desired
 *              Set this to NULL and the event will be acked automatically
 *              Otherwise the caller must ack the event using rdma_ack_cm_event
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Waits for the next CM event and check that is matches the expected
 *      type.
 */
int get_cm_event(struct rdma_event_channel *channel,
                 enum rdma_cm_event_type type,
                 struct rdma_cm_event **out_ev)
{
    int ret = 0;
    struct rdma_cm_event *event = NULL;
 
    ret = rdma_get_cm_event(channel, &event);
    if (ret) {
        VERB_ERR("rdma_resolve_addr", ret);
        return -1;
    }
 
    /* Verify the event is the expected type */
    if (event->event != type) {
        printf("event: %s, status: %d\n",
               rdma_event_str(event->event), event->status);
        ret = -1;
    }
 
    /* Pass the event back to the user if requested */
    if (!out_ev)
        rdma_ack_cm_event(event);
    else
        *out_ev = event;
 
    return ret;
}
 
/*
 * Function:    resolve_addr
 * 
 * Input:
 *      ctx     The context structure
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Resolves the multicast address and also binds to the source address
 *      if one was provided in the context
 */
int resolve_addr(struct context *ctx)
{
    int ret;
    struct rdma_addrinfo *bind_rai = NULL;
    struct rdma_addrinfo *mcast_rai = NULL;
    struct rdma_addrinfo hints;
 
    memset(&hints, 0, sizeof (hints));
    hints.ai_port_space = RDMA_PS_UDP;
 
    if (ctx->bind_addr) {
        hints.ai_flags = RAI_PASSIVE;
 
        ret = rdma_getaddrinfo(ctx->bind_addr, NULL, &hints, &bind_rai);
        if (ret) {
            VERB_ERR("rdma_getaddrinfo (bind)", ret);
            return ret;
        }
    }
 
    hints.ai_flags = 0;
 
    ret = rdma_getaddrinfo(ctx->mcast_addr, NULL, &hints, &mcast_rai);
    if (ret) {
        VERB_ERR("rdma_getaddrinfo (mcast)", ret);
        return ret;
    }
 
    if (ctx->bind_addr) {
        /* bind to a specific adapter if requested to do so */
        ret = rdma_bind_addr(ctx->id, bind_rai->ai_src_addr);
        if (ret) {
            VERB_ERR("rdma_bind_addr", ret);
            return ret;
        }
 
        /* A PD is created when we bind. Copy it to the context so it can
         * be used later on */
        ctx->pd = ctx->id->pd;
    }
 
    ret = rdma_resolve_addr(ctx->id, (bind_rai) ? bind_rai->ai_src_addr : NULL,
                            mcast_rai->ai_dst_addr, 2000);
    if (ret) {
        VERB_ERR("rdma_resolve_addr", ret);
        return ret;
    }
 
    ret = get_cm_event(ctx->channel, RDMA_CM_EVENT_ADDR_RESOLVED, NULL);
    if (ret) {
        return ret;
    }
 
    memcpy(&ctx->mcast_sockaddr,
           mcast_rai->ai_dst_addr,
           sizeof (struct sockaddr));
 
    return 0;
}
 
/*
 * Function:    create_resources
 * 
 * Input:
 *      ctx     The context structure
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Creates the PD, CQ, QP and MR
 */
int create_resources(struct context *ctx)
{
    int ret, buf_size;
    struct ibv_qp_init_attr attr;
 
    memset(&attr, 0, sizeof (attr));
 
    /* If we are bound to an address, then a PD was already allocated
     * to the CM ID */
    if (!ctx->pd) {
        ctx->pd = ibv_alloc_pd(ctx->id->verbs);
        if (!ctx->pd) {
            VERB_ERR("ibv_alloc_pd", -1);
            return ret;
        }
    }
 
    ctx->cq = ibv_create_cq(ctx->id->verbs, 2, 0, 0, 0);
    if (!ctx->cq) {
        VERB_ERR("ibv_create_cq", -1);
        return ret;
    }
 
    attr.qp_type = IBV_QPT_UD;
    attr.send_cq = ctx->cq;
    attr.recv_cq = ctx->cq;
    attr.cap.max_send_wr = ctx->msg_count;
    attr.cap.max_recv_wr = ctx->msg_count;
    attr.cap.max_send_sge = 1;
    attr.cap.max_recv_sge = 1;
 
    ret = rdma_create_qp(ctx->id, ctx->pd, &attr);
    if (ret) {
        VERB_ERR("rdma_create_qp", ret);
        return ret;
    }
 
    /* The receiver must allow enough space in the receive buffer for
     * the GRH */
    buf_size = ctx->msg_length + (ctx->sender ? 0 : sizeof (struct ibv_grh));
 
    ctx->buf = calloc(1, buf_size);
    memset(ctx->buf, 0x00, buf_size);
 
    /* Register our memory region */
    ctx->mr = rdma_reg_msgs(ctx->id, ctx->buf, buf_size);
    if (!ctx->mr) {
        VERB_ERR("rdma_reg_msgs", -1);
        return -1;
    }
 
    return 0;
}
 
/*
 * Function:    destroy_resources
 * 
 * Input:
 *      ctx     The context structure
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Destroys AH, QP, CQ, MR, PD and ID
 */
void destroy_resources(struct context *ctx)
{
    if (ctx->ah)
        ibv_destroy_ah(ctx->ah);
 
    if (ctx->id->qp)
        rdma_destroy_qp(ctx->id);
 
    if (ctx->cq)
        ibv_destroy_cq(ctx->cq);
 
    if (ctx->mr)
        rdma_dereg_mr(ctx->mr);
 
    if (ctx->buf)
        free(ctx->buf);
 
    if (ctx->pd && ctx->id->pd == NULL)
        ibv_dealloc_pd(ctx->pd);
 
    rdma_destroy_id(ctx->id);
}
 
/*
 * Function:    post_send
 * 
 * Input:
 *      ctx     The context structure
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Posts a UD send to the multicast address
 */
int post_send(struct context *ctx)
{
    int ret;
    struct ibv_send_wr wr, *bad_wr;
    struct ibv_sge sge;
 
    memset(ctx->buf, 0x12, ctx->msg_length); /* set the data to non-zero */
 
    sge.length = ctx->msg_length;
    sge.lkey = ctx->mr->lkey;
    sge.addr = (uint64_t) ctx->buf;
 
    /* Multicast requires that the message is sent with immediate data
     * and that the QP number is the contents of the immediate data */
    wr.next = NULL;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.opcode = IBV_WR_SEND_WITH_IMM;
    wr.send_flags = IBV_SEND_SIGNALED;
    wr.wr_id = 0;
    wr.imm_data = htonl(ctx->id->qp->qp_num);
    wr.wr.ud.ah = ctx->ah;
    wr.wr.ud.remote_qpn = ctx->remote_qpn;
    wr.wr.ud.remote_qkey = ctx->remote_qkey;
 
    ret = ibv_post_send(ctx->id->qp, &wr, &bad_wr);
    if (ret) {
        VERB_ERR("ibv_post_send", ret);
        return -1;
    }
 
    return 0;
}
 
/*
 * Function:    get_completion
 * 
 * Input:
 *      ctx     The context structure
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Waits for a completion and verifies that the operation was successful
 */
int get_completion(struct context *ctx)
{
    int ret;
    struct ibv_wc wc;
 
    do {
        ret = ibv_poll_cq(ctx->cq, 1, &wc);
        if (ret < 0) {
            VERB_ERR("ibv_poll_cq", ret);
            return -1;
        }
    }
    while (ret == 0);
 
    if (wc.status != IBV_WC_SUCCESS) {
        printf("work completion status %s\n",
               ibv_wc_status_str(wc.status));
        return -1;
    }
 
    return 0;
}
 
/*
 * Function:    main
 * 
 * Input:       
 *      argc    The number of arguments
 *      argv    Command line arguments
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Main program to demonstrate multicast functionality.
 *      Both the sender and receiver create a UD Queue Pair and join the 
 *      specified multicast group (ctx.mcast_addr). If the join is successful,
 *      the sender must create an Address Handle (ctx.ah). The sender then posts
 *      the specified number of sends (ctx.msg_count) to the multicast group. 
 *      The receiver waits to receive each one of the sends and then both sides
 *      leave the multicast group and cleanup resources.
 */
int main(int argc, char** argv)
{
    int ret, op, i;
    struct context ctx;
    struct ibv_port_attr port_attr;
    struct rdma_cm_event *event;
    char buf[40];
 
    memset(&ctx, 0, sizeof (ctx));
 
    ctx.sender = 0;
    ctx.msg_count = DEFAULT_MSG_COUNT;
    ctx.msg_length = DEFAULT_MSG_LENGTH;
    ctx.server_port = DEFAULT_PORT;
 
    // Read options from command line
    while ((op = getopt(argc, argv, "shb:m:p:c:l:")) != -1) {
        switch (op) {
        case 's':
            ctx.sender = 1;
            break;
        case 'b':
            ctx.bind_addr = optarg;
            break;
        case 'm':
            ctx.mcast_addr = optarg;
            break;
        case 'p':
            ctx.server_port = optarg;
            break;
        case 'c':
            ctx.msg_count = atoi(optarg);
            break;
        case 'l':
            ctx.msg_length = atoi(optarg);
            break;
        default:
            printf("usage: %s -m mc_address\n", argv[0]);
            printf("\t[-s[ender mode]\n");
            printf("\t[-b bind_address]\n");
            printf("\t[-p port_number]\n");
            printf("\t[-c msg_count]\n");
            printf("\t[-l msg_length]\n");
            exit(1);
        }
    }
    if(ctx.mcast_addr == NULL) {
        printf("multicast address must be specified with -m\n");
        exit(1);
    }
 
    ctx.channel = rdma_create_event_channel();
    if (!ctx.channel) {
        VERB_ERR("rdma_create_event_channel", -1);
        exit(1);
    }
 
    ret = rdma_create_id(ctx.channel, &ctx.id, NULL, RDMA_PS_UDP);
    if (ret) {
        VERB_ERR("rdma_create_id", -1);
        exit(1);
    }
 
    ret = resolve_addr(&ctx);
    if (ret)
        goto out;
 
    /* Verify that the buffer length is not larger than the MTU */
    ret = ibv_query_port(ctx.id->verbs, ctx.id->port_num, &port_attr);
    if (ret) {
        VERB_ERR("ibv_query_port", ret);
        goto out;
    }
 
    if (ctx.msg_length > (1 << port_attr.active_mtu + 7)) {
        printf("buffer length %d is larger then active mtu %d\n",
               ctx.msg_length, 1 << (port_attr.active_mtu + 7));
        goto out;
    }
 
    ret = create_resources(&ctx);
    if (ret)
        goto out;
 
    if (!ctx.sender) {
        for (i = 0; i < ctx.msg_count; i++) {
            ret = rdma_post_recv(ctx.id, NULL, ctx.buf,
                                 ctx.msg_length + sizeof (struct ibv_grh), 
                                 ctx.mr);
            if (ret) {
                VERB_ERR("rdma_post_recv", ret);
                goto out;
            }
        }
    }
 
    /* Join the multicast group */
    ret = rdma_join_multicast(ctx.id, &ctx.mcast_sockaddr, NULL);
    if (ret) {
        VERB_ERR("rdma_join_multicast", ret);
        goto out;
    }
 
    /* Verify that we successfully joined the multicast group */
    ret = get_cm_event(ctx.channel, RDMA_CM_EVENT_MULTICAST_JOIN, &event);
    if (ret)
        goto out;
 
    inet_ntop(AF_INET6, event->param.ud.ah_attr.grh.dgid.raw, buf, 40);
    printf("joined dgid: %s, mlid 0x%x, sl %d\n", buf,
           event->param.ud.ah_attr.dlid, event->param.ud.ah_attr.sl);
 
    ctx.remote_qpn = event->param.ud.qp_num;
    ctx.remote_qkey = event->param.ud.qkey;
 
    if (ctx.sender) {
        /* Create an address handle for the sender */
        ctx.ah = ibv_create_ah(ctx.pd, &event->param.ud.ah_attr);
        if (!ctx.ah) {
            VERB_ERR("ibv_create_ah", -1);
            goto out;
        }
    }
 
    rdma_ack_cm_event(event);
 
    /* Create a thread to handle any CM events while messages are exchanged */
    pthread_create(&ctx.cm_thread, NULL, cm_thread, &ctx);
 
    if (!ctx.sender)
        printf("waiting for messages...\n");
 
    for (i = 0; i < ctx.msg_count; i++) {
        if (ctx.sender) {
            ret = post_send(&ctx);
            if (ret)
                goto out;
        }
 
        ret = get_completion(&ctx);
        if (ret)
            goto out;
 
        if (ctx.sender)
            printf("sent message %d\n", i + 1);
        else
            printf("received message %d\n", i + 1);
    }
 
out:
    ret = rdma_leave_multicast(ctx.id, &ctx.mcast_sockaddr);
    if (ret)
        VERB_ERR("rdma_leave_multicast", ret);
 
    destroy_resources(&ctx);
 
    return ret;
}

Shared Received Queue (SRQ)

Copy
Copied!

            
            /*
 * Compile Command:
 * gcc srq.c -o srq -libverbs -lrdmacm
 * 
 * Description:
 * Both the client and server use an SRQ. A number of Queue Pairs (QPs) are
 * created (ctx.qp_count) and each QP uses the SRQ. The connection between the
 * client and server is established using the IP address details passed on the
 * command line. After the connection is established, the client starts
 * blasting sends to the server and stops when the maximum work requests
 * (ctx.max_wr) have been sent. When the server has received all the sends, it
 * performs a send to the client to tell it to continue. The process repeats
 * until the number of requested number of sends (ctx.msg_count) have been
 * performed.
 * 
 * Running the Example:
 * The executable can operate as either the client or server application. It
 * can be demonstrated on a simple fabric of two nodes with the server
 * application running on one node and the client application running on the
 * other. Each node must be configured to support IPoIB and the IB interface
 * (ex. ib0) must be assigned an IP Address. Finally, the fabric must be
 * initialized using OpenSM.
 * 
 * Server (-a is IP of local interface):
 * ./srq -s -a 192.168.1.12
 * 
 * Client (-a is IP of remote interface):
 * ./srq -a 192.168.1.12
 *      
 */
 
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <getopt.h>
#include <rdma/rdma_verbs.h>
 
#define VERB_ERR(verb, ret) \
        fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)
 
/* Default parameters values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH 100000
#define DEFAULT_QP_COUNT 4
#define DEFAULT_MAX_WR 64
 
/* Resources used in the example */
struct context
{
    /* User parameters */
    int server;
    char *server_name;
    char *server_port;
    int msg_count;
    int msg_length;
    int qp_count;
    int max_wr;
 
    /* Resources */
    struct rdma_cm_id *srq_id;
    struct rdma_cm_id *listen_id;
    struct rdma_cm_id **conn_id;
    struct ibv_mr *send_mr;
    struct ibv_mr *recv_mr;
    struct ibv_srq *srq;
    struct ibv_cq *srq_cq;
    struct ibv_comp_channel *srq_cq_channel;
    char *send_buf;
    char *recv_buf;
};
 
/*
 * Function: init_resources
 * 
 * Input:
 *      ctx     The context object
 *      rai     The RDMA address info for the connection
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      This function initializes resources that are common to both the client
 *      and server functionality. 
 *      It creates our SRQ, registers memory regions, posts receive buffers 
 *      and creates a single completion queue that will be used for the receive 
 *      queue on each queue pair.
 */
int init_resources(struct context *ctx, struct rdma_addrinfo *rai)
{
    int ret, i;
    struct rdma_cm_id *id;
 
    /* Create an ID used for creating/accessing our SRQ */
    ret = rdma_create_id(NULL, &ctx->srq_id, NULL, RDMA_PS_TCP);
    if (ret) {
        VERB_ERR("rdma_create_id", ret);
        return ret;
    }
 
    /* We need to bind the ID to a particular RDMA device
     * This is done by resolving the address or binding to the address */
    if (ctx->server == 0) {
        ret = rdma_resolve_addr(ctx->srq_id, NULL, rai->ai_dst_addr, 1000);
        if (ret) {
            VERB_ERR("rdma_resolve_addr", ret);
            return ret;
        }
    }
    else {
        ret = rdma_bind_addr(ctx->srq_id, rai->ai_src_addr);
        if (ret) {
            VERB_ERR("rdma_bind_addr", ret);
            return ret;
        }
    }
 
    /* Create the memory regions being used in this example */
    ctx->recv_mr = rdma_reg_msgs(ctx->srq_id, ctx->recv_buf, ctx->msg_length);
    if (!ctx->recv_mr) {
        VERB_ERR("rdma_reg_msgs", -1);
        return -1;
    }
 
    ctx->send_mr = rdma_reg_msgs(ctx->srq_id, ctx->send_buf, ctx->msg_length);
    if (!ctx->send_mr) {
        VERB_ERR("rdma_reg_msgs", -1);
        return -1;
    }
 
    /* Create our shared receive queue */
    struct ibv_srq_init_attr srq_attr;
    memset(&srq_attr, 0, sizeof (srq_attr));
    srq_attr.attr.max_wr = ctx->max_wr;
    srq_attr.attr.max_sge = 1;
 
    ret = rdma_create_srq(ctx->srq_id, NULL, &srq_attr);
    if (ret) {
        VERB_ERR("rdma_create_srq", ret);
        return -1;
    }
 
    /* Save the SRQ in our context so we can assign it to other QPs later */
    ctx->srq = ctx->srq_id->srq;
 
    /* Post our receive buffers on the SRQ */
    for (i = 0; i < ctx->max_wr; i++) {
        ret = rdma_post_recv(ctx->srq_id, NULL, ctx->recv_buf, ctx->msg_length,
                             ctx->recv_mr);
        if (ret) {
            VERB_ERR("rdma_post_recv", ret);
            return ret;
        }
    }
 
    /* Create a completion channel to use with the SRQ CQ */
    ctx->srq_cq_channel = ibv_create_comp_channel(ctx->srq_id->verbs);
    if (!ctx->srq_cq_channel) {
        VERB_ERR("ibv_create_comp_channel", -1);
        return -1;
    }
 
    /* Create a CQ to use for all connections (QPs) that use the SRQ */
    ctx->srq_cq = ibv_create_cq(ctx->srq_id->verbs, ctx->max_wr, NULL,
                                ctx->srq_cq_channel, 0);
    if (!ctx->srq_cq) {
        VERB_ERR("ibv_create_cq", -1);
        return -1;
    }
 
    /* Make sure that we get notified on the first completion */
    ret = ibv_req_notify_cq(ctx->srq_cq, 0);
    if (ret) {
        VERB_ERR("ibv_req_notify_cq", ret);
        return ret;
    }
 
    return 0;
}
 
/*
 * Function:    destroy_resources
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      This function cleans up resources used by the application
 */
void destroy_resources(struct context *ctx)
{
    int i;
 
    if (ctx->conn_id) {
        for (i = 0; i < ctx->qp_count; i++) {
            if (ctx->conn_id[i]) {
                if (ctx->conn_id[i]->qp &&
                    ctx->conn_id[i]->qp->state == IBV_QPS_RTS) {
                    rdma_disconnect(ctx->conn_id[i]);
                }
                rdma_destroy_qp(ctx->conn_id[i]);
                rdma_destroy_id(ctx->conn_id[i]);
            }
        }
 
        free(ctx->conn_id);
    }
 
    if (ctx->recv_mr)
        rdma_dereg_mr(ctx->recv_mr);
 
    if (ctx->send_mr)
        rdma_dereg_mr(ctx->send_mr);
 
    if (ctx->recv_buf)
        free(ctx->recv_buf);
 
    if (ctx->send_buf)
        free(ctx->send_buf);
 
    if (ctx->srq_cq)
        ibv_destroy_cq(ctx->srq_cq);
 
    if (ctx->srq_cq_channel)
        ibv_destroy_comp_channel(ctx->srq_cq_channel);
 
    if (ctx->srq_id) {
        rdma_destroy_srq(ctx->srq_id);
        rdma_destroy_id(ctx->srq_id);
    }
}
 
/*
 * Function:    await_completion
 * 
 * Input:
 *      ctx     The context object
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Waits for a completion on the SRQ CQ
 * 
 */
int await_completion(struct context *ctx)
{
    int ret;
    struct ibv_cq *ev_cq;
    void *ev_ctx;
 
    /* Wait for a CQ event to arrive on the channel */
    ret = ibv_get_cq_event(ctx->srq_cq_channel, &ev_cq, &ev_ctx);
    if (ret) {
        VERB_ERR("ibv_get_cq_event", ret);
        return ret;
    }
 
    ibv_ack_cq_events(ev_cq, 1);
 
    /* Reload the event notification */
    ret = ibv_req_notify_cq(ctx->srq_cq, 0);
    if (ret) {
        VERB_ERR("ibv_req_notify_cq", ret);
        return ret;
    }
 
    return 0;
}
 
/*
 * Function:    run_server
 * 
 * Input:
 *      ctx     The context object
 *      rai     The RDMA address info for the connection
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Executes the server side of the example
 */
int run_server(struct context *ctx, struct rdma_addrinfo *rai)
{
    int ret, i;
    uint64_t send_count = 0;
    uint64_t recv_count = 0;
    struct ibv_wc wc;
    struct ibv_qp_init_attr qp_attr;
 
    ret = init_resources(ctx, rai);
    if (ret) {
        printf("init_resources returned %d\n", ret);
        return ret;
    }
 
    /* Use the srq_id as the listen_id since it is already setup */
    ctx->listen_id = ctx->srq_id;
 
    ret = rdma_listen(ctx->listen_id, 4);
    if (ret) {
        VERB_ERR("rdma_listen", ret);
        return ret;
    }
 
    printf("waiting for connection from client...\n");
    for (i = 0; i < ctx->qp_count; i++) {
        ret = rdma_get_request(ctx->listen_id, &ctx->conn_id[i]);
        if (ret) {
            VERB_ERR("rdma_get_request", ret);
            return ret;
        }
 
        /* Create the queue pair */
        memset(&qp_attr, 0, sizeof (qp_attr));
 
        qp_attr.qp_context = ctx;
        qp_attr.qp_type = IBV_QPT_RC;
        qp_attr.cap.max_send_wr = ctx->max_wr;
        qp_attr.cap.max_recv_wr = ctx->max_wr;
        qp_attr.cap.max_send_sge = 1;
        qp_attr.cap.max_recv_sge = 1;
        qp_attr.cap.max_inline_data = 0;
        qp_attr.recv_cq = ctx->srq_cq;
        qp_attr.srq = ctx->srq;
        qp_attr.sq_sig_all = 0;
 
        ret = rdma_create_qp(ctx->conn_id[i], NULL, &qp_attr);
        if (ret) {
            VERB_ERR("rdma_create_qp", ret);
            return ret;
        }
 
        /* Set the new connection to use our SRQ */
        ctx->conn_id[i]->srq = ctx->srq;
 
        ret = rdma_accept(ctx->conn_id[i], NULL);
        if (ret) {
            VERB_ERR("rdma_accept", ret);
            return ret;
        }
    }
 
    while (recv_count < ctx->msg_count) {
        i = 0;
        while (i < ctx->max_wr && recv_count < ctx->msg_count) {
            int ne;
 
            ret = await_completion(ctx);
            if (ret) {
                printf("await_completion %d\n", ret);
                return ret;
            }
 
            do {
                ne = ibv_poll_cq(ctx->srq_cq, 1, &wc);
                if (ne < 0) {
                    VERB_ERR("ibv_poll_cq", ne);
                    return ne;
                }
                else if (ne == 0)
                    break;
 
                if (wc.status != IBV_WC_SUCCESS) {
                    printf("work completion status %s\n",
                           ibv_wc_status_str(wc.status));
                    return -1;
                }
 
                recv_count++;
                printf("recv count: %d, qp_num: %d\n", recv_count, wc.qp_num);
 
                ret = rdma_post_recv(ctx->srq_id, (void *) wc.wr_id,
                                     ctx->recv_buf, ctx->msg_length, 
                                     ctx->recv_mr);
                if (ret) {
                    VERB_ERR("rdma_post_recv", ret);
                    return ret;
                }
 
                i++;
            }
            while (ne);
        }
 
        ret = rdma_post_send(ctx->conn_id[0], NULL, ctx->send_buf, 
                             ctx->msg_length, ctx->send_mr, IBV_SEND_SIGNALED);
        if (ret) {
            VERB_ERR("rdma_post_send", ret);
            return ret;
        }
 
        ret = rdma_get_send_comp(ctx->conn_id[0], &wc);
        if (ret <= 0) {
            VERB_ERR("rdma_get_send_comp", ret);
            return -1;
        }
 
        send_count++;
        printf("send count: %d\n", send_count);
    }
 
    return 0;
}
 
/*
 * Function:    run_client
 * 
 * Input:
 *      ctx     The context object
 *      rai     The RDMA address info for the connection
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Executes the client side of the example
 */
int run_client(struct context *ctx, struct rdma_addrinfo *rai)
{
    int ret, i, ne;
    uint64_t send_count = 0;
    uint64_t recv_count = 0;
    struct ibv_wc wc;
    struct ibv_qp_init_attr attr;
 
    ret = init_resources(ctx, rai);
    if (ret) {
        printf("init_resources returned %d\n", ret);
        return ret;
    }
 
    for (i = 0; i < ctx->qp_count; i++) {
        memset(&attr, 0, sizeof (attr));
 
        attr.qp_context = ctx;
        attr.cap.max_send_wr = ctx->max_wr;
        attr.cap.max_recv_wr = ctx->max_wr;
        attr.cap.max_send_sge = 1;
        attr.cap.max_recv_sge = 1;
        attr.cap.max_inline_data = 0;
        attr.recv_cq = ctx->srq_cq;
        attr.srq = ctx->srq;
        attr.sq_sig_all = 0;
 
        ret = rdma_create_ep(&ctx->conn_id[i], rai, NULL, &attr);
        if (ret) {
            VERB_ERR("rdma_create_ep", ret);
            return ret;
        }
 
        ret = rdma_connect(ctx->conn_id[i], NULL);
        if (ret) {
            VERB_ERR("rdma_connect", ret);
            return ret;
        }
    }
 
    while (send_count < ctx->msg_count) {
        for (i = 0; i < ctx->max_wr && send_count < ctx->msg_count; i++) {
            /* perform our send to the server */
            ret = rdma_post_send(ctx->conn_id[i % ctx->qp_count], NULL,
                                 ctx->send_buf, ctx->msg_length, ctx->send_mr, 
                                 IBV_SEND_SIGNALED);
            if (ret) {
                VERB_ERR("rdma_post_send", ret);
                return ret;
            }
 
            ret = rdma_get_send_comp(ctx->conn_id[i % ctx->qp_count], &wc);
            if (ret <= 0) {
                VERB_ERR("rdma_get_send_comp", ret);
                return ret;
            }
 
            send_count++;
            printf("send count: %d, qp_num: %d\n", send_count, wc.qp_num);
        }
 
        /* wait for a recv indicating that all buffers were processed */
        ret = await_completion(ctx);
        if (ret) {
            VERB_ERR("await_completion", ret);
            return ret;
        }
 
        do {
            ne = ibv_poll_cq(ctx->srq_cq, 1, &wc);
            if (ne < 0) {
                VERB_ERR("ibv_poll_cq", ne);
                return ne;
            }
            else if (ne == 0)
                break;
 
            if (wc.status != IBV_WC_SUCCESS) {
                printf("work completion status %s\n",
                       ibv_wc_status_str(wc.status));
                return -1;
            }
 
            recv_count++;
            printf("recv count: %d\n", recv_count);
 
            ret = rdma_post_recv(ctx->srq_id, (void *) wc.wr_id,
                                 ctx->recv_buf, ctx->msg_length, ctx->recv_mr);
            if (ret) {
                VERB_ERR("rdma_post_recv", ret);
                return ret;
            }
        }
        while (ne);
    }
 
    return ret;
}
 
/*
 * Function:    main
 * 
 * Input:
 *      argc    The number of arguments
 *      argv    Command line arguments
 * 
 * Output:
 *      none
 * 
 * Returns:
 *      0 on success, non-zero on failure
 * 
 * Description:
 *      Main program to demonstrate SRQ functionality.
 *      Both the client and server use an SRQ. ctx.qp_count number of QPs are
 *      created and each one of them uses the SRQ. After the connection, the
 *      client starts blasting sends to the server upto ctx.max_wr. When the
 *      server has received all the sends, it performs a send to the client to
 *      tell it that it can continue. Process repeats until ctx.msg_count
 *      sends have been performed.
 */
int main(int argc, char** argv)
{
    int ret, op;
    struct context ctx;
    struct rdma_addrinfo *rai, hints;
 
    memset(&ctx, 0, sizeof (ctx));
    memset(&hints, 0, sizeof (hints));
 
    ctx.server = 0;
    ctx.server_port = DEFAULT_PORT;
    ctx.msg_count = DEFAULT_MSG_COUNT;
    ctx.msg_length = DEFAULT_MSG_LENGTH;
    ctx.qp_count = DEFAULT_QP_COUNT;
    ctx.max_wr = DEFAULT_MAX_WR;
 
    /* Read options from command line */
    while ((op = getopt(argc, argv, "sa:p:c:l:q:w:")) != -1) {
        switch (op) {
        case 's':
            ctx.server = 1;
            break;
        case 'a':
            ctx.server_name = optarg;
            break;
        case 'p':
            ctx.server_port = optarg;
            break;
        case 'c':
            ctx.msg_count = atoi(optarg);
            break;
        case 'l':
            ctx.msg_length = atoi(optarg);
            break;
        case 'q':
            ctx.qp_count = atoi(optarg);
            break;
        case 'w':
            ctx.max_wr = atoi(optarg);
            break;
        default:
            printf("usage: %s -a server_address\n", argv[0]);
            printf("\t[-s server mode]\n");
            printf("\t[-p port_number]\n");
            printf("\t[-c msg_count]\n");
            printf("\t[-l msg_length]\n");
            printf("\t[-q qp_count]\n");
            printf("\t[-w max_wr]\n");
            exit(1);
        }
    }
 
    if (ctx.server_name == NULL) {
        printf("server address required (use -a)!\n");
        exit(1);
    }
 
    hints.ai_port_space = RDMA_PS_TCP;
    if (ctx.server == 1)
        hints.ai_flags = RAI_PASSIVE; /* this makes it a server */
 
    ret = rdma_getaddrinfo(ctx.server_name, ctx.server_port, &hints, &rai);
    if (ret) {
        VERB_ERR("rdma_getaddrinfo", ret);
        exit(1);
    }
 
    /* allocate memory for our QPs and send/recv buffers */
    ctx.conn_id = (struct rdma_cm_id **) calloc(ctx.qp_count,
                                                sizeof (struct rdma_cm_id *));
    memset(ctx.conn_id, 0, sizeof (ctx.conn_id));
 
    ctx.send_buf = (char *) malloc(ctx.msg_length);
    memset(ctx.send_buf, 0, ctx.msg_length);
    ctx.recv_buf = (char *) malloc(ctx.msg_length);
    memset(ctx.recv_buf, 0, ctx.msg_length);
 
    if (ctx.server)
        ret = run_server(&ctx, rai);
    else
        ret = run_client(&ctx, rai);
 
    destroy_resources(&ctx);
    free(rai);
 
    return ret;
}

Experimental APIs

Dynamically Connected Transport

The Dynamically Connected (DC) transport provides reliable transport services from a DC Initiator (DCI) to a DC Target (DCT). A DCI can send data to multiple targets on the same or different subnet, and a DCT can simultaneously service traffic from multiple DCIs. No explicit connections are setup by the user, with the target DCT being identified by an address vector similar to that used in UD transport, DCT number, and DC access key.

DC Usage Model

Query device is used to detect if the DC transport is supported, and if so what are it's characteristics
User creates DCI's. The number of DCI's depends on the user's strategy for handling concurrent data transmissions.
User defines a DC Access Key, and initializes a DCT using this access key
User can query the DCI with the routine ibv_exp_query_qp(), and can query the DCT with the ibv_exp_query_dct() routine.
User can arm the DCT, so that an event is generated when a DC Access Key violation occurs.
Send work requests are posted to the DCI's. Data can be sent to a different DCT only after all previous sends complete, so send CQE's can be used to detect such completions.
The CQ associated with the DCT is used to detect data arrival.
Destroy resources when done

Query Device

The function int ibv_exp_query_device(struct ibv_context *context, struct ibv_exp_device_attr *attr)

is used to query for device capabilities. The flag IBV_EXP_DEVICE_DC_TRANSPORT in the field exp_atomic_cap of the struct ibv_exp_device_attr defines if the DC transport is supported.

The fields,

int max_dc_req_rd_atom;

int max_dc_res_rd_atom;

in the same structure describe DC's atomic support characteristics.

Create DCT

/* create a DC target object */

struct ibv_dct *ibv_exp_create_dct(struct ibv_context *context,

struct ibv_exp_dct_init_attr *attr);

context - Context to the InfiniBand device as returned from ibv_open_device.
attr - Defines attributes of the DCT and include
- Struct ibv_pd *pd - The PD to verify access validity with respect to protection domains
- struct ibv_cq *cq - CQ used to report receive completions
- Struct ibv_srq *srq - The SRQ that will provide the received buffers.
  Note that the PD is not checked against the PD of the scatter entry. This check is done with the PD of the DC target.
- dc_key - A 64 bit key associated with the DCT.
- port - The port number this DCT is bound to
- access flags - Semantics similar to RC QPs
  - remote read
  - remote write
  - remote atomics
min_rnr_timer - Minimum rnr nak time required from the requester between successive requests of a message that was previously rejected due to insufficient receive buffers. IB spec 9.7.5.2.8
tclass- Used by packets sent by the DCT in case GRH is used
flow_label - Used by packets sent by the DCT in case GRH is used
mtu - MTU
pkey_index - pkey index used by the DC target
gid_index - Gid (e.g., all caps) index associated with the DCT. Used to verify incoming packets if GRH is used. This field in mandatory
hop_limit - Used by packets sent by the DCT in case GRH is used
Create flags

Destroy DCT

/* destroy a DCT object */

int ibv_exp_destroy_dct(struct ibv_exp_dct *dct);

Destroy a DC target. This call may take some time till all DCRs are disconnected.

Query DCT

/* query DCT attributes */

int ibv_exp_query_dct(struct ibv_exp_dct *dct, struct ibv_exp_dct_attr *attr);

Attributes queried are:

state
cq
access_flags
min_rnr_flags
pd
tclass
flow_label
dc_key
mtu
port
pkey_index
gid_index
hop_limit
key_violations
pd
srq
cq

Arm DCT

A DC target can be armed to request notification when DC key violations occur. After return from a call to ibv_exp_arm_dct, the DC target is moved into the “ARMED” state. If a packet targeting this DCT with a wrong key is received, the DCT moves to the “FIRED” state and the event IBV_EXP_EVENT_DCT_KEY_VIOLATION is generated. The user can read these events by calling ibv_get_async_event. Events must be acked with ibv_ack_async_event.

struct ibv_exp_arm_attr {

uint32_t comp_mask;

};

int ibv_exp_arm_dct(struct ibv_exp_dct *dct,

struct ibv_exp_arm_attr *attr);

dct - Pointer to a previously create DC target
attr - Pointer to arm DCT attributes. This struct has a single comp_mask field that must be zero in this version

Create DCI

A DCI is created by calling ibv_exp_create_qp() with a new QP type, IBV_EXP_QPT_DC_INI The semantics is similar to regular QPs. A DCI is an initiator endpoint which connects to DC targets. Matching rules are identical to those of QKEY for UD. However, the key is 64 bits. A DCI is not a responder, it's only an initiator.

The following are the valid state transitions for DCI with required and optional params

From	To	Required	Optional
Reset	Init	IBV_QP_PKEY_INDEX, IBV_QP_PORT, IBV_QP_DC_KEY
Init	Init	IBV_QP_PKEY_INDEX, IBV_QP_PORT, IBV_QP_ACCESS_FLAGS
Init	RTR	IBV_QP_AV, IBV_QP_PATH_MTU	IBV_QP_PKEY_INDEX, IBV_QP_DC_KEY
RTR	RTS	IBV_QP_TIMEOUT, IBV_QP_RETRY_CNT, IBV_QP_RNR_RETRY, IBV_QP_MAX_QP_RD_ATOMIC	IBV_QP_ALT_PATH, IBV_QP_MIN_RNR_TIMER, IBV_QP_PATH_MIG_STATE
RTS	RTS		IBV_QP_ALT_PATH, IBV_QP_PATH_MIG_STATE, IBV_QP_MIN_RNR_TIMER

Verbs API for Extended Atomics Support

The extended atomics capabilities provide support for performing Fetch&Add and masked Compare&Swap atomic operations on multiple fields. The figure below shows how the individual fields within the user-supplied-data field are specified.

In the figure above, the total operand size is N bits, with the length of each data field being four bits. The 1's in the mask indicate the termination of a data field. With ConnectX® family of HCA's and Connect-IB®, there is always an implicit 1 in the mask.

Supported Hardware

The extended atomic operations are supported by ConnectX®-2 and subsequent hardware. ConnectX-2/ConnectX®-3 devices employ read-modify-write operations on regions that are sized as multiples of 64 bits with 64 bit alignment. Therefore, when operations are performed on user buffers that are smaller than 64 bits, the unmodified sections of such regions will be written back unmodified when the results are committed to user memory. Connect-IB® and subsequent devices operate on memory regions that are multiples of 32 or 64 bits, with natural alignment.

Verbs Interface Changes

Usage model:

Query device to see if
- Atomic Operations are supported
- Endieness of atomic response
- Extended atomics are supported, and the data sizes supported
Initialize QP for use with atomic operations, taking device capabilities into account
Use the atomic operations
Destroy QP after finishing to use it

Query Device Capabilities

The device capabilities flags enumeration is updated to reflect the support for extended atomic operations by adding the flag:

+ IBV_EXP_DEVICE_EXT_ATOMICS ,

and the device attribute comp mask enumeration ibv_exp_device_attr_comp_mask is updated with:

+ IBV_EXP_DEVICE_ATTR_EXT_ATOMIC_ARGS,

The device attributes struct, ibv_exp_device_attr, is modified by adding struct ibv_exp_ext_atomics_params ext_atom

Copy
Copied!

            
            struct ibv_exp_ext_atomics_params {
 
uint64_t atomic_arg_sizes; /* bit-mask of supported sizes */
 
uint32_t max_fa_bit_boundary;
 
uint32_t log_max_atomic_inline;
 
};

Atomic fetch&add operations on subsections of the operands are also supported, with max_fa_bit_boundary being the log-base-2 of the largest such subfield, in bytes. Log_max_atomic_inline is the log of the largest amount of atomic data, in bytes, that can be put in the work request and includes the space for all required fields. -For ConnectX and Connect-IB the largest subsection supported is eight bytes.

The returned data is formatted in units that correspond to the host's natural word size. For example, if extended atomics are used for a 16 byte field, and returned in big-endian format, each eight byte portion is arranged in big-endian format, regardless of the size the fields used in an association in a multi-field fetch-and-add operation.

Response Format

QP Initialization

QP initialization needs additional information with respect to the sizes of atomic operations that will be supported inline. This is needed to ensure the QP is provisioned with sufficient send resources to support the number of support WQE's.

The QP attribute enumeration comp-mask, ibv_exp_qp_init_attr_comp_mask, is expanded by adding

+ IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG ,

Send Work Request Changes

Copy
Copied!

            
            The send op codes are extended to include
+	IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP,
+	IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD
ibv_exp_send_flags
The send flags, ibv_exp_send_flags, are expanded to include inline support for extended atomic operations with the flag
+	IBV_EXP_SEND_EXT_ATOMIC_INLINE
The send work request is extended by appending
union {
    struct {
        /* Log base-2 of total operand size
         */
        uint32_t        log_arg_sz;
        uint64_t  remote_addr;
        uint32_t  rkey;  /* remote memory key */
        union {
            struct {
                /* For the next four fields:
                 * If operand_size < 8 bytes then inline data is in
                 * the corresponding field; for small operands,
                 * LSBs are used.
                 * Else the fields are pointers in the process's 
                 * address space to
                 * where the arguments are stored
                 */
                union {
                    struct ibv_exp_cmp_swap cmp_swap;
                    struct ibv_exp_fetch_add fetch_add;
                } op;            } inline_data;       
		/* in the future add support for non-inline 
            * argument provisioning 
            */
        } wr_data;
    } masked_atomics;
} ext_op;
 
To the end of work request, ibv_exp_send_wr,
with
struct ibv_exp_cmp_swap {
  uint64_t  compare_mask;
  uint64_t  compare_val;
  uint64_t  swap_val;
  uint64_t  swap----_mask;
};
and
struct ibv_exp_fetch_add {
  uint64_t  add_val;
  uint64_t  field_boundary;
};

User-Mode Memory Registration (UMR)

This section describes User-Mode Memory Registration (UMR) which supports the creation of memory keys for non-contiguous memory regions. This includes the concatenation of arbitrary contiguous regions of memory, as well as regions with regular structure.

Three examples of non-contiguous regions of memory that are used to form new contiguous regions of memory are described below. Figure 2 shows an example where portions of three separate contiguous regions of memory are combined to create a single logically contiguous region of memory. The base address of the new memory region is defined by the user when the new memory key is defined.

images/networking/download/attachments/34256583/Memory_region_described_by_Indirect_Memory_key_%28KLM%29.jpg

The figure below shows a non-contiguous memory region with regular. This region is defined by a base address, stride between adjacent elements, the extent of each element, and a repeat count.

The figure below shows an example where two non-contiguous memory regions are interleaved, using the repeat structure UMR.

Interleaving_data_from_two_separate_non-contiguous_regions_of_memory.jpg

Interfaces

The usage model for the UMR includes:

Ability to with ibv_exp_query_device if UMR is supported.
If UMR is supported, checking struct ibv_exp_device_attr for it's characteristics
Using ibv_exp_create_mr() to create an uninitialized memory key for future UMR use
Using ibv_exp_post_send() to define the new memory key. This can be posted to the same send queue that will use the memory key in future operations.
Using the UMR defined as one would use any other memory keys
Using ibv_exp_post_send() to invalidate the UMR memory key
Releasing the memory key with the ibv_dereg_mr()

Device Capabilities

The query device capabilities is queried to see if the UMR capability is supported, and if so, what are it's characteristics. The routine used is:

int ibv_exp_query_device(struct ibv_context *context, struct ibv_exp_device_attr *attr)

struct ibv_exp_umr_caps umr_caps field describes the UMR capabilities. This structure is defined as:

Copy
Copied!

            
            struct ibv_exp_umr_caps {
 
uint32_t max_klm_list_size;
 
uint32_t max_send_wqe_inline_klms;
 
uint32_t max_umr_recursion_depth;
 
uint32_t max_umr_stride_dimension;
 
};

The fields added to the struct struct ibv_exp_device_attr to support UMR include:

exp_device_cap_flags - UMR support available if the flag IBV_EXP_DEVICE_ATTR_UMR is set.
max_mkey_klm_list_size - maximum number of memory keys that may be input to UMR
max_send_wqe_inline_klms - the largest number of KLM's that can be provided inline in the work request. When the list is larger than this, a buffer allocated via the struct ibv_mr *ibv_exp_reg_mr(struct ibv_exp_reg_mr_in *in) function, and provided to the driver as part of the memory key creation
max_umr_recursion_depth - memory keys created by UMR operations may be input to UMR memory key creation. This specifies the limit on how deep this recursion can be.
max_umr_stride_dimension - The maximum number of independent dimensions that may be used with the regular structure UMR operations. The current limit is one.

QP Creation

To configure QP UMR support the routine

ibv_qp * ibv_exp_create_qp(struct ibv_context *context, struct ibv_exp_qp_init_attr *qp_init_attr)

is to be used. When the attribute IBV_EXP_QP_CREATE_UMR is set in the exp_create_flags field of struct ibv_exp_qp_init_attr enables UMR support. The attribute IBV_ IBV_EXP_QP_INIT_ATTR_MAX_INL_KLMS is set in the field comp_mask struct ibv_exp_qp_init_attr, with the field max_inl_send_klms defining this number.

Memory Key Manipulation

To create an uninitialized memory key for future use the routine

Copy
Copied!

            
            struct ibv_mr *ibv_exp_create_mr(struct ibv_exp_create_mr_in *create_mr_in)
is used with
struct ibv_exp_create_mr_in {
    struct ibv_pd *pd;
    struct ibv_exp_mr_init_attr attr;
};
and
struct ibv_exp_mr_init_attr {
    uint64_t max_reg_descriptors; /* maximum number of entries */
    uint32_t create_flags; /* enum ibv_mr_create_flags */
    uint64_t access_flags; /* region's access rights */
    uint32_t comp_mask;
};

To query the resources associated with the memory key, the routine

Copy
Copied!

            
            int ibv_exp_query_mkey(struct ibv_mr *mr, struct ibv_exp_mkey_attr *query_mkey_in)
is used with
struct ibv_exp_mkey_attr {
    int n_mkey_entries;  /* the maximum number of memory keys that can be supported */
     uint32_t comp_mask;
};

Non-inline memory objects

When the list of memory keys input into the UMR memory key creation is too large to fit into the work request, a hardware accessible buffer needs to be provided in the posted send request. This buffer will be populated by the driver with the relevant memory objects.

Copy
Copied!

            
            We will define the enum
enum memory_reg_type{
    IBV_MEM_REG_MKEY
};
 
The memory registration function is defined as:
 
struct non_inline_data  *ibv_exp_alloc_mkey_list_memory 
             (struct ibv_exp_mkey_list_container_attr *attr)
where
struct ibv_exp_mkey_list_container_attr {
  struct ibv_pd *pd;
  uint32_t mkey_list_type;  /* use ibv_exp_mkey_list_type */
  uint32_t max_klm_list_size;
  uint32_t comp_mask; /*use ibv_exp_alloc_mkey_list_comp_mask */
};
This memory is freed with
int ibv_exp_dealloc_mkey_list_memory(struct ibv_exp_mkey_list_container *mem)
 
where
struct ibv_exp_mkey_list_container {
  uint32_t max_klm_list_size;
  struct ibv_context *context;
};  (NOTE - Need to check with Eli Cohen here - just reading the code).

Memory Key Initialization

The memory key is manipulated with the ibv_exp_post_send() routine. The opcodes IBV_EXP_WR_UMR_FILL and IBV_EXP_WR_UMR_INVALIDATE are used to define and invalidate, respectively, the memory key.

The struct ibv_exp_send_wr contains the following fields to support the UMR capabilities:

Copy
Copied!

            
            union {
    struct {
      uint32_t umr_type; /* use ibv_exp_umr_wr_type */
      struct ibv_exp_mkey_list_container *memory_objects; /* used when IBV_EXP_SEND_INLINE is not set */
      uint64_t exp_access; /* use ibv_exp_access_flags */
      struct ibv_mr *modified_mr;
      uint64_t base_addr;
      uint32_t num_mrs; /* array size of mem_repeat_block_list or mem_reg_list */
      union {
        struct ibv_exp_mem_region *mem_reg_list; /* array, size corresponds to num_mrs */
        struct {
          struct ibv_exp_mem_repeat_block *mem_repeat_block_list; /* array,  size corresponds to num_mr */
          size_t *repeat_count; /* array size corresponds to stride_dim */
          uint32_t stride_dim;
        } rb;
      } mem_list;
    } umr;
 
where 
enum ibv_exp_umr_wr_type {
  IBV_EXP_UMR_MR_LIST,
  IBV_EXP_UMR_REPEAT
};
 
and
 
struct ibv_exp_mkey_list_container {
  uint32_t max_klm_list_size;
  struct ibv_context *context;
};
 
struct ibv_exp_mem_region {
  uint64_t base_addr;
  struct ibv_mr *mr;
  size_t length;
};
 
and
 
struct ibv_exp_mem_repeat_block {
  uint64_t base_addr; /* array, size corresponds to ndim */
  struct ibv_mr *mr;
  size_t *byte_count; /* array, size corresponds to ndim */
  size_t *stride; /* array, size corresponds to ndim */
};

Cross-Channel Communications Support

The Cross-Channel Communications adds support for work requests that are used for synchronizing communication between separate QP's and support for data reductions. This functionality, for example, is sufficient for implementing MPI collective communication with a single post of work requests, with the need to check only of full communication completion, rather than on completion of individual work requests.

Terms relevant to the Cross-Channel Synchronization are defined in the following table:

Term	Description
Cross Channel supported QP	QP that allows send_enable, recv_enable, wait, and reduction tasks.
Managed send QP	Work requests in the corresponding send queues must be explicitly enabled before they can be executed.
Managed receive QP	Work requests in the corresponding receive queues must be explicitly enabled before they can be executed.
Master Queue	Queue that uses send_enable and/or recv_enable work requests to enable tasks in managed QP. A QP can be both master and managed QP.
Wait task (n)	Task the completes when n completion tasks appear in the specified completion queue
Send Enable task (n)	Enables the next n send tasks in the specified send queue to be executable.
Receive Enable task	Enables the next n send tasks in the specified receive queue to be executable.
Reduction operation	Data reduction operation to be executed by the HCA on specified data.

Usage Model

Creating completion queues, setting the ignore-overrun bit for the CQ's that only hardware will monitor.
Creating and configuring the relevant QP's, setting the flags indicating that Cross-Channel Synchronization work requests are supported, and the appropriate master and managed flags (based on planned QP usage). For example, this may happen when an MPI library creates a new communicator.
Posting tasks list for the compound operations.
Checking the appropriate queue for compound operation completion (need to request completion notification from the appropriate work request). For example, a user may setup a CQ that receives completion notification for the work-request whose completion indicates the entire collective operation has completed locally.
Destroying the QP's and CQ's created for Cross-Channel Synchronization operations, once the application is done using them. For example, an MPI library may destroy these resources after it frees all the communicator using these resources.

Resource Initialization

Device Capabilities

Copy
Copied!

            
            The device query function,
int ibv_exp_query_device(struct ibv_context *context,
               struct ibv_exp_device_attr *attr);
is used to query for device capabilities.
A value of 
IBV_EXP_DEVICE_CROSS_CHANNEL
in exp_device_cap_flags indicates support for Cross-Channel capabilities.
 
In addition, the struct calc_cap is used to define what reduction capabilities are supported 
struct ibv_exp_device_attr {
 … 
    struct ibv_exp_device_calc_cap calc_cap;
 …
};
 
where,
struct ibv_exp_device_calc_cap {
  uint64_t    data_types;
  uint64_t    data_sizes;
  uint64_t    int_ops;
  uint64_t    uint_ops;
  uint64_t    fp_ops;
};
Where the operation types are given by:
IBV_EXP_CALC_OP_ADD , /* addition */
IBV_EXP_CALC_OP_BAND, /* bit-wise and */
IBV_EXP_CALC_OP_BXOR, /*bit wise xor */
IBV_EXP_CALC_OP_BOR, /* bit-wise or */
 
and data types supported are described by
IBV_EXP_CALC_DATA_SIZE_64_BIT

Completion Queue

Completion queue (CQ) that will be used with Cross Channel Synchronization operations needs to be marked as such as CQ at creation time. This CQ needs to be initialized with

Copy
Copied!

            
            struct ibv_cq *ibv_exp_create_cq(struct ibv_context *context,
                 int cqe,
                 void *cq_context,
                 struct ibv_comp_channel *channel,
                 int comp_vector,
                 struct ibv_exp_cq_init_attr *attr)
where the new parameter is defined as:
struct ibv_exp_cq_init_attr{
    uint32_t comp_mask;
    unit32_t flags;
}
The appropriate flag to set is:
IBV_EXP_CQ_CREATE_CROSS_CHANNEL 
The comp_mask needs to set the bit,
IBV_EXP_CQ_INIT_ATTR_FLAGS 
To avoid the CQ's entering the error state due to lack of CQ processing, the overrun ignore (OI) bit of the Completion Queue Context table must be set. 
To set these bit the function
/**
 * ibv_exp_modify_cq - Modifies the attributes for the specified CQ.
 * @cq: The CQ to modify.
 * @cq_attr: Specifies the CQ attributes to modify.
 * @cq_attr_mask: A bit-mask used to specify which attributes of the CQ
 *   are being modified.
 */
static inline int ibv_exp_modify_cq(struct ibv_cq *cq,
            struct ibv_exp_cq_attr *cq_attr,
            int cq_attr_mask)
The bit IBV_EXP_CQ_CAP_FLAGS in  cq_attr_mask needs to be set, as does the bit IBV_EXP_CQ_ATTR_CQ_CAP_FLAGS  in cq_attr_mask's comp_mask.  Finally, the bit IBV_EXP_CQ_IGNORE_OVERRUN needs to be set in the field cq_cap_flags.

QP Creation

To configure the QP for Cross-Channel use following function is used

Copy
Copied!

            
            struct ibv_qp *ibv_exp_create_qp(struct ibv_context *context, 
    struct ibv_exp_qp_init_attr *qp_init_attr)
 
where
 
struct ibv_exp_qp_init_attr {
  void           *qp_context;
  struct ibv_cq        *send_cq;
  struct ibv_cq        *recv_cq;
  struct ibv_srq         *srq;
  struct ibv_qp_cap cap;
  enum ibv_qp_type  qp_type;
  int     sq_sig_all;
 
  uint32_t    comp_mask; /* use ibv_exp_qp_init_attr_comp_mask */
  struct ibv_pd        *pd;
  struct ibv_xrcd        *xrcd;
  uint32_t    exp_create_flags; /* use ibv_exp_qp_create_flags */
 
  uint32_t    max_inl_recv;
  struct ibv_exp_qpg  qpg;
  uint32_t    max_atomic_arg;
  uint32_t                max_inl_send_klms;
};

The exp_create_flags that are available are

IBV_EXP_QP_CREATE_CROSS_CHANNEL - This must be set for any QP to which cross-channel-synchronization work requests will be posted.
IBV_EXP_QP_CREATE_MANAGED_SEND - This is set for a managed send QP, e.g. one for which send-enable operations are used to activate the posted send requests.
IBV_EXP_QP_CREATE_MANAGED_RECV - This is set for a managed receive QP, e.g. one for which send-enable operations are used to activate the posted receive requests.

Posting Request List

A single operation is defined with by a set of work requests posted to multiple QP's, as described in the figure below.

The lists are of tasks are NULL terminated.

The routine

Copy
Copied!

            
            int ibv_exp_post_task(struct ibv_context *context, struct ibv_exp_task *task, struct ibv_exp_task **bad_task)
is used to post the list of work requests, with
 
struct ibv_exp_task {
  enum ibv_exp_task_type  task_type;
  struct {
    struct ibv_qp  *qp;
    union {
      struct ibv_exp_send_wr  *send_wr;
      struct ibv_recv_wr  *recv_wr;
    };
  } item;
  struct ibv_exp_task    *next;
  uint32_t                comp_mask; /* reserved for future growth (must be 0) */
};
 
The task type is defined by:
 
IBV_EXP_TASK_SEND
  IBV_EXP_TASK_RECV
 
To support the new work requests, the struct ibv_exp_send_wr is expanded with
union {
    struct {
      uint64_t    remote_addr;
      uint32_t    rkey;
    } rdma;
    struct {
      uint64_t    remote_addr;
      uint64_t    compare_add;
      uint64_t    swap;
      uint32_t    rkey;
    } atomic;
    struct {
      struct ibv_cq *cq;
      int32_t  cq_count;
    } cqe_wait;
    struct {
      struct ibv_qp *qp;
      int32_t  wqe_count;
    } wqe_enable;
  } task;
 
The calc operation is also defined in ibv_exp_send_wr by the union:
 
union {
    struct {
      enum ibv_exp_calc_op        calc_op;
      enum ibv_exp_calc_data_type data_type;
      enum ibv_exp_calc_data_size data_size;
    } calc;
  } op;

In addition, in the field exp_send_flags in ibv_exp_send_wr the flag IBV_EXP_SEND_WITH_CALC indicates the presence of a reduction operation, and IBV_EXP_SEND_WAIT_EN_LAST is used to signal the last wait task posted for a given CQ in the current task list.

For ibv_exp_calc_data_type the types

IBV_EXP_CALC_DATA_TYPE_INT,
IBV_EXP_CALC_DATA_TYPE_UINT,
IBV_EXP_CALC_DATA_TYPE_FLOA

are supported.

The supported data size for ibv_exp_data_size is IBV_EXP_CALC_DATA_SIZE_64_BIT.

New send opcodes are defined for the new work requests. These include:

IBV_EXP_WR_SEND_ENABLE
IBV_EXP_WR_RECV_ENABLE
IBV_EXP_WR_CQE_WAIT

ConnectX-3/Connect-IB Data Endianess

The ConnectX-3 and Connect-IB HCA's expect to get the data in network order.

On This Page