1. cuObjClient API Specification v1.0.0#

1.1. Overview#

The cuObjClient library provides client side APIs for performing high performance GET and PUT operations with RDMA (Remote Direct Memory Access) acceleration and GPUDirect Storage support.

Key features include:

  • Memory registration and management for RDMA transfers

  • Synchronous GET and PUT operations with user defined callbacks

  • Support for system memory, CUDA managed memory, and CUDA device memory

  • RDMA descriptor generation and management

  • Telemetry and logging capabilities

  • Maximum memory registration size: 4 GiB per buffer

Protocol support:

  • CUOBJ_PROTO_RDMA_DC_V1 (RDMA Dynamically Connected version 1)

1.2. I/O flow#

cuObjClient follows a callback based architecture:

  • The user application creates a cuObjClient instance with callback operations.

  • Memory registration via cuMemObjGetDescriptor() prepares buffers for RDMA.

  • I O operations (cuObjGet() and cuObjPut()) trigger callbacks with RDMA information.

  • Callbacks handle data transfer logic and communication with the server.

  • Cleanup via cuMemObjPutDescriptor() deregisters memory.

1.3. Core Types and Enumerations#

1.3.1. Error Types#

typedef enum cuObjErr_enum {
    CU_OBJ_SUCCESS = 0,    // Operation successfully completed
    CU_OBJ_FAIL = 1        // Operation failed
} cuObjErr_t;

1.3.2. Protocol Types#

typedef enum cuObjProto_enum {
    CUOBJ_PROTO_RDMA_DC_V1 = 1001,    // RDMA Dynamically Connected version 1
    CUOBJ_PROTO_MAX
} cuObjProto_t;

1.3.3. Operation Types#

typedef enum cuObjOpType_enum {
    CUOBJ_GET = 0,        // GET operation (read from server)
    CUOBJ_PUT = 1,        // PUT operation (write to server)
    CUOBJ_INVALID = 9999
} cuObjOpType_t;

1.3.4. Memory Types#

typedef enum cuObjMemoryType_enum {
    CUOBJ_MEMORY_SYSTEM = 0,        // System (host) memory
    CUOBJ_MEMORY_CUDA_MANAGED = 1,  // CUDA unified/managed memory
    CUOBJ_MEMORY_CUDA_DEVICE = 2,   // CUDA device memory
    CUOBJ_MEMORY_UNKNOWN = 3,       // Unknown memory type
    CUOBJ_MEMORY_INVALID = 4        // Invalid memory type
} cuObjMemoryType_t;

1.4. cuObjClient Class API#

1.4.1. Class Declaration#

class cuObjClient {
public:
    // Constructor & Destructor
    cuObjClient(CUObjOps_t& ops, cuObjProto_t proto = CUOBJ_PROTO_RDMA_DC_V1);
    ~cuObjClient();

    // Memory Management
    cuObjErr_t cuMemObjGetDescriptor(void* ptr, size_t size);
    cuObjErr_t cuMemObjPutDescriptor(void* ptr);
    ssize_t cuMemObjGetMaxRequestCallbackSize(void* ptr);
    cuObjErr_t cuMemObjGetRDMAToken(void* ptr, size_t size, size_t buffer_offset,
                                     cuObjOpType_t operation, char** desc_str_out);
    cuObjErr_t cuMemObjPutRDMAToken(char* desc_str);

    // I/O Operations
    ssize_t cuObjGet(void* ctx, void* ptr, size_t size, loff_t offset = 0,
                     loff_t buf_offset = 0);
    ssize_t cuObjPut(void* ctx, void* ptr, size_t size, loff_t offset = 0,
                     loff_t buf_offset = 0);

    // Connection Management
    bool isConnected(void);

    // Static Utilities
    static void* getCtx(const void* handle);
    static cuObjMemoryType_t getMemoryType(const void* ptr);

    // Telemetry Management
    static void setupTelemetry(bool use_OTEL, std::ostream* os);
    static void shutdownTelemetry();
    static void setTelemFlags(unsigned flags);
};

1.4.2. Constructor#

Creates a new cuObjClient instance with user defined callback operations.

cuObjClient(CUObjOps_t& ops, cuObjProto_t proto = CUOBJ_PROTO_RDMA_DC_V1);

Parameters:

  • ops: Reference to callback operations structure (CUObjIOOps).

  • proto: RDMA descriptor protocol. Defaults to CUOBJ_PROTO_RDMA_DC_V1.

Notes:

  • Callbacks are invoked during cuObjGet() and cuObjPut() operations.

  • Callbacks may be called from different threads than the caller thread.

1.5. Memory Management APIs#

1.5.1. cuMemObjGetDescriptor#

Registers memory for RDMA operations and prepares it for data transfer.

cuObjErr_t cuMemObjGetDescriptor(void* ptr, size_t size);

Parameters:

  • ptr: Start address of user memory.

  • size: Size of memory to register. Maximum 4 GiB.

Returns:

  • CU_OBJ_SUCCESS on successful registration.

  • CU_OBJ_FAIL on failure.

Notes:

  • Memory sizes greater than or equal to CUOBJ_MAX_MEMORY_REG_SIZE (4 GiB) are not supported.

  • Memory must remain valid until cuMemObjPutDescriptor() is called.

  • Supports system memory, CUDA managed memory, and CUDA device memory.

1.5.2. cuMemObjPutDescriptor#

Unregisters previously registered memory and releases associated resources.

cuObjErr_t cuMemObjPutDescriptor(void* ptr);

Parameters:

  • ptr: Start address of memory. Must match the cuMemObjGetDescriptor call.

Returns:

  • CU_OBJ_SUCCESS on successful deregistration.

  • CU_OBJ_FAIL if memory cannot be unregistered.

Notes:

  • Call after all I O operations on this memory have completed.

  • ptr must match the address used in cuMemObjGetDescriptor().

1.5.3. cuMemObjGetMaxRequestCallbackSize#

Gets the maximum callback size for registered memory.

ssize_t cuMemObjGetMaxRequestCallbackSize(void* ptr);

Parameters:

  • ptr: Start address of registered memory.

Returns:

  • Maximum callback size in bytes for this memory pointer.

  • May be smaller than allocated memory if the RDMA subsystem has limitations.

Notes:

  • The callback may be invoked multiple times if the requested I O size exceeds this limit.

  • Returns the chunk size used in each callback invocation.

1.5.4. cuMemObjGetRDMAToken#

Generates an RDMA descriptor string for a registered memory buffer.

cuObjErr_t cuMemObjGetRDMAToken(void* ptr, size_t size, size_t buffer_offset,
                                 cuObjOpType_t operation, char** desc_str_out);

Parameters:

  • ptr: Start address of registered memory. Must be previously registered.

  • size: Size of the memory region for the descriptor.

  • buffer_offset: Offset from the base address to start the descriptor region.

  • operation: Operation type, CUOBJ_GET or CUOBJ_PUT.

  • desc_str_out: Pointer to store the allocated descriptor string (output).

Returns:

  • CU_OBJ_SUCCESS on success.

  • CU_OBJ_FAIL on failure.

Notes:

  • Memory must be registered with cuMemObjGetDescriptor() before calling.

  • The descriptor string is allocated by the underlying cuFile API.

  • Free the string using cuMemObjPutRDMAToken().

  • buffer_offset + size must not exceed the originally registered buffer size.

  • The buffer must be RDMA capable for this function to succeed.

1.5.5. cuMemObjPutRDMAToken#

Frees an RDMA descriptor string allocated by cuMemObjGetRDMAToken.

cuObjErr_t cuMemObjPutRDMAToken(char* desc_str);

Parameters:

  • desc_str: Descriptor string allocated by cuMemObjGetRDMAToken.

Returns:

  • CU_OBJ_SUCCESS on success.

  • CU_OBJ_FAIL on failure.

Notes:

  • Call to free memory allocated by cuMemObjGetRDMAToken().

  • Calls the underlying cuFileRDMADescStrPut() function.

1.6. I/O Operations#

1.6.1. cuObjGet#

Performs a GET operation to read data from the remote server.

ssize_t cuObjGet(void* ctx, void* ptr, size_t size, loff_t offset = 0,
                 loff_t buf_offset = 0);

Parameters:

  • ctx: User context pointer passed to the GET callback. Retrieve via getCtx().

  • ptr: Pointer to local memory buffer. Must be registered.

  • size: Size of the GET operation in bytes.

  • offset: Object offset. Reserved for future use. Set to 0.

  • buf_offset: Buffer offset. Reserved for future use. Set to 0.

Returns:

  • Data size returned by server on success (greater than or equal to 0).

  • Negative error code on failure.

Notes:

  • Register the memory buffer via cuMemObjGetDescriptor() before calling.

  • The GET callback is invoked one or more times during this operation.

  • If size exceeds MaxRequestCallbackSize, multiple callbacks are invoked.

  • This is a synchronous operation and returns after the operation completes.

1.6.2. cuObjPut#

Performs a PUT operation to write data to the remote server.

ssize_t cuObjPut(void* ctx, void* ptr, size_t size, loff_t offset = 0,
                 loff_t buf_offset = 0);

Parameters:

  • ctx: User context pointer passed to the PUT callback. Retrieve via getCtx().

  • ptr: Pointer to local memory buffer. Must be registered.

  • size: Size of the PUT operation in bytes.

  • offset: Object offset. Reserved for future use. Set to 0.

  • buf_offset: Buffer offset. Reserved for future use. Set to 0.

Returns:

  • Data size returned by server on success (greater than or equal to 0).

  • Negative error code on failure.

Notes:

  • Register the memory buffer via cuMemObjGetDescriptor() before calling.

  • The PUT callback is invoked one or more times during this operation.

  • If size exceeds MaxRequestCallbackSize, multiple callbacks are invoked.

  • This is a synchronous operation and returns after the operation completes.

1.7. Connection Management#

1.7.1. isConnected#

Checks if the client is connected and ready for operations.

bool isConnected(void);

Returns:

  • true if the client is connected and operational.

  • false otherwise.

1.8. Static Utility Methods#

1.8.1. getCtx#

Extracts the user context from the callback handle.

static void* getCtx(const void* handle);

Parameters:

  • handle: Handle from GET or PUT callback.

Returns:

  • User context pointer passed to cuObjGet() or cuObjPut().

Notes:

  • Call from within the callback to retrieve the user context.

  • handle is the first parameter passed to the callback.

1.8.2. getMemoryType#

Determines the memory type of a given pointer.

static cuObjMemoryType_t getMemoryType(const void* ptr);

Parameters:

  • ptr: Memory pointer to analyze.

Returns:

  • Memory type enumeration (cuObjMemoryType_t).

Possible values:

  • CUOBJ_MEMORY_SYSTEM: System host memory

  • CUOBJ_MEMORY_CUDA_MANAGED: CUDA managed memory

  • CUOBJ_MEMORY_CUDA_DEVICE: CUDA device memory

  • CUOBJ_MEMORY_UNKNOWN: Unknown memory type

  • CUOBJ_MEMORY_INVALID: Invalid memory type

1.9. Telemetry Management#

1.9.1. setupTelemetry#

Configures telemetry output stream for logging and monitoring.

static void setupTelemetry(bool use_OTEL, std::ostream* os);

Parameters:

  • use_OTEL: Enable OpenTelemetry integration.

  • os: Output stream for telemetry data.

Notes:

  • The output stream must remain valid until shutdownTelemetry() is called.

  • The stream must remain valid until all cuObjClient objects are destroyed.

  • Affects all cuObjClient instances.

1.9.2. shutdownTelemetry#

Shuts down telemetry and resets to the default output stream.

static void shutdownTelemetry();

Notes:

  • Call before closing custom output streams.

  • Telemetry is fully closed when all cuObjClient objects are destroyed.

1.9.3. setTelemFlags#

Configures the telemetry logging level and flags.

static void setTelemFlags(unsigned flags);

Parameters:

  • flags: Telemetry configuration flags.

1.10. Callback Interface#

1.10.1. CUObjIOOps Structure#

User defined callback operations structure.

typedef struct CUObjIOOps {
    ssize_t (*get)(const void* handle, char* ptr, size_t size, loff_t offset,
                   const cufileRDMAInfo_t* rdma_info);
    ssize_t (*put)(const void* handle, const char* ptr, size_t size, loff_t offset,
                   const cufileRDMAInfo_t* rdma_info);
} CUObjOps_t;

1.10.2. GET Callback#

Called during cuObjGet() to handle data retrieval.

ssize_t (*get)(const void* handle, char* ptr, size_t size, loff_t offset,
               const cufileRDMAInfo_t* rdma_info);

Parameters:

  • handle: Cookie to user context. Use cuObjClient::getCtx(handle) to extract.

  • ptr: Pointer to memory chunk to be filled with data.

  • size: Size of the memory chunk being read.

  • offset: Starting object offset for this chunk.

  • rdma_info: RDMA memory descriptor information for out of band communication.

Returns:

  • Size of data read on success.

  • -1 on failure.

Notes:

  • offset is 0 when MaxRequestCallbackSize is greater than or equal to the total request size.

  • size equals the total request size when MaxRequestCallbackSize is greater than or equal to the total request size.

  • Multiple callbacks may be invoked for large requests.

  • The callback may be called from a different thread than the caller.

  • The callback must communicate with cuObjServer using the RDMA descriptor information.

1.10.3. PUT Callback#

Called during cuObjPut() to handle data transmission.

ssize_t (*put)(const void* handle, const char* ptr, size_t size, loff_t offset,
               const cufileRDMAInfo_t* rdma_info);

Parameters:

  • handle: Cookie to user context. Use cuObjClient::getCtx(handle) to extract.

  • ptr: Pointer to memory chunk containing data to write.

  • size: Size of the memory chunk being written.

  • offset: Starting object offset for this chunk.

  • rdma_info: RDMA memory descriptor information for out of band communication.

Returns:

  • Size of data written on success.

  • -1 on failure.

Notes:

  • offset is 0 when MaxRequestCallbackSize is greater than or equal to the total request size.

  • size equals the total request size when MaxRequestCallbackSize is greater than or equal to the total request size.

  • Multiple callbacks may be invoked for large requests.

  • The callback may be called from a different thread than the caller.

  • The callback must communicate with cuObjServer using the RDMA descriptor information.

1.11. Error Handling#

1.11.1. Error Codes#

  • CU_OBJ_SUCCESS (0): Operation completed successfully

  • CU_OBJ_FAIL (1): Operation failed

1.11.2. Return Value Conventions#

  • Memory management APIs return cuObjErr_t.

  • I O operations return ssize_t (positive is bytes transferred, negative is error).

  • Connection status returns boolean values.

  • Utility functions return pointers or specific types as documented.

1.11.3. Best Practices#

  • Always check return values for error conditions.

  • Ensure proper cleanup in error paths, deregister memory and free tokens.

  • Callbacks should return -1 on errors to propagate failures.

  • Verify memory is registered before performing I O operations.

  • Keep RDMA tokens valid only as long as needed.

1.12. Usage Patterns#

1.12.1. Basic Client Usage#

// 1. Define callback operations
CUObjOps_t ops = {
    .get = my_get_callback,
    .put = my_put_callback
};

// 2. Create client instance
cuObjClient client(ops, CUOBJ_PROTO_RDMA_DC_V1);

// 3. Allocate and register memory
void* buffer = malloc(1024 * 1024);  // 1MB buffer
if (client.cuMemObjGetDescriptor(buffer, 1024 * 1024) != CU_OBJ_SUCCESS) {
    // Handle error
}

// 4. Perform GET operation
MyContext ctx = { /* user data */ };
ssize_t result = client.cuObjGet(&ctx, buffer, 1024, 0, 0);
if (result < 0) {
    // Handle error
}

// 5. Perform PUT operation
result = client.cuObjPut(&ctx, buffer, 1024, 0, 0);
if (result < 0) {
    // Handle error
}

// 6. Cleanup
client.cuMemObjPutDescriptor(buffer);
free(buffer);

1.12.2. Callback Implementation Example#

ssize_t my_get_callback(const void* handle, char* ptr, size_t size,
                        loff_t offset, const cufileRDMAInfo_t* rdma_info) {
    // Extract user context
    MyContext* ctx = (MyContext*)cuObjClient::getCtx(handle);

    // Get RDMA descriptor information
    const char* rdma_desc = (const char*)rdma_info;

    // Communicate with server via control path
    // Send: operation=GET, size, offset, rdma_desc
    ssize_t bytes_read = send_request_to_server(ctx->server_conn,
                                                "GET", size, offset, rdma_desc);

    // Server performs RDMA_WRITE directly to ptr
    return bytes_read;
}

ssize_t my_put_callback(const void* handle, const char* ptr, size_t size,
                        loff_t offset, const cufileRDMAInfo_t* rdma_info) {
    // Extract user context
    MyContext* ctx = (MyContext*)cuObjClient::getCtx(handle);

    // Get RDMA descriptor information
    const char* rdma_desc = (const char*)rdma_info;

    // Communicate with server via control path
    // Send: operation=PUT, size, offset, rdma_desc
    ssize_t bytes_written = send_request_to_server(ctx->server_conn,
                                                   "PUT", size, offset, rdma_desc);

    // Server performs RDMA_READ directly from ptr
    return bytes_written;
}

1.12.3. CUDA Device Memory Usage#

// Allocate CUDA device memory
void* d_buffer;
cudaMalloc(&d_buffer, 1024 * 1024);

// Check memory type
cuObjMemoryType_t mem_type = cuObjClient::getMemoryType(d_buffer);
assert(mem_type == CUOBJ_MEMORY_CUDA_DEVICE);

// Register for RDMA
client.cuMemObjGetDescriptor(d_buffer, 1024 * 1024);

// Perform I/O operations
client.cuObjGet(&ctx, d_buffer, 1024, 0, 0);

// Cleanup
client.cuMemObjPutDescriptor(d_buffer);
cudaFree(d_buffer);

1.12.4. Manual RDMA Token Management#

// Register memory
void* buffer = malloc(1024 * 1024);
client.cuMemObjGetDescriptor(buffer, 1024 * 1024);

// Get RDMA token for a specific region
char* rdma_token = nullptr;
size_t region_size = 4096;
size_t region_offset = 1024;
cuObjErr_t err = client.cuMemObjGetRDMAToken(buffer, region_size,
                                             region_offset, CUOBJ_GET,
                                             &rdma_token);
if (err == CU_OBJ_SUCCESS) {
    // Use rdma_token for custom RDMA operations

    // Free the token
    client.cuMemObjPutRDMAToken(rdma_token);
}

// Cleanup
client.cuMemObjPutDescriptor(buffer);
free(buffer);

1.12.5. Telemetry Configuration#

// Setup telemetry with custom output stream
std::ofstream log_file("cuobj_client.log");
cuObjClient::setupTelemetry(false, &log_file);
cuObjClient::setTelemFlags(0xFFFF);  // Enable all logging

// Create and use client
cuObjClient client(ops);
// ... perform operations ...

// Shutdown telemetry before closing file
cuObjClient::shutdownTelemetry();
log_file.close();

1.13. Constants and Limits#

1.13.1. Memory Limits#

#define CUOBJ_MAX_MEMORY_REG_SIZE (4ULL * 1024 * 1024 * 1024)  // 4GiB
  • Maximum memory registration size: 4 GiB per cuMemObjGetDescriptor() call.

  • No explicit limit on the number of registered buffers.

  • Total RDMA resources are limited by system capabilities.

1.13.2. Protocol Constants#

#define OBJ_RDMA_V1 "CUOBJ"
  • Protocol identifier string: "CUOBJ".

  • Default protocol: CUOBJ_PROTO_RDMA_DC_V1 (value: 1001).

1.13.3. Operation Limits#

  • I O operations are split into chunks if they exceed MaxRequestCallbackSize.

  • Each callback invocation receives a contiguous memory chunk.

  • Total I O size is limited by the registered buffer size.

1.14. Thread Safety#

1.14.1. Client Thread Safety#

  • Constructor and destructor are not thread safe. Create and destroy from a single thread.

  • Memory registration is not thread safe for the same buffer. Synchronize if multiple threads register or deregister the same memory.

  • I O operations are thread safe for different buffers. Multiple threads can call cuObjGet() and cuObjPut() concurrently with different registered buffers.

  • Callbacks may be invoked from different threads than the caller. Synchronize access to shared resources within callbacks.

  • Each callback invocation is serialized for a single I O operation.

  • Static methods are thread safe with internal synchronization (getCtx, getMemoryType, telemetry methods).

1.14.2. Synchronization Guidelines#

// SAFE: Multiple threads with different buffers
void* buffer1 = malloc(1024);
void* buffer2 = malloc(1024);
client.cuMemObjGetDescriptor(buffer1, 1024);
client.cuMemObjGetDescriptor(buffer2, 1024);

std::thread t1([&]() { client.cuObjGet(&ctx1, buffer1, 1024); });
std::thread t2([&]() { client.cuObjGet(&ctx2, buffer2, 1024); });

// UNSAFE: Multiple threads accessing same buffer
// Add external synchronization
std::mutex buffer_mutex;
std::thread t3([&]() {
    std::lock_guard<std::mutex> lock(buffer_mutex);
    client.cuObjGet(&ctx, buffer1, 512);
});

1.15. Best Practices#

  • Memory alignment: While not strictly required, page aligned buffers (4 KB) provide optimal performance.

  • RDMA capability: Not all memory types support RDMA. Device memory requires GPUDirect RDMA support.

  • Error recovery: If callbacks return -1, the I O operation fails immediately and returns an error.

  • Context lifetime: User context passed to cuObjGet and cuObjPut must remain valid until the operation completes.

  • Descriptor lifetime: RDMA descriptors are only valid during the callback invocation.

  • Connection status: Check isConnected() after construction to ensure the client initialized successfully.

  • Cleanup order: Deregister memory before freeing buffers.

  • Callback performance: Keep callbacks fast and avoid heavy processing.