1. cuObjClient API Specification v1.0.0#
1.1. Overview#
The cuObjClient library provides client side APIs for performing high performance GET and PUT operations with RDMA (Remote Direct Memory Access) acceleration and GPUDirect Storage support.
Key features include:
Memory registration and management for RDMA transfers
Synchronous GET and PUT operations with user defined callbacks
Support for system memory, CUDA managed memory, and CUDA device memory
RDMA descriptor generation and management
Telemetry and logging capabilities
Maximum memory registration size: 4 GiB per buffer
Protocol support:
CUOBJ_PROTO_RDMA_DC_V1(RDMA Dynamically Connected version 1)
1.2. I/O flow#
cuObjClient follows a callback based architecture:
The user application creates a
cuObjClientinstance with callback operations.Memory registration via
cuMemObjGetDescriptor()prepares buffers for RDMA.I O operations (
cuObjGet()andcuObjPut()) trigger callbacks with RDMA information.Callbacks handle data transfer logic and communication with the server.
Cleanup via
cuMemObjPutDescriptor()deregisters memory.
1.3. Core Types and Enumerations#
1.3.1. Error Types#
typedef enum cuObjErr_enum {
CU_OBJ_SUCCESS = 0, // Operation successfully completed
CU_OBJ_FAIL = 1 // Operation failed
} cuObjErr_t;
1.3.2. Protocol Types#
typedef enum cuObjProto_enum {
CUOBJ_PROTO_RDMA_DC_V1 = 1001, // RDMA Dynamically Connected version 1
CUOBJ_PROTO_MAX
} cuObjProto_t;
1.3.3. Operation Types#
typedef enum cuObjOpType_enum {
CUOBJ_GET = 0, // GET operation (read from server)
CUOBJ_PUT = 1, // PUT operation (write to server)
CUOBJ_INVALID = 9999
} cuObjOpType_t;
1.3.4. Memory Types#
typedef enum cuObjMemoryType_enum {
CUOBJ_MEMORY_SYSTEM = 0, // System (host) memory
CUOBJ_MEMORY_CUDA_MANAGED = 1, // CUDA unified/managed memory
CUOBJ_MEMORY_CUDA_DEVICE = 2, // CUDA device memory
CUOBJ_MEMORY_UNKNOWN = 3, // Unknown memory type
CUOBJ_MEMORY_INVALID = 4 // Invalid memory type
} cuObjMemoryType_t;
1.4. cuObjClient Class API#
1.4.1. Class Declaration#
class cuObjClient {
public:
// Constructor & Destructor
cuObjClient(CUObjOps_t& ops, cuObjProto_t proto = CUOBJ_PROTO_RDMA_DC_V1);
~cuObjClient();
// Memory Management
cuObjErr_t cuMemObjGetDescriptor(void* ptr, size_t size);
cuObjErr_t cuMemObjPutDescriptor(void* ptr);
ssize_t cuMemObjGetMaxRequestCallbackSize(void* ptr);
cuObjErr_t cuMemObjGetRDMAToken(void* ptr, size_t size, size_t buffer_offset,
cuObjOpType_t operation, char** desc_str_out);
cuObjErr_t cuMemObjPutRDMAToken(char* desc_str);
// I/O Operations
ssize_t cuObjGet(void* ctx, void* ptr, size_t size, loff_t offset = 0,
loff_t buf_offset = 0);
ssize_t cuObjPut(void* ctx, void* ptr, size_t size, loff_t offset = 0,
loff_t buf_offset = 0);
// Connection Management
bool isConnected(void);
// Static Utilities
static void* getCtx(const void* handle);
static cuObjMemoryType_t getMemoryType(const void* ptr);
// Telemetry Management
static void setupTelemetry(bool use_OTEL, std::ostream* os);
static void shutdownTelemetry();
static void setTelemFlags(unsigned flags);
};
1.4.2. Constructor#
Creates a new cuObjClient instance with user defined callback operations.
cuObjClient(CUObjOps_t& ops, cuObjProto_t proto = CUOBJ_PROTO_RDMA_DC_V1);
Parameters:
ops: Reference to callback operations structure (CUObjIOOps).proto: RDMA descriptor protocol. Defaults toCUOBJ_PROTO_RDMA_DC_V1.
Notes:
Callbacks are invoked during
cuObjGet()andcuObjPut()operations.Callbacks may be called from different threads than the caller thread.
1.5. Memory Management APIs#
1.5.1. cuMemObjGetDescriptor#
Registers memory for RDMA operations and prepares it for data transfer.
cuObjErr_t cuMemObjGetDescriptor(void* ptr, size_t size);
Parameters:
ptr: Start address of user memory.size: Size of memory to register. Maximum 4 GiB.
Returns:
CU_OBJ_SUCCESSon successful registration.CU_OBJ_FAILon failure.
Notes:
Memory sizes greater than or equal to
CUOBJ_MAX_MEMORY_REG_SIZE(4 GiB) are not supported.Memory must remain valid until
cuMemObjPutDescriptor()is called.Supports system memory, CUDA managed memory, and CUDA device memory.
1.5.2. cuMemObjPutDescriptor#
Unregisters previously registered memory and releases associated resources.
cuObjErr_t cuMemObjPutDescriptor(void* ptr);
Parameters:
ptr: Start address of memory. Must match thecuMemObjGetDescriptorcall.
Returns:
CU_OBJ_SUCCESSon successful deregistration.CU_OBJ_FAILif memory cannot be unregistered.
Notes:
Call after all I O operations on this memory have completed.
ptrmust match the address used incuMemObjGetDescriptor().
1.5.3. cuMemObjGetMaxRequestCallbackSize#
Gets the maximum callback size for registered memory.
ssize_t cuMemObjGetMaxRequestCallbackSize(void* ptr);
Parameters:
ptr: Start address of registered memory.
Returns:
Maximum callback size in bytes for this memory pointer.
May be smaller than allocated memory if the RDMA subsystem has limitations.
Notes:
The callback may be invoked multiple times if the requested I O size exceeds this limit.
Returns the chunk size used in each callback invocation.
1.5.4. cuMemObjGetRDMAToken#
Generates an RDMA descriptor string for a registered memory buffer.
cuObjErr_t cuMemObjGetRDMAToken(void* ptr, size_t size, size_t buffer_offset,
cuObjOpType_t operation, char** desc_str_out);
Parameters:
ptr: Start address of registered memory. Must be previously registered.size: Size of the memory region for the descriptor.buffer_offset: Offset from the base address to start the descriptor region.operation: Operation type,CUOBJ_GETorCUOBJ_PUT.desc_str_out: Pointer to store the allocated descriptor string (output).
Returns:
CU_OBJ_SUCCESSon success.CU_OBJ_FAILon failure.
Notes:
Memory must be registered with
cuMemObjGetDescriptor()before calling.The descriptor string is allocated by the underlying
cuFileAPI.Free the string using
cuMemObjPutRDMAToken().buffer_offset + sizemust not exceed the originally registered buffer size.The buffer must be RDMA capable for this function to succeed.
1.5.5. cuMemObjPutRDMAToken#
Frees an RDMA descriptor string allocated by cuMemObjGetRDMAToken.
cuObjErr_t cuMemObjPutRDMAToken(char* desc_str);
Parameters:
desc_str: Descriptor string allocated bycuMemObjGetRDMAToken.
Returns:
CU_OBJ_SUCCESSon success.CU_OBJ_FAILon failure.
Notes:
Call to free memory allocated by
cuMemObjGetRDMAToken().Calls the underlying
cuFileRDMADescStrPut()function.
1.6. I/O Operations#
1.6.1. cuObjGet#
Performs a GET operation to read data from the remote server.
ssize_t cuObjGet(void* ctx, void* ptr, size_t size, loff_t offset = 0,
loff_t buf_offset = 0);
Parameters:
ctx: User context pointer passed to the GET callback. Retrieve viagetCtx().ptr: Pointer to local memory buffer. Must be registered.size: Size of the GET operation in bytes.offset: Object offset. Reserved for future use. Set to 0.buf_offset: Buffer offset. Reserved for future use. Set to 0.
Returns:
Data size returned by server on success (greater than or equal to 0).
Negative error code on failure.
Notes:
Register the memory buffer via
cuMemObjGetDescriptor()before calling.The GET callback is invoked one or more times during this operation.
If
sizeexceedsMaxRequestCallbackSize, multiple callbacks are invoked.This is a synchronous operation and returns after the operation completes.
1.6.2. cuObjPut#
Performs a PUT operation to write data to the remote server.
ssize_t cuObjPut(void* ctx, void* ptr, size_t size, loff_t offset = 0,
loff_t buf_offset = 0);
Parameters:
ctx: User context pointer passed to the PUT callback. Retrieve viagetCtx().ptr: Pointer to local memory buffer. Must be registered.size: Size of the PUT operation in bytes.offset: Object offset. Reserved for future use. Set to 0.buf_offset: Buffer offset. Reserved for future use. Set to 0.
Returns:
Data size returned by server on success (greater than or equal to 0).
Negative error code on failure.
Notes:
Register the memory buffer via
cuMemObjGetDescriptor()before calling.The PUT callback is invoked one or more times during this operation.
If
sizeexceedsMaxRequestCallbackSize, multiple callbacks are invoked.This is a synchronous operation and returns after the operation completes.
1.7. Connection Management#
1.7.1. isConnected#
Checks if the client is connected and ready for operations.
bool isConnected(void);
Returns:
trueif the client is connected and operational.falseotherwise.
1.8. Static Utility Methods#
1.8.1. getCtx#
Extracts the user context from the callback handle.
static void* getCtx(const void* handle);
Parameters:
handle: Handle from GET or PUT callback.
Returns:
User context pointer passed to
cuObjGet()orcuObjPut().
Notes:
Call from within the callback to retrieve the user context.
handleis the first parameter passed to the callback.
1.8.2. getMemoryType#
Determines the memory type of a given pointer.
static cuObjMemoryType_t getMemoryType(const void* ptr);
Parameters:
ptr: Memory pointer to analyze.
Returns:
Memory type enumeration (
cuObjMemoryType_t).
Possible values:
CUOBJ_MEMORY_SYSTEM: System host memoryCUOBJ_MEMORY_CUDA_MANAGED: CUDA managed memoryCUOBJ_MEMORY_CUDA_DEVICE: CUDA device memoryCUOBJ_MEMORY_UNKNOWN: Unknown memory typeCUOBJ_MEMORY_INVALID: Invalid memory type
1.9. Telemetry Management#
1.9.1. setupTelemetry#
Configures telemetry output stream for logging and monitoring.
static void setupTelemetry(bool use_OTEL, std::ostream* os);
Parameters:
use_OTEL: Enable OpenTelemetry integration.os: Output stream for telemetry data.
Notes:
The output stream must remain valid until
shutdownTelemetry()is called.The stream must remain valid until all
cuObjClientobjects are destroyed.Affects all
cuObjClientinstances.
1.9.2. shutdownTelemetry#
Shuts down telemetry and resets to the default output stream.
static void shutdownTelemetry();
Notes:
Call before closing custom output streams.
Telemetry is fully closed when all
cuObjClientobjects are destroyed.
1.9.3. setTelemFlags#
Configures the telemetry logging level and flags.
static void setTelemFlags(unsigned flags);
Parameters:
flags: Telemetry configuration flags.
1.10. Callback Interface#
1.10.1. CUObjIOOps Structure#
User defined callback operations structure.
typedef struct CUObjIOOps {
ssize_t (*get)(const void* handle, char* ptr, size_t size, loff_t offset,
const cufileRDMAInfo_t* rdma_info);
ssize_t (*put)(const void* handle, const char* ptr, size_t size, loff_t offset,
const cufileRDMAInfo_t* rdma_info);
} CUObjOps_t;
1.10.2. GET Callback#
Called during cuObjGet() to handle data retrieval.
ssize_t (*get)(const void* handle, char* ptr, size_t size, loff_t offset,
const cufileRDMAInfo_t* rdma_info);
Parameters:
handle: Cookie to user context. UsecuObjClient::getCtx(handle)to extract.ptr: Pointer to memory chunk to be filled with data.size: Size of the memory chunk being read.offset: Starting object offset for this chunk.rdma_info: RDMA memory descriptor information for out of band communication.
Returns:
Size of data read on success.
-1on failure.
Notes:
offsetis 0 whenMaxRequestCallbackSizeis greater than or equal to the total request size.sizeequals the total request size whenMaxRequestCallbackSizeis greater than or equal to the total request size.Multiple callbacks may be invoked for large requests.
The callback may be called from a different thread than the caller.
The callback must communicate with
cuObjServerusing the RDMA descriptor information.
1.10.3. PUT Callback#
Called during cuObjPut() to handle data transmission.
ssize_t (*put)(const void* handle, const char* ptr, size_t size, loff_t offset,
const cufileRDMAInfo_t* rdma_info);
Parameters:
handle: Cookie to user context. UsecuObjClient::getCtx(handle)to extract.ptr: Pointer to memory chunk containing data to write.size: Size of the memory chunk being written.offset: Starting object offset for this chunk.rdma_info: RDMA memory descriptor information for out of band communication.
Returns:
Size of data written on success.
-1on failure.
Notes:
offsetis 0 whenMaxRequestCallbackSizeis greater than or equal to the total request size.sizeequals the total request size whenMaxRequestCallbackSizeis greater than or equal to the total request size.Multiple callbacks may be invoked for large requests.
The callback may be called from a different thread than the caller.
The callback must communicate with
cuObjServerusing the RDMA descriptor information.
1.11. Error Handling#
1.11.1. Error Codes#
CU_OBJ_SUCCESS(0): Operation completed successfullyCU_OBJ_FAIL(1): Operation failed
1.11.2. Return Value Conventions#
Memory management APIs return
cuObjErr_t.I O operations return
ssize_t(positive is bytes transferred, negative is error).Connection status returns boolean values.
Utility functions return pointers or specific types as documented.
1.11.3. Best Practices#
Always check return values for error conditions.
Ensure proper cleanup in error paths, deregister memory and free tokens.
Callbacks should return
-1on errors to propagate failures.Verify memory is registered before performing I O operations.
Keep RDMA tokens valid only as long as needed.
1.12. Usage Patterns#
1.12.1. Basic Client Usage#
// 1. Define callback operations
CUObjOps_t ops = {
.get = my_get_callback,
.put = my_put_callback
};
// 2. Create client instance
cuObjClient client(ops, CUOBJ_PROTO_RDMA_DC_V1);
// 3. Allocate and register memory
void* buffer = malloc(1024 * 1024); // 1MB buffer
if (client.cuMemObjGetDescriptor(buffer, 1024 * 1024) != CU_OBJ_SUCCESS) {
// Handle error
}
// 4. Perform GET operation
MyContext ctx = { /* user data */ };
ssize_t result = client.cuObjGet(&ctx, buffer, 1024, 0, 0);
if (result < 0) {
// Handle error
}
// 5. Perform PUT operation
result = client.cuObjPut(&ctx, buffer, 1024, 0, 0);
if (result < 0) {
// Handle error
}
// 6. Cleanup
client.cuMemObjPutDescriptor(buffer);
free(buffer);
1.12.2. Callback Implementation Example#
ssize_t my_get_callback(const void* handle, char* ptr, size_t size,
loff_t offset, const cufileRDMAInfo_t* rdma_info) {
// Extract user context
MyContext* ctx = (MyContext*)cuObjClient::getCtx(handle);
// Get RDMA descriptor information
const char* rdma_desc = (const char*)rdma_info;
// Communicate with server via control path
// Send: operation=GET, size, offset, rdma_desc
ssize_t bytes_read = send_request_to_server(ctx->server_conn,
"GET", size, offset, rdma_desc);
// Server performs RDMA_WRITE directly to ptr
return bytes_read;
}
ssize_t my_put_callback(const void* handle, const char* ptr, size_t size,
loff_t offset, const cufileRDMAInfo_t* rdma_info) {
// Extract user context
MyContext* ctx = (MyContext*)cuObjClient::getCtx(handle);
// Get RDMA descriptor information
const char* rdma_desc = (const char*)rdma_info;
// Communicate with server via control path
// Send: operation=PUT, size, offset, rdma_desc
ssize_t bytes_written = send_request_to_server(ctx->server_conn,
"PUT", size, offset, rdma_desc);
// Server performs RDMA_READ directly from ptr
return bytes_written;
}
1.12.3. CUDA Device Memory Usage#
// Allocate CUDA device memory
void* d_buffer;
cudaMalloc(&d_buffer, 1024 * 1024);
// Check memory type
cuObjMemoryType_t mem_type = cuObjClient::getMemoryType(d_buffer);
assert(mem_type == CUOBJ_MEMORY_CUDA_DEVICE);
// Register for RDMA
client.cuMemObjGetDescriptor(d_buffer, 1024 * 1024);
// Perform I/O operations
client.cuObjGet(&ctx, d_buffer, 1024, 0, 0);
// Cleanup
client.cuMemObjPutDescriptor(d_buffer);
cudaFree(d_buffer);
1.12.4. Manual RDMA Token Management#
// Register memory
void* buffer = malloc(1024 * 1024);
client.cuMemObjGetDescriptor(buffer, 1024 * 1024);
// Get RDMA token for a specific region
char* rdma_token = nullptr;
size_t region_size = 4096;
size_t region_offset = 1024;
cuObjErr_t err = client.cuMemObjGetRDMAToken(buffer, region_size,
region_offset, CUOBJ_GET,
&rdma_token);
if (err == CU_OBJ_SUCCESS) {
// Use rdma_token for custom RDMA operations
// Free the token
client.cuMemObjPutRDMAToken(rdma_token);
}
// Cleanup
client.cuMemObjPutDescriptor(buffer);
free(buffer);
1.12.5. Telemetry Configuration#
// Setup telemetry with custom output stream
std::ofstream log_file("cuobj_client.log");
cuObjClient::setupTelemetry(false, &log_file);
cuObjClient::setTelemFlags(0xFFFF); // Enable all logging
// Create and use client
cuObjClient client(ops);
// ... perform operations ...
// Shutdown telemetry before closing file
cuObjClient::shutdownTelemetry();
log_file.close();
1.13. Constants and Limits#
1.13.1. Memory Limits#
#define CUOBJ_MAX_MEMORY_REG_SIZE (4ULL * 1024 * 1024 * 1024) // 4GiB
Maximum memory registration size: 4 GiB per
cuMemObjGetDescriptor()call.No explicit limit on the number of registered buffers.
Total RDMA resources are limited by system capabilities.
1.13.2. Protocol Constants#
#define OBJ_RDMA_V1 "CUOBJ"
Protocol identifier string:
"CUOBJ".Default protocol:
CUOBJ_PROTO_RDMA_DC_V1(value: 1001).
1.13.3. Operation Limits#
I O operations are split into chunks if they exceed
MaxRequestCallbackSize.Each callback invocation receives a contiguous memory chunk.
Total I O size is limited by the registered buffer size.
1.14. Thread Safety#
1.14.1. Client Thread Safety#
Constructor and destructor are not thread safe. Create and destroy from a single thread.
Memory registration is not thread safe for the same buffer. Synchronize if multiple threads register or deregister the same memory.
I O operations are thread safe for different buffers. Multiple threads can call
cuObjGet()andcuObjPut()concurrently with different registered buffers.Callbacks may be invoked from different threads than the caller. Synchronize access to shared resources within callbacks.
Each callback invocation is serialized for a single I O operation.
Static methods are thread safe with internal synchronization (
getCtx,getMemoryType, telemetry methods).
1.14.2. Synchronization Guidelines#
// SAFE: Multiple threads with different buffers
void* buffer1 = malloc(1024);
void* buffer2 = malloc(1024);
client.cuMemObjGetDescriptor(buffer1, 1024);
client.cuMemObjGetDescriptor(buffer2, 1024);
std::thread t1([&]() { client.cuObjGet(&ctx1, buffer1, 1024); });
std::thread t2([&]() { client.cuObjGet(&ctx2, buffer2, 1024); });
// UNSAFE: Multiple threads accessing same buffer
// Add external synchronization
std::mutex buffer_mutex;
std::thread t3([&]() {
std::lock_guard<std::mutex> lock(buffer_mutex);
client.cuObjGet(&ctx, buffer1, 512);
});
1.15. Best Practices#
Memory alignment: While not strictly required, page aligned buffers (4 KB) provide optimal performance.
RDMA capability: Not all memory types support RDMA. Device memory requires GPUDirect RDMA support.
Error recovery: If callbacks return
-1, the I O operation fails immediately and returns an error.Context lifetime: User context passed to
cuObjGetandcuObjPutmust remain valid until the operation completes.Descriptor lifetime: RDMA descriptors are only valid during the callback invocation.
Connection status: Check
isConnected()after construction to ensure the client initialized successfully.Cleanup order: Deregister memory before freeing buffers.
Callback performance: Keep callbacks fast and avoid heavy processing.