Deploy on DGX Spark#

This guide describes how to deploy large language models on NVIDIA DGX Spark using a two-node setup with a ConnectX-7 interconnect for distributed inference.

MiniMax-M2.5#

MiniMax-M2.5 has 229B parameters.

Environment Configuration#

The following specifications apply to the DGX Spark single unit:

  • 128 GB LPDDR5x coherent unified system memory

  • ConnectX-7 NIC @ 200 Gbps

The NVFP4 version of MiniMax-M2.5 requires at least 115 GiB of GPU VRAM, plus substantial memory for the KV cache during inference. To run the model, use two DGX Spark machines connected via ConnectX-7 for parallel inference.

Pre-Deployment Preparation#

  1. Free system memory. Stop unnecessary or memory-heavy processes on both DGX Spark nodes.

  2. Set up the ConnectX-7 interconnect.

    1. Use verified 100 Gbps QSFP28 network cables to connect the two Spark nodes.

    2. Configure RDMA over Converged Ethernet (RoCE) on each node. For more information, refer to Connect Two Sparks.

  3. Configure the Docker container for start up.

    The container must use the host network. With two nodes connected over RoCE and more complex networking, network=host simplifies configuration and is recommended.

    Use the following options when starting the container:

    • --network=host: Use the host network stack.

    • --device=/dev/infiniband: Map the RoCE/InfiniBand device.

    • --ulimit memlock=-1: Remove the memory locking limit (required for RDMA/RoCE).

Deploy MiniMax-M2.5#

Use the steps in the following sections to deploy the NIM.

Start the First Node#

Launch the NIM container on the first DGX Spark node with the options specified in Pre-Deployment Preparation. Replace $NIM_IMAGE and any model-specific variables with your image and settings.

docker run -it --rm --name=nim-node1 \
  --shm-size=32g \
  --gpus all \
  --network=host \
  --device=/dev/infiniband \
  --ulimit memlock=-1 \
  -e NGC_API_KEY=$NGC_API_KEY \
  $NIM_IMAGE

Wait until the log shows that the primary node is ready and displays the connection parameters. For example:

INFO ... model_runner.py:735 ] Init torch distributed begin.
INFO ... model_runner.py:736 ] You need to start node1 within ten minutes while setting NIM_PRIMARY_NODE=169.254.26.132 NIM_NODE_MANAGER_PORT=20000

Note the NIM_PRIMARY_NODE IP address and the NIM_NODE_MANAGER_PORT value from this message. You must start the second node within ten minutes using these values.

Start the Second Node#

On the second DGX Spark node, start the container with the same image and options. Set the primary node and port from the first node’s log:

docker run -it --rm --name=nim-node2 \
  --shm-size=32g \
  --gpus all \
  --network=host \
  --device=/dev/infiniband \
  --ulimit memlock=-1 \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_PRIMARY_NODE=169.254.26.132 \
  -e NIM_NODE_MANAGER_PORT=20000 \
  $NIM_IMAGE

Replace 169.254.26.132 and 20000 with the NIM_PRIMARY_NODE and NIM_NODE_MANAGER_PORT values from the first node.

The second node connects to the first node automatically for distributed inference. After the first node starts, you can send requests to the first node using curl or an OpenAI-compatible client.

Verification#

After the first node reports a ready state, send a simple inference request. Replace <port> with the port your NIM is serving on, typically 8000.

curl -X POST http://localhost:<port>/v1/completions \
  -d '{"prompt": "Hello"}'

If you receive a valid JSON response, the deployment is complete.