Deploy GPT-OSS-120B
Learn how to deploy GPT-OSS-120B for inference on DGX Cloud Lepton.
GPT-OSS-120B Deployment Guide
This guide demonstrates how to deploy OpenAI's open source GPT-OSS-120B model as an endpoint on DGX Cloud Lepton for serving inference requests. The model is deployed as a NIM which is optimized for high throughput and low latency for your specific GPU architecture. This example showcases two different paths for deploying the NIM - via the UI in a web browser, and using the CLI in a local terminal.
Prerequisites
To deploy the NIM, you need:
- Access to your NIM Container Registry and NGC API Key. For more information, refer to the setup information.
- Access to at least one NVIDIA H100 GPU or higher on DGX Cloud Lepton.
Option 1: Create Endpoint from NVIDIA NIM via UI
The first path to deploying the NIM is via a web browser. To get started, navigate to the create endpoint page on the dashboard.
Specify a name for the endpoint in the Endpoint Name field such as gpt-oss-120b.
Select the node group and resource shape to use for the deployment in the Resource section. GPT-OSS-120B requires at least one GPU with a minimum of 80 GB of GPU memory. Deploying with multiple GPUs can yield higher throughput for responses.
In the NIM Configuration section, select Built-in under image and enter/select nim/openai/gpt-oss-120b:latest from the list. Additionally, specify the registry authentication and your NGC API key in the following fields:
- Registry Auth: Select one of the existing registry auth configurations you created in your workspace or create a new one with access to NGC.
- NGC API key: Select one of the existing NGC API keys you saved in your workspace as a secret or create a new one.
GPT-OSS-120B is a large model that can take a long time to download if not already available in storage. To prevent the endpoint from failing before the model is downloaded, go to Advanced Configuration and set Healthcheck Initial Delay Seconds to Custom with a value of 99999 seconds or higher as necessary.
Save Model in Cache
Since models can take hours to download depending on their size, you may want to store the model in a persistent cache. To do this, set an environment variable with a new cache path in persistent storage and mount that storage path.
Go to Environment Variables and Secrets and click Add Variable. Set the variable name to NIM_CACHE_PATH and the variable value to /opt/nim/.cache. Then go to Storages and click + Mount Storage. Select your desired volume, set the from path to /opt/nim/.cache or another location for storing the cache on your volume, and set the to path to /opt/nim/.cache.
For other endpoint configurations, refer to the Endpoints documentation.
Access Token
By default, endpoints require authentication with requests. This can be changed in the Access Tokens section. A random token will be generated for requests if access tokens are enabled. The token appears in the UI alongside your deployed endpoint once it is ready.
Endpoints can also be made publicly available without authentication. Please note that anyone with the endpoint URL can access the model and send requests without authentication if made public.
Option 2: Create Endpoint from NVIDIA NIM via CLI
The second path to deploying the NIM is via the DGX Cloud Lepton CLI. To use the CLI, you must first install the Python package.
Install DGX Cloud Lepton Python package
Open a terminal and install Python 3.10 or newer if not already available. Follow the official Python documentation for instructions on installing a compatible version for your device. Then, install the DGX Cloud Lepton Python package in your terminal with:
This will install the Python SDK which includes the CLI.
Authenticate with the DGX Cloud Lepton CLI
After installing the Python package, authenticate with your workspace by using lep login as shown below:
If a browser tab does not open automatically, navigate to the API tokens page. Click + Create Token in the top right corner, specify a name for your access token, and select the expiration of the token.
After creating a token, it will show you your credential. Copy the credential somewhere safe as it will only be shown once. Once saved, go back to your terminal and enter it when prompted. You should now be authenticated.
To validate authentication, run lep workspace list and you should see a table containing the workspace you connected to. You are now ready to use the CLI to launch endpoints.
Deploy the Endpoint via CLI
The template for launching the NIM via the CLI is as follows:
Some of these values may change depending on your workspace, such as:
--resource-shape gpu.1xh200: Replace with the name of your GPU resource if different, such asgpu.1xh100orgpu.4xh100.--node-group my-node-group: Replace with the name of your node group to deploy the model on.--secret NGC_API_KEY=NGC_API_KEY: Replace the second instance ofNGC_API_KEYwith the name of the secret that was created on DGX Cloud Lepton. For example, if your secret is namedMY_NGC_KEY, the flag would be--secret NGC_API_KEY=MY_NGC_KEY.--image-pull-secrets my-pull-secret: Replace with the name of the container registry authentication secret you created on DGX Cloud Lepton.--public: This will make the endpoint available to anyone with the link. To require an access token, replace--publicwith--token <enter token here>and specify the token to require for incoming requests.
After modifying and running the command above, the model deploys as an endpoint and becomes viewable in the UI.
Test Endpoint
Once your endpoint is ready, you can use the playground to send requests directly from the UI, or you can send requests to the endpoint URL by running a command like the following:
To get your <ENDPOINT_URL>, either copy the URL shown after the Endpoint: string in the UI under your deployed model, or run lep endpoint get -n <endpoint name> | grep external_endpoint. Use this URL to send requests to your model.
If you set your endpoint to be publicly accessible, you do not need to set the authorization token.
LLMs follow the OpenAI API standard. For more information, refer to the official documentation.