Manage Files#

NeMo Platform provides a file storage interface through the Files service. The Files service supports multiple storage backends and can be used to store datasets for training, evaluation results, model artifacts, and other files.

Concepts#

Fileset: A named container that holds files.

Filesets are uniquely identified by a name within a given workspace.
Storage Backend: Each fileset is backed by a storage backend where the files are actually persisted. Supported backends include:
- local: Local filesystem storage (default, read/write)
- s3: Amazon S3 or S3-compatible storage such as MinIO (read/write)
- ngc: NVIDIA GPU Cloud storage (read-only)
- huggingface: HuggingFace Hub repositories (read-only)
Read-only backends allow you to create a fileset that acts as a handle to external resources. This provides a unified interface to access files from different sources using the same SDK methods, and allows other platform services to reference external data through a fileset.

Purpose: A fileset field that indicates the intended use. Each purpose enables specific metadata fields under the corresponding key. Select a tab below to see the available metadata fields for each purpose:

generic

Use purpose="generic" (default) for other files that don’t fit the dataset or model categories.

Metadata fields: No purpose-specific metadata fields.

dataset

Use purpose="dataset" for training and evaluation data.

Metadata fields (metadata.dataset.*):

Field	Type	Description
`metadata.dataset.schema`	`object`	Schema describing the dataset format (e.g., column names and types).

model

Use purpose="model" for model weights and checkpoints.

Metadata fields (metadata.model.*):

Field	Type	Description
`metadata.model.tool_calling.chat_template`	`string`	Jinja2 chat template for the model. Propagated to the model entity spec by the model-spec background task.
`metadata.model.tool_calling.tool_call_parser`	`string`	Name of the tool call parser (e.g., `hermes`, `llama3_json`, `mistral`).
`metadata.model.tool_calling.tool_call_plugin`	`string`	Reference to a fileset containing a custom tool call plugin Python file (`{workspace}/{fileset_name}`). Requires `models.tool_call_plugin.enabled` at the platform level.
`metadata.model.tool_calling.auto_tool_choice`	`boolean`	Whether to enable automatic tool choice.

These fields are merged into the model entity spec by the model-spec background task. For details, see Chat Templates and Tool Calling.

Custom Fields: Arbitrary key-value data attached to a fileset via custom_fields for user-defined metadata.

Use Cases#

Using External Storage Backends#

Connect to files stored in NVIDIA GPU Cloud (NGC):

Note

By default, the platform pre-configures a built-in system/ngc-api-key secret and filesets for Nemotron Personas. The example below demonstrates how to recreate those entities in your own workspace. For disambiguation purposes, this example prefixes the manually-created entities’ names with my-.

CLI

# Create a secret to store your NGC API key
echo "$NGC_API_KEY" | nmp secrets create --name my-ngc-api-key --from-file -

# Create a fileset pointing to NGC storage
nmp files filesets create \
    --name my-nemotron-personas-dataset-en_us \
    --description "Nemotron Personas USA" \
    --storage '{
        "type": "ngc",
        "org": "nvidia",
        "team": "nemotron-personas",
        "resource": "nemotron-personas-dataset-en_us",
        "version": "0.0.2",
        "api_key_secret": "my-ngc-api-key"
    }'

Python SDK

import os

# Create a secret to store your NGC API key
secret = sdk.secrets.create(name="my-ngc-api-key", data="<your-ngc-api-key>")

# Create a fileset pointing to NGC storage
ngc_fileset = sdk.files.filesets.create(
    name="my-nemotron-personas-dataset-en_us",
    description="Nemotron Personas USA",
    storage={
        "type": "ngc",
        "org": "nvidia",
        "team": "nemotron-personas",
        "resource": "nemotron-personas-dataset-en_us",
        "version": "0.0.2",
        "api_key_secret": secret.name,
    }
)

Connect to a HuggingFace repository:

CLI

# Create a secret to store your HuggingFace token (needed for gated and private repos)
echo "$HF_TOKEN" | nmp secrets create --name hf_token --from-file -

# Create a fileset pointing to a HuggingFace repo
nmp files filesets create \
    --name hf-dataset \
    --description "Dataset from HuggingFace" \
    --storage '{
        "type": "huggingface",
        "repo_id": "nvidia/Nemotron-Personas-Japan",
        "repo_type": "dataset",
        "token_secret": "hf_token"
    }'

Python SDK

import os

# Create a secret to store your HuggingFace token (needed for gated and private repos)
secret = sdk.secrets.create(name="hf_token", data=os.getenv("HF_TOKEN"))

# Create a fileset pointing to a HuggingFace repo
hf_fileset = sdk.files.filesets.create(
    name="hf-dataset",
    description="Dataset from HuggingFace",
    storage={
        "type": "huggingface",
        "repo_id": "nvidia/Nemotron-Personas-Japan",
        "repo_type": "dataset",
        "token_secret": secret.name,  # Optional, needed for gated and private repos
    }
)

Connect to an S3 bucket or S3-compatible storage (e.g., MinIO, Ceph):

CLI

# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
# The "prefix" field is optional - use it to scope the fileset to a folder within the bucket
nmp files filesets create \
    --name s3-training-data \
    --description "Training data stored in S3" \
    --storage '{
        "type": "s3",
        "bucket": "my-ml-bucket",
        "prefix": "datasets/training",
        "region": "us-east-1",
        "use_sdk_auth": true
    }'

# Upload data to S3
nmp files upload ./training_data/ --fileset s3-training-data

# Download data from S3
nmp files download --fileset s3-training-data -o ./downloaded_data/

Python SDK

# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
s3_fileset = sdk.files.filesets.create(
    name="s3-training-data",
    description="Training data stored in S3",
    storage={
        "type": "s3",
        "bucket": "my-ml-bucket",
        "prefix": "datasets/training",  # Optional: scope to a folder within the bucket
        "region": "us-east-1",
        "use_sdk_auth": True,  # Use AWS SDK credential chain (default)
    }
)

# Upload data to S3
sdk.files.upload(
    fileset="s3-training-data",
    local_path="./training_data/",
    remote_path="",
)

# Download data from S3
sdk.files.download(
    fileset="s3-training-data",
    remote_path="",
    local_path="./downloaded_data/",
)

For S3-compatible storage like MinIO, use explicit credentials and a custom endpoint:

CLI

# Create secrets to store your S3 credentials
echo "$S3_ACCESS_KEY" | nmp secrets create --name s3_access_key --from-file -
echo "$S3_SECRET_KEY" | nmp secrets create --name s3_secret_key --from-file -

nmp files filesets create \
    --name minio-fileset \
    --description "Data stored in MinIO" \
    --storage '{
        "type": "s3",
        "bucket": "my-bucket",
        "endpoint_url": "http://minio.example.com:9000",
        "region": "us-east-1",
        "use_sdk_auth": false,
        "access_key_id_secret": "s3_access_key",
        "secret_access_key_secret": "s3_secret_key"
    }'

Python SDK

import os

# Create secrets to store your S3 credentials
access_key = sdk.secrets.create(name="s3_access_key", data=os.getenv("S3_ACCESS_KEY"))
secret_key = sdk.secrets.create(name="s3_secret_key", data=os.getenv("S3_SECRET_KEY"))

s3_fileset = sdk.files.filesets.create(
    name="minio-fileset",
    description="Data stored in MinIO",
    storage={
        "type": "s3",
        "bucket": "my-bucket",
        "endpoint_url": "http://minio.example.com:9000",  # Custom S3 endpoint
        "region": "us-east-1",
        "use_sdk_auth": False,  # Use explicit credentials instead of SDK auth
        "access_key_id_secret": access_key.name,
        "secret_access_key_secret": secret_key.name,
    }
)

Manage Files#

Concepts#

Managing Filesets#

Creating Filesets#

Listing Filesets#

Deleting Filesets#

Managing Files Within Filesets#

Uploading Files#

Listing Files#

Downloading Files#

Deleting Files#

Using Progress Callbacks#

Use Cases#

Using External Storage Backends#