Manage Files#
NeMo Platform provides a file storage interface through the Files service. The Files service supports multiple storage backends and can be used to store datasets for training, evaluation results, model artifacts, and other files.
Concepts#
Fileset: A named container that holds files.
Filesets are uniquely identified by a name within a given workspace.
Storage Backend: Each fileset is backed by a storage backend where the files are actually persisted. Supported backends include:
local: Local filesystem storage (default, read/write)s3: Amazon S3 or S3-compatible storage such as MinIO (read/write)ngc: NVIDIA GPU Cloud storage (read-only)huggingface: HuggingFace Hub repositories (read-only)
Read-only backends allow you to create a fileset that acts as a handle to external resources. This provides a unified interface to access files from different sources using the same SDK methods, and allows other platform services to reference external data through a fileset.
Purpose: A fileset field that indicates the intended use. Each purpose enables specific metadata fields under the corresponding key. Select a tab below to see the available metadata fields for each purpose:
Use
purpose="generic"(default) for other files that don’t fit thedatasetormodelcategories.Metadata fields: No purpose-specific metadata fields.
Use
purpose="dataset"for training and evaluation data.Metadata fields (
metadata.dataset.*):Field
Type
Description
metadata.dataset.schemaobjectSchema describing the dataset format (e.g., column names and types).
Use
purpose="model"for model weights and checkpoints.Metadata fields (
metadata.model.*):Field
Type
Description
metadata.model.tool_calling.chat_templatestringJinja2 chat template for the model. Propagated to the model entity spec by the model-spec background task.
metadata.model.tool_calling.tool_call_parserstringName of the tool call parser (e.g.,
hermes,llama3_json,mistral).metadata.model.tool_calling.tool_call_pluginstringReference to a fileset containing a custom tool call plugin Python file (
{workspace}/{fileset_name}). Requiresmodels.tool_call_plugin.enabledat the platform level.metadata.model.tool_calling.auto_tool_choicebooleanWhether to enable automatic tool choice.
These fields are merged into the model entity spec by the model-spec background task. For details, see Chat Templates and Tool Calling.
Custom Fields: Arbitrary key-value data attached to a fileset via
custom_fieldsfor user-defined metadata.
Managing Filesets#
Fileset management operations (create, retrieve, list, delete) are available through the CLI (nmp files filesets) or the SDK (sdk.files.filesets).
Tip
CLI commands use the workspace from your current context by default. Use --workspace to specify a different workspace:
nmp files filesets list --workspace my-workspace
Creating Filesets#
Creating a fileset involves specifying a name and workspace. You can optionally provide a description, purpose, and custom storage configuration.
nmp files filesets create \
--name my-files \
--description "Training data for model fine-tuning"
{
"id": "fileset-TeufFfapeKBrMtpBb42zdv",
"created_at": "2026-01-20T03:00:00",
"custom_fields": {},
"description": "Training data for model fine-tuning",
"metadata": {
"dataset": null
},
"name": "my-files",
"project": "",
"purpose": "generic",
"storage": {
"path": "/var/mnt/filesets/default/my-files",
"read_chunk_size": 16777216,
"type": "local",
"write_buffer_size": 16777216
},
"updated_at": "2026-01-20T03:00:00",
"workspace": "default"
}
import os
from nemo_platform import NeMoPlatform
sdk = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
# Create a fileset
fileset = sdk.files.filesets.create(
name="my-files",
description="Training data for model fine-tuning",
)
print(fileset.model_dump_json(indent=2))
{
"id": "fileset-TeufFfapeKBrMtpBb42zdv",
"created_at": "2026-01-20T03:00:00",
"custom_fields": {},
"description": "Training data for model fine-tuning",
"metadata": {
"dataset": null
},
"name": "my-files",
"project": "",
"purpose": "generic",
"storage": {
"path": "/var/mnt/filesets/default/my-files",
"read_chunk_size": 16777216,
"type": "local",
"write_buffer_size": 16777216
},
"updated_at": "2026-01-20T03:00:00",
"workspace": "default"
}
Listing Filesets#
List all filesets in a given workspace:
nmp files filesets list
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name ┃ workspace ┃ created_at ┃
┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ my-files │ default │ 2026-01-20T03:00:00 │
└──────────┴───────────┴────────────────────────────┘
filesets = sdk.files.filesets.list()
for fileset in filesets:
print(f"{fileset.name}: {fileset.description}")
Filter filesets by purpose or storage type:
# List only dataset filesets
nmp files filesets list --filter.purpose dataset
# List filesets using local storage
nmp files filesets list --filter.storage-type local
# List only dataset filesets
datasets = sdk.files.filesets.list(filter={"purpose": "dataset"})
# List filesets using local storage
local_filesets = sdk.files.filesets.list(filter={"storage_type": "local"})
Use pagination for large result sets:
# The "-" prefix sorts in descending order (newest first)
nmp files filesets list --page 1 --page-size 10 --sort "-created_at"
filesets = sdk.files.filesets.list(
page=1,
page_size=10,
sort="-created_at", # The "-" prefix sorts descending (newest first)
)
Deleting Filesets#
Delete an entire fileset:
nmp files filesets delete my-files
✓ Deleted successfully
deleted_fileset = sdk.files.filesets.delete(name="my-files")
print(f"Deleted fileset: {deleted_fileset.name}")
Warning
Deleting a fileset is permanent and cannot be undone. For local and s3 storage backends, this also deletes all underlying files.
Managing Files Within Filesets#
High-level file operations are available through the CLI (nmp files) or the SDK (sdk.files), which provide convenient methods for uploading, downloading, and listing files.
For advanced use cases, a fsspec-compatible filesystem is available at sdk.files.fsspec. Refer to the fsspec documentation for additional methods.
Uploading Files#
Upload files to a fileset:
# Upload a single file
nmp files upload ./data.jsonl --fileset my-files --remote-path training/data.jsonl
# Upload an entire directory
nmp files upload ./training_data/ --fileset my-files --remote-path training/
Uploading ━━━━━━━━━━━━━━━━ 100% • 3/3 files
Completed upload to my-files#training/
Upload without specifying a fileset to auto-create one:
# Auto-creates a new fileset with a generated name (fileset-<8 hex chars>)
nmp files upload ./data.jsonl
Uploading ━━━━━━━━━━━━━━━━ 100% • 1/1 files
Completed upload to fileset-a1b2c3d4
# Upload a single file
sdk.files.upload(
fileset="my-files",
local_path="./data.jsonl",
remote_path="training/data.jsonl",
)
# Upload an entire directory
sdk.files.upload(
fileset="my-files",
local_path="./training_data/",
remote_path="training/",
)
# Auto-create a new fileset (generates name like "fileset-a1b2c3d4")
result = sdk.files.upload(
local_path="./data.jsonl",
fileset_auto_create=True,
)
print(f"Uploaded to fileset: {result.name}")
Tip
If fileset is omitted, a new fileset is automatically created with a unique name following the pattern fileset-<uuid> (e.g., fileset-a1b2c3d4). The generated name is returned so you can reference it in subsequent operations.
Listing Files#
List all files in a fileset:
nmp files list --fileset my-files
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┓
┃ PATH ┃ SIZE ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━┩
│ training/data.jsonl │ 1024 │
│ training/validation.jsonl │ 512 │
└────────────────────────────┴──────┘
response = sdk.files.list(fileset="my-files")
for file in response.data:
print(f"{file.path}: {file.size} bytes")
List files under a specific directory:
nmp files list --fileset my-files --remote-path training/
training_files = sdk.files.list(fileset="my-files", remote_path="training/")
Downloading Files#
Download files to a local path:
# Download a single file
nmp files download --fileset my-files --remote-path training/data.jsonl -o ./data.jsonl
# Download an entire directory
nmp files download --fileset my-files --remote-path training/ -o ./training_data/
Downloading ━━━━━━━━━━━━━━━━ 100% • 2/2 files
Downloaded my-files#training/ to './training_data/'
# Download a single file
sdk.files.download(
fileset="my-files",
remote_path="training/data.jsonl",
local_path="./data.jsonl",
)
# Download an entire directory
sdk.files.download(
fileset="my-files",
remote_path="training/",
local_path="./training_data/",
)
Read file content into memory (SDK only):
content = sdk.files.download_content(
fileset="my-files",
remote_path="config.json",
)
print(content.decode("utf-8"))
Deleting Files#
Delete files from a fileset:
nmp files delete --fileset my-files --remote-path training/old-data.jsonl
Deleted my-files#training/old-data.jsonl
sdk.files.delete(
fileset="my-files",
remote_path="training/old-data.jsonl",
)
Using Progress Callbacks#
Note
The CLI displays progress bars automatically during uploads and downloads. This section covers custom progress handling in the SDK.
Track progress during large file transfers using the RichProgressCallback context manager:
from nemo_platform.filesets import RichProgressCallback
# Upload a directory with progress bar
with RichProgressCallback(description="Uploading dataset") as callback:
sdk.files.upload(
fileset="my-files",
local_path="./large_dataset/",
remote_path="",
callback=callback,
)
# Download all files from a fileset with progress bar
with RichProgressCallback(description="Downloading dataset") as callback:
sdk.files.download(
fileset="my-files",
remote_path="",
local_path="./downloaded_data/",
callback=callback,
)
Use Cases#
Using External Storage Backends#
Connect to files stored in NVIDIA GPU Cloud (NGC):
Note
By default, the platform pre-configures a built-in system/ngc-api-key secret and filesets for Nemotron Personas.
The example below demonstrates how to recreate those entities in your own workspace.
For disambiguation purposes, this example prefixes the manually-created entities’ names with my-.
# Create a secret to store your NGC API key
echo "$NGC_API_KEY" | nmp secrets create --name my-ngc-api-key --from-file -
# Create a fileset pointing to NGC storage
nmp files filesets create \
--name my-nemotron-personas-dataset-en_us \
--description "Nemotron Personas USA" \
--storage '{
"type": "ngc",
"org": "nvidia",
"team": "nemotron-personas",
"resource": "nemotron-personas-dataset-en_us",
"version": "0.0.2",
"api_key_secret": "my-ngc-api-key"
}'
import os
# Create a secret to store your NGC API key
secret = sdk.secrets.create(name="my-ngc-api-key", data="<your-ngc-api-key>")
# Create a fileset pointing to NGC storage
ngc_fileset = sdk.files.filesets.create(
name="my-nemotron-personas-dataset-en_us",
description="Nemotron Personas USA",
storage={
"type": "ngc",
"org": "nvidia",
"team": "nemotron-personas",
"resource": "nemotron-personas-dataset-en_us",
"version": "0.0.2",
"api_key_secret": secret.name,
}
)
Connect to a HuggingFace repository:
# Create a secret to store your HuggingFace token (needed for gated and private repos)
echo "$HF_TOKEN" | nmp secrets create --name hf_token --from-file -
# Create a fileset pointing to a HuggingFace repo
nmp files filesets create \
--name hf-dataset \
--description "Dataset from HuggingFace" \
--storage '{
"type": "huggingface",
"repo_id": "nvidia/Nemotron-Personas-Japan",
"repo_type": "dataset",
"token_secret": "hf_token"
}'
import os
# Create a secret to store your HuggingFace token (needed for gated and private repos)
secret = sdk.secrets.create(name="hf_token", data=os.getenv("HF_TOKEN"))
# Create a fileset pointing to a HuggingFace repo
hf_fileset = sdk.files.filesets.create(
name="hf-dataset",
description="Dataset from HuggingFace",
storage={
"type": "huggingface",
"repo_id": "nvidia/Nemotron-Personas-Japan",
"repo_type": "dataset",
"token_secret": secret.name, # Optional, needed for gated and private repos
}
)
Connect to an S3 bucket or S3-compatible storage (e.g., MinIO, Ceph):
# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
# The "prefix" field is optional - use it to scope the fileset to a folder within the bucket
nmp files filesets create \
--name s3-training-data \
--description "Training data stored in S3" \
--storage '{
"type": "s3",
"bucket": "my-ml-bucket",
"prefix": "datasets/training",
"region": "us-east-1",
"use_sdk_auth": true
}'
# Upload data to S3
nmp files upload ./training_data/ --fileset s3-training-data
# Download data from S3
nmp files download --fileset s3-training-data -o ./downloaded_data/
# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
s3_fileset = sdk.files.filesets.create(
name="s3-training-data",
description="Training data stored in S3",
storage={
"type": "s3",
"bucket": "my-ml-bucket",
"prefix": "datasets/training", # Optional: scope to a folder within the bucket
"region": "us-east-1",
"use_sdk_auth": True, # Use AWS SDK credential chain (default)
}
)
# Upload data to S3
sdk.files.upload(
fileset="s3-training-data",
local_path="./training_data/",
remote_path="",
)
# Download data from S3
sdk.files.download(
fileset="s3-training-data",
remote_path="",
local_path="./downloaded_data/",
)
For S3-compatible storage like MinIO, use explicit credentials and a custom endpoint:
# Create secrets to store your S3 credentials
echo "$S3_ACCESS_KEY" | nmp secrets create --name s3_access_key --from-file -
echo "$S3_SECRET_KEY" | nmp secrets create --name s3_secret_key --from-file -
nmp files filesets create \
--name minio-fileset \
--description "Data stored in MinIO" \
--storage '{
"type": "s3",
"bucket": "my-bucket",
"endpoint_url": "http://minio.example.com:9000",
"region": "us-east-1",
"use_sdk_auth": false,
"access_key_id_secret": "s3_access_key",
"secret_access_key_secret": "s3_secret_key"
}'
import os
# Create secrets to store your S3 credentials
access_key = sdk.secrets.create(name="s3_access_key", data=os.getenv("S3_ACCESS_KEY"))
secret_key = sdk.secrets.create(name="s3_secret_key", data=os.getenv("S3_SECRET_KEY"))
s3_fileset = sdk.files.filesets.create(
name="minio-fileset",
description="Data stored in MinIO",
storage={
"type": "s3",
"bucket": "my-bucket",
"endpoint_url": "http://minio.example.com:9000", # Custom S3 endpoint
"region": "us-east-1",
"use_sdk_auth": False, # Use explicit credentials instead of SDK auth
"access_key_id_secret": access_key.name,
"secret_access_key_secret": secret_key.name,
}
)