CLI Reference
After you install the Python dependencies, you can use the NeMo Retriever Library command line interface (CLI).
To use the CLI, use the retriever command.
Command name
Depending on your installation (NeMo Retriever Library vs. nv-ingest-client), you invoke the CLI by using retriever or nv-ingest-cli. Both expose the same options and behavior. The following sections use retriever for consistency with the examples.
To check the version of the CLI that you have installed, run the following command.
retriever --version
To get a list of the current CLI commands and their options, run the following command.
retriever --help
Tip
There is a Jupyter notebook available to help you get started with the CLI. For more information, refer to CLI Client Quick Start Guide.
Parameter Reference
The following table lists all CLI options.
* At least one of --doc or --dataset must be provided; otherwise there are no files to process.
| Flag | Aliases | Type | Default | Required | Description |
|---|---|---|---|---|---|
--doc |
— | path (multiple) | none | No | Path to a document to process. Can be specified multiple times. Files must exist. Supports glob-style patterns. |
--dataset |
— | path | none | No | Path to a dataset definition file (JSON with sampled_files list). |
--output_directory |
— | path | none | No | Directory where result metadata and optional media are written. If omitted, results are not saved to disk. |
--task |
— | string (multiple) | none | No | Task definition in task_id:{"key":"value"} format. Repeat for multiple tasks (e.g. extract, split, caption). |
--client_host |
— | string | localhost |
No | Hostname or IP of the ingest service. |
--client_port |
— | int | 7670 |
No | Port of the ingest service. |
--api_version |
— | enum | v2 |
No | API version: v1 or v2. Required for --pdf_split_page_count. |
--pdf_split_page_count |
— | int | none | No | Pages per PDF chunk when splitting (V2 API). Typically 1–128; server default if unset. |
--client_type |
— | enum | rest |
No | Client transport: rest or simple. |
--client_kwargs |
— | string (JSON) | {} |
No | Extra JSON object passed to the client. |
--batch_size |
— | int | 10 |
No | The number of in-flight jobs. This value must be greater than or equal to 1. |
--concurrency_n |
— | int | 10 |
No | Number of concurrent jobs to maintain. |
--log_level |
— | enum | INFO |
No | Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL. |
--dry_run |
— | flag | false | No | Do not run the pipeline; only validate and log what would be done. |
--fail_on_error |
— | flag | false | No | Stop on first job error instead of continuing. |
--save_images_separately |
— | flag | false | No | Write extracted images to disk under output_directory and set content_url in metadata. |
--shuffle_dataset |
— | flag | true | No | Shuffle file list from --dataset before processing. |
--collect_profiling_traces |
— | flag | false | No | After the run, fetch Zipkin traces for submitted jobs and write them under output_directory. |
--zipkin_host |
— | string | localhost |
No | Host for Zipkin API (used when --collect_profiling_traces is set). |
--zipkin_port |
— | int | 9411 |
No | Port for Zipkin API. |
--version |
— | flag | — | No | Print nv-ingest and nv-ingest-cli versions and exit. |
Output Format and Output_Directory Layout
Output Format
- Metadata: When
--output_directoryis set, the CLI writes JSON files. Each result is a JSON structure that follows the content metadata schema:content,content_url,source_metadata,content_metadata, and type-specific blocks (text_metadata,image_metadata,table_metadata, etc.). - Streaming / stdout: Progress and telemetry are written to stderr (e.g. progress bar, timing). No structured result stream is written to stdout unless you use a different mode (e.g. streaming APIs) not covered by this CLI.
Output_Directory Structure
When --output_directory DIR is used, the CLI creates:
-
DIR/<content_type>/<source_basename>.metadata.json
One JSON file per source document, grouped by content type (text,image,structured, etc.). Each file contains an array of document objects (one per extracted chunk/element) with the full metadata structure. -
DIR/<content_type>/media/(optional)
If--save_images_separatelyis set, extracted images are saved here as files (e.g.<source_basename>_0.png). Metadata JSON then references them viacontent_urlinstead of inline base64. -
DIR/zipkin_profiles/(optional)
If--collect_profiling_tracesis used, Zipkin trace JSON files are written here for profiling analysis.
Parsing Results Programmatically
- Read each
*.metadata.jsonfile underDIRand parse as JSON. Each file is an array of objects; each object has ametadatakey with the schema described in Content Metadata. - To iterate over all extracted items for a single run, walk the subdirectories of
output_directoryand load every*.metadata.json; then iterate over the array elements in each file.
Errors and Exit Codes
The CLI does not define custom exit codes for every case. In general:
- Exit 0: All requested files were processed successfully (or
--dry_runcompleted). - Non-zero: Validation failed, a runtime error occurred, or the process was interrupted.
Common errors and how they appear:
| Condition | Message / behavior | Exit |
|---|---|---|
| Connection refused | Client cannot reach --client_host:--client_port. Log message like Error: ... or connection-related exception. |
Non-zero |
| Missing input | Neither --doc nor --dataset provided, or no files matched. |
Non-zero |
| File does not exist | --doc or --dataset path missing. |
Non-zero (Click: "File does not exist: <path>") |
| Invalid task format | --task value not in task_id:{"options"} form or invalid JSON. |
Non-zero (Click: "Invalid JSON format for task '...' ..." or "Unsupported task type: ...") |
| Unsupported document type | Task expects a document type not supported by the server or extractor. | Non-zero (server/validation error) |
| Batch size < 1 | --batch_size < 1. |
Non-zero (Click: "Batch size must be >= 1.") |
| Invalid boolean in extract task | e.g. extract_text not true/false, 1/0, yes/no. |
Non-zero (ValueError: "Invalid boolean value for ...") |
| UDF validation | Missing target_stage or phase, or both specified. |
Non-zero (ValueError from task validation) |
| Timeout | Job did not complete within the client’s timeout; may be retried internally. | Depends on retries / --fail_on_error |
Running with --fail_on_error causes the process to exit on the first job failure; otherwise the CLI may continue and report failures at the end.
Complete --help Output
The following is the standard help output for the CLI (equivalent to retriever --help or nv-ingest-cli --help). Use it as a quick reference when you cannot run the command locally.
Usage: retriever [OPTIONS]
Options:
--batch_size INTEGER Batch size (must be >= 1). [default: 10]
--doc PATH Add a new document to be processed
(supports multiple). [default: (none)]
--dataset PATH Path to a dataset definition file.
[default: (none)]
--client_host TEXT DNS name or URL for the endpoint.
[default: localhost]
--client_port INTEGER Port for the client endpoint. [default: 7670]
--client_kwargs TEXT Additional arguments to pass to the client.
[default: {}]
--api_version [v1|v2] API version to use (v1 or v2). V2 required
for PDF split page count feature.
[default: v2]
--client_type [rest|simple] Client type used to connect to the ingest
service. [default: rest]
--concurrency_n INTEGER Number of inflight jobs to maintain at one
time. [default: 10]
--dry_run Perform a dry run without executing actions.
--fail_on_error Fail on error.
--output_directory PATH Output directory for results.
[default: (none)]
--log_level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
Log level. [default: INFO]
--save_images_separately Save images separately from returned
metadata.
--shuffle_dataset / --no-shuffle_dataset
Shuffle the dataset before processing.
[default: True]
--task TEXT Task definition in
'[task_id]:{json_options}' format (repeatable).
--collect_profiling_traces Collect Zipkin traces for submitted jobs
into output_directory.
--zipkin_host TEXT DNS name or Zipkin API. [default: localhost]
--zipkin_port INTEGER Port for the Zipkin trace API.
[default: 9411]
--pdf_split_page_count INTEGER Number of pages per PDF chunk for splitting
(v2 api). [default: (none)]
--version Show version.
--help Show this message and exit.
For detailed task syntax (extract, split, caption, embed, udf, etc.), refer to the --task option in the parameter table and the examples below.
Examples
Use the following code examples to submit a document to the nemo-retriever-ms-runtime service.
Each of the following commands can be run from the host machine, or from within the nemo-retriever-ms-runtime container.
- Host:
retriever ... - Container:
retriever ...
Example: Text File With No Splitting
To submit a text file with no splitting, run the following code.
Note
You receive a response that contains a single document, which is the entire text file. The data that is returned is wrapped in the appropriate metadata structure.
retriever \
--doc ./data/test.pdf \
--client_host=localhost \
--client_port=7670
Example: PDF File With Splitting Only
To submit a .pdf file with only a splitting task, run the following code.
retriever \
--doc ./data/test.pdf \
--output_directory ./processed_docs \
--task='split' \
--client_host=localhost \
--client_port=7670
Example: PDF File With Splitting and Extraction
To submit a .pdf file with both a splitting task and an extraction task, run the following code.
Note
Currently, split only works for pdfium and nemotron-parse.
retriever \
--doc ./data/test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
--task='extract:{"document_type": "docx", "extract_method": "python_docx"}' \
--task='split' \
--client_host=localhost \
--client_port=7670
Example: PDF File With Custom Split Page Count
To submit a PDF file with a custom split page count, use the --pdf_split_page_count option.
This allows you to control how many pages are included in each PDF chunk during processing.
Note
The --pdf_split_page_count option requires using the V2 API (set via --api_version v2 or environment variable NEMO_RETRIEVER_API_VERSION=v2).
It accepts values between 1 and 128 pages per chunk (default is server default, typically 32).
Smaller chunks provide more parallelism but increase overhead, while larger chunks reduce overhead but limit concurrency.
retriever \
--doc ./data/test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
--pdf_split_page_count 64 \
--api_version v2 \
--client_host=localhost \
--client_port=7670
Example: Caption images with reasoning control
To invoke image captioning and control reasoning:
retriever \
--doc ./data/test.pdf \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_images": "true"}' \
--task='caption:{"prompt": "Caption the content of this image:", "reasoning": true}' \
--client_host=localhost \
--client_port=7670
reasoning(boolean): Set totrueto enable reasoning,falseto disable it. Defaults to service default (typically disabled).- Ensure the VLM caption profile/service is running or pointing to the public build endpoint; otherwise the caption task will be skipped.
Tip
The caption service uses a default VLM which you can override by selecting other vision-language models to better match your image captioning needs. For more information, refer to Extract Captions from Images.
Alternatively, you can use an environment variable to set the API version:
export NEMO_RETRIEVER_API_VERSION=v2
retriever \
--doc ./data/test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
--pdf_split_page_count 64 \
--client_host=localhost \
--client_port=7670
Example: Process a Dataset
To submit a dataset for processing, run the following code. To create a dataset, refer to Command Line Dataset Creation with Enumeration and Sampling.
retriever \
--dataset dataset.json \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
--client_host=localhost \
--client_port=7670
Submit a PDF file with extraction tasks and upload extracted images to MinIO.
retriever \
--doc ./data/test.pdf \
--output_directory ./processed_docs \
--task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
--client_host=localhost \
--client_port=7670
Command Line Dataset Creation with Enumeration and Sampling
The gen_dataset.py script samples files from a specified source directory according to defined proportions and a total size target.
It offers options for caching the file list, outputting a sampled file list, and validating the output.
python ./src/util/gen_dataset.py --source_directory=./data --size=1GB --sample pdf=60 --sample txt=40 --output_file \
dataset.json --validate-output