Is this page helpful?

CLI Reference

After you install the Python dependencies, you can use the NeMo Retriever Library command line interface (CLI). To use the CLI, use the retriever command.

Command name

Depending on your installation (NeMo Retriever Library vs. nv-ingest-client), you invoke the CLI by using retriever or nv-ingest-cli. Both expose the same options and behavior. The following sections use retriever for consistency with the examples.

To check the version of the CLI that you have installed, run the following command.

retriever --version

To get a list of the current CLI commands and their options, run the following command.

retriever --help

Tip

There is a Jupyter notebook available to help you get started with the CLI. For more information, refer to CLI Client Quick Start Guide.

Parameter Reference

The following table lists all CLI options.

* At least one of --doc or --dataset must be provided; otherwise there are no files to process.

Flag	Aliases	Type	Default	Required	Description
`--doc`	—	path (multiple)	none	No	Path to a document to process. Can be specified multiple times. Files must exist. Supports glob-style patterns.
`--dataset`	—	path	none	No	Path to a dataset definition file (JSON with `sampled_files` list).
`--output_directory`	—	path	none	No	Directory where result metadata and optional media are written. If omitted, results are not saved to disk.
`--task`	—	string (multiple)	none	No	Task definition in `task_id:{"key":"value"}` format. Repeat for multiple tasks (e.g. extract, split, caption).
`--client_host`	—	string	`localhost`	No	Hostname or IP of the ingest service.
`--client_port`	—	int	`7670`	No	Port of the ingest service.
`--api_version`	—	enum	`v2`	No	API version: `v1` or `v2`. Required for `--pdf_split_page_count`.
`--pdf_split_page_count`	—	int	none	No	Pages per PDF chunk when splitting (V2 API). Typically 1–128; server default if unset.
`--client_type`	—	enum	`rest`	No	Client transport: `rest` or `simple`.
`--client_kwargs`	—	string (JSON)	`{}`	No	Extra JSON object passed to the client.
`--batch_size`	—	int	`10`	No	The number of in-flight jobs. This value must be greater than or equal to 1.
`--concurrency_n`	—	int	`10`	No	Number of concurrent jobs to maintain.
`--log_level`	—	enum	`INFO`	No	Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`.
`--dry_run`	—	flag	false	No	Do not run the pipeline; only validate and log what would be done.
`--fail_on_error`	—	flag	false	No	Stop on first job error instead of continuing.
`--save_images_separately`	—	flag	false	No	Write extracted images to disk under `output_directory` and set `content_url` in metadata.
`--shuffle_dataset`	—	flag	true	No	Shuffle file list from `--dataset` before processing.
`--collect_profiling_traces`	—	flag	false	No	After the run, fetch Zipkin traces for submitted jobs and write them under `output_directory`.
`--zipkin_host`	—	string	`localhost`	No	Host for Zipkin API (used when `--collect_profiling_traces` is set).
`--zipkin_port`	—	int	`9411`	No	Port for Zipkin API.
`--version`	—	flag	—	No	Print nv-ingest and nv-ingest-cli versions and exit.

Output Format and Output_Directory Layout

Output Format

Metadata: When --output_directory is set, the CLI writes JSON files. Each result is a JSON structure that follows the content metadata schema: content, content_url, source_metadata, content_metadata, and type-specific blocks (text_metadata, image_metadata, table_metadata, etc.).
Streaming / stdout: Progress and telemetry are written to stderr (e.g. progress bar, timing). No structured result stream is written to stdout unless you use a different mode (e.g. streaming APIs) not covered by this CLI.

Output_Directory Structure

When --output_directory DIR is used, the CLI creates:

DIR/<content_type>/<source_basename>.metadata.json
One JSON file per source document, grouped by content type (text, image, structured, etc.). Each file contains an array of document objects (one per extracted chunk/element) with the full metadata structure.
DIR/<content_type>/media/ (optional)
If --save_images_separately is set, extracted images are saved here as files (e.g. <source_basename>_0.png). Metadata JSON then references them via content_url instead of inline base64.
DIR/zipkin_profiles/ (optional)
If --collect_profiling_traces is used, Zipkin trace JSON files are written here for profiling analysis.

Parsing Results Programmatically

Read each *.metadata.json file under DIR and parse as JSON. Each file is an array of objects; each object has a metadata key with the schema described in Content Metadata.
To iterate over all extracted items for a single run, walk the subdirectories of output_directory and load every *.metadata.json; then iterate over the array elements in each file.

Errors and Exit Codes

The CLI does not define custom exit codes for every case. In general:

Exit 0: All requested files were processed successfully (or --dry_run completed).
Non-zero: Validation failed, a runtime error occurred, or the process was interrupted.

Common errors and how they appear:

Condition	Message / behavior	Exit
Connection refused	Client cannot reach `--client_host`:`--client_port`. Log message like `Error: ...` or connection-related exception.	Non-zero
Missing input	Neither `--doc` nor `--dataset` provided, or no files matched.	Non-zero
File does not exist	`--doc` or `--dataset` path missing.	Non-zero (Click: "File does not exist: <path>")
Invalid task format	`--task` value not in `task_id:{"options"}` form or invalid JSON.	Non-zero (Click: "Invalid JSON format for task '...' ..." or "Unsupported task type: ...")
Unsupported document type	Task expects a document type not supported by the server or extractor.	Non-zero (server/validation error)
Batch size < 1	`--batch_size` < 1.	Non-zero (Click: "Batch size must be >= 1.")
Invalid boolean in extract task	e.g. `extract_text` not true/false, 1/0, yes/no.	Non-zero (ValueError: "Invalid boolean value for ...")
UDF validation	Missing `target_stage` or `phase`, or both specified.	Non-zero (ValueError from task validation)
Timeout	Job did not complete within the client’s timeout; may be retried internally.	Depends on retries / `--fail_on_error`

Running with --fail_on_error causes the process to exit on the first job failure; otherwise the CLI may continue and report failures at the end.

Complete --help Output

The following is the standard help output for the CLI (equivalent to retriever --help or nv-ingest-cli --help). Use it as a quick reference when you cannot run the command locally.

Usage: retriever [OPTIONS]

Options:
  --batch_size INTEGER          Batch size (must be >= 1).  [default: 10]
  --doc PATH                     Add a new document to be processed
                                 (supports multiple).  [default: (none)]
  --dataset PATH                 Path to a dataset definition file.
                                 [default: (none)]
  --client_host TEXT             DNS name or URL for the endpoint.
                                 [default: localhost]
  --client_port INTEGER           Port for the client endpoint.  [default: 7670]
  --client_kwargs TEXT           Additional arguments to pass to the client.
                                 [default: {}]
  --api_version [v1|v2]          API version to use (v1 or v2). V2 required
                                 for PDF split page count feature.
                                 [default: v2]
  --client_type [rest|simple]    Client type used to connect to the ingest
                                 service.  [default: rest]
  --concurrency_n INTEGER        Number of inflight jobs to maintain at one
                                 time.  [default: 10]
  --dry_run                      Perform a dry run without executing actions.
  --fail_on_error                Fail on error.
  --output_directory PATH        Output directory for results.
                                 [default: (none)]
  --log_level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  Log level.  [default: INFO]
  --save_images_separately       Save images separately from returned
                                 metadata.
  --shuffle_dataset / --no-shuffle_dataset
                                  Shuffle the dataset before processing.
                                 [default: True]
  --task TEXT                    Task definition in
                                 '[task_id]:{json_options}' format (repeatable).
  --collect_profiling_traces     Collect Zipkin traces for submitted jobs
                                 into output_directory.
  --zipkin_host TEXT             DNS name or Zipkin API.  [default: localhost]
  --zipkin_port INTEGER          Port for the Zipkin trace API.
                                 [default: 9411]
  --pdf_split_page_count INTEGER Number of pages per PDF chunk for splitting
                                 (v2 api).  [default: (none)]
  --version                      Show version.
  --help                         Show this message and exit.

For detailed task syntax (extract, split, caption, embed, udf, etc.), refer to the --task option in the parameter table and the examples below.

Examples

Use the following code examples to submit a document to the nemo-retriever-ms-runtime service.

Each of the following commands can be run from the host machine, or from within the nemo-retriever-ms-runtime container.

Host: retriever ...
Container: retriever ...

Example: Text File With No Splitting

To submit a text file with no splitting, run the following code.

Note

You receive a response that contains a single document, which is the entire text file. The data that is returned is wrapped in the appropriate metadata structure.

retriever \
  --doc ./data/test.pdf \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Splitting Only

To submit a .pdf file with only a splitting task, run the following code.

retriever \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Splitting and Extraction

To submit a .pdf file with both a splitting task and an extraction task, run the following code.

Note

Currently, split only works for pdfium and nemotron-parse.

retriever \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --task='extract:{"document_type": "docx", "extract_method": "python_docx"}' \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Custom Split Page Count

To submit a PDF file with a custom split page count, use the --pdf_split_page_count option. This allows you to control how many pages are included in each PDF chunk during processing.

Note

The --pdf_split_page_count option requires using the V2 API (set via --api_version v2 or environment variable NEMO_RETRIEVER_API_VERSION=v2). It accepts values between 1 and 128 pages per chunk (default is server default, typically 32). Smaller chunks provide more parallelism but increase overhead, while larger chunks reduce overhead but limit concurrency.

retriever \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
  --pdf_split_page_count 64 \
  --api_version v2 \
  --client_host=localhost \
  --client_port=7670

Example: Caption images with reasoning control

To invoke image captioning and control reasoning:

retriever \
  --doc ./data/test.pdf \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_images": "true"}' \
  --task='caption:{"prompt": "Caption the content of this image:", "reasoning": true}' \
  --client_host=localhost \
  --client_port=7670

reasoning (boolean): Set to true to enable reasoning, false to disable it. Defaults to service default (typically disabled).
Ensure the VLM caption profile/service is running or pointing to the public build endpoint; otherwise the caption task will be skipped.

Tip

The caption service uses a default VLM which you can override by selecting other vision-language models to better match your image captioning needs. For more information, refer to Extract Captions from Images.

Alternatively, you can use an environment variable to set the API version:

export NEMO_RETRIEVER_API_VERSION=v2

retriever \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
  --pdf_split_page_count 64 \
  --client_host=localhost \
  --client_port=7670

Example: Process a Dataset

To submit a dataset for processing, run the following code. To create a dataset, refer to Command Line Dataset Creation with Enumeration and Sampling.

retriever \
  --dataset dataset.json \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --client_host=localhost \
  --client_port=7670

Submit a PDF file with extraction tasks and upload extracted images to MinIO.

retriever \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --client_host=localhost \
  --client_port=7670

Command Line Dataset Creation with Enumeration and Sampling

The gen_dataset.py script samples files from a specified source directory according to defined proportions and a total size target. It offers options for caching the file list, outputting a sampled file list, and validating the output.

python ./src/util/gen_dataset.py --source_directory=./data --size=1GB --sample pdf=60 --sample txt=40 --output_file \
  dataset.json --validate-output