Is this page helpful?

Use the NV-Ingest Command Line Interface

After you install the Python dependencies, you can use the NV-Ingest command line interface (CLI). To use the CLI, use the nv-ingest-cli command.

To check the version of the CLI that you have installed, run the following command.

nv-ingest-cli --version

To get a list of the current CLI commands and their options, run the following command.

nv-ingest-cli --help

Tip

There is a Jupyter notebook available to help you get started with the CLI. For more information, refer to CLI Client Quick Start Guide.

Examples

Use the following code examples to submit a document to the nv-ingest-ms-runtime service.

Each of the following commands can be run from the host machine, or from within the nv-ingest-ms-runtime container.

Host: nv-ingest-cli ...
Container: nv-ingest-cli ...

Example: Text File With No Splitting

To submit a text file with no splitting, run the following code.

Note

You receive a response that contains a single document, which is the entire text file. The data that is returned is wrapped in the appropriate metadata structure.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Splitting Only

To submit a .pdf file with only a splitting task, run the following code.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Splitting and Extraction

To submit a .pdf file with both a splitting task and an extraction task, run the following code.

Note

Currently, split only works for pdfium, nemotron-parse, and Unstructured.io.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --task='extract:{"document_type": "docx", "extract_method": "python_docx"}' \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Custom Split Page Count

To submit a PDF file with a custom split page count, use the --pdf_split_page_count option. This allows you to control how many pages are included in each PDF chunk during processing.

Note

The --pdf_split_page_count option requires using the V2 API (set via --api_version v2 or environment variable NV_INGEST_API_VERSION=v2). It accepts values between 1 and 128 pages per chunk (default is server default, typically 32). Smaller chunks provide more parallelism but increase overhead, while larger chunks reduce overhead but limit concurrency.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
  --pdf_split_page_count 64 \
  --api_version v2 \
  --client_host=localhost \
  --client_port=7670

Example: Caption images with reasoning control

To invoke image captioning and control reasoning:

nv-ingest-cli \
  --doc ./data/test.pdf \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_images": "true"}' \
  --task='caption:{"prompt": "Caption the content of this image:", "reasoning": true}' \
  --client_host=localhost \
  --client_port=7670

reasoning (boolean): Set to true to enable reasoning, false to disable it. Defaults to service default (typically disabled).
Ensure the VLM caption profile/service is running or pointing to the public build endpoint; otherwise the caption task will be skipped.

Alternatively, you can use an environment variable to set the API version:

export NV_INGEST_API_VERSION=v2

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
  --pdf_split_page_count 64 \
  --client_host=localhost \
  --client_port=7670

Example: Process a Dataset

To submit a dataset for processing, run the following code. To create a dataset, refer to Command Line Dataset Creation with Enumeration and Sampling.

nv-ingest-cli \
  --dataset dataset.json \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --client_host=localhost \
  --client_port=7670

Submit a PDF file with extraction tasks and upload extracted images to MinIO.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --client_host=localhost \
  --client_port=7670

Command Line Dataset Creation with Enumeration and Sampling

The gen_dataset.py script samples files from a specified source directory according to defined proportions and a total size target. It offers options for caching the file list, outputting a sampled file list, and validating the output.

python ./src/util/gen_dataset.py --source_directory=./data --size=1GB --sample pdf=60 --sample txt=40 --output_file \
  dataset.json --validate-output