Use the NV-Ingest Command Line Interface

After you install the Python dependencies, you can use the NV-Ingest command line interface (CLI). To use the CLI, use the nv-ingest-cli command.

To check the version of the CLI that you have installed, run the following command.

nv-ingest-cli --version

To get a list of the current CLI commands and their options, run the following command.

nv-ingest-cli --help

Tip

There is a Jupyter notebook available to help you get started with the CLI. For more information, refer to CLI Client Quick Start Guide.

Examples

Use the following code examples to submit a document to the nv-ingest-ms-runtime service.

Each of the following commands can be run from the host machine, or from within the nv-ingest-ms-runtime container.

Host: nv-ingest-cli ...
Container: nv-ingest-cli ...

Example: Text File With No Splitting

To submit a text file with no splitting, run the following code.

Note

You receive a response that contains a single document, which is the entire text file. The data that is returned is wrapped in the appropriate metadata structure.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Splitting Only

To submit a .pdf file with only a splitting task, run the following code.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

Example: PDF File With Splitting and Extraction

To submit a .pdf file with both a splitting task and an extraction task, run the following code.

Note

This currently only works for pdfium, nemoretriever_parse, and Unstructured.io. Haystack, Adobe, and LlamaParse have existing workflows, but have not been fully converted to use our unified metadata schema.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --task='extract:{"document_type": "docx", "extract_method": "python_docx"}' \
  --task='split' \
  --client_host=localhost \
  --client_port=7670

Example: Process a Dataset

To submit a dataset for processing, run the following code. To create a dataset, refer to Command Line Dataset Creation with Enumeration and Sampling.

nv-ingest-cli \
  --dataset dataset.json \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --client_host=localhost \
  --client_port=7670

Submit a PDF file with extraction tasks and upload extracted images to MinIO.

nv-ingest-cli \
  --doc ./data/test.pdf \
  --output_directory ./processed_docs \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
  --client_host=localhost \
  --client_port=7670

Command Line Dataset Creation with Enumeration and Sampling

The gen_dataset.py script samples files from a specified source directory according to defined proportions and a total size target. It offers options for caching the file list, outputting a sampled file list, and validating the output.

python ./src/util/gen_dataset.py --source_directory=./data --size=1GB --sample pdf=60 --sample txt=40 --output_file \
  dataset.json --validate-output