API Reference for NVIDIA NIM for Object Detection#

This documentation contains the API reference for NVIDIA NIM for Object Detection.

OpenAPI Specification#

You can download the complete page-elements API spec, graphic-elements API spec, and table-structure API spec.

API Examples#

The Object Detection NIM supports multiple models for page element, table structure, and graphic element detection. This section provides examples that use the Page Elements NIM, but the API is the same for all models.

See the Table Structure & Graphic Element Example sections for sample images and output to expect if these NIMs were used instead.

Compute Bounding Boxes#

The v1/infer endpoint accepts multiple images and returns a list of bounding boxes for each image. The bounding box coordinates are defined with respect to the top-left corner of the image. This means that:

  • x: Represents the horizontal distance from the left edge of the image to the left edge of the bounding box.

  • y: Represents the vertical distance from the top edge of the image to the top edge of the bounding box.

The only supported type is image_url.

Each image must be base64 encoded, and should be represented in the following JSON format. The supported image formats are png and jpeg.

{
  "type": "image_url",
  "url": "data:image/<IMAGE_FORMAT>;base64,<BASE64_ENCODED_IMAGE>"
}

An inference request has an entry for input. The value for input is an array of dictionaries that contain fields type and url. For example, a JSON payload of three images looks like the following:

{
  "input": [
    {
      "type": "image_url",
      "url": "data:img/png;base64,<BASE64_ENCODED_IMAGE>"
    },
    {
      "type": "image_url",
      "url": "data:img/png;base64,<BASE64_ENCODED_IMAGE>"
    },
    {
      "type": "image_url",
      "url": "data:img/png;base64,<BASE64_ENCODED_IMAGE>"
    }
  ]
}

cURL Example

API_ENDPOINT="http://localhost:8000"

# Create JSON payload with base64 encoded image
# Set your image source - can be a URL or a local file path
IMAGE_SOURCE="https://assets.ngc.nvidia.com/products/api-catalog/nemo-retriever/object-detection/page-elements-example-1.jpg"
# IMAGE_SOURCE="path/to/your/image.jpg"  # Uncomment to use a local file instead

# Encode the image to base64 (handles both URLs and local files)
if [[ $IMAGE_SOURCE == http* ]]; then
  # Handle URL
  BASE64_IMAGE=$(curl -s ${IMAGE_SOURCE} | base64 -w 0)
else
  # Handle local file
  BASE64_IMAGE=$(base64 -w 0 ${IMAGE_SOURCE})
fi

# Construct the full JSON payload
JSON_PAYLOAD='{
  "input": [{
    "type": "image_url",
    "url": "data:image/jpeg;base64,'${BASE64_IMAGE}'"
  }]
}'

# Send POST request to inference endpoint
echo "${JSON_PAYLOAD}" | \
  curl -X POST "${API_ENDPOINT}/v1/infer" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @-

The following image is used as the input in the previous example.

_images/page-elements-example-input-1.png

The following JSON response shows the output from the inference API. The response contains a bounding box for each element, such as tables and charts, that was detected. For each bounding box the response includes the coordinates x_min, y_min, x_max, and y_max, and a confidence score.

Note

Each detected element includes a confidence score between 0 and 1. In production applications, you might want to filter results based on a minimum confidence threshold (for example, 0.5) to reduce false positives.

{
  "data": [
    {
      "index": 0,
      "bounding_boxes": {
        "table": [
          {
            "x_min": 0.36,
            "y_min": 0.2616,
            "x_max": 0.4907,
            "y_max": 0.3881,
            "confidence": 0.6416
          },
          {
            "x_min": 0.505,
            "y_min": 0.2287,
            "x_max": 0.6356,
            "y_max": 0.3538,
            "confidence": 0.5757
          },
          {
            "x_min": 0.2437,
            "y_min": 0.7994,
            "x_max": 0.7526,
            "y_max": 0.8382,
            "confidence": 0.4475
          },
          {
            "x_min": 0.6518,
            "y_min": 0.1928,
            "x_max": 0.7821,
            "y_max": 0.3258,
            "confidence": 0.4405
          },
          {
            "x_min": 0.2156,
            "y_min": 0.3202,
            "x_max": 0.3488,
            "y_max": 0.438,
            "confidence": 0.2427
          }
        ],
        "chart": [
          {
            "x_min": 0.2133,
            "y_min": 0.548,
            "x_max": 0.7816,
            "y_max": 0.8542,
            "confidence": 0.8397
          }
        ],
        "title": [
          {
            "x_min": 0.2384,
            "y_min": 0.1365,
            "x_max": 0.7192,
            "y_max": 0.1926,
            "confidence": 0.5737
          }
        ]
      }
    }
  ],
  "usage": {
    "images_size_mb": 0.10183906555175781
  }
}

The following image shows the input image with the bounding boxes overlaid to visualize the detected page elements.

_images/page-elements-example-output-1.png

Python Example

The following Python code demonstrates how to detect page elements and visualize the results.

Note

This example requires the requests and Pillow libraries. You can install them by using pip (or your preferred package manager). For example: pip install requests Pillow

import requests
import base64
import json
import io
from PIL import Image, ImageDraw

def encode_image(image_source):
    """
    Encode an image to base64 data URL.

    Args:
        image_source: A URL or a local file path

    Returns:
        A base64-encoded data URL
    """
    # Check if the source is a URL or local file
    if image_source.startswith(('http://', 'https://')):
        # Handle remote URL
        response = requests.get(image_source)
        response.raise_for_status()
        image_bytes = response.content
    else:
        # Handle local file
        with open(image_source, 'rb') as f:
            image_bytes = f.read()

    # Encode to base64
    base64_image = base64.b64encode(image_bytes).decode('utf-8')
    return f"data:image/jpeg;base64,{base64_image}"


def detect_elements(image_data_url, api_endpoint):
    """
    Detect page elements in an image using the Page Elements NIM API.

    Args:
        image_data_url: Data URL of the image to process
        api_endpoint: Base URL of the NIM service

    Returns:
        API response dict
    """
    # Prepare payload
    payload = {
        "input": [{
            "type": "image_url",
            "url": image_data_url,
        }]
    }

    # Make inference request
    url = f"{api_endpoint}/v1/infer"
    headers = {
        'accept': 'application/json',
        'Content-Type': 'application/json'
    }

    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()
    return response.json()


def visualize_detections(image_data, result, output_path):
    """Draw bounding boxes on the image based on API results."""
    # Load image from data URL or URL
    if image_data.startswith('data:'):
        # Extract base64 data after the comma
        b64_data = image_data.split(',')[1]
        image_bytes = base64.b64decode(b64_data)
        image = Image.open(io.BytesIO(image_bytes))
    else:
        # Download from URL
        response = requests.get(image_data)
        image = Image.open(io.BytesIO(response.content))

    draw = ImageDraw.Draw(image)

    # Get image dimensions
    width, height = image.size

    # Define colors for different element types
    colors = {
        "table": "red",
        "chart": "green",
        "title": "blue"
    }

    # Draw detected elements
    for detection in result["data"]:
        for element_type, boxes in detection["bounding_boxes"].items():
            color = colors.get(element_type, "yellow")
            for box in boxes:
                # Convert normalized coordinates to pixels
                x1 = int(box["x_min"] * width)
                y1 = int(box["y_min"] * height)
                x2 = int(box["x_max"] * width)
                y2 = int(box["y_max"] * height)

                # Draw rectangle
                draw.rectangle([x1, y1, x2, y2], outline=color, width=3)

                # Add label with confidence
                label = f"{element_type}: {box['confidence']:.2f}"
                draw.text((x1, y1-15), label, fill=color)

    # Save the annotated image
    image.save(output_path)
    print(f"Annotated image saved to {output_path}")


# Example usage
if __name__ == "__main__":
    # Process the sample image
    image_source = "https://assets.ngc.nvidia.com/products/api-catalog/nemo-retriever/object-detection/page-elements-example-1.jpg"
    # Also works with local files
    # image_source = "path/to/your/image.jpg"
    api_endpoint = "http://localhost:8000"
    output_path = "detected_page_elements.jpg"

    try:
        # Encode the image
        image_data_url = encode_image(image_source)

        # Detect elements
        result = detect_elements(image_data_url, api_endpoint)
        print(json.dumps(result, indent=2))

        # Visualize the results
        visualize_detections(image_data_url, result, output_path)

    except requests.exceptions.RequestException as e:
        print(f"API request failed: {e}")
    except Exception as e:
        print(f"Error: {e}")

Table Structure Example#

API_ENDPOINT="http://localhost:8000"

# Create JSON payload with base64 encoded image
IMAGE_SOURCE="https://assets.ngc.nvidia.com/products/api-catalog/nemo-retriever/object-detection/table-structure-example-1.png"
# IMAGE_SOURCE="path/to/your/image.jpg"  # Uncomment to use a local file instead

# Encode the image to base64 (handles both URLs and local files)
if [[ $IMAGE_SOURCE == http* ]]; then
  # Handle URL
  BASE64_IMAGE=$(curl -s ${IMAGE_SOURCE} | base64 -w 0)
else
  # Handle local file
  BASE64_IMAGE=$(base64 -w 0 ${IMAGE_SOURCE})
fi

# Construct the full JSON payload
JSON_PAYLOAD='{
  "input": [{
    "type": "image_url",
    "url": "data:image/jpeg;base64,'${BASE64_IMAGE}'"
  }]
}'

# Send POST request to inference endpoint
echo "${JSON_PAYLOAD}" | \
  curl -X POST "${API_ENDPOINT}/v1/infer" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @-

The following image is used as the input in the previous example.

_images/table-structure-example-1-input.png

Using the same visualization steps from earlier, the following image shows the bounding boxes overlaid to visualize the detected table structure elements visualized by cell, row, and column.

_images/table-structure-example-1-output-cell.png _images/table-structure-example-1-output-row.png _images/table-structure-example-1-output-col.png

Graphic Element Example#

API_ENDPOINT="http://localhost:8000"

# Create JSON payload with base64 encoded image
IMAGE_SOURCE="https://assets.ngc.nvidia.com/products/api-catalog/nemo-retriever/object-detection/graphic-elements-example-1.jpg"
# IMAGE_SOURCE="path/to/your/image.jpg"  # Uncomment to use a local file instead

# Encode the image to base64 (handles both URLs and local files)
if [[ $IMAGE_SOURCE == http* ]]; then
  # Handle URL
  BASE64_IMAGE=$(curl -s ${IMAGE_SOURCE} | base64 -w 0)
else
  # Handle local file
  BASE64_IMAGE=$(base64 -w 0 ${IMAGE_SOURCE})
fi

# Construct the full JSON payload
JSON_PAYLOAD='{
  "input": [{
    "type": "image_url",
    "url": "data:image/jpeg;base64,'${BASE64_IMAGE}'"
  }]
}'

# Send POST request to inference endpoint
echo "${JSON_PAYLOAD}" | \
  curl -X POST "${API_ENDPOINT}/v1/infer" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @-

The following image is used as the input in the previous example.

_images/graphic-element-example-1-input.jpeg

Using the same visualization steps from earlier, the following image shows the bounding boxes overlaid to visualize the detected graphic elements.

_images/graphic-element-example-1-output.png

Error Handling#

When you use the NVIDIA NIM for Object Detection NIM APIs, you might encounter various errors. Understanding these errors can help you troubleshoot issues in your applications.

Common Error Responses#

Status Code

Error Type

Description

Resolution

422 (Unprocessable Entity)

Invalid image URL format

The image URL doesn’t follow the required data URL format

Ensure all URLs follow the pattern: data:<image-media-type>;base64,<base64-image-data>

422 (Unprocessable Entity)

Invalid base64 content

The base64-encoded data in the URL is invalid

Verify that your base64 encoding process is correct and that the image data is not corrupted

422 (Unprocessable Entity)

Malformed request

The JSON payload structure is incorrect

Verify that your request format matches the API specification

429 (Too Many Requests)

Request queue full

The number of concurrent requests exceeds the configured queue size

Reduce the request rate, or increase the queue size by using NIM_TRITON_MAX_QUEUE_SIZE

500 (Internal Server Error)

Server error

An unexpected error occurred during processing

Check server logs for details and report the issue if persistent

503 (Service Unavailable)

Service not ready

The service is still initializing or loading models

Check health endpoints and wait for the service to complete initialization

Error Response Example#

{
  "error": "One or more images in the request contain an invalid image URL. Ensure that all URLs are data URLs with an image media type and base64-encoded image data. The pattern for this is 'data:<image-media-type>;base64,<base64-image-data>'."
}

Troubleshooting Tips#

  1. Invalid Image Format: The API supports PNG and JPEG formats. Ensure your images are in one of these formats before encoding.

  2. Image Size Limits: Very large images may cause processing issues. Consider resizing large images before sending them to the API.

  3. Service Health: Use the health check endpoints (/v1/health/live and /v1/health/ready) to verify the service is operational before sending inference requests.

  4. Base64 Encoding: When encoding images, ensure you’re using the correct MIME type in the data URL:

    • For JPEG: data:image/jpeg;base64,...

    • For PNG: data:image/png;base64,...

  5. Request Timeout: If requests are timing out, the model may be processing a large batch or complex images. Consider adjusting timeout settings in your client application.

  6. Rate Limiting: If you’re receiving 429 errors, implement backoff strategies in your client application to handle rate limiting gracefully.

Health Check#

cURL Request

Use the following command to query the health endpoints.

HOSTNAME="localhost"
SERVICE_PORT=8000
curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/ready" \
-H 'Accept: application/json'
HOSTNAME="localhost"
SERVICE_PORT=8000
curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/live" \
-H 'Accept: application/json'

Response

{
  "ready": true
}
{
  "live": true
}

OpenAPI Reference for Page Elements#

The following is the OpenAPI reference for NVIDIA NIM for Page Elements.

OpenAPI Reference for Graphic Elements#

The following is the OpenAPI reference for NVIDIA NIM for Graphic Elements.

OpenAPI Reference for Table Structure#

The following is the OpenAPI reference for NVIDIA NIM for Table Structure.