Grounding DINO (GDINO)#

Overview#

Grounding DINO supports open vocabulary object detection supporting limitless category of objects. It also offers superior capability compared to traditional object detectors based on spatial, contextual understanding of the image and video. Examples include:

multi object detection - Example: “car, bicycle”
object attributes - Example: “red car”
spatial location - Example: “boy next to the dog”
action performed - Example: “Person sitting on the chair”

The microservice container contains the Grounding Dino model with pre-trained weights that can be used out of the box. It can be used to input both videos and images for processing. The Docker container distributed as part of JPS includes the TAO Pretrained Grounding DINO model from NGC .

Running GDINO#

Pull and start the container

$ sudo docker pull nvcr.io/nvidia/jps/jps-gdino:ds7.1-public-12-11-1
$ sudo docker run -itd --runtime nvidia --network host nvcr.io/nvidia/jps/jps-gdino:ds7.1-public-12-11-1

Note that by default, the container uses port 8000. If this is being used by something else, the container will throw an error. You can manually specify a different port by adding the -e SERVER_PORT=8025 flag to the above docker run command, changing 8025 to the port value you want.

You may also want to mount a directory at the /ds_microservices/output folder in the container using the -v docker run flag so that you can easily access output videos and images.

View container logs to determine when the service is fully up. This will be the case once you see something similar to below printed out. As engine files are included in the container, this should be fairly quick (less than 30 seconds). You may see some “Failed to connect” errors initially, but these should go away once the Triton server instance within the container is up and accessible.
}Starting inference server INFO: Started server process [627] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Send some sample video/images to the server.

If using images:
$ curl -X POST "http://localhost:8000/files" -H  "Content-Type: multipart/form-data" -F purpose="vision" -F media_type="image" -F "file=@image.jpg"
Or if using videos:
$ curl -X POST "http://localhost:8000/files" -H  "Content-Type: multipart/form-data" -F purpose="vision" -F media_type="video" -F "file=@MOJD2.mp4"
The file parameter should point to the path on the Jetson (relative to your current directory), not the path in the container.

The curl request from above will return back an ID. This needs to be added to the below payload in order to request analysis on the input image/video.

Create a JSON file with the following contents, replacing {asset_id} with the ID from before and prompt with whatever prompt you want to pass in. You may also have to play around with the threshold value depending on your input and prompt in order to properly detect desired objects.
{
    "model": "Grounding-Dino",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "prompt"
                },
                {
                    "type": "media_url",
                    "media_url": {
                        "url": "data:image/jpeg;asset_id,{asset_id}"
                    }
                }
            ]
        }
    ],
    "threshold": 0.35
}
A curl request can then be made to request analysis. Make sure to update the port if you changed it in step 1 and change input.JSON to the path/name of the JSON file you created:
$ curl -X POST http://localhost:8000/inference -H "Content-Type: application/json" -d @input.json
This will output the detection bboxes to console and overlay image to /ds_microservices/output/<id>/out1.jpg (or /ds_microservices/output/<id>/out1.mp4 if using videos) in the container. The API request will also return back an array of detections.

GDINO Performance and Sample Results#

Performance#

Performance will vary depending on the device used, prompt, and input stream resolution. For reference, on an AGX Orin, a sample 1080p video with the prompt “car, bike, person, bus” was able to be processed at about 11.8fps.

Sample Results#

Below are some sample output results using image inputs:

Prompt: “car, bike”

Prompt: “red car”

Prompt: “person next to bike”