JSON Schema

Introduction

This document specifies requirements for metadata description for interoperability across multiple NVIDIA SDKs and libraries including the Metropolis platform. Various existing standards have been studied in the process of coming up with these requirements. The references section lists the industry standards and the academic research used. There are several standards for describing video metadata. Examples include MPEG-7, LSCOM, ViPER, ONVIF, VERL, ViSOR, CVML, HumanML, VERSA etc., Of the various existing standards MPEG-7 is by far the most comprehensive but also the most onerous from an implementation perspective. ONVIF on the other hand is widely adopted by the camera community and is the most widespread for some camera metadata but weak on scene description metadata. There have been several studies that attempt to define a minimal essential MPEG-7 profile for video surveillance. Unfortunately, there is no consensus and no single standard that has succeeded for all aspects of metadata description. It is also futile to attempt to create or expect a universal standard for the semantics of video metadata. Efforts to standardize this for the web (the semantic web) have not seen much success after years of work. Some of the challenges for the Semantic Web include vastness, vagueness, uncertainty, and inconsistency. Trying to define a standard ontology for describing video will not succeed in universal adoption and effectiveness for the very same reasons. Natural language will replace most of the structured and controlled means of describing semantics in future in any case and that the way this schema is defined should leave that option open for future use.

This document will therefore define requirements for the schema as well as a controlled vocabulary that favor extensibility with a minimal core of necessity. It will try to combine valuable features from more than one standard described above but not adhere in entirety to any single existing standard. It will also extend the richness of the existing standards. The metadata description standard requirements in this document covers the following aspects of the needed standard:

A Schema to formally define the elements in a structured format
A language to encode the schema. JSON for the schema and OWL for upper ontology should be used.
A Taxonomy inclusive of Controlled Vocabularies to represent the semantics.
The attributes of Things listed in the taxonomy.
The relationships between Things listed in the taxonomy.

Ingredients of a Video Metadata Description Standard

When a video is captured and analyzed, the metadata generated must describe the following independent characteristics:

Metadata about the “sensor” (i.e. one or more cameras that captured the video)
Metadata about the “Things” seen (and heard) in the video
Metadata about the “analytical engine” that is analyzing the videos and generating scene analysis metadata

While expressing the metadata the following primitive data types will be used.

Data Type	Description
Boolean	A Boolean value: either “true” or “false”.
Integer	An integer
Float	A ﬂoating point number
lvalue	An enumeration type. In the conﬁg part the list of allowed values must be deﬁned
Long	Long integer
String	A string value. Strings must be xml-escaped
Timestamp	RFC3339 UTC timestamp e.g. 2006-01-02T15:04:05.999Z
Array	An array of any of the above types of data

Metadata about the sensor

This document proposes the use of the ONVIF standard for describing the metadata about the sensor which in this case is the camera. Given the widespread use and adoption of ONVIF in the camera community for this purpose it is reasonable to adopt it for the description of camera specific communications and metadata.

Item	Description
SensorID	Unique camera id in a system of multiple cameras. This is a mandatory attribute.
SensorLocation	Latitude, Longitude, Altitude of the sensor. This is a mandatory attribute.
SensorCoordinates	Cartesian Coordinates (X, Y, Z)
SensorType	Indicates the type of sensor. For example camera, puck, etc.
SensorOrientation	Pan Tilt Zoom values (PTZ)
SensorDescription	Free Text description (e.g. “Camera facing parking lot A”)
SensorSampingRate	For example 30 frames per second

Table 2: Sensor Metadata

Note

Additional camera parameters like lens distortion, focal length, etc., for camera calibration can also be added to the Sensor Metadata. Similarly, environmental attributes such as temperature, relative humidity etc., can also be added. Attributes should follow the ONVIF standard specified in ONVIF-Core-Specification-v1612.pdf for those that are defined by the ONVIF core specification. SensorCoordinates denote the Cartesian coordinates, with respect to some Euclidian plane. Such information yield to faster and simpler computation in spatio-temporal analysis than utilizing geo-coordinates (which are uses computations on an ellipsoid surface).

Metadata about the “Things” seen (and heard) in the video

This section focuses on the requirements for metadata about the “Things” in the scene that are observed through audio and visual signals. The purpose of describing a video is to answer one or more of the questions of the following kind: Who? What? When? Where? How? To do so, a taxonomy is needed. This taxonomy is comprised of three parts. A list of “Things” and their “attributes”. A list of “relationship types”. An actual instantiation of “Things, their “attributes” and “relationship types” to define the schema to be used for the semantic description of the scene. The schema should be equally applicable to labeling a video as well as for developing models for the class labels and then performing inference using those models during runtime processing.

Things can be of one of the following three types:

Places: Describing the Place and its attributes where the scene is taking place. A Place is immobile. E.g. Airport, Indoor, Outdoor, etc.
Objects: Describing the objects that are observed in the scene along with their appearance and behavior attributes. Objects can be immobile or mobile. E.g. Vehicle, Person, Face, Backpack, etc.
Events: Describing the occurrence of various events over a period of time. Events are complex and they occur as a consequence of change of state of one or more object and the interaction of one or more object with the environment or Place. E.g. Person Entering Room, Car Accident, etc.

Attributes:

There are two sets of attributes to be defined for all Things.

A set of attributes that will be common to all Things.
An additional set of optional attributes that are unique to each Thing.

Common Attributes

The following attributes are common for all Things.

Item	Description
ID	Unique id for every object, place or event detected in the scene. This is a mandatory attribute. This is not globally unique but can be made so by combining it with other metadata such as sensor id and time. Rollover should be handled by application logic.
Appearance – Shape BBox COG Polygon	Bounding box, CenterofGravity and the polygon coordinates indicating appearance location of object within frame. An arbitrary polygon can be optionally added to the appearance shape descriptor. For Places the default shape is the entire image
Appearance – Color	Color of the object as defined. Use the ONVIF schema for the color space specification.
Confidence	A floating-point number in the closed interval [0,1] indicating the probability of detection of said object by the analytical engine.
Individual or Composite	This attribute allows us to specify if the Thing is an individual Thing or is composed of multiple other Things and is therefore a composite Thing. Default assumption will be that the Thing is an individual Thing unless noted otherwise.

Table 3: Common Attributes of Things

Time: Time is mandatory metadata. Time comes in two varieties – instant and interval. We will use t to indicate an instant and T to indicate an interval with start and end time

Item

Description

Timestamp

RFC3339 UTC timestamp

e.g. 2006-01-02T15:04:05.999Z

This is a mandatory attribute.

Frame (optional)

Frame number within the video sequence

Table 4: Defining Time

An instant should always be expressed as a timestamp. An interval will be defined by a pair of timestamps or frame numbers and will define as closed range. E.g. (t1,t2) indicates a range of interval T beginning at time instant t1 and ending at time instant t2, inclusive of t1 and t2.

Optional Attributes:

The following appearance attributes are optional:

Item	Description
AppearanceFeatureName	Name of the numerical feature being used for describing the appearance
AppearanceFeatureType	Define the data type of this appearance feature. For example integer, float, string etc.
AppearanceFeatureVector	A vector of AppearanceFeatureType to represent any kind of image descriptors like SIFT features, texture, fiducials for face recognition, etc.

Table 5: Optional Attributes of Appearance

Location: Location is mandatory metadata. Location has two aspects. Relative location is camera-centric. Absolute location is world-centric.

Absolute Location

Item	Description
AbsoluteLocation	Latitude, Longitude and Altitude of the center of gravity of the Thing being observed. This is a mandatory attribute.
Coordinates	Cartesian Coordinates (X, Y, Z). Similar to “sensor”, coordinates helps faster computations.
Coordinates Bounding box (p1, p2)	Defined by a pair of points p1 and p2 indicating the top left and bottom right Cartesian coordinates

Table 6: Location Attributes

Relative Location allows for within frame localization of the Thing being described. All the items listed in the table below are not mandatory but having at least one representation of location (either absolute or relative) is mandatory. If location is being described in the relative frame of description, one of the items in Table 6 must be associated with each Thing as relative location attribute.

Item	Description
Point p	A pixel p in the image defined by an x,y coordinate system. Recommended to use ONVIF standard for coordinate system of a point pixel in a camera centric view
Bounding box (p1,p2)	Defined by a pair of points p1 and p2 indicating the top left and bottom right coordinates of a rectangular bounding box.
Mask	An arbitrary collection of points that define an arbitrarily shaped mask within an image. The advantage is that it can allow for describing any shape precisely. The disadvantage is that it can be large in size and expensive to encode.
Center of gravity	x and y coordinate of the center of gravity.
Distance (p1, p2)	Euclidean distance between points p1 and p2
Translation and Scale	Follow the ONVIF schema definition of a translation vector T and a scale vector V to move the coordinate system if needed

Table 7: Attributes about relative position within image/video

Movement Attributes

In many applications including video surveillance, mobile objects are of greater interest than stationary objects. Movement of objects can be described by a set of primitive events that can happen to them. A Mobile Object has a velocity and a direction. The primitive events are:

Item	Description
Direction	Represented by a float to denote the angle indicating directionality. The angle should be between 0 to 360 degrees. This is the angle with respect to the horizontal line in the Cartesian coordinates
Orientation	Indicates the orientation of the object. Represented as a float and denotes the angle made with horizontal line. For example, this indicates which direction the front of the car is facing
Speedpixel	Represented by a float. In conjunction with direction of motion, it provides information on the velocity of the moving object. Speed in pixels/second
Speed	Real-world speed in miles per hour

Table 8: Movement Attributes

Defining Things and their Thing-specific Attributes

The table below lists the class labels for Things and unique attributes of the class labels. The list also includes a few events whose attributes are described later in the document. The name of the Thing can be provided as a fully qualified name in JSON schema. For example, a garage (parking lot) within a building can have a fully qualified type as “building/garage”.

Indentation indicates hierarchy of things.

Hierarchy table
Type	Hierarchy Level 1	Hierarchy Level 2	Attributes
			ID, Individual/Composite, Time, Location (Relative, Absolute), Appearance (Color Shape), Behavior, Freetextstring, AppearanceFeatureName, AppearanceFeatureVector
Place			Outdoor/Indoor, Bright/Dark
	Building
		Airport
		Station
		Mall
		School
		Garage
		Hospital
		Store
		House
		Hangar
		Tower
	Road
		Walkway
		Trail
		Railwaytrack
		Runway
		Biketrack
		Pavement
	Sky
	Forest
	Greenery
	Beach
	Bodyofwater
	Mountain
Object
	Vehicle		Make, Model, Type, LicensePlate, LicensePlateState, FeatureVector=VehicleSignatureFeatureVector
		Car	Taxi
		Truck
		Bus	Large/Small/Minibus
		Pickuptruck
		Firetruck
		Van
		Ambulance
		Train
		Helicopter
		Airplane
		Tram
		Bicycle
		Motorbicycle
	Person		Gender, Age, Hair, Glasses, Apparel, Bag, Cap, AppearanceFeatureVector=PersonSignatureFeatureVector
		Cyclist
		Driver
		Lawenforcement official
	Face		Gender, Age, Hair, Glasses, Cap, Facialhair, Name, Eyecolor, Emotion, FeatureVector=FiducialFeatures
	Group		Numberofpersons
	Bag
	Animal
	Glasses		Dark/Transparent
	Cap
	Apparel		Formal, Casual, Suit, Ethnic
	Umbrella
	Trafficsignal
	Signpost
	Lightpole
	Tree
	Pipeline
	Stairwell
	Elevator
	Bridge
	Busstop
	Escalator
	Door		Open/Closed
Event
	Action		Active/Idle, Slow/Fast
		Moving
		Walking
		Running
		Smiling
		Driving
		Speaking
		Touching
		Holding
		Picking
		Releasing
		Starting
		Stopping
		Flying
		Swimming
		Loitering
		Entering
		Throwing
		Screaming
		Exiting
	Brawl
	Crash
	Glassbreaking
	Fire
	Explosion
	Gunshot
	Sportsevent
None of the above

Table 9: A Taxonomy of Things

Defining Relationships between Things and Attributes

Four types of relationships need to be supported:

Subsumption: When a Thing subsumes another Thing. For example a “Car” IS-A “Vehicle”
Composition: When a Thing is part of another Thing. For example a Vehicle HAS-A Licenseplate. All attributes of a Thing are also to be supported through this “HAS-A” relationship.
Spatial relationship: Needed to define how objects, Places and events spatial relation is to be described. For example Vehicle 1 is to the “right-of” Vehicle 2.
Temporal relationship: Needed to define all kinds of events that occur due to change of state of Places and objects over time. For example, the event of loitering can be defined as repeated entry and exit of a person within a zone within a certain amount of time.

We will allow applications to define ontology to describe the scene using the following components:

Subsumption (IS-A)

The IS-A relationship will allow for class subsumption. Here are a few examples of the use of this relationship in defining Things:

Thing	IS-A
Airport	Building
Station	Building
Building	Place
Car	Vehicle
Pavement	Road

Table 10: Subsumption Relationship Examples

Composition (HAS-A)

The HAS-A relationship allows for an object to be composed of several other objects and for an object to have attributes. All mandatory and Thing-specific attributes can be defined using a HAS-A relationship. All attributes defined in Column 3 of Table 11 are in fact to be defined using the HAS-A relationship.

Thing	ThingType	HAS-A
Person	Object	Gender, Age, Hair, Glasses, Apparel, Bag, Cap, Emotion, Expression

Table 11: Composition Relationship Example

Temporal Relationships.

There are two types of temporal relationships. One is between an instant and an interval. The other is between two distinct time intervals. This definition is based on Allen’s temporal relationships.

Relationship between time instant t and time interval T.

Relationship	Description
Begins (t,T)	Time instant t is the beginning of time interval T
Inside(t,T)	Time instant t is within the time interval T
Ends(t,T)	Time instant t is the ending of time interval T
Outside(t,T)	Time instant t is outside time interval T

Table 12: Temporal Relationships between instant and interval

Relationship between time interval T1 and time interval T2.

Relationship	Description
Before(T1,T2)	Time interval T1 happens entirely before time interval T2
Meets(T1,T2)	Time interval T1 ends when time interval T2 begins so they meet at the common time instant
Overlaps(T1,T2)	Time interval T1 and time interval T2 share more than one time instants.
Begins(T1,T2)	Time intervals T1 and T2 start together but do not end together.
Contains(T1,T2)	Time interval T1 is a subset of time interval T2
Ends(T1,T2)	Time intervals T1 and T2 start at different instants but end together

Table 13: Temporal Relationships between multiple intervals

There are two possible relations between events and times: Some events happen instantaneously. Some events happen in a time intervals. The corresponding video granularity is Frame (Instant) and Segment (Interval). UTC Timecode is the expected way to define timestamp.

Spatial Relationships

A common requirement is to be able to describe the spatial relationship between objects and Places. The following primitives can be a starting point and further refinement can be decided in terms of granularity of relative position. Assume two objects A and B.

Relationship	Description
Above(A,B)	Y coordinate of COG(A) is above Y coordinate of COG(B)
Below(A,B)	Y coordinate of COG(A) is above Y coordinate of COG(B)
Leftof(A,B)	X coordinate of COG(A) is to the left of X coordinate of COG(B)
Rightof(A,B)	X coordinate of COG(A) is to the right of X coordinate of COG(B)
Inside(A,B)	A is entirely contained inside B
Outside (A,B)	A and B do not overlap

Table 14: Spatial relationships between Things

Merging and Separation of Objects in a Scene:

We will follow the ONVIF notion of objecttree for merge and separation of objects in a scene described in Section 5 below but require an objecttree to have its own unique ID instead of hijacking the id of one of its objects merged. To avoid confusion with ONVIF namespace this will be done as follows.

Imagine Thing 1 and Thing 2 need to be now combined in a new container (objecttree in ONVIF.)

For any container that needs to contain multiple Things, define a new Thing class.

Thing Object,

Object.ID=3

Object.Thingtype=Composite

Object.Join (1,3) # Thing 3 now includes Thing 1

Object.Join (2,3) # Thing 3 now includes Thing 2

If Thing 2 now has to leave the container:

Object.Leave(2,3) # Thing 2 has now left Thing 3 the container

Object movement relationships for pairs of objects

Two Objects can be in a “distance-from” relation. It may also be useful to have a notion of one object being “near” another object. This is generally a functional notion—near enough for some purpose. In a particular domain, it may be possible to define this precisely in terms of distance. When at least one of those Objects is a mobile object, this relation can change. Assume objects A and B. This gives rise to the following primitive events:

Item	Description
MoveToward(A,B)	A moves and changes the distance from B to a smaller value
MoveAwayFrom(A,B)	A moves and changes the distance from B to a larger value

Table 15: Object movement relationships for pairs of objects

Additional movement relationships are described in the Table 16 below and are also listed in Table 18.

Item	Description
Enter(A, B)	A changes from outside B to inside B.
Exit(A, B)	A changes from inside B to outside B

Table 16: Additional Movement Relationships

The following events are of interest to the video surveillance community: the event of objects being picked up, left behind or carried.

Item	Description
Pick-Up(A,B)	There is a change from A not holding B to A holding B
Put-Down(A,B)	There is a change from A holding B to A not holding B
Carry(A,B)	Sequence(Pick-Up(A,B), AND(Hold(A,B), Move(A,C,D)), Put-Down(A,B))

Table 17: Events of objects picked up or left behind of particular interest in surveillance

The hierarchy below summarizes all the relationships.

Relationships			Description
Interthing
	IS-A
	HAS-A
	Join
	Leave
Spatial
	Point
	Polyline
	Polygon
	Boundingbox
	Mask
	Center of gravity
	Distance
	Translation
	Above
	Below
	Leftof
	Rightof
	Inside
	Outside
	Overlap
	Scale
	Absolute Location		latitude, longitude, altitude
	Coordinates		X,Y,Z
	Coordinates Bounding box
Movement
	Direction		Degrees
	Orientation		Degrees
	Speed		pixels/second
	Absolute speed		Miles/hour
	Movetoward
	Moveawayfrom
	Enter
	Exit
Temporal
	Instant
		Frame
		Before
		Together
	Interval
		Segment
		Begins
		Ends
		Contains
		Inside
		Outside
		Before
		After
		Meets
		Overlaps

Table 18: Hierarchy of relationships

To simplify representation the following convention should be followed. When defining an event that involves more than one Thing, that event should be created as a new Thing entity as described above.

However, if an action is to be described and the action involves an elemental Thing only, then it need not be defined with a new Thing ID and should be defined as an attribute of the Thing using a Has-A relationship.

All Has-A relationships will be symbolically denoted by the “.” Notation.

Metadata about the analytical engine

The following items need to be specific, Sd in terms of the metadata about the analytical engine.

Item	Description
EngineID	Unique ID for each analytical engine module used in the system
EngineDescription	Free text description of the analysis performed by the engine. E.g. Face Recognition, License Plate Recognition, etc.
EngineSource	Name of the entity (individual/company) providing this analytical engine
EngineVersion	Version number of the analytical engine

Table 19: Metadata about the analytical engine

Implementation Examples

This section will now illustrate the use of this knowledge model with a JSON schema implementation for a concrete use case that describes the movement of vehicles and people in a parking garage.

Sensor

Sensor object to be nested in every message:

"sensor": {
 "id": "string",
 "type": "Camera/Puck",
 "location": {
     "lat": 45.99,
     "lon": 35.54,
     "alt": 79.03
 },
  "coordinate": {
     "x": 5.2,
     "y": 10.1,
     "z": 11.2
 },
 "description": "Entrance of Endeavor Garage Right Lane"
}

Video Path

Data Source to be included with every metadata for playback:

"videoPath": "URL of the playback Video"

Analytics Module

Analytics Module to be nested in every message:

"analyticsModule": {
   "id": "string",
   "description": "Vehicle Detection and License Plate Recognition",
   "source": "OpenALR",
   "version": "string"
}

Place

Describing a scene needs describing where the scene is happening – place, what is happening in terms of events, and who are the objects participating in the event.

To describe a place:

"place": {
   "id": "string",
   "name": "endeavor",
   "type": "building/garage/entrance",
   "location": {
      "lat": 30.333,
      "lon": -40.555,
      "alt": 100.00
   },
   "coordinate": {
      "x": 1.0,
      "y": 2.0,
      "z": 3.0
   }
   "info": {
      "name": "walsh",
      "lane": "lane1",
      "level": "P2",
   }
}

Object

To describe an object – for example a vehicle:

"object": {
   "id": "string",
   "type": "Vehicle",
   "confidence": 0.9,
   "info": {
      "type": "sedan",
      "make": "Bugatti",
      "model": "M",
      "color": "blue",
      "license": "CGP93S",
      "licenseState": "CA"
   },
   "bbox": {
      "leftX": 0.0,
      "topY": 0.0,
      "rightX": 100.0,
      "bottomY": 200.0
   },
   "location": {
      "lat": 30.333,
      "lon": -40.555,
      "alt": 100.00,
   },

   "coordinate": {
      "x": 5.2,
      "y": 10.1,
      "z": 11.2
   },
   "embedding": {
      "vector": [
         9.3953978E+8,


         -263123344,
         -12817122
    ]
   },
   "orientation": 45.0,
   "direction": 225.0,
   "speed": 7.5
}

Person will be described as below:

"object": {
     "id": "123",
     "type": "Person",
     "confidence": 0.9,
     "info": {
         "gender": "male",
         "age": 45,
         "hair": "black",
         "cap": "none",
         "apparel": "formal"

     },
     "bbox": {
             "leftX": 0.0,
             "topY": 0.0,
             "rightX": 100.0,
             "bottomY": 200.0
         },
     "location": {
         "lat": 30.333,
         "lon": -40.555,
         "alt": 100.00
      },
     "coordinate": {
         "x": 5.2,
         "y": 10.1,
         "z": 11.2
     },
    "embedding": {
     "vector": [
       9.3953978E+8,

       -263123344,
       -12817122
     ]
   },
           "orientation": 45.0,
      "direction": 225.0,
      "speed": 7.5
 }

Face will be described as below:

"object": {
    "id": "string",
    "type": "Face",
    "confidence": 0.9,
    "info": {
        "gender": "male",
        "age": 45,
        "hair": "black",
        "cap": "none",
        "glasses": "yes",
        "facialhair": "yes",
        "name": "John Smith",
        "eyecolor": "brown"
    },
    "bbox": {
            "leftX": 0.0,
            "topY": 0.0,
            "rightX": 100.0,
            "bottomY": 200.0
    },
    "location": {
        "lat": 30.333,
        "lon": -40.555,
        "alt": 100.00
     },
    "coordinate": {
        "x": 5.2,
        "y": 10.1,
        "z": 11.2
    },
  "embedding": {
    "vector": [
      9.3953978E+8,

      -263123344,
      -12817122
    ]
  }
    "orientation": 45.0,
    "direction": 225.0,
    "speed": 7.5
}

Generic object-type

To support any other object-type like Cat, Dog, Painting etc.,

Use the field generic inside object as shown below:

"Object": {
  "type": "Cat or Dog or Painting",

  "info":

      {
         "name-1": "value-1",
         "name-2": "value-2":

      }


},

The secondary attributes for object-type i.e., Cat, Dog or Painting in the above example will vary in number, type and ordering, the downstream application consuming the message need to interpret the information regarding each object-type and its attributes. The Metropolis microservices analytics stack by default persist all message for search and retrieval.

Example below of Cat object:

"object": {
     "id": "string",
     "type": "Cat",
     "info": {
            "color": "grey",
            "species": "X":

     },
     "bbox": {
             "leftX": 0.0,
             "topY": 0.0,
             "rightX": 100.0,
             "bottomY": 200.0
         },
     "location": {
         "lat": 30.333,
         "lon": -40.555,
         "alt": 100.00
           },
     "coordinate": {
         "x": 5.2,
         "y": 10.1,
         "z": 11.2
     },
    "embedding": {
     "vector": [
       9.3953978E+8,

       -263123344,
       -12817122
     ]
   }
     "orientation": 45.0,
     "direction": 225.0,
     "speed": 7.5
 }

Event

To describe an event:

"event": {
   "id": "event-id",
   "type": "entry"
}

The following set of events can be implemented using similar approach:

entry
exit
moving
stopped
parked
empty
reset

Units

Distance: meters
Speed: miles per hour
Time: UTC
Geo-location: Latitude, Longitude, Altitude

Putting it all together to construct messages

Frame Object

This JSON is compatible with protobuf object, the best practice is to send protobuf messages over network, and the generated JSON in database for search query. See section 4 represents a variation of the JSON schema. If the message payload is less an application can send JSON based on section 4:

 {
    "version": "4.0",
    "id": "252",
    "timestamp": "2022-02-09T10:45:10.170Z",
    "sensorId": "xyz",
    "objects": [
       {
          "id": "3",
          "bbox": {
          "leftX": 285.0,
          "topY": 238.0,
          "rightX": 622.0,
          "bottomY": 687.0
          },
          "type": "Person",
          "confidence": 0.9779,
          "info": {
          "gender": "male",
          "age": 45,
          "hair": "black",
          "cap": "none",
          "apparel": "formal"
          },
          "embedding": {
          "vector": [
             9.3953978E+8,

             -263123344,
             -12817122
          ]
          }
          ]
       }
}

Frame Structure and schema

The second (minimal) variation of the schema is described here in this document. The objective is to have a low footprint for the payload to be transmitted from the Sensor Perception layer (for example DeepStream) to any Message Broker. This schema is an alternative to protobuf.

Each Message/payload should have a one to one correspondence with Frame (Video Frame). A frame comprises of one or more objects detected.

The key elements are described below.

Frame

This represent a Video Frame and contain a list of objects detected by the Deepstream perception layer.

The Key attributes of the frame are:

“version”: represents version of the schema
“id”: represents the video frame-id
“@timestamp”: represents the camera timestamp
“sensorId”: Unique sensor-id
“objects”: List of objects, Object element is defined in the next section

The JSON structure of the frame is:

{
   "version": "4.0",
      "id": "frame-id",
      "@timestamp": "2018-04-11T04:59:59.828Z",
      "sensorId": "sensor-id",
      "objects": [
         ".......object-1 attributes...........",
         ".......object-2 attributes...........",
         ".......object-3 attributes..........."
      ]
}

Object

The Object is defined using a string. There may be number of GPU Inference Engine (GIE) present in the perception pipeline or a model may have multiple outputs, The outputs can be related to object detection + tracking or object appearance or pose or gaze. The outputs are logically separated by |#| example shown below.

The object string is represented as below:

"primary attributes \|#\| secondary attributes \|#|......Additional attributes N ...|confidence|"

Attributes within a single section is pipe (|) separated, and the ordering of attributes are strict.

Primary Attributes

The primary attributes are fixed, they are:

“object-Id”: represents the unique object id, while using the single camera tracker, the id needs to be maintained across frames over time.
The bounding box image coordinates:
- bbox.leftX
- bbox.topY
- bbox.rightX
- bbox.bottomY
“object-type”: represents the type object, which can be
- Vehicle
- Person
- Face
- RoadSign
- Bicycle
- Bag
- others

Example frame with only Primary attributes:

{
   "version": "4.0",
      "id": "frame-id",
      "@timestamp": "2018-04-11T04:59:59.828Z",
      "sensorId": "sensor-id",
      "objects": [
         "object-Id | bbox.leftX | bbox.topY | bbox.rightX | bbox.bottomY |object-type",
         ".......object-2 primary attributes...........",
         ".......object-3 primary attributes..........."
      ]
   }

Secondary Attributes

Secondary attributes are based on the appearance of the “object-type”:

Vehicle Attributes
- type
- make
- model
- color
- license
- licenseState

An example object with Vehicle object-type:

{
   "version": "4.0",
      "id": "frame-id",
      "@timestamp": "2018-04-11T04:59:59.828Z",
      "sensorId": "sensor-id",
      "objects": [
         "957|1834|150|1918|215|Vehicle|#|sedan|Bugatti|M|blue|CA 444|California|0.8",
         "..........."
      ]
   }

Vehicle attributes and possible classes/values each attribute may have:

type
- coupe, largevehicle, sedan, suv, truck, van
make
- acura, audi, bmw, chevrolet, chrysler, dodge, ford, gmc, honda,
  hyundai, infiniti, jeep, kia, lexus, mazda, mercedes, nissan, subaru, toyota, volkswagen
color
- black, blue, brown, gold, green, grey, maroon, orange, red,
  silver, white, yellow
Person Attributes
- gender
- age
- hair
- cap
- apparel
- height

Person example, Frame with Vehicle and Person Object

{
   "version": "4.0",
      "id": "frame-id",
      "@timestamp": "2018-04-11T04:59:59.828Z",
      "sensorId": "sensor-id",
      "objects": [
         "957|1834|150|1918|215|Vehicle|#|sedan|Bugatti|M|blue|CA 444|California|0.8",
         "323|1800|140|1800|190|Person|#|female|30|black|none|formal|5.5|0.8"

      ]
   }

Face Attributes
- gender
- age
- hair
- cap
- glasses
- facialhair
- name
- eyecolor

Generic object-type

The schema is flexible to define any other class of object, the approach is same as taken for Vehicle, Person or Face, one would use “primary attributes |#| secondary attributes” to describe a detected object.

Example of Cat, Dog, Painting will be as below:

{
   "version": "4.0",
      "id": "frame-id",
      "@timestamp": "2018-04-11T04:59:59.828Z",
      "sensorId": "sensor-id",
      "objects": [
         "object-Id & bbox values | Cat |#| Cat attributes",
         "object-Id & bbox values | Dog |#| Dog attributes",
         "object-Id & bbox values | Painting |#| Painting attributes"
      ]
}

The secondary attributes for object-type i.e., Cat, Dog or Painting in the above example will vary in number, type and ordering, the downstream application consuming the message need to interpret the information regarding each object-type and its attributes. The metropolis microservices analytics stack by default persist all message for search and retrieval.

Pose Attributes

The pose attributes are added as a new section in the object string representation and treated as Additional attributes.

" primary attributes \|#\| secondary attributes \|#|......pose2D or pose3D attributes......"

Further breaking it down:

" ID + bbox attributes \|#\| appearance attributes \|#|pose3D|bodypart-1|bodypart-2|…"

Including multiple pose dimensions:

" ID + bbox attributes \|#\| appearance attributes \|#|pose3D|bodypart-1|bodypart-2|…\|#|pose2D|bodypart-1|bodypart-2|…"

Example Frame JSON will be:

{
    "version": "4.0",
    "id": "frame-id",
    "@timestamp": "2018-04-11T04:59:59.828Z",
    "sensorId": "sensor-id",
    "objects": [
           "323|1800|140|1800|190|Person|#|female|30|black|none|formal|5.5|0.8|#|pose3D|nose,x,y,z,0.8|left-eye,x,y,z,1.0|..."
    ]
}

The first attribute value says the nature of the pose, i.e., pose2D, pose25D or pose3D, which is followed by several bodyparts. Each bodypart comprises of name, location (x,y,z) and confidence, e.g., |left-eye,x,y,z,0.75|. For pose3D, the coordinates (x,y,z) are in the world coordinate system with respect to the camera (unit: mm). For pose2.5D, it shares the same format as pose3D, however, the coordinates (x,y) are in the image plane (unit: pixel) and the z coordinate stands for the metric depth relative to the root keypoint, i.e., pelvis (unit: mm). For pose2D, (x,y) are the image pixel coordinates and the z coordinate is ignored, e.g., |right-ear,x,y,0.80|.

The bodyparts are pipe separated. There is an implicit ordering of bodyparts. For example, the ordering for the 18 joints of COCO format is given as follows: (0) nose; (1) neck; (2) right-shoulder; (3) right-elbow; (4) right-wrist; (5) left-shoulder; (6) left -elbow; (7) left -wrist; (8) right-hip; (9) right-knee; (10) right-ankle; (11) left-hip; (12) left -knee; (13) left -ankle; (14) right-eye; (15) left-eye; (16) right-ear; (17) left-ear.

As for the NVIDIA MAXINE format with 34 joints, the implicit ordering of the keypoints is as follows: (0) pelvis; (1) left-hip; (2) right-hip; (3) torso; (4) left-knee; (5) right-knee; (6) neck; (7) left-ankle; (8) right-ankle; (9) left-big-toe; (10) right-big-toe; (11) left-small-toe; (12) right-small-toe; (13) left-heel; (14) right-heel; (15) nose; (16) left-eye; (17) right-eye; (18) left-ear; (19) right-ear; (20) left-shoulder; (21) right-shoulder; (22) left-elbow; (23) right-elbow; (24) left-wrist; (25) right-wrist; (26) left-pinky-knuckle; (27) right-pinky-knuckle; (28) left-middle-tip; (29) right-middle-tip; (30) left-index-knuckle; (31) right-index-knuckle; (32) left-thumb-tip; (33) right-thumb-tip.

Embedding Attributes

Like the pose attributes, the embedding attributes are also added as a new section in the object string representation and treated as Additional attributes. There can be more than one section of additional attributes, which should be separated by “|#|”.

" primary attributes \|#\| secondary attributes \| confidence \|#\|......embedding attributes......"

Further breaking it down:

" ID + bbox attributes \|#\| appearance attributes \| confidence \|#\|embedding|dimension-1,dimension-2,…,dimension-N\| "

Example Frame JSON will be:

{
      "version": "4.0",
      "id": "frame-id",
      "@timestamp": "2018-04-11T04:59:59.828Z",
      "sensorId": "sensor-id",
      "objects": [
            "323|1800|140|1800|190|Person|#|female|30|black|none|formal|5.5|0.8|#|embedding|0.9881,0.677869,0.779454,0.375686,0.396891,0.747902,0.91728,0.481577,0.706675,0.181111|"
      ]
}

The first attribute value indicates that the following attributes represent an embedding. Each embedding consists of a sequence of values of a pre-defined dimension N, separated by comma.

Single-View 3D Tracking Attributes

Like the pose and embedding attributes, the single-view 3D tracking (SV3DT) attributes are also added as a new section in the object string representation and treated as Additional attributes, which should be separated by “|#|”.

" primary attributes \|#\| secondary attributes \|#|......SV3DT attributes......"

Further breaking it down:

" ID + bbox attributes \|#\| appearance attributes
\|#|SV3DT|visibility|foot-location-2D|foot-location-3D|convex-hull-2D\| "

Example Frame JSON will be:

{
      "version": "4.0",
      "id": "frame-id",
      "@timestamp": "2018-04-11T04:59:59.828Z",
      "sensorId": "sensor-id",
      "objects": [
            "323|1800|140|1800|190|Person|#|female|30|black|none|formal|5.5|0.8|#|SV3DT|0.991698|297.655,179.069|17.56687,20.29478|-19,-33,-18,-32,3,35|"
      ]
}

The first attribute value indicates that the following attributes represent SV3DT.

The second attribute value is the visibility which is a floating number between 0 and 1 indicating the ratio of visible bodyparts under occlusion.

The third attribute is the x and y coordinates (in pixel) of foot location in 2D separated by comma, whereas the fourth attribute gives the foot location in 3D and the unit is meter.

The last attribute indicates the 2D convex hull coordinates relative to the bounding box center. In the above example, if the bounding box center is (286, 144), the convex hull is formed by the following three points.

(-19,-33)+(286,144)=(267,111)

(-18,-32)+(286,144)=(268,112)

(3,35)+(286,144)=(289,179)

Lip Activity Classification Attributes

The lip activity classification attributes are added as a new section in the object string representation and treated as Additional attributes.

They will be optionally present when the object type is Face. Each Face object will have its own lip activity classification attributes.

" primary attributes \|#\| secondary attributes \|#|......lip activity classification attributes......"

Further breaking it down:

" ID + bbox attributes \|#\| appearance attributes|#|lip_activity|class-label"

Where class-label is an enum with possible values of speaking, silent, undefined.

Example Frame JSON will be:

{
 "version":"4.0",
 "id":"frame-id",
 "@timestamp":"2018-04-11T04:59:59.828Z",
 "sensorId":"sensor-id",
 "objects": [
     "323|1200|140|1600|190|Face|#|lip_activity|speaking",
     "324|1600|160|1800|210|Face|#|lip_activity|silent"
 ]
}

Gaze Estimation Attributes

The gaze attributes are added as a new section in the object string representation and treated as Additional attributes.

They will be optionally present when the object type is Face. Each Face object will have its own gaze attributes.

" primary attributes \|#\| secondary attributes \|#|......gaze attributes......"

Further breaking it down:

” ID + bbox attributes |#| appearance-attributes|#|gaze|x|y|z|theta|phi”

Example Frame JSON will be:

 {
    "version":"4.0",
    "id":"frame-id",
    "@timestamp":"2018-04-11T04:59:59.828Z",
    "sensorId":"sensor-id",
    "objects": [
       "323|1200|140|1600|190|Face|#|gaze|100|120|130|0.042603|0.026154",
       "324|1600|160|1800|210|Face|#|gaze|100|120|130|0.005659|0.281006"
    ]

 }

Gaze point of reference x,y,z are in the camera coordinate system. theta, phi are angles.