Heterogeneous GNN Data Format for Edge Prediction#

This section defines the data organization and structure required to prepare heterogeneous graph data for edge prediction tasks.

[comment:] I removed a lot of the extraneous bold in this file because it actually makes text harder to read.


Directory Structure Overview#

All training data must be organized under a common root directory containing two subfolders:

data_dir/
├── edges/
└── nodes/

Each folder has strict naming conventions and structural rules as described in the following.


Nodes Folder#

[comment]: You should add a Nodes Folder description here.

Path#

nodes/

Description#

Each node type in the heterogeneous graph corresponds to one file inside the nodes folder.

File Naming Convention#

<nodetype>.<ext>
  • <nodetype>: The unique identifier for a node type (for example, User, Merchant, Device).

  • <ext>: File format extension, one of:

    {"csv", "parquet", "orc"}
    

Example#

nodes/
├── A.csv
├── B.csv
└── C.csv

Format Requirements#

  • Each file contains one node per row.

  • Columns represent node features or attributes (for example, embeddings, categorical variables, numerical variables, and so on).

  • The row position of a node in its corresponding file (for example, nodes/A.csv) represents the node ID, that is, Node Index = Row Position.


Edges Folder#

[comment]: You should add an Edges Folder description here.

Path#

edges/

Description#

The edges folder contains all edge relationships between node types, as well as optional edge attributes and a target label file for edge prediction.

File Naming Conventions#

1. Main Edge Files#

Each edge type is represented by a file of the form:

<A_rel_B>.<ext>

Where:

  • A: Source node type (must exist in nodes/)

  • B: Target node type (must exist in nodes/)

  • rel: Relationship name (for example, buys, follows, cites)

  • <ext>: File extension (csv, parquet, or orc)

These files define all edges of a given type, with columns:

src, dst

2. Edge Attribute Files (Optional)#

Each edge type can optionally include an edge attribute file:

<A_rel_B>_attr.<ext>
  • Must correspond exactly to an existing edge type <A_rel_B>.<ext>.

  • The number of rows must match the number of edges.

  • Columns define edge features.

3. Target Edge Label File (Exactly One)#

Exactly one edge type must have a label file for training:

<A_rel_B>_label.<ext>
  • Represents the supervised target for edge prediction.

  • The number of rows must equal the number of edges in corresponding edge file <A_rel_B>.<ext>.

  • If an edge attribute file exists for the same type, it must also have the same number of rows.

Example Directory Layout#

edges/
├── A_rel_B.csv
├── A_rel_B_attr.csv
├── A_rel_B_label.csv
├── A_rel_C.csv
└── A_rel_C_attr.csv

Constraints#

To ensure consistency and usability during model training, the following validation checks must hold:

[comment:] Please ensure your numbers render properly. You go from 3 to 5.

  1. Node Existence

    • For every edge file <A_rel_B>.<ext>, both A and B must exist in the nodes directory.

  2. Edge Attribute Size Match

    • At most one _attr per edge type.

    • For each edge type <A_rel_B> that has an attribute file, rows count must match the number of edges of the corresponding edge type.

  3. Single Label File

    • Exactly one _label file across all edge types; labels count must match the number of edges of the corresponding edge type.

  4. File Format Consistency

    • All files must be in one of the allowed formats: csv, parquet, or orc.

    • Mixing formats across files is allowed, but discouraged in production for consistency.


Example Complete Structure#

graph_data/
├── nodes/
│   ├── A.csv
│   ├── B.csv
│   └── C.csv
└── edges/
    ├── A_rel_B.csv
    ├── A_rel_B_attr.csv
    ├── A_rel_B_label.csv
    ├── A_rel_C.csv
    └── A_rel_C_attr.csv

Summary#

Component

Description

Example Filename

Required

Node features

Per-node attributes

A.csv

Main edge list

Edge indices between node types

A_rel_B.csv

Edge attributes

Optional edge-level features

A_rel_B_attr.csv

⚙️ Optional

Edge labels

Supervised target for edge prediction

A_rel_B_label.csv

⚠️ Exactly one


Notes#

  • File formats are interchangeable (csv, parquet, orc).

  • Each edge type is self-contained — that is, edge list, attributes, and labels belong to the same <A_rel_B> group.

  • This design allows scalable and modular data pipelines for edge prediction in large multi-relational graphs.