Heterogeneous GNN Data Format for Edge Prediction#
This section defines the data organization and structure required to prepare heterogeneous graph data for edge prediction tasks.
[comment:] I removed a lot of the extraneous bold in this file because it actually makes text harder to read.
Directory Structure Overview#
All training data must be organized under a common root directory containing two subfolders:
data_dir/
├── edges/
└── nodes/
Each folder has strict naming conventions and structural rules as described in the following.
Nodes Folder#
[comment]: You should add a Nodes Folder description here.
Path#
nodes/
Description#
Each node type in the heterogeneous graph corresponds to one file inside the nodes folder.
File Naming Convention#
<nodetype>.<ext>
<nodetype>: The unique identifier for a node type (for example,User,Merchant,Device).<ext>: File format extension, one of:{"csv", "parquet", "orc"}
Example#
nodes/
├── A.csv
├── B.csv
└── C.csv
Format Requirements#
Each file contains one node per row.
Columns represent node features or attributes (for example, embeddings, categorical variables, numerical variables, and so on).
The row position of a node in its corresponding file (for example, nodes/A.csv) represents the node ID, that is, Node Index = Row Position.
Edges Folder#
[comment]: You should add an Edges Folder description here.
Path#
edges/
Description#
The edges folder contains all edge relationships between node types, as well as optional edge attributes and a target label file for edge prediction.
File Naming Conventions#
1. Main Edge Files#
Each edge type is represented by a file of the form:
<A_rel_B>.<ext>
Where:
A: Source node type (must exist innodes/)B: Target node type (must exist innodes/)rel: Relationship name (for example,buys,follows,cites)<ext>: File extension (csv,parquet, ororc)
These files define all edges of a given type, with columns:
src, dst
2. Edge Attribute Files (Optional)#
Each edge type can optionally include an edge attribute file:
<A_rel_B>_attr.<ext>
Must correspond exactly to an existing edge type
<A_rel_B>.<ext>.The number of rows must match the number of edges.
Columns define edge features.
3. Target Edge Label File (Exactly One)#
Exactly one edge type must have a label file for training:
<A_rel_B>_label.<ext>
Represents the supervised target for edge prediction.
The number of rows must equal the number of edges in corresponding edge file
<A_rel_B>.<ext>.If an edge attribute file exists for the same type, it must also have the same number of rows.
Example Directory Layout#
edges/
├── A_rel_B.csv
├── A_rel_B_attr.csv
├── A_rel_B_label.csv
├── A_rel_C.csv
└── A_rel_C_attr.csv
Constraints#
To ensure consistency and usability during model training, the following validation checks must hold:
[comment:] Please ensure your numbers render properly. You go from 3 to 5.
Node Existence
For every edge file
<A_rel_B>.<ext>, bothAandBmust exist in thenodesdirectory.
Edge Attribute Size Match
At most one
_attrper edge type.For each edge type
<A_rel_B>that has an attribute file, rows count must match the number of edges of the corresponding edge type.
Single Label File
Exactly one
_labelfile across all edge types; labels count must match the number of edges of the corresponding edge type.
File Format Consistency
All files must be in one of the allowed formats:
csv,parquet, ororc.Mixing formats across files is allowed, but discouraged in production for consistency.
Example Complete Structure#
graph_data/
├── nodes/
│ ├── A.csv
│ ├── B.csv
│ └── C.csv
└── edges/
├── A_rel_B.csv
├── A_rel_B_attr.csv
├── A_rel_B_label.csv
├── A_rel_C.csv
└── A_rel_C_attr.csv
Summary#
Component |
Description |
Example Filename |
Required |
|---|---|---|---|
Node features |
Per-node attributes |
|
✅ |
Main edge list |
Edge indices between node types |
|
✅ |
Edge attributes |
Optional edge-level features |
|
⚙️ Optional |
Edge labels |
Supervised target for edge prediction |
|
⚠️ Exactly one |
Notes#
File formats are interchangeable (
csv,parquet,orc).Each edge type is self-contained — that is, edge list, attributes, and labels belong to the same
<A_rel_B>group.This design allows scalable and modular data pipelines for edge prediction in large multi-relational graphs.