Heterogeneous GNN Data Format for Edge Prediction#

This section defines the data organization and structure required to prepare heterogeneous graph data for edge prediction tasks.

[comment:] I removed a lot of the extraneous bold in this file because it actually makes text harder to read.

Directory Structure Overview#

All training data must be organized under a common root directory containing two subfolders:

data_dir/
├── edges/
└── nodes/

Each folder has strict naming conventions and structural rules as described in the following.

Nodes Folder#

[comment]: You should add a Nodes Folder description here.

Path#

nodes/

Description#

Each node type in the heterogeneous graph corresponds to one file inside the nodes folder.

File Naming Convention#

<nodetype>.<ext>

<nodetype>: The unique identifier for a node type (for example, User, Merchant, Device).
<ext>: File format extension, one of:
```
{"csv", "parquet", "orc"}
```

Example#

nodes/
├── A.csv
├── B.csv
└── C.csv

Format Requirements#

Each file contains one node per row.
Columns represent node features or attributes (for example, embeddings, categorical variables, numerical variables, and so on).
The row position of a node in its corresponding file (for example, nodes/A.csv) represents the node ID, that is, Node Index = Row Position.

Edges Folder#

[comment]: You should add an Edges Folder description here.

Path#

edges/

Description#

The edges folder contains all edge relationships between node types, as well as optional edge attributes and a target label file for edge prediction.

File Naming Conventions#

1. Main Edge Files#

Each edge type is represented by a file of the form:

<A_rel_B>.<ext>

Where:

A: Source node type (must exist in nodes/)
B: Target node type (must exist in nodes/)
rel: Relationship name (for example, buys, follows, cites)
<ext>: File extension (csv, parquet, or orc)

These files define all edges of a given type, with columns:

src, dst

2. Edge Attribute Files (Optional)#

Each edge type can optionally include an edge attribute file:

<A_rel_B>_attr.<ext>

Must correspond exactly to an existing edge type <A_rel_B>.<ext>.
The number of rows must match the number of edges.
Columns define edge features.

3. Target Edge Label File (Exactly One)#

Exactly one edge type must have a label file for training:

<A_rel_B>_label.<ext>

Represents the supervised target for edge prediction.
The number of rows must equal the number of edges in corresponding edge file <A_rel_B>.<ext>.
If an edge attribute file exists for the same type, it must also have the same number of rows.

Example Directory Layout#

edges/
├── A_rel_B.csv
├── A_rel_B_attr.csv
├── A_rel_B_label.csv
├── A_rel_C.csv
└── A_rel_C_attr.csv

Constraints#

To ensure consistency and usability during model training, the following validation checks must hold:

[comment:] Please ensure your numbers render properly. You go from 3 to 5.

Node Existence
- For every edge file <A_rel_B>.<ext>, both A and B must exist in the nodes directory.
Edge Attribute Size Match
- At most one _attr per edge type.
- For each edge type <A_rel_B> that has an attribute file, rows count must match the number of edges of the corresponding edge type.
Single Label File
- Exactly one _label file across all edge types; labels count must match the number of edges of the corresponding edge type.
File Format Consistency
- All files must be in one of the allowed formats: csv, parquet, or orc.
- Mixing formats across files is allowed, but discouraged in production for consistency.

Example Complete Structure#

graph_data/
├── nodes/
│   ├── A.csv
│   ├── B.csv
│   └── C.csv
└── edges/
    ├── A_rel_B.csv
    ├── A_rel_B_attr.csv
    ├── A_rel_B_label.csv
    ├── A_rel_C.csv
    └── A_rel_C_attr.csv

Summary#

Component	Description	Example Filename	Required
Node features	Per-node attributes	`A.csv`	✅
Main edge list	Edge indices between node types	`A_rel_B.csv`	✅
Edge attributes	Optional edge-level features	`A_rel_B_attr.csv`	⚙️ Optional
Edge labels	Supervised target for edge prediction	`A_rel_B_label.csv`	⚠️ Exactly one

Notes#

File formats are interchangeable (csv, parquet, orc).
Each edge type is self-contained — that is, edge list, attributes, and labels belong to the same <A_rel_B> group.
This design allows scalable and modular data pipelines for edge prediction in large multi-relational graphs.