FloydWarshall#

Overview#

The FloydWarshall operator computes the shortest paths between all pairs of nodes in a weighted graph. The current implementation supports graphs with up to 240 nodes.

Algorithm Description#

The Floyd-Warshall algorithm uses dynamic programming to compute all-pairs shortest paths. It considers paths through intermediate nodes, with the core recurrence relation:

\[\text{dist}[i][j][k] = \min(\text{dist}[i][j][k-1], \text{dist}[i][k][k-1] + \text{dist}[k][j][k-1])\]

where \(\text{dist}[i][j][k]\) represents the shortest distance from node \(i\) to node \(j\) considering only nodes \(\{0, 1, ..., k\}\) as intermediate nodes.

Input and Output Tensors#

The FloydWarshall operator takes the following tensors as input:

lenTensor: An \(N \times N\) matrix where \(\text{lenTensor}[i][j]\) contains the shortest path length from \(i\) to \(j\).
nextTensor: An \(N \times N\) matrix where \(\text{nextTensor}[i][j]\) contains the next node to visit on the shortest path from \(i\) to \(j\). This is used to reconstruct the shortest path.

The operator performs in-place updates on input tensors, modifying them to produce the final results.

Algorithm Steps#

Initialization: Both tensors are initialized based on the graph’s edge structure.
- \(\text{lenTensor}[i][i] = 0\) and \(\text{nextTensor}[i][i] = i\) for all \(i\)
- \(\text{lenTensor}[i][j] = \text{weight of the edge}\) if there is an edge from \(i\) to \(j\), otherwise \(\text{lenTensor}[i][j] = \infty\) (0xFFFF for \(\text{uint16_t}\)).
- \(\text{nextTensor}[i][j] = j\) if there is an edge from \(i\) to \(j\), otherwise \(\text{nextTensor}[i][j] = NIL\) (0xFFFF for \(\text{uint16_t}\)).
Computation: For each intermediate node \(k\) from 0 to \(N-1\):
- For each pair of nodes \((i, j)\), if the path \(i \rightarrow k \rightarrow j\) is shorter than the current path \(i \rightarrow j\):
  - Set \(\text{lenTensor}[i][j] = \text{lenTensor}[i][k] + \text{lenTensor}[k][j]\)
  - Set \(\text{nextTensor}[i][j] = \text{nextTensor}[i][k]\)
Output Generation: After iterating through all intermediate nodes, the final lenTensor contains the shortest path lengths, and the nextTensor enables path reconstruction.

Constraints#

The operator only supports \(\text{uint16_t}\) for \(\text{lenTensor}\) and \(\text{nextTensor}\).
The operator supports a maximum of 240 nodes (\(N \leq 240\)) due to VMEM constraints. Processing larger graphs requires partitioning or alternative algorithms.

Implementation Details#

The original Floyd-Warshall algorithm contains three nested loops over \(k\), \(i\), and \(j\). The FloydWarshall operator implementation restructures this as:

Outer Loop: Iterates over intermediate node \(k\) from 0 to \(N-1\)
Inner Loop: Iterates over all pairs \((i, j)\) using SIMD instructions

For each intermediate node \(k\), the VPU kernel processes all pairs \((i, j)\) in parallel using SIMD instructions:

Computes \(\text{newLen} = \text{lenTensor}[i][k] + \text{lenTensor}[k][j]\)
Evaluates the condition: \(\text{newLen} < \text{lenTensor}[i][j]\)
Updates \(\text{lenTensor}[i][j]\) and \(\text{nextTensor}[i][j]\) based on the condition.

Dataflow Configuration#

The FloydWarshall operator uses RasterDataFlow (RDF) for efficient data transfers:

Input Transfer: Two RDFs transfer \(\text{lenTensor}\) and \(\text{nextTensor}\) from DRAM to VMEM before the outer loop starts.
Output Transfer: Two RDFs transfer the updated tensors from VMEM back to DRAM after the outer loop ends.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

Size	DataType	Execution Time	Submit Latency	Total Power
32	U16	0.018ms	0.019ms	10.634W
33	U16	0.020ms	0.018ms	11.032W
65	U16	0.043ms	0.019ms	11.829W
100	U16	0.094ms	0.019ms	11.923W
240	U16	0.843ms	0.022ms	11.207W