Long Sequence Performance#
LLAMA2-7B (FP8)#
The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system.
Container: NeMo24.03.01.framework
System: DGX-H100
SeqLen (K) | # of GPUs | Without CP | With CP | Speedup with CP/without CP | ||||
---|---|---|---|---|---|---|---|---|
TFLOPS / GPU | TP | PP | DP | CP | TFLOPS / GPU | |||
4 | 4 | 768 | 1 | 1 | 4 | 1 | 768 | 1.00 |
8 | 8 | 730 | 1 | 2 | 4 | 1 | 730 | 1.00 |
16 | 16 | 660 | 2 | 1 | 8 | 1 | 660 | 1.00 |
32 | 32 | 595 | 2 | 1 | 8 | 2 | 610 | 1.03 |
64 | 64 | 534 | 4 | 1 | 8 | 2 | 574 | 1.07 |
128 | 128 | 424 | 4 | 1 | 8 | 4 | 555 | 1.31 |
256 | 256 | 392 | 4 | 1 | 8 | 8 | 549 | 1.40 |
512 | 512 | 104 | 8 | 1 | 4 | 16 | 549 | 5.28 |
1024 | 1024 | 26.5 | 8 | 1 | 4 | 32 | 536 | 20.23 |