Long Sequence Performance#

LLAMA2-7B (FP8)#

  • The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system.

SeqLen (K) # of GPUs Without CP With CP Speedup with CP/without CP
TFLOPS / GPU TP PP DP CP TFLOPS / GPU
4 4 768 1 1 4 1 768 1.00
8 8 730 1 2 4 1 730 1.00
16 16 660 2 1 8 1 660 1.00
32 32 595 2 1 8 2 610 1.03
64 64 534 4 1 8 2 574 1.07
128 128 424 4 1 8 4 555 1.31
256 256 392 4 1 8 8 549 1.40
512 512 104 8 1 4 16 549 5.28
1024 1024 26.5 8 1 4 32 536 20.23

Speedup of LLAMA2 7B training with CP over without CP#

cp_speedup_figure