Long Sequence Performance#

LLAMA2-7B (FP8)#

The table below shows the pre-training performance of the LLAMA2-7B with CP (context parallelism) and compares it against the results without CP at various input sequence lengths. The detailed model-parallel configurations and the achieved performance are shown in the training results with CP. In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the H100 GPU system.
- Container: NeMo24.03.01.framework
- System: DGX-H100

SeqLen (K)	# of GPUs	Without CP	With CP					Speedup with CP/without CP
SeqLen (K)	# of GPUs	TFLOPS / GPU	TP	PP	DP	CP	TFLOPS / GPU	Speedup with CP/without CP
4	4	768	1	1	4	1	768	1.00
8	8	730	1	2	4	1	730	1.00
16	16	660	2	1	8	1	660	1.00
32	32	595	2	1	8	2	610	1.03
64	64	534	4	1	8	2	574	1.07
128	128	424	4	1	8	4	555	1.31
256	256	392	4	1	8	8	549	1.40
512	512	104	8	1	4	16	549	5.28
1024	1024	26.5	8	1	4	32	536	20.23

cp_speedup_figure