Released in 2023, Baichuan Intelligence Inc.’s Baichuan 2 is a large multilingual language model meant to fill a gap in the English-dominated LLM ecosystem. Based on the prevailing transformer decoder framework, Baichuan 2 makes a few interesting adjustments, including SwiGLU activations, memory-efficient attention from xFormers, and Layer Normalization on the input of the transformer block. Additionally, the 7B model uses RoPE; while the 13B model uses Attention with Linear Biases. More information is available in the paper “Baichuan 2: Open Large-scale Language Models”.
- Data Preparation
- Training with Predefined Configurations
- Checkpoint Conversion
- Model Evaluation
- Parameter Efficient Fine-Tuning (PEFT)
- Model Export to TensorRT-LLM
- Model Deployment
Feature |
Status |
---|---|
Data parallelism | ✓ |
Tensor parallelism | ✓ |
Pipeline parallelism | ✓ |
Interleaved Pipeline Parallelism Sched | N/A |
Sequence parallelism | ✓ |
Selective activation checkpointing | ✓ |
Gradient checkpointing | ✓ |
Partial gradient checkpointing | ✓ |
FP32/TF32 | ✓ |
AMP/FP16 | ✗ |
BF16 | ✓ |
TransformerEngine/FP8 | ✗ |
Multi-GPU | ✓ |
Multi-Node | ✓ |
Inference | N/A |
Slurm | ✓ |
Base Command Manager | ✓ |
Base Command Platform | ✓ |
Distributed data preprcessing | ✓ |
NVfuser | ✗ |
P-Tuning and Prompt Tuning | ✓ |
IA3 and Adapter learning | ✓ |
Distributed Optimizer | ✓ |
Distributed Checkpoint | ✓ |
Fully Shared Data Parallel | N/A |