Unified Recipe for Training Global Weather Forecasting Models
This example demonstrates how to train a neural global weather forecast model. The recipe is set up so that modifying the model architecture, data, or the training procedure is straightforward.
The conf
directory contains the configuration files for the model,
data, training, etc. The configs are given in YAML format and use the
omegaconf
library to manage them. Several example configs are given
for generating different datasets, models, and training procedures. For
example, AFNO and GraphCast are given with corresponding training
procedure and datasets configs. The default configs are set to only
download and train a tiny dataset and can be run on an 8GB GPU. To train
larger models please adjust conf/config.yaml
according to the
comments.
In this example we provide scripts to obtain the ERA5 dataset from ARCO ERA5 and perform needed curation and remapping steps. ARCO ERA5 contains a complete lat lon gridded dataset of the ERA5 reanalysis including single and pressure level data. Often when training a model on ERA5, a temporal and channel subset is used. For example, FourCast Net is trained on a 20-channel subset of ERA5 at 6 hour temporal resolution (AFNO). There can also be the need for remapping from lat lon grids as is the case with the DLWP model. Given these requirements we provide the following workflow for generating needed datasets that works for most applications.
Download temporally subsampled ERA5 dataset from ARCO ERA5
We recommend first downloading a temporally and single leveled subset of
ERA5 from ARCO ERA5. This can be done using the download_era5.py
script and configs for this can be found in ./conf/dataset/
. This
script will require ~40TB of storage for non-tiny configs but can be
adjusted to download a smaller subset of the data. Given a 2.5 Gb/s
connection the download will take ~1.5 days. The default configs will
only download ~100 GBs.
python download_era5.py
Generate Curated Dataset for Training
Once the ERA5 dataset is downloaded you can generate a curated dataset
for training. This can be done using the curate_era5.py
script and
configs for this can be found in ./conf/curated_dataset/
. This
script will generate the zarr dataset needed for training including
needed transformations such as regridding.
python curate_era5.py
NOTE
In theory one should be able perform curation directly from ARCO ERA5. This will work however there is a significant penalty in doing so due to the pressure levels being chunked together in ARCO ERA5. This means that if you want to extract a single pressure level you will need to download all 37 levels. If you are planning to test multiple transforms or channel subsamplings then this will become prohibitively expensive. Because of this we recommend following our described workflow. We have also raised an issue on ARCO ERA5 to fix this chunking issue and if resolved we will update instructions.
Prerequisites
Install the required dependencies by running below:
pip install -r requirements.txt
Apart from the dataset configs the main configs for training are
model
, training
, and validation
. These can be adjusted
accordingly and to train the model, run
python train.py
Progress can be monitored using MLFlow. Open a new terminal and navigate to the training directory, then run:
mlflow ui -p 2458
View progress in a browser at http://127.0.0.1:2458
Data parallelism is also supported with multi-GPU runs. To launch a multi-GPU training, run
mpirun -np <num_GPUs> python train.py
If running inside a docker container, you may need to include the
--allow-run-as-root
in the multi-GPU run command.
One of the showcased models available in the configs is Spherical Fourier Neural Operators: Learning Stable Dynamics on the Sphere. In order to train the SFNO model, Modulus Makani needs to be installed. This allows the model to be added to modulus’s model registry. For more information on this process, please refer to Modulus model registry.
git clone git@github.com:NVIDIA/makani.git
cd makani
pip install -e .
The config file can be modified to train the SFNO model by uncommenting all SFNO configs. Following the prior dataset fetching and curation steps, the model can be trained by running:
python train.py