Determinism

From reasons ranging from traceability and auditing to experimentation and debugging, determinism in training may be desired to achieve reproducible results. The goal is to have exact reproducibility from run to run with the same model weights and the same inference results.

For additional details, see: https://github.com/NVIDIA/tensorflow-determinism

Enabling deterministic training in Clara

All that is needed in Clara is for seeds to be set in determinism in config.json:

{
    "epochs": 1240,
    "num_training_epoch_per_valid": 20,
    "learning_rate": 1e-4,
    "determinism": {
        "python_seed": "20191015",
        "random_seed": 123456,
        "numpy_seed": 654321,
        "tf_seed": 11111
    },
    ...

The values used in the example above are merely for demonstration purposes. You can set your own seeds:

The python_seed is set with a string of numerical value. All others are set with positive integers, similar to what most random functions take.

python_seed is responsible for setting the environment variable PYTHONHASHSEED, which is used as a fixed seed for generating the hash() of the types covered by the hash randomization. random_seed is used for seeding Python’s built-in random library. numpy_seed and tf_seed are used exactly as their names suggest, for numpy and Tensorflow.

There are still two other sources of non-determinism, namely multiple workers in the tf.data.Dataset pipeline and Horovod Tensor Fusion. Both cases are handled internally when determinism is used. To be more specific, the number of workers will be set to 1 for deterministic training. The environment variable HOROVOD_FUSION_THRESHOLD will be set to '0' when determinism is enabled in multi-gpu training.

Note

For deterministic training to work, it is important to eliminate all sources of randomness. It is recommended to keep same the number of GPUs, the GPU architecture, driver versions, all framework versions, and the setup.