Datamodule utils
float_or_int_or_none(value)
Converts a given value into a float, int, or None.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value
|
Union[str, float, int, None]
|
A value that can be either a string, float, int, or None. |
required |
Returns:
Type | Description |
---|---|
Union[float, int, None]
|
Union[float, int, None]: A float, int, or None based on the input value. |
If the input value is None or "None", it returns None. If the input value is an int or float, it returns the same value. If the input value is a string, it tries to convert it into an int if possible, otherwise into a float.
Source code in bionemo/llm/utils/datamodule_utils.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
infer_global_batch_size(micro_batch_size, num_nodes, devices, accumulate_grad_batches=1, tensor_model_parallel_size=1, pipeline_model_parallel_size=1)
Infers the global batch size based on the micro batch size, number of nodes, devices, accumulation of gradient batches, and model parallel sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
micro_batch_size
|
int
|
The micro batch size. |
required |
num_nodes
|
int
|
The number of nodes. |
required |
devices
|
int
|
The number of devices. |
required |
accumulate_grad_batches
|
int
|
The accumulation of gradient batches. Defaults to 1. |
1
|
tensor_model_parallel_size
|
int
|
The tensor model parallel size. Defaults to 1. |
1
|
pipeline_model_parallel_size
|
int
|
The pipeline model parallel size. Defaults to 1. |
1
|
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The global batch size. |
Source code in bionemo/llm/utils/datamodule_utils.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
|
infer_num_samples(limit_batches, num_samples_in_dataset, global_batch_size, stage)
Infers the number of samples based on the limit_batches parameter, the length of the dataset, and the global batch size.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
limit_batches
|
Union[float, int, str, None]
|
The limit on the number of batches. Can be a float between 0 and 1, an integer, a string, or None. If None, defaults to 1.0. |
required |
num_samples_in_dataset
|
int
|
The number of samples in the dataset. |
required |
global_batch_size
|
int
|
The global batch size. |
required |
stage
|
str
|
The stage of the training. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
The number of samples from the limit. |
Raises:
Type | Description |
---|---|
ValueError
|
If the limited number of samples is less than the global batch size, or if the limit_batches parameter is invalid. |
If limit_batches is a float between 0 and 1, the number of samples is inferred as a fraction of the number of samples in the dataset. If limit_batches is an integer greater than or equal to 1, the number of limited samples is inferred as the product of limit_batches and global batch size. If limit_batches is None, it defaultsto 1.0, indicating that all dataset samples should be used.
Source code in bionemo/llm/utils/datamodule_utils.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
parse_kwargs_to_arglist(kwargs)
Converts a dictionary of keyword arguments into a list of command-line arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kwargs
|
Dict[str, Any]
|
A dictionary where keys are argument names and values are argument values. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
A list of strings, where each string is a command-line argument in the format '--argument-name value'. |
Source code in bionemo/llm/utils/datamodule_utils.py
45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
tensor_dict_hash(tensor_dict, hash_func=None)
Generates a hash for the given tensor dictionary using the specified hash function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tensor_dict
|
Dict[str, Tensor]
|
The input tensor dictionary to be hashed. |
required |
hash_func
|
Optional[Callable]
|
An optional hash function to use. If None, defaults to SHA-256. |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The resulting hash string. |
If no hash function is provided, SHA-256 is used by default. The function first converts the tensor to a contiguous array on the CPU and then to bytes before hashing.
Source code in bionemo/llm/utils/datamodule_utils.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
tensor_hash(tensor, hash_func=None)
Generates a hash for the given tensor using the specified hash function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tensor
|
Tensor
|
The input tensor to be hashed. |
required |
hash_func
|
Optional[Callable]
|
An optional hash function to use. If None, defaults to SHA-256. |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The resulting hash string. |
If no hash function is provided, SHA-256 is used by default. The function first converts the tensor to a contiguous array on the CPU and then to bytes before hashing.
Source code in bionemo/llm/utils/datamodule_utils.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|