core.datasets.object_storage_utils#
Module Contents#
Classes#
Config when the data (.bin) file and the index (.idx) file are in object storage |
|
The protocol which all s3 clients should abide by |
Functions#
Remove the S3 prefix from a path |
|
Ascertain whether a path is in S3 |
|
Remove the MSC prefix from a path |
|
Checks whether a path is in MSC path (msc://profile/path/to/file) |
|
Download the object at the given S3 path to the given local file system path |
|
Ascertain whether the object at the given S3 path exists in S3 |
|
Ascertain whether a path is in object storage |
|
Get the index cache path for the given path |
|
Parses the given S3 path returning correspsonding bucket and key. |
|
Get the object storage access |
|
Check if the dataset exists on object storage |
|
Download a file from object storage to a local path with distributed training support. The download only happens on Rank 0, and other ranks will wait for the file to be available. |
Data#
API#
- core.datasets.object_storage_utils.S3_PREFIX#
‘s3://’
- core.datasets.object_storage_utils.MSC_PREFIX#
‘msc://’
- class core.datasets.object_storage_utils.ObjectStorageConfig#
Config when the data (.bin) file and the index (.idx) file are in object storage
.. attribute:: path_to_idx_cache
The local directory where we will store the index (.idx) file
- Type:
str
.. attribute:: bin_chunk_nbytes
If the number of bytes is too small, then we send a request to S3
- Type:
int
.. attribute:: at each call of the
readmethod in _S3BinReader, which is slow, because each request.. attribute:: has a fixed cost independent of the size of the byte range requested. If the number of
.. attribute:: bytes is too large, then we only rarely have to send requests to S3, but it takes a lot
.. attribute:: of time to complete the request when we do, which can block training. We’ve found that
.. attribute:: 256 * 1024 * 1024
- Type:
i.e., 256 MiB
.. attribute:: effort into tuning it), so we default to it.
- path_to_idx_cache: str#
None
- bin_chunk_nbytes: int#
None
- core.datasets.object_storage_utils.S3Config#
None
- class core.datasets.object_storage_utils.S3Client#
Bases:
typing.ProtocolThe protocol which all s3 clients should abide by
- download_file(Bucket: str, Key: str, Filename: str) None#
Download the file from S3 to the local file system
- upload_file(Filename: str, Bucket: str, Key: str) None#
Upload the file to S3
- head_object(Bucket: str, Key: str) Dict[str, Any]#
Get the metadata of the file in S3
- get_object(
- Bucket: str,
- Key: str,
- Range: str,
Get the file from S3
- close() None#
Close the S3 client
- core.datasets.object_storage_utils._remove_s3_prefix(path: str) str#
Remove the S3 prefix from a path
- Parameters:
path (str) – The path
- Returns:
The path without the S3 prefix
- Return type:
str
- core.datasets.object_storage_utils._is_s3_path(path: str) bool#
Ascertain whether a path is in S3
- Parameters:
path (str) – The path
- Returns:
True if the path is in S3, False otherwise
- Return type:
bool
- core.datasets.object_storage_utils._remove_msc_prefix(path: str) str#
Remove the MSC prefix from a path
- Parameters:
path (str) – The path
- Returns:
The path without the MSC prefix
- Return type:
str
- core.datasets.object_storage_utils._is_msc_path(path: str) bool#
Checks whether a path is in MSC path (msc://profile/path/to/file)
- Parameters:
path (str) – The path
- Returns:
True if the path is in MSC path, False otherwise
- Return type:
bool
- core.datasets.object_storage_utils._s3_download_file(
- client: core.datasets.object_storage_utils.S3Client,
- s3_path: str,
- local_path: str,
Download the object at the given S3 path to the given local file system path
- Parameters:
client (S3Client) – The S3 client
s3_path (str) – The S3 source path
local_path (str) – The local destination path
- core.datasets.object_storage_utils._s3_object_exists(
- client: core.datasets.object_storage_utils.S3Client,
- path: str,
Ascertain whether the object at the given S3 path exists in S3
- Parameters:
client (S3Client) – The S3 client
path (str) – The S3 path
- Raises:
botocore.exceptions.ClientError – The error code is 404
- Returns:
True if the object exists in S3, False otherwise
- Return type:
bool
- core.datasets.object_storage_utils.is_object_storage_path(path: str) bool#
Ascertain whether a path is in object storage
- Parameters:
path (str) – The path
- Returns:
True if the path is in object storage (s3:// or msc://), False otherwise
- Return type:
bool
- core.datasets.object_storage_utils.get_index_cache_path(
- idx_path: str,
- object_storage_config: core.datasets.object_storage_utils.ObjectStorageConfig,
Get the index cache path for the given path
- Parameters:
idx_path (str) – The path to the index file
object_storage_config (ObjectStorageConfig) – The object storage config
- Returns:
The index cache path
- Return type:
str
- core.datasets.object_storage_utils.parse_s3_path(path: str) Tuple[str, str]#
Parses the given S3 path returning correspsonding bucket and key.
- Parameters:
path (str) – The S3 path
- Returns:
A (bucket, key) tuple
- Return type:
Tuple[str, str]
- core.datasets.object_storage_utils.get_object_storage_access(path: str) str#
Get the object storage access
- core.datasets.object_storage_utils.dataset_exists(path_prefix: str, idx_path: str, bin_path: str) bool#
Check if the dataset exists on object storage
- Parameters:
path_prefix (str) – The prefix to the index (.idx) and data (.bin) files
idx_path (str) – The path to the index file
bin_path (str) – The path to the data file
- Returns:
True if the dataset exists on object storage, False otherwise
- Return type:
bool
- core.datasets.object_storage_utils.cache_index_file(remote_path: str, local_path: str) None#
Download a file from object storage to a local path with distributed training support. The download only happens on Rank 0, and other ranks will wait for the file to be available.
Note that this function does not include any barrier synchronization. The caller (typically in blended_megatron_dataset_builder.py) is responsible for ensuring proper synchronization between ranks using torch.distributed.barrier() after this function returns.
- Parameters:
remote_path (str) – The URL of the file to download (e.g., s3://bucket/path/file.idx or msc://profile/path/file.idx)
local_path (str) – The local destination path where the file should be saved
- Raises:
ValueError – If the remote_path is not a valid S3 or MSC path