`core.datasets.object_storage_utils`#

Module Contents#

Classes#

`ObjectStorageConfig`	Config when the data (.bin) file and the index (.idx) file are in object storage
`S3Client`	The protocol which all s3 clients should abide by

Functions#

`_remove_s3_prefix`	Remove the S3 prefix from a path
`_is_s3_path`	Ascertain whether a path is in S3
`_remove_msc_prefix`	Remove the MSC prefix from a path
`_is_msc_path`	Checks whether a path is in MSC path (msc://profile/path/to/file)
`_s3_download_file`	Download the object at the given S3 path to the given local file system path
`_s3_object_exists`	Ascertain whether the object at the given S3 path exists in S3
`is_object_storage_path`	Ascertain whether a path is in object storage
`get_index_cache_path`	Get the index cache path for the given path
`parse_s3_path`	Parses the given S3 path returning correspsonding bucket and key.
`get_object_storage_access`	Get the object storage access
`dataset_exists`	Check if the dataset exists on object storage
`cache_index_file`	Download a file from object storage to a local path with distributed training support. The download only happens on Rank 0, and other ranks will wait for the file to be available.

Data#

`S3_PREFIX`
`MSC_PREFIX`
`S3Config`

API#

core.datasets.object_storage_utils.S3_PREFIX#: ‘s3://’

core.datasets.object_storage_utils.MSC_PREFIX#: ‘msc://’

class core.datasets.object_storage_utils.ObjectStorageConfig#

Config when the data (.bin) file and the index (.idx) file are in object storage

.. attribute:: path_to_idx_cache

The local directory where we will store the index (.idx) file

Type:: str

.. attribute:: bin_chunk_nbytes

If the number of bytes is too small, then we send a request to S3

Type:: int

.. attribute:: at each call of the read method in _S3BinReader, which is slow, because each request

.. attribute:: has a fixed cost independent of the size of the byte range requested. If the number of

.. attribute:: bytes is too large, then we only rarely have to send requests to S3, but it takes a lot

.. attribute:: of time to complete the request when we do, which can block training. We’ve found that

.. attribute:: 256 * 1024 * 1024

Type:: i.e., 256 MiB

.. attribute:: effort into tuning it), so we default to it.

path_to_idx_cache: str#: None

bin_chunk_nbytes: int#: None

core.datasets.object_storage_utils.S3Config#: None

class core.datasets.object_storage_utils.S3Client#

Bases: typing.Protocol

The protocol which all s3 clients should abide by

download_file(Bucket: str, Key: str, Filename: str) → None#: Download the file from S3 to the local file system

upload_file(Filename: str, Bucket: str, Key: str) → None#: Upload the file to S3

head_object(Bucket: str, Key: str) → Dict[str, Any]#: Get the metadata of the file in S3

get_object( Bucket: str, Key: str, Range: str, ) → Dict[str, Any]#: Get the file from S3

close() → None#: Close the S3 client

core.datasets.object_storage_utils._remove_s3_prefix(path: str) → str#

Remove the S3 prefix from a path

Parameters:: path (str) – The path
Returns:: The path without the S3 prefix
Return type:: str

core.datasets.object_storage_utils._is_s3_path(path: str) → bool#

Ascertain whether a path is in S3

Parameters:: path (str) – The path
Returns:: True if the path is in S3, False otherwise
Return type:: bool

core.datasets.object_storage_utils._remove_msc_prefix(path: str) → str#

Remove the MSC prefix from a path

Parameters:: path (str) – The path
Returns:: The path without the MSC prefix
Return type:: str

core.datasets.object_storage_utils._is_msc_path(path: str) → bool#

Checks whether a path is in MSC path (msc://profile/path/to/file)

Parameters:: path (str) – The path
Returns:: True if the path is in MSC path, False otherwise
Return type:: bool

core.datasets.object_storage_utils._s3_download_file( client: core.datasets.object_storage_utils.S3Client, s3_path: str, local_path: str, ) → None#

Download the object at the given S3 path to the given local file system path

Parameters:

client (S3Client) – The S3 client
s3_path (str) – The S3 source path
local_path (str) – The local destination path

core.datasets.object_storage_utils._s3_object_exists( client: core.datasets.object_storage_utils.S3Client, path: str, ) → bool#

Ascertain whether the object at the given S3 path exists in S3

Parameters:

client (S3Client) – The S3 client
path (str) – The S3 path

Raises:

botocore.exceptions.ClientError – The error code is 404

Returns:

True if the object exists in S3, False otherwise

Return type:

bool

core.datasets.object_storage_utils.is_object_storage_path(path: str) → bool#

Ascertain whether a path is in object storage

Parameters:: path (str) – The path
Returns:: True if the path is in object storage (s3:// or msc://), False otherwise
Return type:: bool

core.datasets.object_storage_utils.get_index_cache_path( idx_path: str, object_storage_config: core.datasets.object_storage_utils.ObjectStorageConfig, ) → str#

Get the index cache path for the given path

Parameters:

idx_path (str) – The path to the index file
object_storage_config (ObjectStorageConfig) – The object storage config

Returns:

The index cache path

Return type:

str

core.datasets.object_storage_utils.parse_s3_path(path: str) → Tuple[str, str]#

Parses the given S3 path returning correspsonding bucket and key.

Parameters:: path (str) – The S3 path
Returns:: A (bucket, key) tuple
Return type:: Tuple[str, str]

core.datasets.object_storage_utils.get_object_storage_access(path: str) → str#: Get the object storage access

core.datasets.object_storage_utils.dataset_exists(path_prefix: str, idx_path: str, bin_path: str) → bool#

Check if the dataset exists on object storage

Parameters:

path_prefix (str) – The prefix to the index (.idx) and data (.bin) files
idx_path (str) – The path to the index file
bin_path (str) – The path to the data file

Returns:

True if the dataset exists on object storage, False otherwise

Return type:

bool

core.datasets.object_storage_utils.cache_index_file(remote_path: str, local_path: str) → None#

Download a file from object storage to a local path with distributed training support. The download only happens on Rank 0, and other ranks will wait for the file to be available.

Note that this function does not include any barrier synchronization. The caller (typically in blended_megatron_dataset_builder.py) is responsible for ensuring proper synchronization between ranks using torch.distributed.barrier() after this function returns.

Parameters:

remote_path (str) – The URL of the file to download (e.g., s3://bucket/path/file.idx or msc://profile/path/file.idx)
local_path (str) – The local destination path where the file should be saved

Raises:

ValueError – If the remote_path is not a valid S3 or MSC path

core.datasets.object_storage_utils#

Module Contents#

Classes#

Functions#

Data#

API#

`core.datasets.object_storage_utils`#