core.datasets.object_storage_utils#

Module Contents#

Classes#

ObjectStorageConfig

Config when the data (.bin) file and the index (.idx) file are in object storage

S3Client

The protocol which all s3 clients should abide by

Functions#

_remove_s3_prefix

Remove the S3 prefix from a path

_is_s3_path

Ascertain whether a path is in S3

_remove_msc_prefix

Remove the MSC prefix from a path

_is_msc_path

Checks whether a path is in MSC path (msc://profile/path/to/file)

_s3_download_file

Download the object at the given S3 path to the given local file system path

_s3_object_exists

Ascertain whether the object at the given S3 path exists in S3

is_object_storage_path

Ascertain whether a path is in object storage

get_index_cache_path

Get the index cache path for the given path

parse_s3_path

Parses the given S3 path returning correspsonding bucket and key.

get_object_storage_access

Get the object storage access

dataset_exists

Check if the dataset exists on object storage

cache_index_file

Download a file from object storage to a local path with distributed training support. The download only happens on Rank 0, and other ranks will wait for the file to be available.

Data#

API#

core.datasets.object_storage_utils.S3_PREFIX#

‘s3://’

core.datasets.object_storage_utils.MSC_PREFIX#

‘msc://’

class core.datasets.object_storage_utils.ObjectStorageConfig#

Config when the data (.bin) file and the index (.idx) file are in object storage

.. attribute:: path_to_idx_cache

The local directory where we will store the index (.idx) file

Type:

str

.. attribute:: bin_chunk_nbytes

If the number of bytes is too small, then we send a request to S3

Type:

int

.. attribute:: at each call of the read method in _S3BinReader, which is slow, because each request

.. attribute:: has a fixed cost independent of the size of the byte range requested. If the number of

.. attribute:: bytes is too large, then we only rarely have to send requests to S3, but it takes a lot

.. attribute:: of time to complete the request when we do, which can block training. We’ve found that

.. attribute:: 256 * 1024 * 1024

Type:

i.e., 256 MiB

.. attribute:: effort into tuning it), so we default to it.

path_to_idx_cache: str#

None

bin_chunk_nbytes: int#

None

core.datasets.object_storage_utils.S3Config#

None

class core.datasets.object_storage_utils.S3Client#

Bases: typing.Protocol

The protocol which all s3 clients should abide by

download_file(Bucket: str, Key: str, Filename: str) None#

Download the file from S3 to the local file system

upload_file(Filename: str, Bucket: str, Key: str) None#

Upload the file to S3

head_object(Bucket: str, Key: str) Dict[str, Any]#

Get the metadata of the file in S3

get_object(
Bucket: str,
Key: str,
Range: str,
) Dict[str, Any]#

Get the file from S3

close() None#

Close the S3 client

core.datasets.object_storage_utils._remove_s3_prefix(path: str) str#

Remove the S3 prefix from a path

Parameters:

path (str) – The path

Returns:

The path without the S3 prefix

Return type:

str

core.datasets.object_storage_utils._is_s3_path(path: str) bool#

Ascertain whether a path is in S3

Parameters:

path (str) – The path

Returns:

True if the path is in S3, False otherwise

Return type:

bool

core.datasets.object_storage_utils._remove_msc_prefix(path: str) str#

Remove the MSC prefix from a path

Parameters:

path (str) – The path

Returns:

The path without the MSC prefix

Return type:

str

core.datasets.object_storage_utils._is_msc_path(path: str) bool#

Checks whether a path is in MSC path (msc://profile/path/to/file)

Parameters:

path (str) – The path

Returns:

True if the path is in MSC path, False otherwise

Return type:

bool

core.datasets.object_storage_utils._s3_download_file(
client: core.datasets.object_storage_utils.S3Client,
s3_path: str,
local_path: str,
) None#

Download the object at the given S3 path to the given local file system path

Parameters:
  • client (S3Client) – The S3 client

  • s3_path (str) – The S3 source path

  • local_path (str) – The local destination path

core.datasets.object_storage_utils._s3_object_exists(
client: core.datasets.object_storage_utils.S3Client,
path: str,
) bool#

Ascertain whether the object at the given S3 path exists in S3

Parameters:
  • client (S3Client) – The S3 client

  • path (str) – The S3 path

Raises:

botocore.exceptions.ClientError – The error code is 404

Returns:

True if the object exists in S3, False otherwise

Return type:

bool

core.datasets.object_storage_utils.is_object_storage_path(path: str) bool#

Ascertain whether a path is in object storage

Parameters:

path (str) – The path

Returns:

True if the path is in object storage (s3:// or msc://), False otherwise

Return type:

bool

core.datasets.object_storage_utils.get_index_cache_path(
idx_path: str,
object_storage_config: core.datasets.object_storage_utils.ObjectStorageConfig,
) str#

Get the index cache path for the given path

Parameters:
  • idx_path (str) – The path to the index file

  • object_storage_config (ObjectStorageConfig) – The object storage config

Returns:

The index cache path

Return type:

str

core.datasets.object_storage_utils.parse_s3_path(path: str) Tuple[str, str]#

Parses the given S3 path returning correspsonding bucket and key.

Parameters:

path (str) – The S3 path

Returns:

A (bucket, key) tuple

Return type:

Tuple[str, str]

core.datasets.object_storage_utils.get_object_storage_access(path: str) str#

Get the object storage access

core.datasets.object_storage_utils.dataset_exists(path_prefix: str, idx_path: str, bin_path: str) bool#

Check if the dataset exists on object storage

Parameters:
  • path_prefix (str) – The prefix to the index (.idx) and data (.bin) files

  • idx_path (str) – The path to the index file

  • bin_path (str) – The path to the data file

Returns:

True if the dataset exists on object storage, False otherwise

Return type:

bool

core.datasets.object_storage_utils.cache_index_file(remote_path: str, local_path: str) None#

Download a file from object storage to a local path with distributed training support. The download only happens on Rank 0, and other ranks will wait for the file to be available.

Note that this function does not include any barrier synchronization. The caller (typically in blended_megatron_dataset_builder.py) is responsible for ensuring proper synchronization between ranks using torch.distributed.barrier() after this function returns.

Parameters:
  • remote_path (str) – The URL of the file to download (e.g., s3://bucket/path/file.idx or msc://profile/path/file.idx)

  • local_path (str) – The local destination path where the file should be saved

Raises:

ValueError – If the remote_path is not a valid S3 or MSC path