aistore.pytorch.multishard_dataset

View as MarkdownOpen in Claude

Multishard Stream Dataset for AIS.

Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.

Module Contents

Classes

NameDescription
AISMultiShardStreamAn iterable-style dataset that iterates over multiple shard streams and yields combined samples.

API

class aistore.pytorch.multishard_dataset.AISMultiShardStream(
data_sources: typing.List[aistore.sdk.DataShard]
)

Bases: IterableDataset

An iterable-style dataset that iterates over multiple shard streams and yields combined samples.

Parameters:

data_sources
List[DataShard]

List of DataShard objects

Returns:

Iterable over the combined samples, where each sample is a tuple of one object bytes from each shard stream

aistore.pytorch.multishard_dataset.AISMultiShardStream.__iter__() -> typing.Iterator
aistore.pytorch.multishard_dataset.AISMultiShardStream._get_shard_objects_iterator(
bucket: aistore.sdk.Bucket,
prefix: str = '',
etl_name: str = ''
) -> typing.Iterable[bytes]

Create an iterable over all the objects in the given shards.

Parameters:

bucket
Bucket

Bucket containing the shards

prefix
strDefaults to ''

Prefix of the object names

etl_name
strDefaults to ''

ETL name to apply on each object

Returns: Iterable[bytes]

Iterable[Object]: Iterable over all the objects in the given shards, with each iteration returning a combined sample