aistore.sdk.dataset.dataset_config

View as Markdown

Module Contents

Classes

NameDescription
DatasetConfigRepresents the configuration for managing datasets, particularly focusing on how data attributes are structured

API

class aistore.sdk.dataset.dataset_config.DatasetConfig(
primary_attribute: aistore.sdk.dataset.config_attribute.ConfigAttribute,
secondary_attributes: typing.List[aistore.sdk.dataset.config_attribute.ConfigAttribute] = None
)

Represents the configuration for managing datasets, particularly focusing on how data attributes are structured

Parameters:

primary_attribute
ConfigAttribute

The primary key used for looking up any secondary_attributes will be determined by the filename of each sample defined by primary_attribute

secondary_attributes
List[ConfigAttribute]Defaults to None

A list of configurations for each attribute or feature in the dataset

secondary_attributes
aistore.sdk.dataset.dataset_config.DatasetConfig._get_format_string(
val
) -> str
staticmethod

Get a key string for an item in webdataset format

aistore.sdk.dataset.dataset_config.DatasetConfig.generate_dataset(
max_shard_items: int
) -> typing.Generator[typing.Tuple[typing.Dict[str, typing.Any], typing.List[str]], None, None]

Generate a dataset in webdataset format

Parameters:

max_shard_items
int

The maximum number of items to include in a shard

aistore.sdk.dataset.dataset_config.DatasetConfig.write_shards(
skip_missing: bool,
kwargs = {}
)

Write the dataset to a bucket in webdataset format and log the missing attributes

Parameters:

skip_missing
bool

Skip samples that are missing one or more attributes, defaults to True

**kwargs
Defaults to {}

Additional arguments to pass to the webdataset writer