ai4med.components.data package

class BaseImagePipeline(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, data_format='channels_first', num_data_dims=3, num_channels=1, num_label_channels=1, label_format=None, shuffle=True, duplicate_count=1, batch_transforms=None, items_per_category=None, category_weights=None)

Bases: ai4med.components.data.data_pipeline.DataPipeline

Base class for ImagePipelines.

Parameters

batch_transforms (list) – List of transforms to be applied to batched data.

get_batched_data(session)
get_next_batch(session)

Get the next batch of data.

Parameters

session – the TF session

Returns: batched data

process_data_list()
class ClassificationImagePipeline(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None)

Bases: ai4med.components.data.image_pipeline.ImagePipeline

An ImagePipeline for classification tasks.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_format – Format of output labels, refer to ai4med.common.label_format

  • output_batch_size (int) – Batch size of output

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • batch_transforms (list) – List of transforms to be applied to batched data.

class ClassificationImagePipelineWithCache(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None)

Bases: ai4med.components.data.image_pipeline_with_cache.ImagePipelineWithCache

An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of classification tasks.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_format – Format of output labels, refer to ai4med.common.label_format

  • output_batch_size (int) – Batch size of output

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • num_cache_objects (int) – Number of objects to be cached

  • replace_percent (float) – The percent of cached data to be replaced in every epoch

  • caches_data (bool) – Whether to cache data in memory

  • batch_transforms (list) – List of transforms to be applied to batched data.

Note

SmartCache has a content rotation feature that is based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.

If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.

class ClassificationKerasImagePipeline(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None)

Bases: ai4med.components.data.keras_image_pipeline.KerasImagePipeline

An ImagePipeline for classification tasks using keras backend.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_format – Format of output labels, refer to ai4med.common.label_format

  • output_batch_size (int) – Batch size of output

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • batch_transforms (list) – List of transforms to be applied to batched data.

  • multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.

  • sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.

class ClassificationKerasImagePipelineWithCache(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None)

Bases: ai4med.components.data.keras_image_pipeline_with_cache.KerasImagePipelineWithCache

An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of classification tasks.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_format – Format of output labels, refer to ai4med.common.label_format

  • output_batch_size (int) – Batch size of output

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • num_cache_objects (int) – Number of objects to be cached

  • replace_percent (float) – The percent of cached data to be replaced in every epoch

  • caches_data (bool) – Whether to cache data in memory

  • batch_transforms (list) – List of transforms to be applied to batched data.

  • multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.

  • sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.

Note

SmartCache has a content rotation feature that is based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.

If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.

class DataPipeline

Bases: ai4med.common.graph_component.GraphComponent

This class defines the required methods for data pipeline implementations.

A DataPipeline produces data items for training and validation.

Note

DataPipeline is a graph building component. Implementations must implement the build method required by GraphComponent’s interface.

abstract get_data_property()

Get the property of produced data

Returns: DataProperty object

abstract get_dataset_size()

Get the size of the dataset, which is the number of training subjects.

Returns: size of dataset

get_extra_inputs()

Get the placeholder specs of extra data inputs, if any

Returns: list of PlaceholderSpec objects, or None

abstract get_next_batch(session)

Get the next batch of data.

Parameters

session – the TF session

Returns: batched data

abstract initialize_dataset(session, state=- 1)

Initializes the dataset.

Note

This method is called at the beginning of each training epoch.

Parameters
  • session – the TF session.

  • state – the current training state. It is usually the epoch number. State -1 means

  • training has not started. (the) –

abstract number_of_subjects_per_batch()

Get the number of subjects used to produce a batch. Depending on how the training data is transformed, training samples in the same batch could be produced from one or more subjects.

Returns: number of subjects used to produce a batch

abstract set_sharding(rank, num_shards, equal_shard_size=True, fixed_shard_data=False)

Computes the parameters for dividing the dataset to multiple partitions (shards). This is used for horovod-based multi-gpu training.

Parameters
  • rank (int) – the rank of the current process

  • num_shards (int) – total number of shards

  • equal_shard_size (bool) – whether to make all shards equal size (Default: True)

  • fixed_shard_data (bool) – whether to make content of each shard fixed. If not fixed, content of each shard is recomputed randomly across the whole dataset each time the dataset is initialized. (Default: False)

abstract shutdown()

Shuts down the data pipeline. This is for an implementation to clean up resources used.

class ImagePipeline(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None)

Bases: ai4med.components.data.base_image_pipeline.BaseImagePipeline

An implementation of DataPipeline that generates images for training/testing.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Note

This class implements dataset with TF’s dataset.

Parameters
  • task (string) – Task to perform

  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • num_data_dims (int) – Number of dimensions of output images

  • num_channels (int) – Number of channels of output images

  • image_dtype (string) – Data type of output images

  • num_label_channels (int) – Number of channels of output label images (for segmentation task)

  • label_format – Format of output labels, refer to ai4med.common.label_format (for classification task)

  • label_dtype (string) – Data type for output labels

  • batch_size (int) – Batch size of output

  • num_workers (int) – Number of worker threads for data transformation

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • extra_inputs – Extra placeholders for data inputs

  • batch_transforms (list) – List of transforms to be applied to batched data.

build(build_ctx: ai4med.common.build_ctx.BuildContext)

Builds TF graph components. It reads and processes the data list file, creates data property object based on the data list content and init parameters.

Parameters

build_ctx – the build context.

get_batched_data(session)

Get the next batch of data.

Parameters

session – the TF session

Returns: batched data

get_data_property()

Get the property of produced data

Returns: DataProperty object

get_dataset_size()

Get the size of the TF dataset, which is the number of training subjects.

Returns: size of dataset

get_extra_inputs()

Get the placeholder specs of extra data inputs, if any

Returns: list of PlaceholderSpec objects, or None

initialize_dataset(session, state=- 1)

Initializes the dataset. Note: this method is called at the beginning of each training epoch.

Parameters
  • session – the TF session.

  • state – the current training state. It is usually the epoch number. State -1 means the training has not started.

number_of_subjects_per_batch()

Get the number of subjects used to produce a batch. Depending on how the training data is transformed, training samples in the same batch could be produced from one or more subjects.

Returns: number of subjects used to produce a batch

set_sharding(rank, num_shards, equal_shard_size=True, fixed_shard_data=False)

Computes the parameters for dividing the dataset to multiple partitions (shards). This is used for horovod-based multi-gpu training.

Parameters
  • rank (int) – the rank of the current process

  • num_shards (int) – total number of shards

  • equal_shard_size (bool) – whether to make all shards equal size

  • fixed_shard_data (bool) – whether to make content of each shard fixed. If not fixed, content of each shard is recomputed randomly across the whole dataset each time the dataset is initialized.

Returns:

shutdown()

Shut down the image pipeline and clean up dataset resources.

class ImagePipelineWithCache(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None)

Bases: ai4med.components.data.image_pipeline.ImagePipeline

An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • task (string) – Task to perform

  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • num_data_dims (int) – Number of dimensions of output images

  • num_channels (int) – Number of channels of output images

  • image_dtype (string) – Data type of output images

  • num_label_channels (int) – Number of channels of output label images (for segmentation task)

  • label_format – Format of output labels, refer to ai4med.common.label_format (for classification task)

  • label_dtype (string) – Data type for output labels

  • batch_size (int) – Batch size of output

  • num_workers (int) – Number of worker processes for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • num_cache_objects (int) – Number of objects to be cached

  • replace_percent (float) – The percent of cached data to be replaced in every epoch

  • caches_data (bool) – Whether to cache data in memory.

  • batch_transforms (list) – List of transforms to be applied to batched data.

Note

SmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.

If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.

class KerasImagePipeline(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None)

Bases: ai4med.components.data.base_image_pipeline.BaseImagePipeline

Implementation of data pipeline using keras.

Note

This class uses Keras’s data enqueuer to manage worker threads.

Parameters
  • task (string) – Task to perform

  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • num_data_dims (int) – Number of dimensions of output images

  • num_channels (int) – Number of channels of output images

  • image_dtype (string) – Data type of output images

  • num_label_channels (int) – Number of channels of output label images (for segmentation task)

  • label_format – Format of output labels, refer to ai4med.common.label_format (for classification task)

  • label_dtype (string) – Data type for output labels

  • batch_size (int) – Batch size of output

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • batch_transforms (list) – List of transforms to be applied to batched data.

  • multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.

  • sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.

begin_generator()
build(build_ctx: ai4med.common.build_ctx.BuildContext)

Builds the keras pipeline using Sequence generator and starts queue operation.

create_data_gen_and_enqueuer()
create_sample_weights()

Adds weights to items list if sampling is enabled.

get_batched_data(session)

Get the next batch of data. In keras, session shouldn’t be used.

get_data_property()

Get property of produced data.

get_dataset_size()

Get size of the keras dataset.

get_extra_inputs()

Get placeholder specs for extra inputs

initialize_dataset(session, state=- 1)

Initializes the dataset. Note: this method is called at the beginning of each training epoch.

Parameters
  • session – the TF session.

  • state – the current training state. It is usually the epoch number. State -1 means the training has not started.

number_of_subjects_per_batch()

Get the number of subjects used to produce a batch. Depending on how the training data is transformed, training samples in the same batch could be produced from one or more subjects.

Returns: number of subjects used to produce a batch

set_sharding(rank, num_shards, equal_shard_size=True, fixed_shard_data=False)

Computes the parameters for dividing the dataset to multiple partitions (shards). This is used for horovod-based multi-gpu training.

Parameters
  • rank (int) – the rank of the current process

  • num_shards (int) – total number of shards

  • equal_shard_size (bool) – whether to make all shards equal size

  • fixed_shard_data (bool) – whether to make content of each shard fixed. If not fixed, content of each shard is recomputed randomly across the whole dataset each time the dataset is initialized.

Returns:

shutdown()

Shut down the image pipeline and clean up dataset resources.

class KerasImagePipelineWithCache(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None)

Bases: ai4med.components.data.keras_image_pipeline.KerasImagePipeline

An implementation of KerasPipeline that uses SmartCache to efficiently generate data for training/testing.

Note

This class uses Keras’s data enqueuer to manage worker threads.

Parameters
  • task (string) – Task to perform

  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • num_data_dims (int) – Number of dimensions of output images

  • num_channels (int) – Number of channels of output images

  • image_dtype (string) – Data type of output images

  • num_label_channels (int) – Number of channels of output label images (for segmentation task)

  • label_format – Format of output labels, refer to ai4med.common.label_format (for classification task)

  • label_dtype (string) – Data type for output labels

  • batch_size (int) – Batch size of output

  • num_workers (int) – Number of worker processes for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • num_cache_objects (int) – Number of objects to be cached

  • replace_percent (float) – The percent of cached data to be replaced in every epoch

  • caches_data (bool) – Whether to cache data in memory.

  • batch_transforms (list) – List of transforms to be applied to batched data.

  • multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.

  • sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.

Note

SmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.

If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.

class SegmentationImagePipeline(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None)

Bases: ai4med.components.data.image_pipeline.ImagePipeline

An ImagePipeline for segmentation tasks.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_channels (int) – Number of channels of output label images

  • output_label_dtype (string) – Data type for output label images

  • output_batch_size (int) – Batch size of output data

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • batch_transforms (list) – List of transforms to be applied to batched data.

class SegmentationImagePipelineWithCache(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None)

Bases: ai4med.components.data.image_pipeline_with_cache.ImagePipelineWithCache

An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of segmentation tasks.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_channels (int) – Number of channels of output label images

  • output_label_dtype (string) – Data type for output label images

  • output_batch_size (int) – Batch size of output data

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • num_cache_objects (int) – Number of objects to be cached

  • replace_percent (float) – The percent of cached data to be replaced in every epoch

  • caches_data (bool) – Whether to cache data in memory.

  • batch_transforms (list) – List of transforms to be applied to batched data.

Note

SmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.

If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.

class SegmentationKerasImagePipeline(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, multiprocessing=False)

Bases: ai4med.components.data.keras_image_pipeline.KerasImagePipeline

An ImagePipeline for segmentation tasks using keras backend.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_channels (int) – Number of channels of output label images

  • output_label_dtype (string) – Data type for output label images

  • output_batch_size (int) – Batch size of output data

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.

  • batch_transforms (list) – List of transforms to be applied to batched data.

class SegmentationKerasImagePipelineWithCache(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, multiprocessing=False)

Bases: ai4med.components.data.keras_image_pipeline_with_cache.KerasImagePipelineWithCache

An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of segmentation tasks.

Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.

Parameters
  • data_list_file_path (string) – The path to the json file

  • data_file_base_dir (string) – The base directory of the dataset

  • data_list_key (string) – The key to get a list of dictionary to be used

  • output_crop_size (tuple, list) – Crop size of the output data

  • transforms – A list of transforms to be applied to the data

  • output_data_format – Format of the output data. Must be a valid format from DataFormat. See ai4med.common.data_format

  • output_data_dims (int) – Number of dimensions of output images

  • output_image_channels (int) – Number of channels of output images

  • output_image_dtype (string) – Data type of output images

  • output_label_channels (int) – Number of channels of output label images

  • output_label_dtype (string) – Data type for output label images

  • output_batch_size (int) – Batch size of output data

  • batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.

  • num_workers (int) – Number of worker threads for data transformation

  • prefetch_size (int) – Number of data subjects to prefetch

  • shuffle (bool) – To shuffle the data or not

  • duplicate_count (int) – Number of times to duplicate the datalist.

  • extra_inputs – Extra placeholders for data inputs

  • num_cache_objects (int) – Number of objects to be cached

  • replace_percent (float) – The percent of cached data to be replaced in every epoch

  • caches_data (bool) – Whether to cache data in memory

  • batch_transforms (list) – List of transforms to be applied to batched data.

  • multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.

Note

SmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.

If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.

© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.