ai4med.components.data package
-
class
BaseImagePipeline
(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, data_format='channels_first', num_data_dims=3, num_channels=1, num_label_channels=1, label_format=None, shuffle=True, duplicate_count=1, batch_transforms=None, items_per_category=None, category_weights=None) Bases:
ai4med.components.data.data_pipeline.DataPipeline
Base class for ImagePipelines.
- Parameters
batch_transforms (list) – List of transforms to be applied to batched data.
-
get_batched_data
(session)
-
get_next_batch
(session) Get the next batch of data.
- Parameters
session – the TF session
Returns: batched data
-
process_data_list
()
-
class
ClassificationImagePipeline
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None) Bases:
ai4med.components.data.image_pipeline.ImagePipeline
An ImagePipeline for classification tasks.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_format – Format of output labels, refer to
ai4med.common.label_format
output_batch_size (int) – Batch size of output
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
batch_transforms (list) – List of transforms to be applied to batched data.
-
class
ClassificationImagePipelineWithCache
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None) Bases:
ai4med.components.data.image_pipeline_with_cache.ImagePipelineWithCache
An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of classification tasks.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_format – Format of output labels, refer to
ai4med.common.label_format
output_batch_size (int) – Batch size of output
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
num_cache_objects (int) – Number of objects to be cached
replace_percent (float) – The percent of cached data to be replaced in every epoch
caches_data (bool) – Whether to cache data in memory
batch_transforms (list) – List of transforms to be applied to batched data.
NoteSmartCache has a content rotation feature that is based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.
If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.
-
class
ClassificationKerasImagePipeline
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None) Bases:
ai4med.components.data.keras_image_pipeline.KerasImagePipeline
An ImagePipeline for classification tasks using keras backend.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_format – Format of output labels, refer to
ai4med.common.label_format
output_batch_size (int) – Batch size of output
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
batch_transforms (list) – List of transforms to be applied to batched data.
multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.
sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.
-
class
ClassificationKerasImagePipelineWithCache
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_format=None, output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None) Bases:
ai4med.components.data.keras_image_pipeline_with_cache.KerasImagePipelineWithCache
An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of classification tasks.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_format – Format of output labels, refer to
ai4med.common.label_format
output_batch_size (int) – Batch size of output
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
num_cache_objects (int) – Number of objects to be cached
replace_percent (float) – The percent of cached data to be replaced in every epoch
caches_data (bool) – Whether to cache data in memory
batch_transforms (list) – List of transforms to be applied to batched data.
multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.
sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.
NoteSmartCache has a content rotation feature that is based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.
If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.
-
class
DataPipeline
Bases:
ai4med.common.graph_component.GraphComponent
This class defines the required methods for data pipeline implementations.
A DataPipeline produces data items for training and validation.
NoteDataPipeline is a graph building component. Implementations must implement the
build
method required by GraphComponent’s interface.-
abstract
get_data_property
() Get the property of produced data
Returns: DataProperty object
-
abstract
get_dataset_size
() Get the size of the dataset, which is the number of training subjects.
Returns: size of dataset
-
get_extra_inputs
() Get the placeholder specs of extra data inputs, if any
Returns: list of PlaceholderSpec objects, or None
-
abstract
get_next_batch
(session) Get the next batch of data.
- Parameters
session – the TF session
Returns: batched data
-
abstract
initialize_dataset
(session, state=- 1) Initializes the dataset.
NoteThis method is called at the beginning of each training epoch.
- Parameters
session – the TF session.
state – the current training state. It is usually the epoch number. State -1 means
training has not started. (the) –
-
abstract
number_of_subjects_per_batch
() Get the number of subjects used to produce a batch. Depending on how the training data is transformed, training samples in the same batch could be produced from one or more subjects.
Returns: number of subjects used to produce a batch
-
abstract
set_sharding
(rank, num_shards, equal_shard_size=True, fixed_shard_data=False) Computes the parameters for dividing the dataset to multiple partitions (shards). This is used for horovod-based multi-gpu training.
- Parameters
rank (int) – the rank of the current process
num_shards (int) – total number of shards
equal_shard_size (bool) – whether to make all shards equal size (Default: True)
fixed_shard_data (bool) – whether to make content of each shard fixed. If not fixed, content of each shard is recomputed randomly across the whole dataset each time the dataset is initialized. (Default: False)
-
abstract
shutdown
() Shuts down the data pipeline. This is for an implementation to clean up resources used.
-
abstract
-
class
ImagePipeline
(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None) Bases:
ai4med.components.data.base_image_pipeline.BaseImagePipeline
An implementation of DataPipeline that generates images for training/testing.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
NoteThis class implements dataset with TF’s dataset.
- Parameters
task (string) – Task to perform
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
num_data_dims (int) – Number of dimensions of output images
num_channels (int) – Number of channels of output images
image_dtype (string) – Data type of output images
num_label_channels (int) – Number of channels of output label images (for segmentation task)
label_format – Format of output labels, refer to
ai4med.common.label_format
(for classification task)label_dtype (string) – Data type for output labels
batch_size (int) – Batch size of output
num_workers (int) – Number of worker threads for data transformation
duplicate_count (int) – Number of times to duplicate the datalist.
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
extra_inputs – Extra placeholders for data inputs
batch_transforms (list) – List of transforms to be applied to batched data.
-
build
(build_ctx: ai4med.common.build_ctx.BuildContext) Builds TF graph components. It reads and processes the data list file, creates data property object based on the data list content and init parameters.
- Parameters
build_ctx – the build context.
-
get_batched_data
(session) Get the next batch of data.
- Parameters
session – the TF session
Returns: batched data
-
get_data_property
() Get the property of produced data
Returns: DataProperty object
-
get_dataset_size
() Get the size of the TF dataset, which is the number of training subjects.
Returns: size of dataset
-
get_extra_inputs
() Get the placeholder specs of extra data inputs, if any
Returns: list of PlaceholderSpec objects, or None
-
initialize_dataset
(session, state=- 1) Initializes the dataset. Note: this method is called at the beginning of each training epoch.
- Parameters
session – the TF session.
state – the current training state. It is usually the epoch number. State -1 means the training has not started.
-
number_of_subjects_per_batch
() Get the number of subjects used to produce a batch. Depending on how the training data is transformed, training samples in the same batch could be produced from one or more subjects.
Returns: number of subjects used to produce a batch
-
set_sharding
(rank, num_shards, equal_shard_size=True, fixed_shard_data=False) Computes the parameters for dividing the dataset to multiple partitions (shards). This is used for horovod-based multi-gpu training.
- Parameters
rank (int) – the rank of the current process
num_shards (int) – total number of shards
equal_shard_size (bool) – whether to make all shards equal size
fixed_shard_data (bool) – whether to make content of each shard fixed. If not fixed, content of each shard is recomputed randomly across the whole dataset each time the dataset is initialized.
Returns:
-
shutdown
() Shut down the image pipeline and clean up dataset resources.
-
class
ImagePipelineWithCache
(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None) Bases:
ai4med.components.data.image_pipeline.ImagePipeline
An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
task (string) – Task to perform
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
num_data_dims (int) – Number of dimensions of output images
num_channels (int) – Number of channels of output images
image_dtype (string) – Data type of output images
num_label_channels (int) – Number of channels of output label images (for segmentation task)
label_format – Format of output labels, refer to
ai4med.common.label_format
(for classification task)label_dtype (string) – Data type for output labels
batch_size (int) – Batch size of output
num_workers (int) – Number of worker processes for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
num_cache_objects (int) – Number of objects to be cached
replace_percent (float) – The percent of cached data to be replaced in every epoch
caches_data (bool) – Whether to cache data in memory.
batch_transforms (list) – List of transforms to be applied to batched data.
NoteSmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.
If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.
-
class
KerasImagePipeline
(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None) Bases:
ai4med.components.data.base_image_pipeline.BaseImagePipeline
Implementation of data pipeline using keras.
NoteThis class uses Keras’s data enqueuer to manage worker threads.
- Parameters
task (string) – Task to perform
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
num_data_dims (int) – Number of dimensions of output images
num_channels (int) – Number of channels of output images
image_dtype (string) – Data type of output images
num_label_channels (int) – Number of channels of output label images (for segmentation task)
label_format – Format of output labels, refer to
ai4med.common.label_format
(for classification task)label_dtype (string) – Data type for output labels
batch_size (int) – Batch size of output
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
batch_transforms (list) – List of transforms to be applied to batched data.
multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.
sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.
-
begin_generator
()
-
build
(build_ctx: ai4med.common.build_ctx.BuildContext) Builds the keras pipeline using Sequence generator and starts queue operation.
-
create_data_gen_and_enqueuer
()
-
create_sample_weights
() Adds weights to items list if sampling is enabled.
-
get_batched_data
(session) Get the next batch of data. In keras, session shouldn’t be used.
-
get_data_property
() Get property of produced data.
-
get_dataset_size
() Get size of the keras dataset.
-
get_extra_inputs
() Get placeholder specs for extra inputs
-
initialize_dataset
(session, state=- 1) Initializes the dataset. Note: this method is called at the beginning of each training epoch.
- Parameters
session – the TF session.
state – the current training state. It is usually the epoch number. State -1 means the training has not started.
-
number_of_subjects_per_batch
() Get the number of subjects used to produce a batch. Depending on how the training data is transformed, training samples in the same batch could be produced from one or more subjects.
Returns: number of subjects used to produce a batch
-
set_sharding
(rank, num_shards, equal_shard_size=True, fixed_shard_data=False) Computes the parameters for dividing the dataset to multiple partitions (shards). This is used for horovod-based multi-gpu training.
- Parameters
rank (int) – the rank of the current process
num_shards (int) – total number of shards
equal_shard_size (bool) – whether to make all shards equal size
fixed_shard_data (bool) – whether to make content of each shard fixed. If not fixed, content of each shard is recomputed randomly across the whole dataset each time the dataset is initialized.
Returns:
-
shutdown
() Shut down the image pipeline and clean up dataset resources.
-
class
KerasImagePipelineWithCache
(task, data_list_file_path, data_file_base_dir, data_list_key, crop_size, transforms, data_format='channels_first', num_data_dims=3, num_channels=1, image_dtype='float32', num_label_channels=1, label_format=None, label_dtype='float32', batch_size=10, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, items_per_category=None, category_weights=None, multiprocessing=False, sampling=None) Bases:
ai4med.components.data.keras_image_pipeline.KerasImagePipeline
An implementation of KerasPipeline that uses SmartCache to efficiently generate data for training/testing.
NoteThis class uses Keras’s data enqueuer to manage worker threads.
- Parameters
task (string) – Task to perform
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
num_data_dims (int) – Number of dimensions of output images
num_channels (int) – Number of channels of output images
image_dtype (string) – Data type of output images
num_label_channels (int) – Number of channels of output label images (for segmentation task)
label_format – Format of output labels, refer to
ai4med.common.label_format
(for classification task)label_dtype (string) – Data type for output labels
batch_size (int) – Batch size of output
num_workers (int) – Number of worker processes for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
num_cache_objects (int) – Number of objects to be cached
replace_percent (float) – The percent of cached data to be replaced in every epoch
caches_data (bool) – Whether to cache data in memory.
batch_transforms (list) – List of transforms to be applied to batched data.
multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.
sampling (str) – Should use weighted sampling for data. Default sampling=None means no uniform sampling. Options are ‘element’ and ‘automatic’. ‘element’ picks weights from dataset json. ‘Automatic’ calculates it based on the number of elements in each class.
NoteSmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.
If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.
-
class
SegmentationImagePipeline
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None) Bases:
ai4med.components.data.image_pipeline.ImagePipeline
An ImagePipeline for segmentation tasks.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_channels (int) – Number of channels of output label images
output_label_dtype (string) – Data type for output label images
output_batch_size (int) – Batch size of output data
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
batch_transforms (list) – List of transforms to be applied to batched data.
-
class
SegmentationImagePipelineWithCache
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None) Bases:
ai4med.components.data.image_pipeline_with_cache.ImagePipelineWithCache
An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of segmentation tasks.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_channels (int) – Number of channels of output label images
output_label_dtype (string) – Data type for output label images
output_batch_size (int) – Batch size of output data
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
num_cache_objects (int) – Number of objects to be cached
replace_percent (float) – The percent of cached data to be replaced in every epoch
caches_data (bool) – Whether to cache data in memory.
batch_transforms (list) – List of transforms to be applied to batched data.
NoteSmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.
If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.
-
class
SegmentationKerasImagePipeline
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, batch_transforms=None, multiprocessing=False) Bases:
ai4med.components.data.keras_image_pipeline.KerasImagePipeline
An ImagePipeline for segmentation tasks using keras backend.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_channels (int) – Number of channels of output label images
output_label_dtype (string) – Data type for output label images
output_batch_size (int) – Batch size of output data
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This arg specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.
batch_transforms (list) – List of transforms to be applied to batched data.
-
class
SegmentationKerasImagePipelineWithCache
(data_list_file_path, data_file_base_dir, data_list_key, output_crop_size, transforms, output_data_format='channels_first', output_data_dims=3, output_image_channels=1, output_image_dtype='float32', output_label_channels=1, output_label_dtype='float32', output_batch_size=10, batched_by_transforms=False, num_workers=4, prefetch_size=20, shuffle=True, repeat=True, duplicate_count=1, extra_inputs=None, num_cache_objects=10000, replace_percent=0.1, caches_data=True, batch_transforms=None, multiprocessing=False) Bases:
ai4med.components.data.keras_image_pipeline_with_cache.KerasImagePipelineWithCache
An implementation of DataPipeline that uses SmartCache to efficiently generate data for training/testing of segmentation tasks.
Note that data_list_file_path must point to a json file that is similar to what you get from http://medicaldecathlon.com/.
- Parameters
data_list_file_path (string) – The path to the json file
data_file_base_dir (string) – The base directory of the dataset
data_list_key (string) – The key to get a list of dictionary to be used
output_crop_size (tuple, list) – Crop size of the output data
transforms – A list of transforms to be applied to the data
output_data_format – Format of the output data. Must be a valid format from DataFormat. See
ai4med.common.data_format
output_data_dims (int) – Number of dimensions of output images
output_image_channels (int) – Number of channels of output images
output_image_dtype (string) – Data type of output images
output_label_channels (int) – Number of channels of output label images
output_label_dtype (string) – Data type for output label images
output_batch_size (int) – Batch size of output data
batched_by_transforms (bool) – Batching can be done either by transforms or by the TF dataset. This parameter specifies how the batching is done.
num_workers (int) – Number of worker threads for data transformation
prefetch_size (int) – Number of data subjects to prefetch
shuffle (bool) – To shuffle the data or not
duplicate_count (int) – Number of times to duplicate the datalist.
extra_inputs – Extra placeholders for data inputs
num_cache_objects (int) – Number of objects to be cached
replace_percent (float) – The percent of cached data to be replaced in every epoch
caches_data (bool) – Whether to cache data in memory
batch_transforms (list) – List of transforms to be applied to batched data.
multiprocessing (bool) – Whether to use multiprocessing lib or python’s native threading. Default: False.
NoteSmartCache has a content rotation feature that based on the config parameters, and it determines the content in the cache by dynamically rotating through the whole dataset. Only the data in the cache is used for training.
If you set ‘caches_data’ to False, then you are only using this content rotation feature without incurring any memory consumption.