Creating a dataloader

melissa.server.deep_learning.dataset.make_dataloader(framework_t, iter_dataset, batch_size, collate_fn=None, num_workers=0, **extra_torch_dl_args)

Factory function to create dataloader based on the specified deep learning framework.

Parameters
  • framework_t (FrameworkType): The type of framework (DEFAULT, TORCH or TENSORFLOW).
  • iter_dataset (MelissaIterableDataset): An iterable dataset that streams data via its __iter__ method.
  • batch_size (int): Number of samples per batch.
  • collate_fn (Callable, optional): A function to combine multiple samples into a batch. Defaults to None, which creates batches as lists of samples.
  • num_workers (int, optional): Number of worker threads for parallel data loading. Defaults to 0 (no threading).
  • extra_torch_dl_args (Dict[str, Any], optional): Extra kwargs for torch.utils.data.DataLoader.
Returns
  • Union[GeneralDataLoader, torch.utils.data.DataLoader, tensorflow.data.Dataset]: An iterable for training over batches.
Raises
  • RuntimeErrorif the specified framework is not found.

Iterable Dataset Classes

melissa.server.deep_learning.dataset.GeneralDataLoader

A general-purpose data loader designed to handle streaming datasets with optional multi-threaded loading and batch collation.

This class supports datasets like MelissaIterableDataset that provide infinite or streaming data. It enables efficient batching and parallel data loading while ensuring compatibility with custom collation functions.

Parameters
  • dataset (MelissaIterableDataset): An iterable dataset that streams data via its __iter__ method.
  • batch_size (int): Number of samples per batch.
  • collate_fn (Callable, optional): A function to combine multiple samples into a batch. Defaults to None, which creates batches as lists of samples.
  • num_workers (int, optional): Number of worker threads for parallel data loading. Defaults to 0 (no threading).
  • drop_last (bool, optional): Whether to drop the last incomplete batch. Defaults to True.
Attributes
  • dataset (MelissaIterableDataset): The dataset being wrapped for batching and loading.
  • batch_size (int): Size of each batch produced by the data loader.
  • collate_fn (Optional[Callable]): The function used to collate samples into batches.
  • num_workers (int): Number of worker threads for parallel data loading.
  • drop_last (bool): Indicates if incomplete batches are dropped.
  • _queue (queue.Queue): An internal buffer to hold preloaded samples during multi-threaded loading.
  • _stop_event (threading.Event): A flag to signal worker threads to stop loading data.
  • _threads (List[threading.Thread]): List of worker threads for parallel data loading.

__iter__()

Iterate over the dataset and yield batches.

__len__()

__len__ is not supported for infinite datasets.

melissa.server.deep_learning.dataset.torch_dataset.as_torch_dataloader

Creates a torch DataLoader using the iterable dataset.

  • iter_dataset (TorchMelissaIterableDataset): An iterable dataset that streams data via its __iter__ method.
  • batch_size (int): Number of samples per batch.
  • collate_fn (Callable, optional): A function to combine multiple samples into a batch. Defaults to None, which creates batches as lists of samples.
  • num_workers (int, optional): Number of worker threads for parallel data loading. Defaults to 0 (no threading).
  • extra_torch_dl_args (Dict[str, Any], optional): Extra kwargs for torch.utils.data.DataLoader.
Returns
  • torch.utils.data.DataLoader: A torch dataloader instance for training over batches.

melissa.server.deep_learning.dataset.tf_dataset.as_tensorflow_dataset

Converts the iterable dataset into a TensorFlow tf.data.Dataset.

This method utilizes TensorFlow's from_generator functionality to wrap the current iterable dataset into a tf.data.Dataset, allowing integration with TensorFlow's data processing pipelines.

Parameters
  • iter_dataset (TfMelissaIterableDataset): An iterable dataset instance defining __iter__ method.
  • batch_size (int): Batch size for the iterable.
  • collate_fn (Callable, optional): A function to combine multiple samples into a batch.
Returns
  • tf.data.Dataset: A TensorFlow dataset with elements structured as (features, labels). Both features and labels are of type tf.float32 with dynamic shapes (None).