fusilli.data

Data loading classes for multimodal and unimodal data. This file contains functions and classes for loading the data for the different modalities, and in training the subspace methods on the data (if the subspace methods need pre-training). Train/test splits and k-fold cross validation are also implemented here.

Functions

downsample_img_batch(imgs, output_size)

Downsamples a batch of images to a specified size.

prepare_fusion_data(prediction_task, ...[, ...])

Gets the data module for a specific fusion model and training protocol.

Classes

CustomDataset(pred_features, labels)

Custom dataset class for multimodal data.

KFoldDataModule(fusion_model, sources, ...)

Custom pytorch lightning datamodule class for the different modalities with k-fold cross validation

KFoldGraphDataModule(num_folds, ...[, ...])

Custom pytorch lightning datamodule class for the different modalities with graph data structure and k-fold cross validation

LoadDatasets(sources[, img_downsample_dims])

Class for loading the different datasets for the different modalities.

TrainTestDataModule(fusion_model, sources, ...)

Custom pytorch lightning datamodule class for the different modalities.

TrainTestGraphDataModule(fusion_model, ...)

Custom pytorch lightning datamodule class for the different modalities with graph data structure.

class CustomDataset(pred_features, labels)[source]

Bases: Dataset

Custom dataset class for multimodal data.

multimodal_flag

Flag for multimodal data. True if multimodal, False if unimodal.

Type:

bool

dataset1

Tensor of predictive features for modality 1.

Type:

tensor

dataset2

Tensor of predictive features for modality 2.

Type:

tensor

dataset

Tensor of predictive features for uni-modal data.

Type:

tensor

labels

Tensor of labels.

Type:

tensor

__init__(pred_features, labels)[source]
Parameters:
  • pred_features (list or tensor) – List of tensors or tensor of predictive features (i.e. tabular or image data without labels).

  • labels (dataframe) – Dataframe of labels (column name must be “prediction_label”).

Raises:

ValueError – If pred_features is not a list or tensor.

class KFoldDataModule(fusion_model, sources, output_paths, prediction_task, batch_size, num_folds, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, own_kfold_indices=None, kwargs=None)[source]

Bases: LightningDataModule

Custom pytorch lightning datamodule class for the different modalities with k-fold cross validation

num_folds

Total number of folds.

Type:

int

sources

List of source csv files. [Tabular1, Tabular2, Image]

Type:

list

output_paths

Dictionary of output paths for saving the checkpoints, figures, and the losses.

Type:

dict

image_downsample_size

Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)

Type:

tuple

modality_methods

Dictionary of methods for loading the different modalities.

Type:

dict

fusion_model

Fusion model class. e.g. “TabularCrossmodalAttention”.

Type:

class

batch_size

Batch size (default 8).

Type:

int

prediction_task

Prediction type (binary, multiclass, regression).

Type:

str

multiclass_dimensions

Number of classes for multiclass prediction (default None).

Type:

int

subspace_method

Subspace method class (default None) (only for subspace methods).

Type:

class

layer_mods

Dictionary of layer modifications to make to the subspace method. (default None)

Type:

dict

max_epochs

Maximum number of epochs to train subspace methods for. (default 1000)

Type:

int

dataset

Tensor of predictive features. Created in prepare_data().

Type:

tensor

data_dims

List of data dimensions [mod1_dim, mod2_dim, img_dim]. Created in prepare_data().

Type:

list

train_dataset

Tensor of predictive features for training. Created in setup().

Type:

tensor

test_dataset

Tensor of predictive features for testing. Created in setup().

Type:

tensor

own_early_stopping_callback

Early stopping callback class.

Type:

pytorch_lightning.callbacks.EarlyStopping

num_workers

Number of workers for the dataloader (default 0).

Type:

int

own_kfold_indices

List of indices to use for k-fold cross validation (default None). If None, the k-fold indices are randomly selected. Structure is a list of tuples of (train_indices, test_indices). Must be the same length as num_folds.

Type:

list

kwargs

Dictionary of extra arguments for the subspace method class.

Type:

dict

__init__(fusion_model, sources, output_paths, prediction_task, batch_size, num_folds, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, own_kfold_indices=None, kwargs=None)[source]
Parameters:
  • fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.

  • sources (list) – List of source data files: csv or torch files.

  • output_paths (dict) – Dictionary of output paths for saving the checkpoints, figures, and the losses.

  • prediction_task (str) – Prediction task (binary, multiclass, regression).

  • batch_size (int) – Batch size.

  • num_folds (int) – Total number of folds.

  • test_size (float) – Fraction of data to use for testing (default 0.2). Not needed for this class for k-fold cross validation but it’s here to be consistent with TrainTestDataModule.

  • multiclass_dimensions (int) – Number of classes for multiclass prediction (default None).

  • subspace_method (class) – Subspace method class (default None) (only for subspace methods).

  • image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)

  • layer_mods (dict) – Dictionary of layer modifications to make to the subspace method. (default None)

  • max_epochs (int) – Maximum number of epochs to train subspace methods for. (default 1000)

  • extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.

  • own_early_stopping_callback (pytorch_lightning.callbacks.EarlyStopping) – Early stopping callback class (default None).

  • num_workers (int) – Number of workers for the dataloader (default 0).

  • own_kfold_indices (list) – List of indices to use for k-fold cross validation (default None). If None, the k-fold indices are randomly selected. Structure is a list of tuples of (train_indices, test_indices). Must be the same length as num_folds.

  • kwargs (dict) – Dictionary of extra arguments for the subspace method class.

kfold_split()[source]

Splits the dataset into k folds

Returns:

folds – List of tuples of (train_dataset, test_dataset)

Return type:

list

prepare_data()[source]

Loads the data with LoadDatasets class

Returns:

  • dataset (tensor) – Tensor of predictive features.

  • data_dims (list) – List of data dimensions [mod1_dim, mod2_dim, img_dim] i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

setup(checkpoint_path=None)[source]

Splits the data into train and test sets, and runs the subspace method if specified

checkpoint_path

Path to the checkpoint file for the subspace method (default None).

Type:

str

Returns:

  • train_dataloader (dataloader) – Dataloader for training.

  • val_dataloader (dataloader) – Dataloader for validation.

train_dataloader(fold_idx)[source]

Returns the dataloader for training.

Parameters:

fold_idx (int) – Index of the fold to use.

Returns:

dataloader – Dataloader for training.

Return type:

dataloader

val_dataloader(fold_idx)[source]

Returns the dataloader for validation.

Parameters:

fold_idx (int) – Index of the fold to use.

Returns:

dataloader – Dataloader for validation.

Return type:

dataloader

class KFoldGraphDataModule(num_folds, fusion_model, sources, graph_creation_method, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_kfold_indices=None)[source]

Bases: object

Custom pytorch lightning datamodule class for the different modalities with graph data structure and k-fold cross validation

num_folds

Total number of folds.

Type:

int

image_downsample_size

Size to downsample the images to (height, width, depth) or (height, width) for 2D images.

Type:

tuple

sources

List of source csv files. [Tabular1, Tabular2, Image]

Type:

list

modality_methods

Dictionary of methods for loading the different modalities.

Type:

dict

fusion_model

Fusion model class. e.g. “TabularCrossmodalAttention”.

Type:

class

graph_creation_method

Graph creation method class.

Type:

class

graph_maker_instance

Graph maker class instance.

Type:

graph maker class

layer_mods

Dictionary of layer modifications to make to the graph maker method.

Type:

dict

dataset

Tensor of predictive features.

Type:

tensor

data_dims

List of data dimensions [mod1_dim, mod2_dim, img_dim]

Type:

list

folds

List of tuples of (graph_data, train_idxs, test_idxs)

Type:

list

__init__(num_folds, fusion_model, sources, graph_creation_method, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_kfold_indices=None)[source]
Parameters:
  • num_folds (int) – Total number of folds.

  • fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.

  • sources (list) – List of source csv files.

  • graph_creation_method (class) – Graph creation method class.

  • image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)

  • layer_mods (dict) – Dictionary of layer modifications to make to the graph maker method. (default None)

  • extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.

  • own_kfold_indices (list) – List of indices to use for k-fold cross validation (default None). If None, the k-fold indices are randomly selected. Structure is a list of tuples of (train_indices, test_indices). Must be the same length as num_folds.

get_lightning_module()[source]

Returns the lightning module using the pytorch geometric lightning module for converting the graph data structure into a pytorch dataloader.

Returns:

lightning_modules – List of lightning modules for each fold.

Return type:

list

kfold_split()[source]

Splits the dataset into k folds

Returns:

folds – List of tuples of (train_dataset, test_dataset)

Return type:

list

prepare_data()[source]

Loads the data with LoadDatasets class

Return type:

None

setup()[source]

Gets random train and test indices, and gets the graph data structure.

Return type:

None

class LoadDatasets(sources, img_downsample_dims=None)[source]

Bases: object

Class for loading the different datasets for the different modalities.

tabular1_source

Source csv file for tabular1 data.

Type:

str

tabular2_source

Source csv file for tabular2 data.

Type:

str

img_source

Source torch file for image data.

Type:

str

image_downsample_size

Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)

Type:

tuple

__init__(sources, img_downsample_dims=None)[source]
Parameters:
  • sources (list) – List of source csv files. [tabular1_source, tabular2_source, img_source]

  • img_downsample_dims (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)

Raises:
  • ValueError – If sources is not a list.

  • ValueError – If the CSVs do not have the right columns or if the index column is not named “ID”.

load_img()[source]

Loads the image-only dataset

Returns:

  • dataset (tensor) (tensor of predictive features)

  • data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

load_tab_and_img()[source]

Loads the tabular1 and image multimodal dataset.

Returns:

  • dataset (tensor) (tensor of predictive features)

  • data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

load_tabular1()[source]

Loads the tabular1-only dataset

Returns:

  • dataset (tensor) (tensor of predictive features)

  • data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

load_tabular2()[source]

Loads the tabular2-only dataset

Returns:

  • dataset (tensor) (tensor of predictive features)

  • data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

load_tabular_tabular()[source]

Loads the tabular1 and tabular2 multimodal dataset

Returns:

  • dataset (tensor) (tensor of predictive features)

  • data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

class TrainTestDataModule(fusion_model, sources, output_paths, prediction_task, batch_size, test_size, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, test_indices=None, kwargs=None)[source]

Bases: LightningDataModule

Custom pytorch lightning datamodule class for the different modalities.

sources

List of source csv files. [Tabular1, Tabular2, Image]

Type:

list

modality_methods

Dictionary of methods for loading the different modalities.

Type:

dict

fusion_model

fusion model class. e.g. TabularCrossmodalAttention.

Type:

class

output_paths

Dictionary of output paths for saving the checkpoints, figures, and the losses.

Type:

dict

batch_size

Batch size (default 8).

Type:

int

test_size

Fraction of data to use for testing (default 0.2).

Type:

float

prediction_task

Prediction type (binary, multiclass, or regression).

Type:

str

multiclass_dimensions

Number of classes for multiclass prediction (default None).

Type:

int

subspace_method

Subspace method class (default None) (only for subspace methods).

Type:

class

layer_mods

Dictionary of layer modifications to make to the subspace method. (default None)

Type:

dict

max_epochs

Maximum number of epochs to train subspace methods for. (default 1000)

Type:

int

dataset

Tensor of predictive features. Created in prepare_data().

Type:

tensor

data_dims

List of data dimensions [mod1_dim, mod2_dim, img_dim]. Created in prepare_data().

Type:

list

train_dataset

Tensor of predictive features for training. Created in setup().

Type:

tensor

test_dataset

Tensor of predictive features for testing. Created in setup().

Type:

tensor

subspace_method_train

Subspace method class trained (only for subspace methods).

Type:

class

own_early_stopping_callback

Early stopping callback class.

Type:

pytorch_lightning.callbacks.EarlyStopping

num_workers

Number of workers for the dataloader (default 0).

Type:

int

test_indices

List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.

Type:

list

kwargs

Dictionary of extra arguments for the subspace method class.

Type:

dict

__init__(fusion_model, sources, output_paths, prediction_task, batch_size, test_size, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, test_indices=None, kwargs=None)[source]
Parameters:
  • fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.

  • sources (list) – List of source csv files.

  • output_paths (dict) – Dictionary of output paths for saving the checkpoints, figures, and the losses.

  • prediction_task (str) – Prediction task (binary, multiclass, regression).

  • batch_size (int) – Batch size (default 8).

  • test_size (float) – Fraction of data to use for testing (default 0.2).

  • multiclass_dimensions (int) – Number of classes for multiclass prediction (default None).

  • subspace_method (class) – Subspace method class (default None) (only for subspace methods).

  • image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)

  • layer_mods (dict) – Dictionary of layer modifications to make to the subspace method. (default None)

  • max_epochs (int) – Maximum number of epochs to train subspace methods for. (default 1000)

  • extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.

  • own_early_stopping_callback (pytorch_lightning.callbacks.EarlyStopping) – Early stopping callback class (default None).

  • num_workers (int) – Number of workers for the dataloader (default 0).

  • test_indices (list) – List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.

  • kwargs (dict) – Dictionary of extra arguments for the subspace method class.

prepare_data()[source]

Loads the data with LoadDatasets class

Returns:

  • dataset (tensor) – Tensor of predictive features.

  • data_dims (list) – List of data dimensions [mod1_dim, mod2_dim, img_dim] i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

setup(checkpoint_path=None)[source]

Splits the data into train and test sets, and runs the subspace method if specified. If checkpoint_path is specified, the subspace method is loaded from the checkpoint and not trained.

checkpoint_path

Path to the checkpoint file for the subspace method (default None).

Type:

str

Returns:

  • train_dataloader (dataloader) – Dataloader for training.

  • val_dataloader (dataloader) – Dataloader for validation.

train_dataloader()[source]

Returns the dataloader for training.

Returns:

dataloader – Dataloader for training.

Return type:

dataloader

val_dataloader()[source]

Returns the dataloader for validation.

Returns:

dataloader – Dataloader for validation.

Return type:

dataloader

class TrainTestGraphDataModule(fusion_model, sources, graph_creation_method, test_size, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_test_indices=None)[source]

Bases: object

Custom pytorch lightning datamodule class for the different modalities with graph data structure.

sources

List of source csv files.

Type:

list

image_downsample_size

Size to downsample the images to (height, width, depth) or (height, width) for 2D images.

Type:

tuple

modality_methods

Dictionary of methods for loading the different modalities.

Type:

dict

fusion_model

Fusion model class. e.g. “TabularCrossmodalAttention”.

Type:

class

test_size

Fraction of data to use for testing (default 0.2).

Type:

float

graph_creation_method

Graph creation method class.

Type:

class

graph_maker_instance

Graph maker class instance.

Type:

graph maker class

layer_mods

Dictionary of layer modifications to make to the graph maker method.

Type:

dict

dataset

Tensor of predictive features. Created in prepare_data().

Type:

tensor

data_dims

List of data dimensions [mod1_dim, mod2_dim, img_dim]. Created in prepare_data().

Type:

list

train_idxs

List of indices for training. Created in setup().

Type:

list

test_idxs

List of indices for testing. Created in setup().

Type:

list

graph_data

Graph data structure. Created in setup().

Type:

graph data structure

own_test_indices

List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.

Type:

list

__init__(fusion_model, sources, graph_creation_method, test_size, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_test_indices=None)[source]
Parameters:
  • fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.

  • sources (list) – List of source csv files.

  • graph_creation_method (class) – Graph creation method class.

  • test_size (float) – Fraction of data to use for testing (default 0.2).

  • image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)

  • layer_mods (dict) – Dictionary of layer modifications to make to the graph maker method. (default None)

  • extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.

  • own_test_indices (list) – List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.

get_lightning_module()[source]

Gets the lightning module using the pytorch geometric lightning module for converting the graph data structure into a pytorch dataloader.

Returns:

lightning_module – Lightning module for converting the graph data structure into a pytorch dataloader.

Return type:

lightning module

prepare_data()[source]

Loads the data with LoadDatasets class

Returns:

  • dataset (tensor) – Tensor of predictive features.

  • data_dims (list) – List of data dimensions [mod1_dim, mod2_dim, img_dim] i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image

setup()[source]

Gets random train and test indices, and gets the graph data structure.

Return type:

None

downsample_img_batch(imgs, output_size)[source]

Downsamples a batch of images to a specified size.

Parameters:
  • imgs (array-like) – Batch of images. Shape (batch_size, channels, height, width) or (batch_size, channels, height, width, depth) for 3D images.

  • output_size (tuple) – Size to downsample the images to (height, width) or (height, width, depth) for 3D images. Do not put the batch_size dimension in the tuple. If None, no downsampling is performed

Returns:

downsampled_img – Downsampled image.

Return type:

array-like

prepare_fusion_data(prediction_task, fusion_model, data_paths, output_paths, kfold=False, num_folds=None, test_size=0.2, batch_size=8, multiclass_dimensions=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, checkpoint_path=None, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, test_indices=None, own_kfold_indices=None)[source]

Gets the data module for a specific fusion model and training protocol.

Parameters:
  • prediction_task (str) – Prediction task (binary, multiclass, regression).

  • fusion_model (class) – Fusion model class.

  • data_paths (dict) – Dictionary of data paths with keys “tabular1”, “tabular2”, “image”.

  • output_paths (dict) – Dictionary of output paths with keys “checkpoints”, “figures”, “losses”.

  • kfold (bool) – Whether to use kfold cross validation (default False means train/test split).

  • num_folds (int or None) – Number of folds for kfold cross validation (default None).

  • test_size (float) – Fraction of data to use for testing when using train/test split (default 0.2).

  • batch_size (int) – Batch size (default 8).

  • multiclass_dimensions (int) – Number of classes for multiclass prediction (default None).

  • image_downsample_size (tuple) – Tuple of image dimensions to downsample to (default None). e.g. (100, 100, 100) for 3D images, (100, 100) for 2D images.

  • layer_mods (dict) – Dictionary of layer modifications (default None).

  • max_epochs (int) – Maximum number of epochs to train subspace methods for. (default 1000)

  • checkpoint_path (list) – List containing paths to call checkpoint file. Length of the list is the number of trainable subspace models in the fusion model (e.g., DAETabImgMaps requires two models to be pre-trained, so we’d pass 2 checkpoint paths in the list. (default None will result in the default lightning format).

  • extra_log_string_dict (dict) – Dictionary of extra strings to add to a subspace method checkpoint file name (default None). e.g. if you’re running the same model with different hyperparameters, you can add the hyperparameters. Input format {“name”: “value”}. In the run name, the extra string will be added as “name_value”. And a tag will be added as “name_value”. Default None.

  • own_early_stopping_callback (pytorch_lightning.callbacks.EarlyStopping) – Early stopping callback class (default None).

  • num_workers (int) – Number of workers for the dataloader (default 0).

  • test_indices (list or None) – List of indices to use for testing (default None). If None, then random split is used.

  • own_kfold_indices (list or None) – List of indices to use for k-fold cross validation (default None). If None, then random split is used.

Returns:

dm – Datamodule for the specified fusion method.

Return type:

datamodule