fusilli.data
Data loading classes for multimodal and unimodal data. This file contains functions and classes for loading the data for the different modalities, and in training the subspace methods on the data (if the subspace methods need pre-training). Train/test splits and k-fold cross validation are also implemented here.
Functions
|
Downsamples a batch of images to a specified size. |
|
Gets the data module for a specific fusion model and training protocol. |
Classes
|
Custom dataset class for multimodal data. |
|
Custom pytorch lightning datamodule class for the different modalities with k-fold cross validation |
|
Custom pytorch lightning datamodule class for the different modalities with graph data structure and k-fold cross validation |
|
Class for loading the different datasets for the different modalities. |
|
Custom pytorch lightning datamodule class for the different modalities. |
|
Custom pytorch lightning datamodule class for the different modalities with graph data structure. |
- class CustomDataset(pred_features, labels)[source]
Bases:
Dataset
Custom dataset class for multimodal data.
- multimodal_flag
Flag for multimodal data. True if multimodal, False if unimodal.
- Type:
bool
- dataset1
Tensor of predictive features for modality 1.
- Type:
tensor
- dataset2
Tensor of predictive features for modality 2.
- Type:
tensor
- dataset
Tensor of predictive features for uni-modal data.
- Type:
tensor
- labels
Tensor of labels.
- Type:
tensor
- __init__(pred_features, labels)[source]
- Parameters:
pred_features (list or tensor) – List of tensors or tensor of predictive features (i.e. tabular or image data without labels).
labels (dataframe) – Dataframe of labels (column name must be “prediction_label”).
- Raises:
ValueError – If pred_features is not a list or tensor.
- class KFoldDataModule(fusion_model, sources, output_paths, prediction_task, batch_size, num_folds, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, own_kfold_indices=None, kwargs=None)[source]
Bases:
LightningDataModule
Custom pytorch lightning datamodule class for the different modalities with k-fold cross validation
- num_folds
Total number of folds.
- Type:
int
- sources
List of source csv files. [Tabular1, Tabular2, Image]
- Type:
list
- output_paths
Dictionary of output paths for saving the checkpoints, figures, and the losses.
- Type:
dict
- image_downsample_size
Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)
- Type:
tuple
- modality_methods
Dictionary of methods for loading the different modalities.
- Type:
dict
- fusion_model
Fusion model class. e.g. “TabularCrossmodalAttention”.
- Type:
class
- batch_size
Batch size (default 8).
- Type:
int
- prediction_task
Prediction type (binary, multiclass, regression).
- Type:
str
- multiclass_dimensions
Number of classes for multiclass prediction (default None).
- Type:
int
- subspace_method
Subspace method class (default None) (only for subspace methods).
- Type:
class
- layer_mods
Dictionary of layer modifications to make to the subspace method. (default None)
- Type:
dict
- max_epochs
Maximum number of epochs to train subspace methods for. (default 1000)
- Type:
int
- dataset
Tensor of predictive features. Created in prepare_data().
- Type:
tensor
- data_dims
List of data dimensions [mod1_dim, mod2_dim, img_dim]. Created in prepare_data().
- Type:
list
- train_dataset
Tensor of predictive features for training. Created in setup().
- Type:
tensor
- test_dataset
Tensor of predictive features for testing. Created in setup().
- Type:
tensor
- own_early_stopping_callback
Early stopping callback class.
- Type:
pytorch_lightning.callbacks.EarlyStopping
- num_workers
Number of workers for the dataloader (default 0).
- Type:
int
- own_kfold_indices
List of indices to use for k-fold cross validation (default None). If None, the k-fold indices are randomly selected. Structure is a list of tuples of (train_indices, test_indices). Must be the same length as num_folds.
- Type:
list
- kwargs
Dictionary of extra arguments for the subspace method class.
- Type:
dict
- __init__(fusion_model, sources, output_paths, prediction_task, batch_size, num_folds, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, own_kfold_indices=None, kwargs=None)[source]
- Parameters:
fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.
sources (list) – List of source data files: csv or torch files.
output_paths (dict) – Dictionary of output paths for saving the checkpoints, figures, and the losses.
prediction_task (str) – Prediction task (binary, multiclass, regression).
batch_size (int) – Batch size.
num_folds (int) – Total number of folds.
test_size (float) – Fraction of data to use for testing (default 0.2). Not needed for this class for k-fold cross validation but it’s here to be consistent with TrainTestDataModule.
multiclass_dimensions (int) – Number of classes for multiclass prediction (default None).
subspace_method (class) – Subspace method class (default None) (only for subspace methods).
image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)
layer_mods (dict) – Dictionary of layer modifications to make to the subspace method. (default None)
max_epochs (int) – Maximum number of epochs to train subspace methods for. (default 1000)
extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.
own_early_stopping_callback (pytorch_lightning.callbacks.EarlyStopping) – Early stopping callback class (default None).
num_workers (int) – Number of workers for the dataloader (default 0).
own_kfold_indices (list) – List of indices to use for k-fold cross validation (default None). If None, the k-fold indices are randomly selected. Structure is a list of tuples of (train_indices, test_indices). Must be the same length as num_folds.
kwargs (dict) – Dictionary of extra arguments for the subspace method class.
- kfold_split()[source]
Splits the dataset into k folds
- Returns:
folds – List of tuples of (train_dataset, test_dataset)
- Return type:
list
- prepare_data()[source]
Loads the data with LoadDatasets class
- Returns:
dataset (tensor) – Tensor of predictive features.
data_dims (list) – List of data dimensions [mod1_dim, mod2_dim, img_dim] i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- setup(checkpoint_path=None)[source]
Splits the data into train and test sets, and runs the subspace method if specified
- checkpoint_path
Path to the checkpoint file for the subspace method (default None).
- Type:
str
- Returns:
train_dataloader (dataloader) – Dataloader for training.
val_dataloader (dataloader) – Dataloader for validation.
- class KFoldGraphDataModule(num_folds, fusion_model, sources, graph_creation_method, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_kfold_indices=None)[source]
Bases:
object
Custom pytorch lightning datamodule class for the different modalities with graph data structure and k-fold cross validation
- num_folds
Total number of folds.
- Type:
int
- image_downsample_size
Size to downsample the images to (height, width, depth) or (height, width) for 2D images.
- Type:
tuple
- sources
List of source csv files. [Tabular1, Tabular2, Image]
- Type:
list
- modality_methods
Dictionary of methods for loading the different modalities.
- Type:
dict
- fusion_model
Fusion model class. e.g. “TabularCrossmodalAttention”.
- Type:
class
- graph_creation_method
Graph creation method class.
- Type:
class
- graph_maker_instance
Graph maker class instance.
- Type:
graph maker class
- layer_mods
Dictionary of layer modifications to make to the graph maker method.
- Type:
dict
- dataset
Tensor of predictive features.
- Type:
tensor
- data_dims
List of data dimensions [mod1_dim, mod2_dim, img_dim]
- Type:
list
- folds
List of tuples of (graph_data, train_idxs, test_idxs)
- Type:
list
- __init__(num_folds, fusion_model, sources, graph_creation_method, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_kfold_indices=None)[source]
- Parameters:
num_folds (int) – Total number of folds.
fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.
sources (list) – List of source csv files.
graph_creation_method (class) – Graph creation method class.
image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)
layer_mods (dict) – Dictionary of layer modifications to make to the graph maker method. (default None)
extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.
own_kfold_indices (list) – List of indices to use for k-fold cross validation (default None). If None, the k-fold indices are randomly selected. Structure is a list of tuples of (train_indices, test_indices). Must be the same length as num_folds.
- get_lightning_module()[source]
Returns the lightning module using the pytorch geometric lightning module for converting the graph data structure into a pytorch dataloader.
- Returns:
lightning_modules – List of lightning modules for each fold.
- Return type:
list
- class LoadDatasets(sources, img_downsample_dims=None)[source]
Bases:
object
Class for loading the different datasets for the different modalities.
- tabular1_source
Source csv file for tabular1 data.
- Type:
str
- tabular2_source
Source csv file for tabular2 data.
- Type:
str
- img_source
Source torch file for image data.
- Type:
str
- image_downsample_size
Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)
- Type:
tuple
- __init__(sources, img_downsample_dims=None)[source]
- Parameters:
sources (list) – List of source csv files. [tabular1_source, tabular2_source, img_source]
img_downsample_dims (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)
- Raises:
ValueError – If sources is not a list.
ValueError – If the CSVs do not have the right columns or if the index column is not named “ID”.
- load_img()[source]
Loads the image-only dataset
- Returns:
dataset (tensor) (tensor of predictive features)
data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- load_tab_and_img()[source]
Loads the tabular1 and image multimodal dataset.
- Returns:
dataset (tensor) (tensor of predictive features)
data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- load_tabular1()[source]
Loads the tabular1-only dataset
- Returns:
dataset (tensor) (tensor of predictive features)
data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- load_tabular2()[source]
Loads the tabular2-only dataset
- Returns:
dataset (tensor) (tensor of predictive features)
data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- load_tabular_tabular()[source]
Loads the tabular1 and tabular2 multimodal dataset
- Returns:
dataset (tensor) (tensor of predictive features)
data_dims (list) (list of data dimensions [mod1_dim, mod2_dim, img_dim]) – i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- class TrainTestDataModule(fusion_model, sources, output_paths, prediction_task, batch_size, test_size, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, test_indices=None, kwargs=None)[source]
Bases:
LightningDataModule
Custom pytorch lightning datamodule class for the different modalities.
- sources
List of source csv files. [Tabular1, Tabular2, Image]
- Type:
list
- modality_methods
Dictionary of methods for loading the different modalities.
- Type:
dict
- fusion_model
fusion model class. e.g. TabularCrossmodalAttention.
- Type:
class
- output_paths
Dictionary of output paths for saving the checkpoints, figures, and the losses.
- Type:
dict
- batch_size
Batch size (default 8).
- Type:
int
- test_size
Fraction of data to use for testing (default 0.2).
- Type:
float
- prediction_task
Prediction type (binary, multiclass, or regression).
- Type:
str
- multiclass_dimensions
Number of classes for multiclass prediction (default None).
- Type:
int
- subspace_method
Subspace method class (default None) (only for subspace methods).
- Type:
class
- layer_mods
Dictionary of layer modifications to make to the subspace method. (default None)
- Type:
dict
- max_epochs
Maximum number of epochs to train subspace methods for. (default 1000)
- Type:
int
- dataset
Tensor of predictive features. Created in prepare_data().
- Type:
tensor
- data_dims
List of data dimensions [mod1_dim, mod2_dim, img_dim]. Created in prepare_data().
- Type:
list
- train_dataset
Tensor of predictive features for training. Created in setup().
- Type:
tensor
- test_dataset
Tensor of predictive features for testing. Created in setup().
- Type:
tensor
- subspace_method_train
Subspace method class trained (only for subspace methods).
- Type:
class
- own_early_stopping_callback
Early stopping callback class.
- Type:
pytorch_lightning.callbacks.EarlyStopping
- num_workers
Number of workers for the dataloader (default 0).
- Type:
int
- test_indices
List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.
- Type:
list
- kwargs
Dictionary of extra arguments for the subspace method class.
- Type:
dict
- __init__(fusion_model, sources, output_paths, prediction_task, batch_size, test_size, multiclass_dimensions, subspace_method=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, test_indices=None, kwargs=None)[source]
- Parameters:
fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.
sources (list) – List of source csv files.
output_paths (dict) – Dictionary of output paths for saving the checkpoints, figures, and the losses.
prediction_task (str) – Prediction task (binary, multiclass, regression).
batch_size (int) – Batch size (default 8).
test_size (float) – Fraction of data to use for testing (default 0.2).
multiclass_dimensions (int) – Number of classes for multiclass prediction (default None).
subspace_method (class) – Subspace method class (default None) (only for subspace methods).
image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)
layer_mods (dict) – Dictionary of layer modifications to make to the subspace method. (default None)
max_epochs (int) – Maximum number of epochs to train subspace methods for. (default 1000)
extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.
own_early_stopping_callback (pytorch_lightning.callbacks.EarlyStopping) – Early stopping callback class (default None).
num_workers (int) – Number of workers for the dataloader (default 0).
test_indices (list) – List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.
kwargs (dict) – Dictionary of extra arguments for the subspace method class.
- prepare_data()[source]
Loads the data with LoadDatasets class
- Returns:
dataset (tensor) – Tensor of predictive features.
data_dims (list) – List of data dimensions [mod1_dim, mod2_dim, img_dim] i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- setup(checkpoint_path=None)[source]
Splits the data into train and test sets, and runs the subspace method if specified. If checkpoint_path is specified, the subspace method is loaded from the checkpoint and not trained.
- checkpoint_path
Path to the checkpoint file for the subspace method (default None).
- Type:
str
- Returns:
train_dataloader (dataloader) – Dataloader for training.
val_dataloader (dataloader) – Dataloader for validation.
- class TrainTestGraphDataModule(fusion_model, sources, graph_creation_method, test_size, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_test_indices=None)[source]
Bases:
object
Custom pytorch lightning datamodule class for the different modalities with graph data structure.
- sources
List of source csv files.
- Type:
list
- image_downsample_size
Size to downsample the images to (height, width, depth) or (height, width) for 2D images.
- Type:
tuple
- modality_methods
Dictionary of methods for loading the different modalities.
- Type:
dict
- fusion_model
Fusion model class. e.g. “TabularCrossmodalAttention”.
- Type:
class
- test_size
Fraction of data to use for testing (default 0.2).
- Type:
float
- graph_creation_method
Graph creation method class.
- Type:
class
- graph_maker_instance
Graph maker class instance.
- Type:
graph maker class
- layer_mods
Dictionary of layer modifications to make to the graph maker method.
- Type:
dict
- dataset
Tensor of predictive features. Created in prepare_data().
- Type:
tensor
- data_dims
List of data dimensions [mod1_dim, mod2_dim, img_dim]. Created in prepare_data().
- Type:
list
- train_idxs
List of indices for training. Created in setup().
- Type:
list
- test_idxs
List of indices for testing. Created in setup().
- Type:
list
- graph_data
Graph data structure. Created in setup().
- Type:
graph data structure
- own_test_indices
List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.
- Type:
list
- __init__(fusion_model, sources, graph_creation_method, test_size, image_downsample_size=None, layer_mods=None, extra_log_string_dict=None, own_test_indices=None)[source]
- Parameters:
fusion_model (class) – Fusion model class. e.g. “TabularCrossmodalAttention”.
sources (list) – List of source csv files.
graph_creation_method (class) – Graph creation method class.
test_size (float) – Fraction of data to use for testing (default 0.2).
image_downsample_size (tuple) – Size to downsample the images to (height, width, depth) or (height, width) for 2D images. None if not downsampling. (default None)
layer_mods (dict) – Dictionary of layer modifications to make to the graph maker method. (default None)
extra_log_string_dict (dict) – Dictionary of extra strings to add to the log.
own_test_indices (list) – List of indices to use for testing (default None). If None, the test indices are randomly selected using the test_size parameter.
- get_lightning_module()[source]
Gets the lightning module using the pytorch geometric lightning module for converting the graph data structure into a pytorch dataloader.
- Returns:
lightning_module – Lightning module for converting the graph data structure into a pytorch dataloader.
- Return type:
lightning module
- prepare_data()[source]
Loads the data with LoadDatasets class
- Returns:
dataset (tensor) – Tensor of predictive features.
data_dims (list) – List of data dimensions [mod1_dim, mod2_dim, img_dim] i.e. [None, None, [100, 100, 100]] for image only (image dimensions 100 x 100 x 100) i.e. [8, 32, None] for tabular1 and tabular2 (tabular1 has 8 features, tabular2 has 32 features), and no image
- downsample_img_batch(imgs, output_size)[source]
Downsamples a batch of images to a specified size.
- Parameters:
imgs (array-like) – Batch of images. Shape (batch_size, channels, height, width) or (batch_size, channels, height, width, depth) for 3D images.
output_size (tuple) – Size to downsample the images to (height, width) or (height, width, depth) for 3D images. Do not put the batch_size dimension in the tuple. If None, no downsampling is performed
- Returns:
downsampled_img – Downsampled image.
- Return type:
array-like
- prepare_fusion_data(prediction_task, fusion_model, data_paths, output_paths, kfold=False, num_folds=None, test_size=0.2, batch_size=8, multiclass_dimensions=None, image_downsample_size=None, layer_mods=None, max_epochs=1000, checkpoint_path=None, extra_log_string_dict=None, own_early_stopping_callback=None, num_workers=0, test_indices=None, own_kfold_indices=None)[source]
Gets the data module for a specific fusion model and training protocol.
- Parameters:
prediction_task (str) – Prediction task (binary, multiclass, regression).
fusion_model (class) – Fusion model class.
data_paths (dict) – Dictionary of data paths with keys “tabular1”, “tabular2”, “image”.
output_paths (dict) – Dictionary of output paths with keys “checkpoints”, “figures”, “losses”.
kfold (bool) – Whether to use kfold cross validation (default False means train/test split).
num_folds (int or None) – Number of folds for kfold cross validation (default None).
test_size (float) – Fraction of data to use for testing when using train/test split (default 0.2).
batch_size (int) – Batch size (default 8).
multiclass_dimensions (int) – Number of classes for multiclass prediction (default None).
image_downsample_size (tuple) – Tuple of image dimensions to downsample to (default None). e.g. (100, 100, 100) for 3D images, (100, 100) for 2D images.
layer_mods (dict) – Dictionary of layer modifications (default None).
max_epochs (int) – Maximum number of epochs to train subspace methods for. (default 1000)
checkpoint_path (list) – List containing paths to call checkpoint file. Length of the list is the number of trainable subspace models in the fusion model (e.g., DAETabImgMaps requires two models to be pre-trained, so we’d pass 2 checkpoint paths in the list. (default None will result in the default lightning format).
extra_log_string_dict (dict) – Dictionary of extra strings to add to a subspace method checkpoint file name (default None). e.g. if you’re running the same model with different hyperparameters, you can add the hyperparameters. Input format {“name”: “value”}. In the run name, the extra string will be added as “name_value”. And a tag will be added as “name_value”. Default None.
own_early_stopping_callback (pytorch_lightning.callbacks.EarlyStopping) – Early stopping callback class (default None).
num_workers (int) – Number of workers for the dataloader (default 0).
test_indices (list or None) – List of indices to use for testing (default None). If None, then random split is used.
own_kfold_indices (list or None) – List of indices to use for k-fold cross validation (default None). If None, then random split is used.
- Returns:
dm – Datamodule for the specified fusion method.
- Return type:
datamodule