Customising Training

This page will show you how to customise the training and evaluation of your fusion models.

We will cover the following topics:

Early stopping
Valildation metrics
Batch size
Number of epochs
Checkpoint suffix modification
Number of workers in PyTorch DataLoader
Train/test and cross-validation splitting yourself

Early stopping

Early stopping is implemented in Fusilli using the PyTorch Lightning EarlyStopping callback. This callback can be passed to the train_and_save_models() function using the early_stopping_callback argument. For example:

from fusilli.data import prepare_fusion_data
from fusilli.train import train_and_save_models

from lightning.pytorch.callbacks import EarlyStopping

modified_early_stopping_callback = EarlyStopping(
    monitor="val_loss",
    min_delta=0.00,
    patience=3,
    verbose=True,
    mode="min",
)

datamodule = prepare_fusion_data(
        prediction_task="binanry",
        fusion_model=example_model,
        data_paths=data_paths,
        output_paths=output_path,
        own_early_stopping_callback=modified_early_stopping_callback,
    )

trained_model_list = train_and_save_models(
    data_module=datamodule,
    fusion_model=example_model,
    )

Note that you only need to pass the callback to the prepare_fusion_data() and not to the train_and_save_models() function. The new early stopping measure will be saved within the data module and accessed during training.

Choosing metrics

By default, Fusilli uses the following metrics for each prediction task:

Binary classification: Area under the ROC curve and accuracy
Multiclass classification: Area under the ROC curve and accuracy
Regression: R2 score and mean absolute error

You can change the metrics used by passing a list of metrics to the metrics_list argument in the train_and_save_models() function. For example, if you wanted to change the metrics used for a binary classification task to precision, recall, and area under the precision-recall curve, you could do the following:

new_metrics_list = ["precision", "recall", "auprc"]

trained_model = train_and_save_models(
    data_module=datamodule,
    fusion_model=example_model,
    metrics_list=new_metrics_list,
    )

Here are the supported metrics as of Fusilli v1.2.0:

Regression:

Binary or multiclass classification:

Area under the ROC curve: auroc
Accuracy: accuracy
Recall: recall
Specificity: specificity
Precision: precision
F1 score: f1
Area under the precision-recall curve: auprc
Balanced accuracy: balanced_accuracy

If you’d like to add more metrics to fusilli, then please open an issue on the Fusilli GitHub repository issues page or submit a pull request. The metrics are calculated in MetricsCalculator, with a separate method for each metric.

Using your own custom metric:

If you’d like to use your own custom metric without adding it to fusilli, then you can calculate it using the validation labels and predictions/probabilities. You can access the validation labels and validation predictions/probabilities from the trained model that is returned by the train_and_save_models() function. Look at BaseModel for a list of attributes that are available to you to access.

Note

The first metric in the metrics list is used to rank the models in the model comparison evaluation figures. Only the first two metrics will be shown in the model comparison figures. The rest of the metrics will be shown in the model evaluation dataframe and printed out to the console during training.

Warning

There must be at least two metrics in the metrics list.

Batch size

The batch size can be set using the batch_size argument in the prepare_fusion_data() function. By default, the batch size is 8.

from fusilli.data import prepare_fusion_data
from fusilli.train import train_and_save_models

datamodule = prepare_fusion_data(
        prediction_task="binary",
        fusion_model=example_model,
        data_paths=data_paths,
        output_paths=output_path,
        batch_size=32
    )

trained_model_list = train_and_save_models(
        data_module=datamodule,
        fusion_model=example_model,
        batch_size=32,
    )

Number of epochs

You can change the maximum number of epochs using the max_epochs argument in the prepare_fusion_data() and train_and_save_models() functions. By default, the maximum number of epochs is 1000.

You also pass it to the prepare_fusion_data() function because some of the fusion models require pre-training.

Changing the max_epochs parameter is especially useful when wanting to run a quick test of your model. For example, you can set max_epochs=5 to run a quick test of your model.

from fusilli.data import prepare_fusion_data
from fusilli.train import train_and_save_models

datamodule = prepare_fusion_data(
        prediction_task="binary",
        fusion_model=example_model,
        data_paths=data_paths,
        output_paths=output_path,
        max_epochs=5,
    )

trained_model_list = train_and_save_models(
        data_module=datamodule,
        fusion_model=example_model,
        max_epochs=5,
    )

Setting max_epochs to -1 will train the model until early stopping is triggered.

Checkpoint file names

By default, Fusilli saves the model checkpoints in the following format:

{fusion_model.__name__}_epoch={epoch_n}.ckpt

If the checkpoint is for a pre-trained model, then the following format is used:

subspace_{fusion_model.__name__}_{pretrained_model.__name__}.ckpt

You can add suffixes to the checkpoint names by passing a string to the extra_log_string_dict argument in the prepare_fusion_data() and train_and_save_models() functions. For example, I could add a suffix to denote that I’ve changed the batch size for this particular run:

from fusilli.data import prepare_fusion_data
from fusilli.train import train_and_save_models

extra_suffix_dict = {"batchsize": 32}

datamodule = prepare_fusion_data(
        prediction_task="binary",
        fusion_model=example_model,
        data_paths=data_paths,
        output_paths=output_path,
        batch_size=32,
        extra_log_string_dict=extra_suffix_dict,
    )

trained_model_list = train_and_save_models(
        data_module=datamodule,
        fusion_model=example_model,
        batch_size=32,
        extra_log_string_dict=extra_suffix_dict,
    )

The checkpoint name would then be (if the model trained for 100 epochs):

ExampleModel_epoch=100_batchsize_32.ckpt

Note

The extra_log_string_dict argument is also used to modify the logging behaviour of the model. For more information, see Logging with Weights and Biases.

Number of workers in PyTorch DataLoader

You can change the number of workers in the PyTorch DataLoader using the num_workers argument in the prepare_fusion_data() function. By default, the number of workers is 0.

from fusilli.data import prepare_fusion_data
from fusilli.train import train_and_save_models

datamodule = prepare_fusion_data(
        prediction_task="binary",
        fusion_model=example_model,
        data_paths=data_paths,
        output_paths=output_path,
        num_workers=4,
    )

trained_model_list = train_and_save_models(
        data_module=datamodule,
        fusion_model=example_model,
    )

Train/test and cross-validation splitting yourself

By default, fusilli will split your data into train/test or cross-validation splits for you randomly based on a test size or a number of folds you specify in the prepare_fusion_data() function.

You can remove the randomness and specify the data indices for train and test, or for the different cross validation folds yourself by passing in optional arguments to prepare_fusion_data().

For train/test splitting, the argument test_indices should be a list of indices for the test set. To make the test set the first 6 data points in the overall dataset, follow the example below:

from fusilli.data import prepare_fusion_data
from fusilli.train import train_and_save_models

test_indices = [0, 1, 2, 3, 4, 5]

datamodule = prepare_fusion_data(
        prediction_task="binary",
        fusion_model=example_model,
        data_paths=data_paths,
        output_paths=output_path,
        test_indices=test_indices,
    )

For specifying your own cross validation folds, the argument own_kfold_indices should be a list of lists of indices for each fold.

If you wanted to have non-random cross validation folds through your data, you can either specify the folds like so for 3 folds:

own_kfold_indices = [
    ([ 4,  5,  6,  7,  8,  9, 10, 11], [0, 1, 2, 3]), # first fold
    ([ 0,  1,  2,  3,  8,  9, 10, 11], [4, 5, 6, 7]), # second fold
    ([ 0,  1,  2,  3,  4,  5,  6,  7], [8, 9, 10, 11]) # third fold
]

Or to do this automatically, use the Scikit-Learn KFold functionality to generate the folds outside of the fusilli functions, like so:

from sklearn.model_selection import KFold

num_folds = 5

own_kfold_indices = [(train_index, test_index) for train_index, test_index in KFold(n_splits=num_folds).split(range(len(dataset)))]


datamodule = prepare_fusion_data(
    kfold=True,
    prediction_task="binary",
    fusion_model=example_model,
    data_paths=data_paths,
    output_paths=output_path,
    own_kfold_indices=own_kfold_indices,
    num_folds=num_folds,
)