Fusion Model Guide

Below are explanations and diagrams explaining the fusion models available in this library. Some of the models are inspired by papers in the literature, so links to the papers are provided where appropriate.

The diagram above shows the categorisation of the fusion models available in this library. This image has been taken from Cui et al. (2023).

The table below shows the categorisation of the models available in this library. It is important to note that some of the methods in this library can probably be categorised in more than one way. For example, the ConcatImgLatentTabDoubleTrain model can be considered both an subspace-based model and an operation-based model. This is because it uses an autoencoder to learn an image latent space, which is then concatenated with the tabular data. However, the model also uses an operation (concatenation) to combine the modalities. The categorisation of the models are a guide, rather than a strict rule.

Fusion type	Description
Unimodal	These models use only one modality (e.g. tabular data or images) to make a prediction.
Operation	Operation-based models use operations to combine the modalities. For example, concatenation, element-wise summation, and element-wise multiplication. These methods are easy to implement and are often used as baselines.
Attention	Attention-based models use attention mechanisms to combine the modalities. Attention mechanisms work by learning a weight for each modality’s importance, which is then used to combine the modalities.
Subspace	Subspace-based models try to learn a joint latent space for the modalities. This can be done simply by looking at the correlations between the modalities, or by using a variational autoencoder (VAE) to learn the latent space through trying to reconstruct the modalities using a lower-dimensional representation.
Graph	Graph-based models look at the interactions between nodes in a graph, where the edges can be learned as similarities between nodes, for example.
Tensor	Tensor-based models use tensor operations to combine the modalities, such as outer products to find inter-modal and intra-modal interactions.

Operation-based

`ConcatTabularFeatureMaps`

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers, and then concatenating the outputs of these layers together. The concatenated features are then passed through another set of fully-connected layers to make a prediction.

`ConcatTabularData`

This tabular-tabular fusion model works by concatenating the tabular data together, and then passing the concatenated features through a set of fully-connected layers to make a prediction.

`TabularDecision`

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers to make a prediction for each modality. The two predictions are then averaged to make a final prediction. This is known as a “decision-level fusion” method.

`ConcatImageMapsTabularData`

This tabular-image fusion model works by passing the image through a convolutional neural network (CNN) to extract features from the image. The tabular data is then concatenated with the image features, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

`ConcatImageMapsTabularMaps`

This tabular-image fusion model works by passing the image through a CNN to extract features from the image. The tabular data is also passed through its own fully-connected layers to get tabular feature maps. The tabular features are then concatenated with the image features, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

`ImageDecision`

This tabular-image fusion model works by passing each modality through its own network (fully-connected for tabular, CNN for image) to create their own predictions. The two predictions are then averaged to make a final prediction. This is known as a “decision-level fusion” method.

`ActivationFusion`

This is a tabular-tabular fusion model inspired by Chen et al. (2022) : MDFNet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification.

The feature maps from the two separate modality networks are multiplied together, passed through a tanh activation function, then a sigmoid activation function, and then concatenated with the first tabular modality feature map. The concatenated feature maps are then passed through a set of fully-connected layers to make a prediction.

Note

A tabular-image version is coming soon!

`AttentionAndActivation`

Again, this is a tabular-tabular fusion model inspired by Chen et al. (2022) : MDFNet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification.

This one is an extension of ActivationFusion, where a self-attention module has been added to the second tabular modality’s pipeline before its network layers. The second tabular modality is passed through a fully-connected layer to be downsampled (downsampling factor can be modified), then a RelU, another upsampling fully-connected layer, and then a sigmoid activation function. Then the output is multipled by the original second tabular modality input data, and passed through its own fully-connected layers. After this, the process is the same as ActivationFusion.

Note

A tabular-image version is coming soon!

Attention-based

`TabularChannelWiseMultiAttention`

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers. At each layer, the feature maps from the first tabular modality are multiplied into the feature maps from the second tabular modality, effectively modulating the feature maps from the second modality with the feature maps from the first modality (an attention mechanism). The final second tabular feature maps are then passed through a set of fully-connected layers to make a prediction.

This model is inspired by Duanmu et al. (2020) : Deep learning prediction of pathological complete response, residual cancer burden, and progression-free survival in breast cancer patients.

`TabularCrossmodalMultiheadAttention`

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers. Self attention is applied to each modality, and then crossmodal attention is applied to the two modalities. The output of the crossmodal attention is then passed through a fully-connected layer to make a prediction.

This model is inspired by MADDi - Multimodal Alzheimer’s Disease Diagnosis Framework by Golovankesky et al. (2022). They also have their own code available.

`CrossmodalMultiheadAttention`

This tabular-image fusion model works the same as the TabularCrossmodalMultiheadAttention model, except that the tabular modality is passed through a fully-connected layer, and the image modality is passed through a CNN.

`ImageChannelWiseMultiAttention`

This tabular-image model works the same as the TabularChannelWiseMultiAttention model, except that the tabular modality is passed through a fully-connected layer, and the image modality is passed through a CNN.

_images/ImageChannelwiseMultiheadAttention.png

Subspace-based

`MCVAE_tab`

This subspace-based model uses the Multi-channel Variational Autoencoder (MCVAE) by Antelmi et al. (2019). This model works by passing each tabular modality as a separate ‘channel’ into a VAE with a modified loss function, which is then used to learn a joint latent space for the modalities. The 1-dimensional joint latent space is then passed through a set of fully-connected layers to make a prediction.

For many more examples of multi-modal VAE-based models, I highly recommend looking at the Python library Multi-view-AE by Ana Aguila-Lawry et al.

`ConcatImgLatentTabDoubleLoss`

This tabular-image model works by passing the image through an convolutional autoencoder to learn the latent space of the image. The tabular data is concatenated with the image latent space, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

The reconstruction loss of the autoencoder is added to the loss function of the model, to encourage the model to learn a good latent space for the image. This means that the image autoencoder and the prediction model are trained at the same time.

`ConcatImgLatentTabDoubleTrain`

Very similar to the ConcatImgLatentTabDoubleLoss model, except that the image autoencoder is trained separately to the prediction model.

`DAETabImgMaps`

This tabular-image fusion model is inspired by Zhao et al. (2022) : A Multimodal Deep Learning Approach to Predicting Systemic Diseases from Oral Conditions.

The tabular data is input into a denoising autoencoder, which is upsamples the tabular data and uses dropout at the beginning of the network to make the model more robust to noise and missing data (simulating a common problem in medical data). The image data is passed through a CNN to make a prediction, to learn prediction-relevant features from the image. The final two convolutional layers of the CNN are then flattened and concatenated with the upsampled tabular data, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

The denoising autoencoder and the image CNN are trained separately from the prediction model, and the final prediction model is trained on the concatenated features.

Tensor-based

Incoming!

Graph-based

Warning

⚠️ It is not possible to use external test set data with graph-based fusion models. Trying to use a “from new data” method such as RealsVsPreds.from_new_data() will result in an error.

`EdgeCorrGNN`

This graph structure of this tabular-tabular model is made by calculating the correlation between the first tabular modality’s features, and using the correlation as the edge weights in a graph. If the correlations are less than a certain threshold (default of 0.8), the edge is removed from the graph. The node features of the graph are the second tabular modality features. The graph is then passed through a graph neural network (GNN) to make a prediction.

`AttentionWeightedGNN`

This is a model inspired by method in Bintsi et al. (2023) : Multimodal brain age estimation using interpretable adaptive population-graph learning. In the paper, the method is based on adaptive graph learning. However, the fusilli implementation changes this to a static graph because the adaptive graph learning method is not yet implementable in fusilli.

The attention-weighted GNN works by pretraining a tabular-tabular fusion model, ConcatTabularData(), and then taking the “attention weights” from the model as the final input-sized layer output of the GNN. These attention weights are multiplied with the concatenated first and second tabular modalities, and the Euclidean distance between each subject’s attention-weighted features is calculated. If the distance between two subjects is in the lowest 25% of all distances, an edge is created between the two subjects in the graph. The graph is then passed through a GNN to make a prediction.

Fusion Model Guide

Operation-based

`ConcatTabularFeatureMaps`

`ConcatTabularData`

`TabularDecision`

`ConcatImageMapsTabularData`

`ConcatImageMapsTabularMaps`

`ImageDecision`

`ActivationFusion`

`AttentionAndActivation`

Attention-based

`TabularChannelWiseMultiAttention`

`TabularCrossmodalMultiheadAttention`

`CrossmodalMultiheadAttention`

`ImageChannelWiseMultiAttention`

Subspace-based

`MCVAE_tab`

`ConcatImgLatentTabDoubleLoss`

`ConcatImgLatentTabDoubleTrain`

`DAETabImgMaps`

Tensor-based

Graph-based

`EdgeCorrGNN`

`AttentionWeightedGNN`

Unimodal

`Tabular1Unimodal`

`Tabular2Unimodal`

`ImgUnimodal`

Fusion Model Guide

Operation-based

AttentionAndActivation

Attention-based

Subspace-based

Tensor-based

Graph-based

Unimodal

`AttentionAndActivation`