Fusion Model Guide

Below are explanations and diagrams explaining the fusion models available in this library. Some of the models are inspired by papers in the literature, so links to the papers are provided where appropriate.

Model fusion diagram

The diagram above shows the categorisation of the fusion models available in this library. This image has been taken from Cui et al. (2023).

The table below shows the categorisation of the models available in this library. It is important to note that some of the methods in this library can probably be categorised in more than one way. For example, the ConcatImgLatentTabDoubleTrain model can be considered both an subspace-based model and an operation-based model. This is because it uses an autoencoder to learn an image latent space, which is then concatenated with the tabular data. However, the model also uses an operation (concatenation) to combine the modalities. The categorisation of the models are a guide, rather than a strict rule.

Fusion type

Description

Unimodal

These models use only one modality (e.g. tabular data or images) to make a prediction.

Operation

Operation-based models use operations to combine the modalities. For example, concatenation, element-wise summation, and element-wise multiplication. These methods are easy to implement and are often used as baselines.

Attention

Attention-based models use attention mechanisms to combine the modalities. Attention mechanisms work by learning a weight for each modality’s importance, which is then used to combine the modalities.

Subspace

Subspace-based models try to learn a joint latent space for the modalities. This can be done simply by looking at the correlations between the modalities, or by using a variational autoencoder (VAE) to learn the latent space through trying to reconstruct the modalities using a lower-dimensional representation.

Graph

Graph-based models look at the interactions between nodes in a graph, where the edges can be learned as similarities between nodes, for example.

Tensor

Tensor-based models use tensor operations to combine the modalities, such as outer products to find inter-modal and intra-modal interactions.

Operation-based

ConcatTabularFeatureMaps

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers, and then concatenating the outputs of these layers together. The concatenated features are then passed through another set of fully-connected layers to make a prediction.

_images/ConcatTabularFeatureMaps.png

ConcatTabularData

This tabular-tabular fusion model works by concatenating the tabular data together, and then passing the concatenated features through a set of fully-connected layers to make a prediction.

_images/ConcatTabularData.png

TabularDecision

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers to make a prediction for each modality. The two predictions are then averaged to make a final prediction. This is known as a “decision-level fusion” method.

_images/TabularDecision.png

ConcatImageMapsTabularData

This tabular-image fusion model works by passing the image through a convolutional neural network (CNN) to extract features from the image. The tabular data is then concatenated with the image features, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

_images/ConcatImageMapsTabularData.png

ConcatImageMapsTabularMaps

This tabular-image fusion model works by passing the image through a CNN to extract features from the image. The tabular data is also passed through its own fully-connected layers to get tabular feature maps. The tabular features are then concatenated with the image features, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

_images/ConcatImageMapsTabularMaps.png

ImageDecision

This tabular-image fusion model works by passing each modality through its own network (fully-connected for tabular, CNN for image) to create their own predictions. The two predictions are then averaged to make a final prediction. This is known as a “decision-level fusion” method.

_images/ImageDecision.png

ActivationFusion

This is a tabular-tabular fusion model inspired by Chen et al. (2022) : MDFNet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification.

The feature maps from the two separate modality networks are multiplied together, passed through a tanh activation function, then a sigmoid activation function, and then concatenated with the first tabular modality feature map. The concatenated feature maps are then passed through a set of fully-connected layers to make a prediction.

Note

A tabular-image version is coming soon!

_images/ActivationFusion.png

AttentionAndActivation

Again, this is a tabular-tabular fusion model inspired by Chen et al. (2022) : MDFNet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification.

This one is an extension of ActivationFusion, where a self-attention module has been added to the second tabular modality’s pipeline before its network layers. The second tabular modality is passed through a fully-connected layer to be downsampled (downsampling factor can be modified), then a RelU, another upsampling fully-connected layer, and then a sigmoid activation function. Then the output is multipled by the original second tabular modality input data, and passed through its own fully-connected layers. After this, the process is the same as ActivationFusion.

Note

A tabular-image version is coming soon!

_images/ActivationandSelfAttention.png

Attention-based

TabularChannelWiseMultiAttention

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers. At each layer, the feature maps from the first tabular modality are multiplied into the feature maps from the second tabular modality, effectively modulating the feature maps from the second modality with the feature maps from the first modality (an attention mechanism). The final second tabular feature maps are then passed through a set of fully-connected layers to make a prediction.

This model is inspired by Duanmu et al. (2020) : Deep learning prediction of pathological complete response, residual cancer burden, and progression-free survival in breast cancer patients.

_images/TabularChannelwiseAttention.png

TabularCrossmodalMultiheadAttention

This tabular-tabular fusion model works by passing each tabular modality through its own set of fully-connected layers. Self attention is applied to each modality, and then crossmodal attention is applied to the two modalities. The output of the crossmodal attention is then passed through a fully-connected layer to make a prediction.

This model is inspired by MADDi - Multimodal Alzheimer’s Disease Diagnosis Framework by Golovankesky et al. (2022). They also have their own code available.

_images/TabularCrossmodalAttention.png

CrossmodalMultiheadAttention

This tabular-image fusion model works the same as the TabularCrossmodalMultiheadAttention model, except that the tabular modality is passed through a fully-connected layer, and the image modality is passed through a CNN.

_images/CrossmodalMultiheadAttention.png

ImageChannelWiseMultiAttention

This tabular-image model works the same as the TabularChannelWiseMultiAttention model, except that the tabular modality is passed through a fully-connected layer, and the image modality is passed through a CNN.

_images/ImageChannelwiseMultiheadAttention.png

Subspace-based

MCVAE_tab

This subspace-based model uses the Multi-channel Variational Autoencoder (MCVAE) by Antelmi et al. (2019). This model works by passing each tabular modality as a separate ‘channel’ into a VAE with a modified loss function, which is then used to learn a joint latent space for the modalities. The 1-dimensional joint latent space is then passed through a set of fully-connected layers to make a prediction.

For many more examples of multi-modal VAE-based models, I highly recommend looking at the Python library Multi-view-AE by Ana Aguila-Lawry et al.

_images/MCVAE.png

ConcatImgLatentTabDoubleLoss

This tabular-image model works by passing the image through an convolutional autoencoder to learn the latent space of the image. The tabular data is concatenated with the image latent space, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

The reconstruction loss of the autoencoder is added to the loss function of the model, to encourage the model to learn a good latent space for the image. This means that the image autoencoder and the prediction model are trained at the same time.

_images/ImgLatentDoubleLoss.png

ConcatImgLatentTabDoubleTrain

Very similar to the ConcatImgLatentTabDoubleLoss model, except that the image autoencoder is trained separately to the prediction model.

_images/ImgLatentDoubleTrain.png

DAETabImgMaps

This tabular-image fusion model is inspired by Zhao et al. (2022) : A Multimodal Deep Learning Approach to Predicting Systemic Diseases from Oral Conditions.

The tabular data is input into a denoising autoencoder, which is upsamples the tabular data and uses dropout at the beginning of the network to make the model more robust to noise and missing data (simulating a common problem in medical data). The image data is passed through a CNN to make a prediction, to learn prediction-relevant features from the image. The final two convolutional layers of the CNN are then flattened and concatenated with the upsampled tabular data, and the concatenated features are passed through a set of fully-connected layers to make a prediction.

The denoising autoencoder and the image CNN are trained separately from the prediction model, and the final prediction model is trained on the concatenated features.

_images/DAETabImgMaps.png

Tensor-based

Incoming!

Graph-based

Warning

⚠️ It is not possible to use external test set data with graph-based fusion models. Trying to use a “from new data” method such as RealsVsPreds.from_new_data() will result in an error.

EdgeCorrGNN

This graph structure of this tabular-tabular model is made by calculating the correlation between the first tabular modality’s features, and using the correlation as the edge weights in a graph. If the correlations are less than a certain threshold (default of 0.8), the edge is removed from the graph. The node features of the graph are the second tabular modality features. The graph is then passed through a graph neural network (GNN) to make a prediction.

_images/EdgeCorrGNN.png

AttentionWeightedGNN

This is a model inspired by method in Bintsi et al. (2023) : Multimodal brain age estimation using interpretable adaptive population-graph learning. In the paper, the method is based on adaptive graph learning. However, the fusilli implementation changes this to a static graph because the adaptive graph learning method is not yet implementable in fusilli.

The attention-weighted GNN works by pretraining a tabular-tabular fusion model, ConcatTabularData(), and then taking the “attention weights” from the model as the final input-sized layer output of the GNN. These attention weights are multiplied with the concatenated first and second tabular modalities, and the Euclidean distance between each subject’s attention-weighted features is calculated. If the distance between two subjects is in the lowest 25% of all distances, an edge is created between the two subjects in the graph. The graph is then passed through a GNN to make a prediction.

_images/AttentionWeightedGNN.png

Unimodal

Tabular1Unimodal

A simple tabular model that uses a fully-connected network with the first tabular modality to make a prediction.

_images/Tabular1Unimodal.png

Tabular2Unimodal

A simple tabular model that uses a fully-connected network with the second tabular modality to make a prediction.

_images/Tabular2Unimodal.png

ImgUnimodal

A simple image model that uses a convolutional neural network (CNN) with the image modality to make a prediction.

_images/ImageUnimodal.png