Learning Unseen Modality Interaction

Yunhua Zhang
Hazel Doughty
Cees G.M. Snoek

VIS Lab, University of Amsterdam


The connected squares represent a sample, with each color indicating a different modality. Our goal is to learn from a modality-incomplete training set to make predictions for unseen modality combinations during inference.


Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a feature projection module to project the multidimensional features of different modalities into a common space with rich information reserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to unreliable modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.

Paper and Supplementary Material

Yunhua Zhang, Hazel Doughty, Cees G.M. Snoek
Learning Unseen Modality Interaction
In NeurIPS, 2023.
(hosted on ArXiv)



This work is financially supported by the Inception Institute of Artificial Intelligence, the University of Amsterdam and the allowance Top consortia for Knowledge and Innovation (TKIs) from the Netherlands Ministry of Economic Affairs and Climate Policy.

This website template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.