|
|
|
|
|
|
|
The connected squares represent a sample, with each color indicating a different modality. Our goal is to learn from a modality-incomplete training set to make predictions for unseen modality combinations during inference. |
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a feature projection module to project the multidimensional features of different modalities into a common space with rich information reserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to unreliable modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. |
Yunhua Zhang, Hazel Doughty, Cees G.M. Snoek Learning Unseen Modality Interaction In NeurIPS, 2023. (hosted on ArXiv) [Bibtex] |
|
|
|
AcknowledgementsThis website template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here. |