MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

¹The University of Sydney; ²Northwestern Polytechnical University; ³University of Maryland

Abstract

Histopathology and transcriptomics are fundamental modalities in cancer diagnostics, encapsulating the morphological and molecular characteristics of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific intrinsic structures. However, unlike conventional scenarios where multi-modal inputs often share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology data provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics data delineates molecular signatures through quantifying gene expression patterns. This inherent disparity introduces a major challenge in aligning these modalities while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning framework designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive feature representations for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on The Cancer Genome Atlas (TCGA) cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.

Introduction

Unlike conventional methods that primarily emphasize capturing modality-shared information while paying limited attention to modality-specific intrinsic structures and indiscriminately learning both disease-relevant and irrelevant data with high redundancy, MIRROR is specifically designed to balance modality alignment and retention. By selectively preserving only disease-relevant features, it effectively mitigates redundancy, thereby enhancing the model’s efficiency and representational capability.

Highlights

The key contributions of this study are outlined as follows:

MIRROR, a novel multi-modal self-supervised learning (SSL) model, is designed to facilitate both modality alignment and retention, enabling the effective preservation of both modality-shared and modality-specific information.
A consistent pathological style-based clustering mechanism is introduced to preserve disease-relevant information while mitigating redundancy.
A novel preprocessing pipeline for transcriptomics data is proposed, integrating machine learning-driven feature selection with biological knowledge to create refined transcriptomics datasets.
Comprehensive evaluations are conducted across diverse cohorts from the TCGA dataset, focusing on cancer subtyping and survival analysis tasks, substantiating the superior performance and effectiveness of the proposed approach.

Method

Whole Slide Images (WSIs) are first partitioned into patches, which are processed through a pre-trained patch encoder to extract patch-level feature representations. These features are subsequently aggregated by the slide encoder to encapsulate slide-level characteristics into a [CLS] token while projecting patch embeddings into the shared pathological latent space.

Transcriptomics data are preprocessed using Recursive Feature Elimination (RFE) and manual selection to identify high disease-related genes. The refined transcriptomic features are then embedded into a compact representation and mapped into the shared latent space via an RNA encoder.

An alignment module for each modality aligns representations across modalities, guided by the alignment loss (L_align). Meanwhile, modality-specific retention modules utilize perturbed inputs from both encoded patch and transcriptomics features to capture modality-specific intrinsic structures, contributing to the retention loss (L_retention).

Finally, both slide and transcriptomics representations are processed through a style clustering module to learn and compare their pathological styles against learnable cluster centers. The clustering loss (L_cluster) is used to align consistent pathological styles within the cluster space.

Datasets

We provide the processed transcriptomics data on Kaggle, Hugging Face, and Zenodo for TCGA-BRCA, TCGA-NSCLC, TCGA-COADREAD, and TCGA-RCC.

BibTeX

@misc{wang2025mirrormultimodalpathologicalselfsupervised, title={MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention}, author={Tianyi Wang and Jianan Fan and Dingxin Zhang and Dongnan Liu and Yong Xia and Heng Huang and Weidong Cai}, year={2025}, eprint={2503.00374}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.00374}, }