Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation





Abstract

Advances in generative medical models are often constrained by modality-specific scenarios that hinder the integration of complementary evidence, such as imaging, pathology, and clinical notes. This fragmentation limits their development to true foundation models that empower medical AI agents to learn from and predict across the full spectrum of biomedical knowledge. To address the challenges, we propose MeDiM, the first medical discrete diffusion model that learns shared distributions across different medical modalities without requiring modality-specific components. MeDiM unifies multiple generative tasks: it flexibly translates between images and text or jointly produces image–report pairs across domains in response to user prompts. It builds on a discrete diffusion framework that unifies vision and language modeling by modeling their shared probabilistic distribution. To empower the diffusion process to support unified and versatile medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its rich prior knowledge and cross-modal reasoning abilities. Because MLLMs are trained with causal (autoregressive) masking while diffusion denoising benefits from bidirectional context, MeDiM introduces two key designs: 1) removing the causal attention mask to enable a fully bidirectional information flow essential for mutual alignment, and 2) injecting continuous timestep embeddings to make the MLLM aware of the diffusion steps. Extensive experiments validate MeDiM as a unified foundation model capable of high-fidelity medical generation across various domains. It achieves high-quality generation on various tasks, including medical image generation (16.60 FID on MIMIC-CXR; 24.19 FID on PathGen) and report generation (0.2650 METEOR on MIMIC-CXR; 0.2580 METEOR on PathGen). In addition, the jointly generated medical pairs improve downstream performance (+6.43% BLEU-1, +18.57% BLEU-2, +31.58% BLEU-3, and +4.80% METEOR in PathGen), which achieves access to multimodal inputs and generate coherent, clinically grounded multimodal outputs.

Framework

Overview of the MeDiM architecture. The framework integrates an MLLM backbone within a discrete diffusion process for unified medical multimodal generation. During the forward process, data is tokenized and diffused over timesteps. The MLLM is then trained to reverse this process. Key architectural adaptations, including causal attention removal, timestep embeddings, and AdaLN, adapt the autoregressive MLLM for the bidirectional denoising required for unified medical generation.

Comparison

Visual comparison of MeDiM against baselines on three tasks: (A) medical image generation (unique colors indicate the alignment between the reference report and the images generated by MeDiM), (B) medical report generation (generated report and the reference are highlighted with the same colors for matched content, while incorrect content is highlighted with red underlines), and (C) joint medical image–report pair generation (generated report and the prompt are highlighted with the same colors for matched content, with green underlines denoting additional correct content consistent with the image, and red underlines marking incorrect content).

Pathology Image-Report Pair Generation

Chest X-Ray Image-Report Pair Generation

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃


@misc{mao2025discretediffusionmodelsmllms,
      title={Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation}, 
      author={Jiawei Mao and Yuhan Wang and Lifeng Chen and Can Zhao and Yucheng Tang and Dong Yang and Liangqiong Qu and Daguang Xu and Yuyin Zhou},
      year={2025},
      eprint={2510.06131},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.06131}, 
}