TY - GEN
T1 - ENHANCING MEDICAL VISION-LANGUAGE MODELS WITH RICH TEXTUAL DESCRIPTIONS AND MULTIPLE ALIGNMENTS FOR CHEST X-RAY DIAGNOSIS
AU - Ibrahim, Youssef
AU - Sohail, Anabia
AU - Javed, Sajid
AU - AlMarzouqi, Hasan
AU - Deriche, Mohamed
AU - Werghi, Naoufel
N1 - Publisher Copyright:
©2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Vision-Language models (VLMs) integrate natural language understanding with visual data interpretation, crucial in diverse applications such as medical imaging. However, training VLMs on limited data, especially in radiology, remains a challenge. We propose a strategy to improve dual encoder performance under data constraints. Using contrastive learning to align visual and textual embeddings effectively, we generated a bag of rich textual descriptions using GPT-4 to augment merged information from esteemed medical resources and pre-trained BiomedCLIP. These rich textual descriptions provide in-depth information on disease visual description, major causes, and major symptoms, enhancing the model’s contextual understanding and classification accuracy. Unlike previous methods relying on a single alignment, our multiple alignment strategy associates multiple images with multiple textual descriptions per disease class while capping descriptors to maintain computational efficiency. Adapting the vision encoder for chest X-ray classification, our approach achieves competitive accuracy with fewer training pairs, highlighting its potential for data-limited domains.
AB - Vision-Language models (VLMs) integrate natural language understanding with visual data interpretation, crucial in diverse applications such as medical imaging. However, training VLMs on limited data, especially in radiology, remains a challenge. We propose a strategy to improve dual encoder performance under data constraints. Using contrastive learning to align visual and textual embeddings effectively, we generated a bag of rich textual descriptions using GPT-4 to augment merged information from esteemed medical resources and pre-trained BiomedCLIP. These rich textual descriptions provide in-depth information on disease visual description, major causes, and major symptoms, enhancing the model’s contextual understanding and classification accuracy. Unlike previous methods relying on a single alignment, our multiple alignment strategy associates multiple images with multiple textual descriptions per disease class while capping descriptors to maintain computational efficiency. Adapting the vision encoder for chest X-ray classification, our approach achieves competitive accuracy with fewer training pairs, highlighting its potential for data-limited domains.
KW - Chest X-ray
KW - Contrastive Learning
KW - Multiple Alignment
KW - Rich Textual Descriptions
KW - Vision-Language Model
UR - https://www.scopus.com/pages/publications/105028644857
U2 - 10.1109/ICIP55913.2025.11084572
DO - 10.1109/ICIP55913.2025.11084572
M3 - Conference contribution
AN - SCOPUS:105028644857
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 2079
EP - 2084
BT - 2025 IEEE International Conference on Image Processing, ICIP 2025 - Proceedings
PB - IEEE Computer Society
T2 - 32nd IEEE International Conference on Image Processing, ICIP 2025
Y2 - 14 September 2025 through 17 September 2025
ER -