Skip to main navigation Skip to search Skip to main content

Multimodal Multisource Neural Machine Translation: Building Resources for Image Caption Translation from European Languages into Arabic

  • Jordan University of Science and Technology

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Neural machine translation (NMT) models combining textual and visual inputs generate more accurate translations compared with unimodal models. Moreover, translation models with an under-resourced target language benefit from multisource inputs (source sentences are provided in different languages). Building MultiModal MutliSource NMT (M3S-NMT) systems require significant efforts to curate datasets suitable for such a multifaceted task. This work uses image caption translation as an example of multimodal translation and presents a novel public dataset for translating captions from multiple European languages (viz., English, German, French, and Czech) into the distant and under-resourced Arabic language. Moreover, it presents multitask learning models trained and tested on this dataset to serve as solid baselines to help further research in this area. These models involve two parts: one for learning the visual representations of the input images, and the other for translating the textual input based on these representations. The translations are produced from a framework of attention-based encoder–decoder architectures. The visual features are learned from a pretrained convolutional neural network (CNN). These features are then integrated with textual features learned through the very basic yet well-known recurrent neural networks (RNNs) with GloVe or BERT word embeddings. Despite the challenges associated with the task at hand, the results of these systems are very promising, reaching 34.57 and 42.52 METEOR scores.

Original languageEnglish
Article number194
JournalComputation
Volume13
Issue number8
DOIs
StatePublished - Aug 2025

Keywords

  • BERT embeddings
  • GloVe embeddings
  • attention-based NMT
  • bidirectional recurrent neural networks
  • convolutional neural networks
  • image caption translation
  • multilingual multisource translation
  • natural language processing
  • neural machine translation

Fingerprint

Dive into the research topics of 'Multimodal Multisource Neural Machine Translation: Building Resources for Image Caption Translation from European Languages into Arabic'. Together they form a unique fingerprint.

Cite this