Abstract
Neural machine translation (NMT) models combining textual and visual inputs generate more accurate translations compared with unimodal models. Moreover, translation models with an under-resourced target language benefit from multisource inputs (source sentences are provided in different languages). Building MultiModal MutliSource NMT (M3S-NMT) systems require significant efforts to curate datasets suitable for such a multifaceted task. This work uses image caption translation as an example of multimodal translation and presents a novel public dataset for translating captions from multiple European languages (viz., English, German, French, and Czech) into the distant and under-resourced Arabic language. Moreover, it presents multitask learning models trained and tested on this dataset to serve as solid baselines to help further research in this area. These models involve two parts: one for learning the visual representations of the input images, and the other for translating the textual input based on these representations. The translations are produced from a framework of attention-based encoder–decoder architectures. The visual features are learned from a pretrained convolutional neural network (CNN). These features are then integrated with textual features learned through the very basic yet well-known recurrent neural networks (RNNs) with GloVe or BERT word embeddings. Despite the challenges associated with the task at hand, the results of these systems are very promising, reaching 34.57 and 42.52 METEOR scores.
| Original language | English |
|---|---|
| Article number | 194 |
| Journal | Computation |
| Volume | 13 |
| Issue number | 8 |
| DOIs | |
| State | Published - Aug 2025 |
Keywords
- BERT embeddings
- GloVe embeddings
- attention-based NMT
- bidirectional recurrent neural networks
- convolutional neural networks
- image caption translation
- multilingual multisource translation
- natural language processing
- neural machine translation
Fingerprint
Dive into the research topics of 'Multimodal Multisource Neural Machine Translation: Building Resources for Image Caption Translation from European Languages into Arabic'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver