Skip to main navigation Skip to search Skip to main content

ATAR: Attention-based LSTM for Arabizi transliteration

  • Jordan University of Science and Technology
  • University of Southampton

Research output: Contribution to journalArticlepeer-review

18 Scopus citations

Abstract

A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).

Original languageEnglish
Pages (from-to)2327-2334
Number of pages8
JournalInternational Journal of Electrical and Computer Engineering
Volume11
Issue number3
DOIs
StatePublished - Jun 2021
Externally publishedYes

Keywords

  • Arabizi transliteration
  • Attention
  • Benchmark dataset
  • LSTM
  • Seq2seq

Fingerprint

Dive into the research topics of 'ATAR: Attention-based LSTM for Arabizi transliteration'. Together they form a unique fingerprint.

Cite this