Skip to main navigation Skip to search Skip to main content

Using Big Data Analytics for Authorship Authentication of Arabic Tweets

  • Jafar Albadarneh
  • , Bashar Talafha
  • , Mahmoud Al-Ayyoub
  • , Belal Zaqaibeh
  • , Mohammad Al-Smadi
  • , Yaser Jararweh
  • , Elhadj Benkhelifa
  • Jordan University of Science and Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

28 Scopus citations

Abstract

Authorship authentication of a certain text is concerned with correctly attributing it to its author based on its contents. It is a very important problem with deep root in history as many classical texts have doubtful attributions. The information age and ubiquitous use of the Internet is further complicating this problem and adding more dimensions to it. We are interested in the modern version of this problem where the text whose authorship needs authentication is an online text found in online social networks. Specifically, we are interested in the authorship authentication of tweets. This is not the only challenging aspect we consider here. Another challenging aspect is the language of the tweets. Most current works and existing tools support English. We chose to focus on the very important, yet largely understudied, Arabic language. Finally, we add another challenging aspect to the problem at hand by addressing it at a very large scale. We present our effort to employ big data analytics to address the authorship authentication problem of Arabic tweets. We start by crawling a dataset of more than 53K tweets distributed across 20 authors. We then use preprocessing steps to clean the data and prepare it for analysis. The next step is to compute the feature vectors of each tweet. We use the Bag-Of-Words (BOW) approach and compute the weights using the Term Frequency-Inverse Document Frequency (TF-IDF). Then, we feed the dataset to a Naive Bayes classifier implemented on a parallel and distributed computing framework known as Hadoop. To the best of our knowledge, none of the previous works on authorship authentication of Arabic text addressed the unique challenges associated with (1) tweets and (2) large-scale datasets. This makes our work unique on many levels. The results show that the testing accuracy is not very high (61.6%), which is expected in the very challenging setting that we consider.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing, UCC 2015
EditorsOmer Rana, Rajkumar Buyya, Ioan Raicu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages448-452
Number of pages5
ISBN (Electronic)9780769556970
DOIs
StatePublished - 2015
Externally publishedYes
Event8th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2015 - Limassol, Cyprus
Duration: 7 Dec 201510 Dec 2015

Publication series

NameProceedings - 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing, UCC 2015

Conference

Conference8th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2015
Country/TerritoryCyprus
CityLimassol
Period7/12/1510/12/15

Fingerprint

Dive into the research topics of 'Using Big Data Analytics for Authorship Authentication of Arabic Tweets'. Together they form a unique fingerprint.

Cite this