Skip to main navigation Skip to search Skip to main content

Overview of the PAN@FIRE 2020 task on the authorship identification of source code

  • Ali Fadel
  • , Husam Musleh
  • , Ibraheem Tuffaha
  • , Mahmoud Al-Ayyoub
  • , Yaser Jararweh
  • , Elhadj Benkhelifa
  • , Paolo Rosso
  • Jordan University of Science and Technology
  • University of Staffordshire
  • Polytechnic University of Valencia

Research output: Contribution to journalConference articlepeer-review

8 Scopus citations

Abstract

Authorship identification is essential to the detection of undesirable deception of others’ content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a PAN@FIRE task, named Authorship Identification of SOurce COde (AI-SOCO), is proposed with the focus on the identification of source code authors. The dataset consists of crawled source codes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the CodeForces online judge platform. The participating systems are asked to predict the author of a given source code from the predefined list of code authors. In total, 60 teams registered on the task’s CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle source code) to stylometric features.

Original languageEnglish
Pages (from-to)649-676
Number of pages28
JournalCEUR Workshop Proceedings
Volume2826
StatePublished - 2020
Externally publishedYes
EventWorking Notes of FIRE - 12th Forum for Information Retrieval Evaluation, FIRE-WN 2020 - Hyderabad, India
Duration: 16 Dec 202020 Dec 2020

Keywords

  • Authorship-identification source-code datasets

Fingerprint

Dive into the research topics of 'Overview of the PAN@FIRE 2020 task on the authorship identification of source code'. Together they form a unique fingerprint.

Cite this