TY - GEN
T1 - Overview of the PAN@FIRE 2020 Task on the Authorship Identification of SOurce COde
AU - Fadel, Ali
AU - Musleh, Husam
AU - Tuffaha, Ibraheem
AU - Al-Ayyoub, Mahmoud
AU - Jararweh, Yaser
AU - Benkhelifa, Elhadj
AU - Rosso, Paolo
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/12/16
Y1 - 2020/12/16
N2 - Authorship identification is essential to the detection of undesirable deception of others' content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a PAN@FIRE task, named Authorship Identification of SOurce COde (AI-SOCO), is proposed with the focus on the identification of source code authors. The dataset consists of crawled source codes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the CodeForces online judge platform. The participating systems are asked to predict the author of a given source code from the predefined list of code authors. In total, 60 teams registered on the task's CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle source code) to stylometric features.
AB - Authorship identification is essential to the detection of undesirable deception of others' content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a PAN@FIRE task, named Authorship Identification of SOurce COde (AI-SOCO), is proposed with the focus on the identification of source code authors. The dataset consists of crawled source codes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the CodeForces online judge platform. The participating systems are asked to predict the author of a given source code from the predefined list of code authors. In total, 60 teams registered on the task's CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle source code) to stylometric features.
KW - authorship-identification
KW - datasets
KW - source-code
UR - https://www.scopus.com/pages/publications/85100403582
U2 - 10.1145/3441501.3441532
DO - 10.1145/3441501.3441532
M3 - Conference contribution
AN - SCOPUS:85100403582
T3 - ACM International Conference Proceeding Series
SP - 4
EP - 8
BT - FIRE 2020 - Proceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation
A2 - Majumder, Prasenjit
A2 - Mitra, Mandar
A2 - Gangopadhyay, Surupendu
A2 - Mehta, Parth
PB - Association for Computing Machinery
T2 - 12th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2020
Y2 - 16 December 2020 through 20 December 2020
ER -