TY - GEN
T1 - Building a standard dataset for Arabie sentiment analysis
T2 - 13th IEEE/ACS International Conference of Computer Systems and Applications, AICCSA 2016
AU - Al-Kabi, Mohammed N.
AU - Al-Qwaqenah, Areej A.
AU - Gigieh, Amal H.
AU - Alsmearat, Kholoud
AU - Al-Ayyoub, Mahmoud
AU - Alsmadi, Izzat M.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Sentiment Analysis (SA) is one of the hottest research fields nowadays. It is concerned with identifying the sentiment conveyed in a piece of text. The current efforts in SA require the existence of standard datasets for training/testing purposes. Such datasets already exist for some languages such as English. Unfortunately, the same cannot be said about other languages such as Arabic. Currently existing Arabic SA datasets are restricted (in their domain, size, dialects covered, etc.) and/or have limited availability. Moreover, the annotation process did not receive the proper attention it deserves. Some of the existing datasets relied on the author's point of view for annotation, while others employed annotators, but did not take into account the personal variations between the annotators and how would that affect their agreement. This study presents our efforts to build a standard Arabic dataset with the above concerns in mind. The constructed dataset is intended for generic use as it contains reviews from different domains written in Modern Standard Arabic (MSA) as well as several dialects. As for the annotation process, it is given high attention by studying the inter-annotator agreements and investigating the potential factors affecting them.
AB - Sentiment Analysis (SA) is one of the hottest research fields nowadays. It is concerned with identifying the sentiment conveyed in a piece of text. The current efforts in SA require the existence of standard datasets for training/testing purposes. Such datasets already exist for some languages such as English. Unfortunately, the same cannot be said about other languages such as Arabic. Currently existing Arabic SA datasets are restricted (in their domain, size, dialects covered, etc.) and/or have limited availability. Moreover, the annotation process did not receive the proper attention it deserves. Some of the existing datasets relied on the author's point of view for annotation, while others employed annotators, but did not take into account the personal variations between the annotators and how would that affect their agreement. This study presents our efforts to build a standard Arabic dataset with the above concerns in mind. The constructed dataset is intended for generic use as it contains reviews from different domains written in Modern Standard Arabic (MSA) as well as several dialects. As for the annotation process, it is given high attention by studying the inter-annotator agreements and investigating the potential factors affecting them.
KW - Arabic sentiment analysis
KW - Cohen's Kappa measure
KW - Dataset preparation
KW - Inter-annotator agreement
UR - https://www.scopus.com/pages/publications/85022016240
U2 - 10.1109/AICCSA.2016.7945822
DO - 10.1109/AICCSA.2016.7945822
M3 - Conference contribution
AN - SCOPUS:85022016240
T3 - Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA
BT - 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications, AICCSA 2016 - Proceedings
PB - IEEE Computer Society
Y2 - 29 November 2016 through 2 December 2016
ER -