TY - GEN
T1 - An extensive study of the Bag-of-Words approach for gender identification of Arabic articles
AU - Alsmearat, Kholoud
AU - Al-Ayyoub, Mahmoud
AU - Al-Shalabi, Riyad
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014
Y1 - 2014
N2 - The prevalent use of Online Social Networks (OSN) and the anonymity and lack of accountability they inherent from being online give rise to many problems related to finding the connection between the massive amount of text data on OSN and the people who actually wrote them. Analyzing text data for such purposes is called authorship analysis. This work is focused on one specific type of authorship analysis, which is identifying the author's gender. Gender identification has various applications from marketing to security. The focus of this work is on Arabic articles. The problem is basically a classification problem and the current approaches differ in the way they compute the features of each document. However, they all agree on following some 'stylometric features' approach. Unlike these works, ours treat this problem as a variation of the Text Classification (TC) problem and follow the Bag-Of-Words (BOW) approach for feature selection. We perform an extensive set of experiments on the feature selection and classification phase and the results show that such an approach yield surprisingly high results.
AB - The prevalent use of Online Social Networks (OSN) and the anonymity and lack of accountability they inherent from being online give rise to many problems related to finding the connection between the massive amount of text data on OSN and the people who actually wrote them. Analyzing text data for such purposes is called authorship analysis. This work is focused on one specific type of authorship analysis, which is identifying the author's gender. Gender identification has various applications from marketing to security. The focus of this work is on Arabic articles. The problem is basically a classification problem and the current approaches differ in the way they compute the features of each document. However, they all agree on following some 'stylometric features' approach. Unlike these works, ours treat this problem as a variation of the Text Classification (TC) problem and follow the Bag-Of-Words (BOW) approach for feature selection. We perform an extensive set of experiments on the feature selection and classification phase and the results show that such an approach yield surprisingly high results.
UR - https://www.scopus.com/pages/publications/84988259136
U2 - 10.1109/AICCSA.2014.7073254
DO - 10.1109/AICCSA.2014.7073254
M3 - Conference contribution
AN - SCOPUS:84988259136
T3 - Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA
SP - 601
EP - 608
BT - 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications, AICCSA 2014
PB - IEEE Computer Society
T2 - 2014 11th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2014
Y2 - 10 November 2014 through 13 November 2014
ER -