TY - GEN
T1 - On the automatic construction of an Arabic thesaurus
AU - Mohsen, Ghassan
AU - Al-Ayyoub, Mahmoud
AU - Hmeidi, Ismail
AU - Al-Aiad, Ahmad
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/5/4
Y1 - 2018/5/4
N2 - Despite its accuracy, the traditional approach of manually constructing a thesaurus can be a very complex task with many challenges. On the other hand, constructing the thesaurus automatically has been found to be very useful in avoiding a number of drawbacks of the manual approach. Automating the process of thesaurus construction can save time, effort and cost in addition to allowing easy maintenance and expansion of the constructed thesaurus. Several approaches have been proposed to build thesauri in many languages (mainly English). To the best of our knowledge, there are very limited efforts towards automatically building a high-quality large-scale thesaurus for the Arabic language. To fill this knowledge gap, the paper aims to automatically build a thesaurus and compare various methods for this task. To this end, a dataset of 14,148 Arabic documents is collected on different topics such as Arts, Politics, etc. The dataset is analyzed to assign weights to each term using three different weighting approaches: Term Frequency-Inverse Document Frequency (TF-IDF), Pointwise Mutual Information (PMI) and Latent Semantic Analysis (LSA). Then, three different similarity measures (Cosine, Jaccard and Dice) are used to compute term-term similarity. We test the constructed thesauri on 20 queries to evaluate their accuracies and determine which combination performs the best. Recall and precision are the main accuracy measures used to evaluate the retrieval process. The experimental results demonstrated the superiority of TF-IDF approach over PMI and LSA approaches.
AB - Despite its accuracy, the traditional approach of manually constructing a thesaurus can be a very complex task with many challenges. On the other hand, constructing the thesaurus automatically has been found to be very useful in avoiding a number of drawbacks of the manual approach. Automating the process of thesaurus construction can save time, effort and cost in addition to allowing easy maintenance and expansion of the constructed thesaurus. Several approaches have been proposed to build thesauri in many languages (mainly English). To the best of our knowledge, there are very limited efforts towards automatically building a high-quality large-scale thesaurus for the Arabic language. To fill this knowledge gap, the paper aims to automatically build a thesaurus and compare various methods for this task. To this end, a dataset of 14,148 Arabic documents is collected on different topics such as Arts, Politics, etc. The dataset is analyzed to assign weights to each term using three different weighting approaches: Term Frequency-Inverse Document Frequency (TF-IDF), Pointwise Mutual Information (PMI) and Latent Semantic Analysis (LSA). Then, three different similarity measures (Cosine, Jaccard and Dice) are used to compute term-term similarity. We test the constructed thesauri on 20 queries to evaluate their accuracies and determine which combination performs the best. Recall and precision are the main accuracy measures used to evaluate the retrieval process. The experimental results demonstrated the superiority of TF-IDF approach over PMI and LSA approaches.
KW - Automatic Thesaurus Construction
KW - Cosine Similarity
KW - Dice Similarity
KW - Jaccard Similarity
KW - Latent Semantic Analysis (LSA)
KW - Modern Standard Arabic
KW - Pointwise Mutual Information (PMI)
KW - Term Frequency-Inverse Document Frequency (TF-IDF)
UR - https://www.scopus.com/pages/publications/85048497629
U2 - 10.1109/IACS.2018.8355431
DO - 10.1109/IACS.2018.8355431
M3 - Conference contribution
AN - SCOPUS:85048497629
T3 - 2018 9th International Conference on Information and Communication Systems, ICICS 2018
SP - 243
EP - 247
BT - 2018 9th International Conference on Information and Communication Systems, ICICS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th International Conference on Information and Communication Systems, ICICS 2018
Y2 - 3 April 2018 through 5 April 2018
ER -