Skip to main navigation Skip to search Skip to main content

Questions clustering using canopy-K-means and hierarchical-K-means clustering

  • Hashemite University

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

In questions datasets, several questions could produce duplicates since they are similar questions due to the ability to write a question in different forms based on the flexibility of Natural Language. However, extracting relevant questions is time-consuming if it is performed manually. Therefore, the computational power of computers is necessary to group similar questions into clusters based on their semantic similarity but still the information included within a question may be insufficient to efficiently cluster the questions making it a challenging task. In this research, canopy clustering is employed as a previous step for K-means clustering, then it is compared to the Hierarchical Clustering approach. Quora questions dataset is used in the experiments to identify question pairs that are similar. In terms of F1 score and rand statistic measure, the results demonstrate that the Hierarchical-K-means approach provides better validity clustering measures than the Canopy-K-means approach. In addition to identifying matches, the Canopy approach serves with the top related questions that have the same intent in the same cluster in several canopies.

Original languageEnglish
Pages (from-to)3793-3802
Number of pages10
JournalInternational Journal of Information Technology (Singapore)
Volume14
Issue number7
DOIs
StatePublished - Dec 2022

Keywords

  • Canopy clustering
  • Hierarchical clustering
  • K-means clustering
  • Questions clustering

Fingerprint

Dive into the research topics of 'Questions clustering using canopy-K-means and hierarchical-K-means clustering'. Together they form a unique fingerprint.

Cite this