Skip to main navigation Skip to search Skip to main content

Analyzing RISC-V Compiler Toolchain by Adopting Topic Modeling

  • NED University of Engineering and Technology
  • Universidade Federal do Paraná
  • National Center of Artificial Intelligence

Research output: Contribution to journalArticlepeer-review

Abstract

Recently, developers have increasingly relied on open repositories and mail archives to build software, particularly in specialized domains where structured documentation is scarce. However, navigating and extracting useful knowledge from such scattered sources is a challenging and time-consuming task. This paper presents the first systematic effort to organize and analyze GitHub commit messages and mailing list patches using topic modeling techniques. The proposed technique is applied to the RISC-V compiler toolchain, where development primarily depends on code repositories and community discussions. By jointly modeling these heterogeneous sources, our method identifies recurring compiler related themes such as auto-vectorization, intrinsics, and data types, enabling efficient retrieval of development knowledge. Our evaluation shows that for GitHub commit messages, Latent Semantic Analysis (LSA) achieves the highest CV coherence, while BERTopic provides the greatest topic diversity. For mailing list patches, BERTopic outperforms other models in CV coherence, whereas Word2Vec leads in topic diversity. In addition, we demonstrate practical retrieval scenarios using queries such as autovec vmerge, intrinsic vfdiv, and jalr uint32_t, highlighting key concerns related to efficient code generation, floating-point precision, and address calculation optimizations in the RISC-V compiler. Overall, the experimental results indicate that topic modeling effectively captures development trends that are difficult to uncover through manual inspection or keyword-based search. By providing an organized and coherent view of scattered knowledge, our approach helps bridge knowledge gaps in complex technical domains and accelerates development where resources are limited.

Original languageEnglish
Pages (from-to)33178-33198
Number of pages21
JournalIEEE Access
Volume14
DOIs
StatePublished - 2026

Keywords

  • GCC
  • GitHub
  • RISC-V
  • Topic modeling
  • compiler
  • mailing list

Fingerprint

Dive into the research topics of 'Analyzing RISC-V Compiler Toolchain by Adopting Topic Modeling'. Together they form a unique fingerprint.

Cite this