A topic modeling approach for code clone detection
Document Type
Conference Proceeding
Publication Date
1-1-2018
Abstract
In this paper we investigate the potential benefits of Latent Dirichlet Allocation (LDA) as a technique for code clone de-tection. Our objective is to propose a language-independent, effective, and scalable approach for identifying similar code fragments in relatively large software systems. The main assumption is that the latent topic structure of software ar-tifacts gives an indication of the presence of code clones. In particular, we hypothesize that artifacts with similar topic distributions contain duplicated code fragments. To test this novel hypothesis, we conduct an experimental investigation using multiple datasets from difierent application domains. Preliminary results show that, if calibrated properly, topic modeling can deliver satisfactory performance in capturing different types of code clones. It also achieves levels of accu-racy adequate for practical applications, showing compara-ble performance to already existing tools that adopt different clone detection strategies.
Publication Title
Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE
Volume
2018-July
First Page
486
Last Page
491
Digital Object Identifier (DOI)
10.18293/SEKE2018-179
ISSN
23259000
E-ISSN
23259086
ISBN
1891706446
Citation Information
Reddivari, Sandeep & Khan, Mohammed. (2018). A Topic Modeling Approach for Code Clone Detection. 486-535. 10.18293/SEKE2018-179