A topic modeling approach for code clone detection
In this paper we investigate the potential benefits of Latent Dirichlet Allocation (LDA) as a technique for code clone de-tection. Our objective is to propose a language-independent, effective, and scalable approach for identifying similar code fragments in relatively large software systems. The main assumption is that the latent topic structure of software ar-tifacts gives an indication of the presence of code clones. In particular, we hypothesize that artifacts with similar topic distributions contain duplicated code fragments. To test this novel hypothesis, we conduct an experimental investigation using multiple datasets from difierent application domains. Preliminary results show that, if calibrated properly, topic modeling can deliver satisfactory performance in capturing different types of code clones. It also achieves levels of accu-racy adequate for practical applications, showing compara-ble performance to already existing tools that adopt different clone detection strategies.
Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE
Digital Object Identifier (DOI)
Reddivari, Sandeep & Khan, Mohammed. (2018). A Topic Modeling Approach for Code Clone Detection. 486-535. 10.18293/SEKE2018-179