A topic modeling approach for code clone detection

Document Type

Conference Proceeding

Publication Date

1-1-2018

Abstract

In this paper we investigate the potential benefits of Latent Dirichlet Allocation (LDA) as a technique for code clone de-tection. Our objective is to propose a language-independent, effective, and scalable approach for identifying similar code fragments in relatively large software systems. The main assumption is that the latent topic structure of software ar-tifacts gives an indication of the presence of code clones. In particular, we hypothesize that artifacts with similar topic distributions contain duplicated code fragments. To test this novel hypothesis, we conduct an experimental investigation using multiple datasets from difierent application domains. Preliminary results show that, if calibrated properly, topic modeling can deliver satisfactory performance in capturing different types of code clones. It also achieves levels of accu-racy adequate for practical applications, showing compara-ble performance to already existing tools that adopt different clone detection strategies.

Publication Title

Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE

Volume

2018-July

First Page

486

Last Page

491

Digital Object Identifier (DOI)

10.18293/SEKE2018-179

ISSN

23259000

E-ISSN

23259086

ISBN

1891706446

Share

COinS