Predicting GitHub Issues Lifetime using Machine Learning and Topic Modeling (LDA)

[PDF]

Abstract

Topic modeling is the process of extracting keywords from documents to characterize and distinguish them from other documents. It is a process applied to summarize, compare and analyze large corpus of text. In the software engineering domain, it has been applied to mine repositories and extract valuable insight into the important properties and aspects of the project and its developers. One such important aspect of current project management efforts is the prediction of issue lifetime. This study conducts topic modeling on GitHub issues to observe patterns in the extracted topics and their performance as a feature for predicting the lifetime of issues. It is observed that issues from a large collection of projects can yield distinguishable and comprehensible topics. In terms of predictive performance, the prediction model with topic modeling performs better than the previous approach, with a high increase in precision and f1- measure. Evaluating these findings helps establish topic modeling as a viable feature in issue-based software development processes.

Keywords

topic modeling, issue lifetime prediction, mining software repositories