Suraj Yatish, Jirayus Jiarpakdee, Patanamon Thongtanunam, Chakkrit Tantithamthavorn
The International Conference on Software Engineering (ICSE)
With the rise of the Mining Software Repositories (MSR) field, defect datasets extracted from software repositories play a foundational role in many empirical studies related to software quality. At the core of defect data preparation is the identification of post-release defects. Prior studies leverage many heuristics (e.g., keywords and issue IDs) to identify post-release defects. However, such heuristic approach is based on several assumptions, which pose common threats to the validity of many studies. In this paper, we set out to investigate the nature of the difference of defect datasets generated by the heuristic approach and the realistic approach that leverages the earliest affected release that is realistically estimated by a software development team for a given defect. In addition, we investigate the impact of defect identification approaches on the predictive accuracy and the ranking of defective modules that are produced by defect models. Through a case study of defect datasets of 32 releases, we conclude that the heuristic approach has a large impact on both defect count datasets and binary defect datasets. On the other hand, the heuristic approach has a minimal impact on the predictive accuracy and the ranking of defective modules that are produced by defect count models and defect classification models. Our findings suggest that practitioners and researchers should not be too concerned about the predictive accuracy and the ranking of defective modules produced by defect models that are constructed using heuristic defect datasets.