Feature Selection in Cross-Project Software Defect Prediction

A Saifudin; A Trisetyarso; W Suparta; C H Kang; B S Abbas; Y Heryadi

Feature Selection in Cross-Project Software Defect Prediction

Aries Saifudin

2020, Journal of Physics: Conference Series

https://doi.org/10.1088/1742-6596/1569/2/022001

visibility

…

description

7 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Advances in technology have increased the use and complexity of software. The complexity of the software can increase the possibility of defects. Defective software can cause high losses. Fixing defective software requires a high cost because it can spend up 50% of the project schedule. Most software developers don't document their work properly so that making it difficult to analyse software development history data. Software metrics which use in cross-project software defects prediction have many features. Software metrics usually consist of various measurement techniques, so there are possibilities for their features to be similar. It is possible that these features are similar or irrelevant so that they can cause a decrease in the performance of classifiers. In this study, several feature selection techniques were proposed to select the relevant features. The classification algorithm used is Naive Bayes. Based on the analysis using ANOVA, the SBS and SBFS models can significantly improve the performance of the Naïve Bayes model.

Figures (6)

This content was downloaded from IP address 158.140.180.106 on 04/08/2020 at 06:38

metrics dataset. Lhe results of model performance measurements are Compared to get the best model. The proposed model implements using NASA dataset because it is the most widely used dataset in this study so that it is easy to compare with other researchers. The NASA dataset is obtained from https://github.com/klainfo/NASADefectDataset which is a _ backup- of _http://nasa- softwaredefectdatasets.wikispaces.com/ from Shepperd et al. (2014). NASA datasets consist of 10 datasets, but for this work, we use five datasets which have the same attributes, namely CM1, MW1, PC1, PC3, and PC4. Feature selection algorithms which have proposed is implemented to select the relevant features for the classifier. The proposed model is shown in Figure 1. Software metrics datasets that have been collected divide into two groups, one as testing dataset and the others training dataset. Then applied to standardization using min-max scalar and feature selection algorithm. The feature selection algorithm

Experiments carried out by applying the model using a dataset that has been collected. The model implementation using a dataset from NASA follows the proposed model as shown in Figure 1. The accuracy and AUC values of the resulting model are then visualized using the graph shown in Figure 2 and Figure 3. Figure 2 shows that the average model accuracy decreases as the number of features increases. While Figure 3 shows that the AUC value, in general, has increased.

Table 2. P-value and Significantly Different Comparison of AUC Based on Table 2 shows that there are only 2 models that show a significant difference to the Naive Bayes model. SFFS has no difference with other models.

Table 1. P-value and Significantly Different Comparison of Accuracy Figure 4. Boxplot visualization of Accuracy

Figure 5. Boxplot visualization of AUC To show the difference significantly towards better or decreasing visualization using boxplot diagram as shown in Figure 5. Based on the visualization in Figure 5 shows that the SBS and SBFS models have significantly better differences than the Naive Bayes model without feature selection.

Naeem Seliya

Software: Practice and Experience, 2011

The selection of software metrics for building software quality prediction models is a search-based software engineering problem. An exhaustive search for such metrics is usually not feasible due to limited project resources, especially if the number of available metrics is large. Defect prediction models are necessary in aiding project managers for better utilizing valuable project resources for software quality improvement. The efficacy and usefulness of a fault-proneness prediction model is only as good as the quality of the software measurement data. This study focuses on the problem of attribute selection in the context of software quality estimation. A comparative investigation is presented for evaluating our proposed hybrid attribute selection approach, in which feature ranking is first used to reduce the search space, followed by a feature subset selection. A total of seven different feature ranking techniques are evaluated, while four different feature subset selection approaches are considered. The models are trained using five commonly used classification algorithms. The case study is based on software metrics and defect data collected from multiple releases of a large real-world software system. The results demonstrate that while some feature ranking techniques performed similarly, the automatic hybrid search algorithm performed the best among the feature subset selection methods. Moreover, performances of the defect prediction models either improved or remained unchanged when over 85% of the software metrics were eliminated.

Log In

Feature Selection in Cross-Project Software Defect Prediction

Sign up for access to the world's latest research

Abstract

Figures (6)

Related papers

Related papers

Related topics