Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012, International Journal on Computational Science & Applications
Software fault prediction plays a vital role in software quality assurance. Identifying the faulty modules helps to better concentrate on those modules and helps improve the quality of the software. With increasing complexity of software nowadays feature selection is important to remove the redundant, irrelevant and erroneous data from the dataset. In general, Feature selection is done mainly based on filter and wrapper. In this paper a hybrid feature selection method is proposed which gives a better prediction than the traditional methods. NASA's public dataset KC1 available at promise software engineering repository is used. To evaluate the performance of the software fault prediction models Accuracy, Mean absolute error (MAE), Root mean squared error (RMSE) values are used.
Journal of Research and Development on Information and Communication Technology, 2021
The rapid growth of data has become a huge challenge for software systems. The quality of fault predictionmodel depends on the quality of software dataset. High-dimensional data is the major problem that affects the performance of the fault prediction models. In order to deal with dimensionality problem, feature selection is proposed by various researchers. Feature selection method provides an effective solution by eliminating irrelevant and redundant features, reducing computation time and improving the accuracy of the machine learning model. In this study, we focus on research and synthesis of the Filter-based feature selection with several search methods and algorithms. In addition, five filter-based feature selection methods are analyzed using five different classifiers over datasets obtained from National Aeronautics and Space Administration (NASA) repository. The experimental results show that Chi-Square and Information Gain methods had the best influence on the results of pre...
Journal of Communications Technology, Electronics and Computer Science, 2016
Improving the software product quality before releasing by periodic tests is one of the most expensive activities in software projects. Due to limited resources to modules test in software projects, it is important to identify fault-prone modules and use the test sources for fault prediction in these modules. Software fault predictors based on machine learning algorithms, are effective tools for identifying fault-prone modules. Extensive studies are being done in this field to find the connection between features of software modules, and their fault-prone. Some of features in predictive algorithms are ineffective and reduce the accuracy of prediction process. So, feature selection methods to increase performance of prediction models in fault-prone modules are widely used. In this study, we proposed a feature selection method for effective selection of features, by using combination of filter feature selection methods. In the proposed filter method, the combination of several filter ...
International Journal on Electrical Engineering and Informatics
Software engineering activities comprise of several activities to ensure that the quality product will be achieved at the end. Some of these activities are software testing, inspection, formal verification and software defect prediction. Many researchers have been developed several models for defect prediction. These models are based on machine learning techniques and statistical analysis techniques. The main objective of these models are to identify the defects before the delivery of the software to the end user. This prediction helps project managers to effectively utilize the resources for better quality assurance. Sometimes, a single defect can cause the entire system failure and most of the time they drop the quality of the software system drastically. Early identification of defects can also help to make a better process plan which can handle the defects effectively and increase the customer satisfaction level. But the accurate prediction of defects in software is not an easy task because this is an indirect measure. Therefore, it is important to find suitable and significant measures which are most relevant for finding the defects in the software system. This paper presents a feature selection based model to predict the defects in a given software module. The most relevant features are extracted from all features with the help of seven feature selection techniques and eight classifiers are used to classify the modules. Six NASA software engineering defects prediction data sets are used in this work. Several performance parameters are also calculated for measuring the performance and validation of this work and the results of the experiments revealed that the proposed model has more capability to predict the software defects.
International Journal of Computer Applications, 2014
Feature subset selection is the process of choosing a subset of good features with respect to the target concept. A clustering based feature subset selection algorithm has been applied over software defect prediction data sets. Software defect prediction domain has been chosen due to the growing importance of maintaining high reliability and high quality for any software being developed. A software quality prediction model is built using software metrics and defect data collected from a previously developed system release or similar software projects. Upon validation of such a model, it could be used for predicting the fault-proneness of program modules that are currently under development. The proposed clustering based algorithm for feature selection uses minimum spanning tree based method to cluster features. And then the algorithm is applied over four different data sets and its impact is analyzed.
International Journal of Embedded Systems
Software is a collection of computer programs written in a programming language. Software contains various modules which make it a complex entity and it can increase the defect probability at the time of development of the modules. In turn, cost and time to develop the software can be increased. Sometimes, these defects can lead to failure of entire software. It will lead to untimely delivery of the software to the customer. This untimely delivery can responsible for withdrawal or cancellation of project in future. Hence, in this research work, some machine learning algorithms are applied to ensure timely delivery and prediction of defects. Further, several feature selection techniques are also adopted to determine relevant features for defect prediction.
Information Systems Frontiers, 2013
Two important problems which can affect the performance of classification models are highdimensionality (an overabundance of independent features in the dataset) and imbalanced data (a skewed class distribution which creates at least one class with many fewer instances than other classes). To resolve these problems concurrently, we propose an iterative feature selection approach, which repeated applies data sampling (in order to address class imbalance) followed by feature selection (in order to address high-dimensionality), and finally we perform an aggregation step which combines the ranked feature lists from the separate iterations of sampling. This approach is designed to find a ranked feature list which is particularly effective on the more balanced dataset resulting from sampling while minimizing the risk of losing data through the sampling step and missing important features. To demonstrate this technique, we employ 18 different feature selection algorithms and Random Undersampling with two post-sampling class distributions. We also investigate
Recently, software defect prediction is an important research topic in the software engineering field. The accurate prediction of defect prone software modules can help the software testing effort, reduce costs, and improve the software testing process by focusing on fault-prone module. Software defect data sets have an imbalanced nature with very few defective modules compared to defect-free ones. The software defect prediction performance also decreases significantly because the dataset contains noisy attributes. In this research, we propose the combination of genetic algorithm and bagging technique for improving the performance of the software defect prediction. Genetic algorithm is applied to deal with the feature selection, and bagging technique is employed to deal with the class imbalance problem. The proposed method is evaluated using the data sets from NASA metric data repository. Results have indicated that the proposed method makes an impressive improvement in prediction performance for most classifiers.
International Journal of Modern Education and Computer Science, 2020
Testing is considered as one of the expensive activities in software development process. Fixing the defects during testing process can increase the cost as well as the completion time of the project. Cost of testing process can be reduced by identifying the defective modules during the development (before testing) stage. This process is known as "Software Defect Prediction", which has been widely focused by many researchers in the last two decades. This research proposes a classification framework for the prediction of defective modules using variant based ensemble learning and feature selection techniques. Variant selection activity identifies the best optimized versions of classification techniques so that their ensemble can achieve high performance whereas feature selection is performed to get rid of such features which do not participate in classification and become the cause of lower performance. The proposed framework is implemented on four cleaned NASA datasets from MDP repository and evaluated by using three performance measures, including: F-measure, Accuracy, and MCC. According to results, the proposed framework outperformed 10 widely used supervised classification techniques, including:
Symmetry
Finding defects early in a software system is a crucial task, as it creates adequate time for fixing such defects using available resources. Strategies such as symmetric testing have proven useful; however, its inability in differentiating incorrect implementations from correct ones is a drawback. Software defect prediction (SDP) is another feasible method that can be used for detecting defects early. Additionally, high dimensionality, a data quality problem, has a detrimental effect on the predictive capability of SDP models. Feature selection (FS) has been used as a feasible solution for solving the high dimensionality issue in SDP. According to current literature, the two basic forms of FS approaches are filter-based feature selection (FFS) and wrapper-based feature selection (WFS). Between the two, WFS approaches have been deemed to be superior. However, WFS methods have a high computational cost due to the unknown number of executions available for feature subset search, evalua...
2020
Feature selection is a technique used to select an optimal feature subset from the original input features according to a specific criterion. The criterion is often formulated as an objective function that finds which features are most appropriate for some tasks at hand. The reason why it is interesting to find a subset of features is because that it always easier to solve a problem in a lower dimension. This helps in understanding the nonlinear mapping between input and output variables. This paper reviewed the basic Feature Selection Techniques for Software Defect Prediction Model and their domain applications. The Subsets selection are categorized into three distinct models and are discussed in a concise form to provide young researchers with the general methods of Subset Selection. Support Vector Machine with Recursive Feature Elimination for both Logistic Regression and Random Forest was introduced to evaluate the performance between filter, wrapper, and embedded feature select...
Journal of Systems Engineering and Electronics, 2021
Software defect prediction (SDP) is used to perform the statistical analysis of historical defect data to find out the distribution rule of historical defects, so as to effectively predict defects in the new software. However, there are redundant and irrelevant features in the software defect datasets affecting the performance of defect predictors. In order to identify and remove the redundant and irrelevant features in software defect datasets, we propose ReliefF-based clustering (RFC), a clusterbased feature selection algorithm. Then, the correlation between features is calculated based on the symmetric uncertainty. According to the correlation degree, RFC partitions features into k clusters based on the k-medoids algorithm, and finally selects the representative features from each cluster to form the final feature subset. In the experiments, we compare the proposed RFC with classical feature selection algorithms on nine National Aeronautics and Space Administration (NASA) software defect prediction datasets in terms of area under curve (AUC) and Fvalue. The experimental results show that RFC can effectively improve the performance of SDP.
International Journal of Advanced Science and Technology, 2014
Software is a complex entity composed in various modules with varied range of defect occurrence possibility. Efficient and timely prediction of defect occurrence in software allows software project managers to effectively utilize people, cost, time for better quality assurance. The presence of defects in a software leads to a poor quality software and also responsible for the failure of a software project. Sometime it is not possible to identify the defects and fixing them at the time of development and it is required to handle such defects any time whenever they are noticed by the team members. So it is important to predict defect-prone software modules prior to deployment of software project in order to plan better maintenance strategy. Early knowledge of defect prone software module can also help to make efficient process improvement plan within justified period of time and cost. This can further lead to better software release as well as high customer satisfaction subsequently. Accurate measurement and prediction of defect is a crucial issue in any software because it is an indirect measurement and is based on several metrics. Therefore, instead of considering all the metrics, it would be more appropriate to find out a suitable set of metrics which are relevant and significant for prediction of defects in any software modules. This paper proposes a feature selection based Linear Twin Support Vector Machine (LSTSVM) model to predict defect prone software modules. F-score, a feature selection technique, is used to determine the significant metrics set which are prominently affecting the defect prediction in a software modules. The efficiency of predictive model could be enhanced with reduced metrics set obtained after feature selection and further used to identify defective modules in a given set of inputs. This paper evaluates the performance of proposed model and compares it against other existing machine learning models. The experiment has been performed on four PROMISE software engineering repository datasets. The experimental results indicate the effectiveness of the proposed feature selection based LSTSVM predictive model on the basis standard performance evaluation parameters.
Proceedings of the 7th India Software Engineering Conference on - ISEC '14, 2014
The quality of a fault prediction model depends on the software metrics that are used to build the prediction model. Feature selection represents a process of selecting a subset of relevant features that may lead to build improved prediction models. Feature selection techniques can be broadly categorized into two subcategories: feature-ranking and featuresubset selection. In this paper, we present a comparative investigation of seven feature-ranking techniques and eight feature-subset selection techniques for improved fault prediction. The performance of these feature selection techniques is evaluated using two popular machine-learning classifiers: Naive Bayes and Random Forest, over fourteen software project's fault-datasets obtained from the PROMISE data repository. The performances were measured using Fmeasure and AUC values. Our results demonstrated that feature-ranking techniques produced better results compared to feature-subset selection techniques. Among, the featureranking techniques used in the study, InfoGain and PCA techniques provided the best performance over all the datasets, while for feature-subset selection techniques ClassifierSub-setEval and Logistic Regression produced better results against their peers.
Symmetry
Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC va...
Applied Sciences
Software Defect Prediction (SDP) models are built using software metrics derived from software systems. The quality of SDP models depends largely on the quality of software metrics (dataset) used to build the SDP models. High dimensionality is one of the data quality problems that affect the performance of SDP models. Feature selection (FS) is a proven method for addressing the dimensionality problem. However, the choice of FS method for SDP is still a problem, as most of the empirical studies on FS methods for SDP produce contradictory and inconsistent quality outcomes. Those FS methods behave differently due to different underlining computational characteristics. This could be due to the choices of search methods used in FS because the impact of FS depends on the choice of search method. It is hence imperative to comparatively analyze the FS methods performance based on different search methods in SDP. In this paper, four filter feature ranking (FFR) and fourteen filter feature su...
Software: Practice and Experience, 2011
The selection of software metrics for building software quality prediction models is a search-based software engineering problem. An exhaustive search for such metrics is usually not feasible due to limited project resources, especially if the number of available metrics is large. Defect prediction models are necessary in aiding project managers for better utilizing valuable project resources for software quality improvement. The efficacy and usefulness of a fault-proneness prediction model is only as good as the quality of the software measurement data. This study focuses on the problem of attribute selection in the context of software quality estimation. A comparative investigation is presented for evaluating our proposed hybrid attribute selection approach, in which feature ranking is first used to reduce the search space, followed by a feature subset selection. A total of seven different feature ranking techniques are evaluated, while four different feature subset selection approaches are considered. The models are trained using five commonly used classification algorithms. The case study is based on software metrics and defect data collected from multiple releases of a large real-world software system. The results demonstrate that while some feature ranking techniques performed similarly, the automatic hybrid search algorithm performed the best among the feature subset selection methods. Moreover, performances of the defect prediction models either improved or remained unchanged when over 85% of the software metrics were eliminated.
ECTI Transactions on Computer and Information Technology (ECTI-CIT)
Two primary issues have emerged in the machine learning and data mining community: how to deal with imbalanced data and how to choose appropriate features. These are of particular concern in the software engineering domain, and more specifically the field of software defect prediction. This research highlights a procedure which includes a feature selection technique to single out relevant attributes, and an ensemble technique to handle the class-imbalance issue. In order to determine the advantages of feature selection and ensemble methods we look at two potential scenarios: (1) Ensemble models constructed from the original datasets, without feature selection; (2) Ensemble models constructed from the reduced datasets after feature selection has been applied. Four feature selection techniques are employed: Principal Component Analysis (PCA), Pearson’s correlation, Greedy Stepwise Forward selection, and Information Gain (IG). The aim of this research is to assess the effectiveness of ...
Journal of Physics: Conference Series, 2020
Advances in technology have increased the use and complexity of software. The complexity of the software can increase the possibility of defects. Defective software can cause high losses. Fixing defective software requires a high cost because it can spend up 50% of the project schedule. Most software developers don't document their work properly so that making it difficult to analyse software development history data. Software metrics which use in cross-project software defects prediction have many features. Software metrics usually consist of various measurement techniques, so there are possibilities for their features to be similar. It is possible that these features are similar or irrelevant so that they can cause a decrease in the performance of classifiers. In this study, several feature selection techniques were proposed to select the relevant features. The classification algorithm used is Naive Bayes. Based on the analysis using ANOVA, the SBS and SBFS models can significantly improve the performance of the Naïve Bayes model.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2024
In software defect prediction, noisy attributes and high-dimensional data remain to be a critical challenge. This paper introduces a novel approach known as multi correlation-based feature selection (MCFS), which seeks to address these challenges. MCFS integrates two feature selection techniques, namely correlation-based feature selection (CFS) and correlation matrix based feature selection (CMFS), intending to reduce data dimensionality and eliminate noisy attributes. To accomplish this, CFS and CMFS are applied independently to filter the datasets, and a weighted average of their outcomes is computed to determine the optimal feature selection. This approach not only reduces data dimensionality but also mitigates the impact of noisy attributes. To further enhance predictive performance, this paper leverages the particle swarm optimization (PSO) algorithm as a feature selection mechanism, specifically targeting improvements in the area under the curve (AUC). The evaluation of the proposed method is conducted on 12 benchmark datasets sourced from the NASA metrics data program (MDP) corpus, renowned for their noisy attributes, high dimensionality, and imbalanced class records. The research findings demonstrate that MCFS outperforms CFS and CMFS, yielding an average AUC value of 0.891, thereby emphasizing it is efficacy in advancing classification performance in the context of software defect prediction using k-nearest neighbors (KNN) classification.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.