Saeid Pourzeynali obtained his M.S. and Ph.D. degrees in Structural Engineering in 1989 and 2001 from Tehran University, Iran, and IIT Delhi, India, respectively. He is currently Associate Professor in the in the Department of Civil Engineering at the University of Guilan, Rasht, Iran. Dr. Pourzeynali has more than 20 years of teaching experience, and has published more than 80 international journal and conference papers. His research interests include: structural dynamics, earthquake engineering, random vibrations, reliability analysis, structural control and optimization, and dynamics of cable supported bridges. Dr. Pourzeynali has received various awards among which the notable one is the “2004 best research study on cable supported bridges” from the Vice-Chancellor for Research of the University of Guilan.
Shide Salimi received her B.S. degree in Civil Engineering from IKIU, Qazvin, Iran, in 2007 and a M.S. degree in Structural Engineering from the University of MAU, Ardabil, Iran, in 2011. Her main research interests include: structure design, control design, optimization, fuzzy control, and isolator design.
Houshyar Eimani-Kalehsar received his M.S. degree in Structural Engineering from the University of Tabriz, Iran, in 1990, and a Ph.D. degree in Experimental Aerodynamics on Tall Rectangular Buildings from IIT Roorkee, India, in 2000. He is currently Assistant Professor of Civil Engineering at the University of Mohaghegh-e-Ardabili, in Ardabil, Iran. He joined as a lecturer in the Dept. of Civil Engineering in May, 1990. He is a life member of Indian society for wind engineering (ISWE). He has derived non-dimensional force spectrum for estimation of acrosswind response of tall rectangular buildings which is unique and useful in the above mentioned area. Dr. Eimani-Kalehsar has more than 22 years of teaching experience, and has published more than 30 international journal and conference papers. His research interests include: theory of plates and shells, bridge engineering, wind engineering, structural aerodynamics and random vibrations.
1College of Computer Science and Technology, Taiyuan University of Technology, Yingze Street 79, Taiyuan 030024, China
2The Center of Information and Network, Shanxi Medical College of Continuing Education, Shuangtasi Street 22, Taiyuan 030012, China
3The Technology and Product Management, Shanxi Branch of Agricultural Bank of China, Nanneihuan Street 33, Taiyuan 030024, China
Copyright © 2016 Jing Bian et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In the era of big data, feature selection is an essential process in machine learning. Although the class imbalance problem has recently attracted a great deal of attention, little effort has been undertaken to develop feature selection techniques. In addition, most applications involving feature selection focus on classification accuracy but not cost, although costs are important. To cope with imbalance problems, we developed a cost-sensitive feature selection algorithm that adds the cost-based evaluation function of a filter feature selection using a chaos genetic algorithm, referred to as CSFSG. The evaluation function considers both feature-acquiring costs (test costs) and misclassification costs in the field of network security, thereby weakening the influence of many instances from the majority of classes in large-scale datasets. The CSFSG algorithm reduces the total cost of feature selection and trades off both factors. The behavior of the CSFSG algorithm is tested on a large-scale dataset of network security, using two kinds of classifiers: C4.5 and -nearest neighbor (KNN). The results of the experimental research show that the approach is efficient and able to effectively improve classification accuracy and to decrease classification time. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.
The class imbalance problem is found in various scientific and social arenas, such as fraud/intrusion detection, spam detection, risk management, technical diagnostics/monitoring, financial engineering, and medical diagnostics [1–4]. In most applications, it is more important to correctly classify the minority class compared to the majority class although the minority class is much smaller in number than the majority class.
There are essentially two methods to address the class imbalance problem: sampling methods and cost-sensitive learning methods . The objective of sampling methods and synthetic data generation is to provide a relatively balanced distribution from oversampling and/or undersampling techniques . A very popular oversampling approach is the Synthetic Minority Oversampling Technique (SMOTE), which produces synthetic minority class samples, as opposed to oversampling with replacement . For high-dimensional data, Blagus and Lusa showed that SMOTE does not change the class-specific mean values, and it decreases data variability, introducing correlation between samples . Cost-sensitive learning methods introduce a cost matrix to minimize total costs while maximizing accuracy . When learning from imbalanced data, most classifiers are overwhelmed by most class samples, so the false negative rate is always high . Researchers have introduced many methods to address these problems, including combining sampling techniques with cost-sensitive learning, setting the cost ratio by inverting prior class distributions, and collecting the cost of features before classification [5, 8, 9].
Most data mining techniques are not designed to cope with large numbers of features, and such is the case with feature selection. Currently, the class imbalance problem is severe when data dimensionality is high. Of the many methods that exploit feature selection, the most common are those that address only relevant features, and these methods are also the most efficient and effective, which is widely known as the curse of dimensionality . Many studies have realized the importance of feature selection and addressed the subject from various perspectives; this approach has been used increasingly in class imbalance problems.
In this paper, we investigate cost-sensitive feature selection issues in an imbalanced scenario. Specifically, before briefly introducing cost-sensitive learning and its application to feature selection, we illustrate the imbalanced problem, which is the most relevant topic of study in the current research. Then, we propose a new method for feature selection whose goal is to develop an efficient approach in the field of network security, an arena in which large numbers of imbalanced datasets are typical. Thus, rather than improving on previous methods, our purpose is to match the performance of previous cost-sensitive feature selection approaches using a method that addresses very large datasets with imbalance problems.
2. Related Work
2.1. Cost-Sensitive Learning
Different costs are associated with different misclassification errors in real world applications . Cost-sensitive learning takes into account the variable cost of misclassifying different classes [12, 13]. In most cases, cost-sensitive learning algorithms are designed to minimize total costs while introducing multiple costs. In 2000, Turney presented the main types of costs involved in inductive concept learning .
Cost-sensitive learning has two major types of costs: misclassification and test costs . Misclassification costs can be viewed as the penalties that result from incorrectly classifying an object within a certain class. Traditional machine learning methods are in large part aimed at minimizing the error rate and are dedicated to uniform error costs. They assume equal misclassification costs and relatively balanced class distributions. Typically, the cost of misclassifying an abnormal incident as a normal incident is much higher than the cost of misclassifying a normal incident as an abnormal incident. Thus, misclassification costs must be minimized rather than misclassification errors. There are two types of misclassification costs: example-dependent costs and class-dependent costs .
Test costs typically refer to money, time, computing, or other resources that are expended to obtain data items related to an object . There are numerous types of measurement methods with different test costs; higher test costs are required to obtain data characterized by smaller measurement error. An appropriate measurement should be selected, and the total test cost should be minimized.
Some studies focus on misclassification costs but fail to consider the cost of the test . Others consider test costs but not misclassification costs . However, because Turney first considered both test and misclassification costs, his approach has gradually become one of the foremost trends in the research.
2.2. Cost-Sensitive Feature Selection
In general, classification time increases with the number of features based on the computational complexity of the classification algorithm. However, it is possible to alleviate the curse of dimensionality by reducing the number of features, although this may weaken discriminating power.
A classifier can be understood as a specific function that maps a feature vector onto a class label . An algorithm that can guide the active acquisition of information and balance the costs is often termed a cost-sensitive classifier . The acquisition cost for selected features is an important issue in some applications, and more researchers have taken the feature acquisition cost into account in the feature selection criterion . Ji and Carin introduced many cost-sensitive feature selection criteria, while traditional methods select all the useful features simultaneously by setting the weights on the redundant features to zero .
Several works have addressed cost-sensitive feature selection in recent years. For example, Bosin et al.  present a cost-sensitive approach feature selection that focuses on the quality of features and the cost of computation. In spite of the complexity, this method is able to increase classifier accuracy and judges the goodness of a subset of features with a particular classification model by defining a feature selection criterion.
Mejía-Lavalle  proposes a feature selection heuristic that takes into account a cost-sensitive classification. Unlike most feature selection studies that evaluate only accuracy and processing time, this heuristic evaluates different feature selection-ranking methods over large datasets. In addition, they can separate relevant and irrelevant features by stressing the issue around the boundary.
Wang et al.  address the issue of data overfitting by designing three simple and efficient strategies—feature selection, smoothing, and threshold pruning—against the test cost-sensitive decision tree method. Before applying the test cost-sensitive decision tree algorithm, they use a feature selection that considers test costs and misclassification costs to preprocess the dataset.
In study by Lee et al. , a spam detection model is proposed, the first to take into account the importance of feature variables and the optimal number of features. The optimal number of selected features is decided using two methods: the use of one parameter optimization during the overall feature selection and parameter optimization in every feature elimination phase.
Chang et al.  propose an efficient hierarchical classification scheme and cost-based group-feature selection criterion to improve feature calculation and classification accuracy. This approach adopts computational cost-sensitive group-feature selection criterion with the Sequential Forward Selection (SFS) to obtain the class-specific quasioptimal feature set.
Zhao et al.  define a new cost-sensitive feature selection problem on a covering-based rough set model with normal distribution measurement errors. Unlike existing models, it proposes backtracking and heuristic algorithms mainly on the error boundaries with test costs and misclassification costs.
Liu et al.  propose a new cost-sensitive learning method for software defect prediction, which is divided into two stages: utilizing cost information in the classification stage as well as the feature selection stage.
However, few studies have focused on the class imbalance problem in view of cost-sensitive feature selection. To the best of our knowledge, no study addresses cost-sensitive feature selection in the security network field because of the significant domain differences and dependences.
3. Cost-Sensitive Feature Selection Model
3.1. Problem Formulation
Here, we present the common notations and an intrusion detection event that was taken from the KDD CUP’99 dataset  to illustrate them.
Let the original feature set be , where is the feature dimension count. The feature selection problem is to find a subset such that should maximize some scoring function; simultaneously, is an optimal subset that gives the best possible classification accuracy.
Assume that, in the instance space, we have a set of samples , (; ), where is the number of samples. Let denote the th feature of sample . The labeled instance space, which is also called universal instance space , is defined as a Cartesian product of the input instance space and the target feature domain; that is, . The training is denoted as , where contains classes, , and . The notation represents a classifier that was trained using inducer on the training dataset .
The cost-sensitive feature selection problem is also called the feature selection with minimal average total cost problem . In this paper, we focus on cost-sensitive feature selection based on both misclassification costs and test costs. Unlike the generic algorithm of feature selection, we use it to achieve accuracy or to reduce measurement error. Another purpose of feature selection in our study is to minimize average cost by considering the trade-off between test costs and misclassification costs . In other words, our optimization objective is to minimize average total cost.
Let MC be the misclassification cost matrix and let TC be the test cost matrix. The average total cost should be the following:
3.2. Cost Information
In the real world, there are many types of costs associated with a potential instance, such as the cost of additional tests, the cost associated with expert analysis, and intervention costs . Different applications are usually associated with various costs involving misclassification and test costs .
Without loss of generality, let be the cost of predicting an instance of class as class . When addressing imbalance problems, misclassification costs can be categorized into four types: () false positive (FP), notation , is the cost of misclassifying an instance of a positive class as a negative class; () false negative (FN), notation , is the cost of the opposite case; () the misclassification costs of true positive (TP) are equal to true negative (TN), that is, zero. Typically, it is more important to recognize positive rather than negative instances .
The cost-sensitive classification problem can be constructed as a decision theory problem using Bayesian Decision Theory [27, 28]. We assume that the probability is defined by a subset of features in instance , while the remaining features are irrelevant or redundant. The optimal prediction, for example, , is the class that minimizes the expected loss :
Although decision tree building does not need to be cost-sensitive to measure feature selection, an algorithm requires the cost-sensitive property to rank or weight features based on their importance . Feature selection could confirm or expand domain knowledge by using such ranking.
Borrowing ideas from credit card and cellular phone fraud detection in related fields, the Lee research group identifies three major cost factors related to intrusion detection: damage cost (DCost), response cost (RCost), and operational cost (OpCost) . The misclassification cost can be identified by the following formula: where is the function of the progress and effect of the attack. For example, from Table 1 we can see : misclassification cost () = RCost(PROBE) + DCost(DOS).
Table 1: Cost metrics of intrusion categories.
The expected misclassification cost for a new example drawn at random from distribution is as follows:
Based on study by Lee et al. , which extracts and constructs predictive features from network audit data, this approach divides features into four relative levels based on their computational costs; see Table 2. While the OpCost is the cost of time and computing resources spent on extracting or analyzing features for processing the stream of events , we assume that the acquisition cost of features can be associated with each feature level.
Table 2: Operation cost metrics.
We assume that both misclassification and test costs are given in the same cost scale. Therefore, summing together the two costs to obtain the average total cost is feasible.
3.3. Cost-Sensitive Fitness Function
Unlike the traditional feature selection algorithm, whose purpose is to improve classification accuracy or reduce measurement error, this paper attempts to minimize total costs and make trade-offs between costs and classification accuracy. The final objective of the feature selection problem is to select a feature subset with minimum size, total average costs, and classification accuracy.
Let feature subset have features. Then, we calculate the average total cost of the dataset