|Table of Contents|

An Improved SMOTE Algorithm Based on Probability Density Estimation(PDF)

《南京师大学报(自然科学版)》[ISSN:1001-4616/CN:32-1239/N]

Issue:
2019年01期
Page:
65-
Research Field:
·人工智能算法与应用专栏·
Publishing date:

Info

Title:
An Improved SMOTE Algorithm Based on Probability Density Estimation
Author(s):
Li TaoZheng ShangZou HaitaoYu Hualong
School of Computer Science,Jiangsu University of Science and Technology,Zhenjiang 212003,China
Keywords:
class imbalanceprobability densityinstance samplingSMOTEGaussian mixture distribution
PACS:
TP181
DOI:
10.3969/j.issn.1001-4616.2019.01.011
Abstract:
Class imbalance problem is one of the main problems in the fields of machine learning and data mining. To address this problem,the researchers have proposed lots of methods,in which instance sampling is the simplest,the most effective and the most used approach. As a popular instance sampling algorithm,SMOTE(synthetic minority oversampling technique)tends to be influenced by the noise instances and has poor generalization ability. To deal with this problem,an improved SMOTE algorithm which considers the probability density information is presented in this paper. Firstly,we assume that the instances in each class satisfy Gaussian mixture distribution,hence the Gaussian mixture model is adopted to estimate the probability density of each instance. Then the noisy instances could be removed by comparing rankings of the intra-class and inter-class probability density information. Next,the probability density information would be calculated again on the filtered data set,and then the instances belonging to the minority class could be divided into three groups as below:boundary,safety and outlier. Finally,for the instances in different group,different SMOTE strategies are used to generate the new instances. In addition,to further promote the generalization,the neighborhood calculation rule in SMOTE has also been modified. The experimental results on several binary-class imbalance data sets indicate that the proposed algorithm is effective and feasible. Moreover,it also shows that the proposed algorithm is significantly better than multiple previous algorithms.

References:

[1] GANGANWAR V. An overview of classification algorithms for imbalanced datasets[J]. International journal of emerging technology and advanced engineering,2012,2(4):42-47.
[2]SUN Y,WONG A K C,KAMEL M S. Classification of imbalanced data:a review[J]. International journal of pattern recognition and artificial intelligence,2009,23(4):687-719.
[3]HULSE V J,KHOSHGOFTAAR T M,NAPOLITANO A. An exploration of learning when data is noisy and imbalanced[J]. Intelligent data analysis,2011,15(2):215-236.
[4]YANG Q,WU X. 10 challenging problems in data mining research[J]. International journal of information technology and decision making,2006,5(4):597-604.
[5]LIU Y,HAN T L,SUN A. Imbalanced text classification:A term weighting approach[J]. Expert systems with applications,2009,36(1):690-701.
[6]THOMAS C. Improving intrusion detection for imbalanced network traffic[J]. Security and communication networks,2013,6(3):309-324.
[7]WANG S,YAO X. Using class imbalance learning for software defect prediction[J]. IEEE transactions on reliability,2013,62(2):434-443.
[8]BATUWITA R,PALADE V. FSVM-CIL:fuzzy support vector machines for class imbalance learning[J]. IEEE transactions on fuzzy systems,2010,18(3):558-571.
[9]YU H L,MU C,SUN C Y,et al. Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data[J]. Knowledge-based systems,2015,76(1):67-78.
[10]CHAWLA N V,BOWYER K W,Hall L O,et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of artificial intelligence research,2002,16(1):321-357.
[11]HAN H,WANG W Y,MAO B H. Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning[C]//International Conference of Intelligent Computing. USA:ICIC,2005:878-887.
[12]GARCIA V,SáNCHEZ J S,MARTíN-FéLEZ R,et al. Surrounding neighborhood-based SMOTE for learning from imbalanced data sets[J]. Progress in artificial intelligence,2012,1(4):347-362.
[13]BUNKHUMPORNPAT C,SINAPIROMSARAN K,LURSINSAP C. Safe-level-SMOTE:safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C]//Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Germany:Springer-Verlag,2009:475-482.
[14]SáEZ J A,LUENGO J,STEFANOWSKI J,et al. SMOTE-IPF:addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information sciences,2015,291(5):184-203.
[15]向日华,王润生. 一种基于高斯混合模型的距离图像分割算法[J]. 软件学报,2003,14(7):1250-1257.
[16]吴福仙,温卫东. 极大似然最大熵概率密度估计及其优化解法[J]. 南京航空航天大学学报(自然科学版),2017,49(1):110-116.
[17]ALCALA F J,FEMANDEZ A,LUENGO J,et al. KEEL data-mining software tool:data set repository,integration of algorithms and experimental analysis framework[J]. Journal of multiple-valued logic and soft computing,2011,17(2/3):255-287.
[18]BLAKE C,KEOGH E,MERZ C J. UCI repository of machine learning databases[EB/OL]. http://www.ics.uci.edu/mlearn/MLRepository.html,1998.
[19]HE H,GARCIA E A. Learning from imbalanced data[J]. IEEE transactions on knowledge & data engineering,2009,21(9):1263-1284.
[20]GUO H X,LI Y,SHANG J,et al. Learning from class-imbalanced data:Review of methods and applications[J]. Expert systems with applications,2016,73:220-239.
[21]LóPEZ V,FERNáNDEZ A,GARCíA S,et al. An insight into classification with imbalanced data:Empirical results and current trends on using data intrinsic characteristics[J]. Information sciences,2013,250:113-141,2013.
[22]DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of machine learning research,2006,7:1-30.

Memo

Memo:
-
Last Update: 2019-03-30