«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1001-4616.2019.01.011]
点击复制

基于概率密度估计的SMOTE改进算法研究()

分享到：

《南京师范大学学报》（自然科学版）[ISSN:1001-4616/CN:32-1239/N]

卷:: 第42卷
期数:: 2019年01期

页码:: 65

栏目:: ·人工智能算法与应用专栏·

出版日期:: 2019-03-20

文章信息/Info

Title:: An Improved SMOTE Algorithm Based on Probability Density Estimation

文章编号:: 1001-4616(2019)01-0065-08

作者:: 李涛; 郑尚; 邹海涛; 于化龙; 江苏科技大学计算机学院,江苏镇江 212003

Author(s):: Li Tao; Zheng Shang; Zou Haitao; Yu Hualong; School of Computer Science,Jiangsu University of Science and Technology,Zhenjiang 212003,China

关键词:: 类别不平衡; 概率密度; 样本采样; SMOTE; 高斯混合分布

Keywords:: class imbalance; probability density; instance sampling; SMOTE; Gaussian mixture distribution

分类号:: TP181

DOI:: 10.3969/j.issn.1001-4616.2019.01.011

文献标志码:: A

摘要:: 类别不平衡问题是机器学习与数据挖掘领域中主要关注的问题之一,目前已有多种解决方法,而样本采样技术是其中最为简单有效、同时也是最为常用的一类方法. 本文主要针对SMOTE(synthetic minority oversampling technique)这一最为流行的采样算法易于受到噪声样本影响及泛化能力差的缺点,提出了一种基于概率密度估计的改进算法. 首先,假定各类样本均服从高斯混合分布,并采用高斯混合模型测得各样本的概率密度,针对各样本在类内与类间所测得概率密度间的排序比较关系来实现噪声信息的过滤. 其次,在过滤后的少数类样本上进行概率密度的重新计算,并根据其特点将其划分为三类:边界样本、安全样本与离群样本. 最后,针对上述三类样本,分别采取不同的策略来进行SMOTE采样. 此外,为了进一步提升泛化性能,本文也对SMOTE算法的邻域计算规则进行了修正. 通过多个基准的二类不平衡数据集对该算法进行了验证,实验结果表明其是有效且可行的,同时显著优于多种已有的采样算法.

Abstract:: Class imbalance problem is one of the main problems in the fields of machine learning and data mining. To address this problem,the researchers have proposed lots of methods,in which instance sampling is the simplest,the most effective and the most used approach. As a popular instance sampling algorithm,SMOTE(synthetic minority oversampling technique)tends to be influenced by the noise instances and has poor generalization ability. To deal with this problem,an improved SMOTE algorithm which considers the probability density information is presented in this paper. Firstly,we assume that the instances in each class satisfy Gaussian mixture distribution,hence the Gaussian mixture model is adopted to estimate the probability density of each instance. Then the noisy instances could be removed by comparing rankings of the intra-class and inter-class probability density information. Next,the probability density information would be calculated again on the filtered data set,and then the instances belonging to the minority class could be divided into three groups as below:boundary,safety and outlier. Finally,for the instances in different group,different SMOTE strategies are used to generate the new instances. In addition,to further promote the generalization,the neighborhood calculation rule in SMOTE has also been modified. The experimental results on several binary-class imbalance data sets indicate that the proposed algorithm is effective and feasible. Moreover,it also shows that the proposed algorithm is significantly better than multiple previous algorithms.

参考文献/References:

[1] GANGANWAR V. An overview of classification algorithms for imbalanced datasets[J]. International journal of emerging technology and advanced engineering,2012,2(4):42-47.
[2]SUN Y,WONG A K C,KAMEL M S. Classification of imbalanced data:a review[J]. International journal of pattern recognition and artificial intelligence,2009,23(4):687-719.
[3]HULSE V J,KHOSHGOFTAAR T M,NAPOLITANO A. An exploration of learning when data is noisy and imbalanced[J]. Intelligent data analysis,2011,15(2):215-236.
[4]YANG Q,WU X. 10 challenging problems in data mining research[J]. International journal of information technology and decision making,2006,5(4):597-604.
[5]LIU Y,HAN T L,SUN A. Imbalanced text classification:A term weighting approach[J]. Expert systems with applications,2009,36(1):690-701.
[6]THOMAS C. Improving intrusion detection for imbalanced network traffic[J]. Security and communication networks,2013,6(3):309-324.
[7]WANG S,YAO X. Using class imbalance learning for software defect prediction[J]. IEEE transactions on reliability,2013,62(2):434-443.
[8]BATUWITA R,PALADE V. FSVM-CIL:fuzzy support vector machines for class imbalance learning[J]. IEEE transactions on fuzzy systems,2010,18(3):558-571.
[9]YU H L,MU C,SUN C Y,et al. Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data[J]. Knowledge-based systems,2015,76(1):67-78.
[10]CHAWLA N V,BOWYER K W,Hall L O,et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of artificial intelligence research,2002,16(1):321-357.
[11]HAN H,WANG W Y,MAO B H. Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning[C]//International Conference of Intelligent Computing. USA:ICIC,2005:878-887.
[12]GARCIA V,SáNCHEZ J S,MARTíN-FéLEZ R,et al. Surrounding neighborhood-based SMOTE for learning from imbalanced data sets[J]. Progress in artificial intelligence,2012,1(4):347-362.
[13]BUNKHUMPORNPAT C,SINAPIROMSARAN K,LURSINSAP C. Safe-level-SMOTE:safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C]//Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Germany:Springer-Verlag,2009:475-482.
[14]SáEZ J A,LUENGO J,STEFANOWSKI J,et al. SMOTE-IPF:addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J]. Information sciences,2015,291(5):184-203.
[15]向日华,王润生. 一种基于高斯混合模型的距离图像分割算法[J]. 软件学报,2003,14(7):1250-1257.
[16]吴福仙,温卫东. 极大似然最大熵概率密度估计及其优化解法[J]. 南京航空航天大学学报(自然科学版),2017,49(1):110-116.
[17]ALCALA F J,FEMANDEZ A,LUENGO J,et al. KEEL data-mining software tool:data set repository,integration of algorithms and experimental analysis framework[J]. Journal of multiple-valued logic and soft computing,2011,17(2/3):255-287.
[18]BLAKE C,KEOGH E,MERZ C J. UCI repository of machine learning databases[EB/OL]. http://www.ics.uci.edu/mlearn/MLRepository.html,1998.
[19]HE H,GARCIA E A. Learning from imbalanced data[J]. IEEE transactions on knowledge & data engineering,2009,21(9):1263-1284.
[20]GUO H X,LI Y,SHANG J,et al. Learning from class-imbalanced data:Review of methods and applications[J]. Expert systems with applications,2016,73:220-239.
[21]LóPEZ V,FERNáNDEZ A,GARCíA S,et al. An insight into classification with imbalanced data:Empirical results and current trends on using data intrinsic characteristics[J]. Information sciences,2013,250:113-141,2013.
[22]DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of machine learning research,2006,7:1-30.

备注/Memo

备注/Memo:: 收稿日期:2018-08-16.
基金项目:国家自然科学基金(61305058,61572242)、江苏省自然科学基金(BK20130471)、中国博士后特别资助计划项目(2015T80481)、中国博士后科学基金(2013M540404)、江苏省博士后基金(1401037B).
通讯联系人:于化龙,博士,副教授,研究方向:机器学习、数据挖掘. E-mail:yuhualong@just.edu.cn

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1650
全文下载/Downloads2469
评论/Comments

更新日期/Last Update: 2019-03-30