[1]吕宏伟,李 博,刘普凡,等.基于均衡聚类索引的近似最近邻检索方法[J].南京师大学报(自然科学版),2024,(02):99-108.[doi:10.3969/j.issn.1001-4616.2024.02.012]
 L Hongwei,Li Bo,Liu Pufan,et al.Balanced Clustering-based Index for Approximate Nearest Neighbor Retrieval[J].Journal of Nanjing Normal University(Natural Science Edition),2024,(02):99-108.[doi:10.3969/j.issn.1001-4616.2024.02.012]
点击复制

基于均衡聚类索引的近似最近邻检索方法()
分享到:

《南京师大学报(自然科学版)》[ISSN:1001-4616/CN:32-1239/N]

卷:
期数:
2024年02期
页码:
99-108
栏目:
计算机科学与技术
出版日期:
2024-06-15

文章信息/Info

Title:
Balanced Clustering-based Index for Approximate Nearest Neighbor Retrieval
文章编号:
1001-4616(2024)02-0099-10
作者:
吕宏伟李 博刘普凡刘 识李继伟刘俊健
(国家电网大数据中心,江苏 南京 210023)
Author(s):
Lü HongweiLi BoLiu PufanLiu ShiLi JiweiLiu Junjian
(Big Data Center of StateGrid Corporation of China,Nanjing 210023,China)
关键词:
大数据检索与分析最近邻搜索均衡感知
Keywords:
big data retrieval and analysisnearest neighbor searchbalanced perception
分类号:
TP391
DOI:
10.3969/j.issn.1001-4616.2024.02.012
文献标志码:
A
摘要:
大数据时代,深度学习通过将复杂对象表示为高维特征向量,并使用向量之间的距离度量来衡量样本的相似性,在推荐系统、用户画像、数据中台管理等场景中得到了广泛的应用. 但是,随着数据规模的不断增加,海量特征数据的相似向量检索面临着检索模型占用内容大、特征检索算法召回率较低的严重挑战. 如何在保证检索精度的前提下,设计紧凑型索引图结构,降低特征检索的内存消耗,对于提升大数据系统的近邻检索效率具有重要的作用. 因此,本文提出了一种均衡感知的快速K均值近邻聚类的特征数据分桶及其图结构紧凑型索引用于海量数据近邻检索. 首先,设计了均衡感知的快速K-均值聚类算法,通过在图索引构建过程中海量特征数据的均衡分桶,将高维向量压缩成轻量级紧凑型图索引结构,随后通过量化操作进一步压缩高维向量样本,提升其在候选集上的最近邻检索速度. 在基准数据集上实验验证结果表明,本文提出的方法能够在保证较高检测召回率的同时,有效加快索引构建速度,可以用于支持高维特征数据的高效最近邻检索.
Abstract:
In the era of big data,deep learning has been widely applied in recommendation systems,user profiling,and data management by representing complex objects as high-dimensional feature vectors and evaluating their similarities based on vector distance measurements. However,with the continuous growth of data scale,the retrieval of similar feature vectors from massive data faces significant challenges such as large memory consumption of retrieval models and low recall rates of feature retrieval algorithms. It is crucial to design compact index graph structures and reduce memory consumption in feature retrieval to improve the efficiency of nearest neighbor search in large-scale data systems while ensuring retrieval accuracy. Therefore,a balanced-aware distributed K-means clustering-based user feature binning approach and a compact index design algorithm for graph structures are proposed. Firstly,fast balanced-aware K-means clustering algorithm is designed to achieve balanced binning of massive feature data during graph index construction,compressing high-dimensional vectors into lightweight and compact graph index structures. Subsequently,quantization operation is conducted to further compress high-dimensional vectors sample and improve its nearest neighbor search speed in dataset. Experimental results on benchmark datasets demonstrate that the proposed method can effectively accelerate index construction speed while ensuring high accuracy,thus enabling efficient indexing and retrieval of massive data.

参考文献/References:

[1]LI X,ZHANG W,DING Q. Deep learning-based remaining useful life estimation of bearings using multi-scale feature extraction[J]. Reliability engineering & system safety,2019,182(C):208-218.
[2]WU H,LIU Y,WANG J. Review of text classification methods on deep learning[J]. Computers,materials & continua,2020,63(3):1309-1321.
[3]PURWINS H,LI B,VIRTANEN T,et al. Deep learning for audio signal processing[J]. IEEE journal of selected topics in signal processing,2019,13(2):206-219.
[4]LITJENS G,KOOI T,BEJNORDI B E,et al. A survey on deep learning in medical image analysis[J]. Medical image analysis,2017,42:60-88.
[5]DA'U A,SALIM N. Recommendation system based on deep learning methods:a systematic review and new directions[J]. Artificial intelligence review,2020,53(4):2709-2748.
[6]ZHENG B,ZHAO X,WENG L. et al. PM-LSH:a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search[J]. The VLDB journal,2022,30(6):1339-1363.
[7]BABENKO A,LEMPITSKY V. The inverted multi-Index[J]. IEEE transactions on pattern analysis and machine intelligence,2015,37(6):1247-1260.
[8]GE T Z,HE K M,KE Q F,et al. Optimized product quantization[J]. IEEE transactions on pattern analysis and machine intelligence,2014,36(4):744-755.
[9]KALANTIDIS Y,AVRITHIS Y. Locally optimized product quantization for approximate nearest neighbor search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus:IEEE,2014:2321-2328.
[10]FU C,XIANG C,WANG C,et al. Fast approximate nearest neighbor search with the navigating spreading-out graph[J]. Proceedings of the VLDB endowment,2019,12(5):461-474.
[11]SUBRAMANYA J S,DEVVRIT F,SIMHADRI H V,et al. DiskANN:fast accurate billion-point nearest neighbor search on a single node[C]//Neural Information Processing Systems. Canada:IEEE,2019,32.
[12]MALKOV Y A,YASHUNIN D A. Efficient and robust approximate near-est neighbor search using hierarchical navigable small world graphs[J]. CoRR,2016.
[13]BARANCHUK D,BABENKO A,MALKOV Y. Revisiting the inverted indices for billion-scale approximate nearest neighbors[C]//Proceedings of the European Conference on Computer Vision(ECCV). Munich:Springer-Verlag,2018:202-216.
[14]MOROUZI M B A,FLEET D J. Cartesian k-means[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Portland:IEEE,2013:3017-3024.
[15]WANG J,WANG J,SONG J,et al. Optimized cartesian k-means[J]. IEEE transactions on knowledge and data engineering,2014,27(1):180-192.
[16]LINDE Y,BUZO A,GRAY R. An algorithm for vector quantizer design[J]. IEEE transactions on communications,1980,28(1):84-95.
[17]ZHAN J,MAO J,LIU Y,et al. Jointly optimizing query encoder and product quantization to improve retrieval performance[C]//Proceedings of the 30th ACM International Conference on Information & Knowledge Management. Santonni:ACM,2021:2487-2496.
[18]BENTLEY J L. Multidimensional binary search trees used for associative searching[J]. Communications of the ACM,1975,18(9):509-517.
[19]CIACCIA P,PATELLA M,ZEZULA P. M-tree:an Efficient access method for similarity search in metric spaces[C]//Proceedings of the 23rd VLDB Conference. Athens,Greece:ACM,1997:426-435.
[20]GUTTMAN A. R-trees:A dynamic index structure for spatial searching[C]//Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data. Boston:ACM,1984:47-57.
[21]ANDONI A,INDYK P,NGUYEN H L,et al. Beyond locality-sensitive hashing[C]//Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics. PA:ACM,2014:1018-1028.
[22]DATAR M,IMMORLICA N,INDYK P,et al. Locality-sensitive hashing scheme based on p-stable distributions[C]//Proceedings of the Twentieth Annual Symposium on Computational Geometry. New York:ACM,2004:253-262.
[23]PAULEVE L,JEGOU H,AMSALEG L. Locality sensitive hashing:a comparison of hash functiontypes and querying mechanisms[J]. Pattern recognition letters,2010,31(11):1348-1358.
[24]HARWOOD B,DRUMMOND T. Fanng:fast approximate nearest neighbour graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York:IEEE,2016:5713-5722.
[25]JAROMCZYK J W,TOUSSAINT G T. Relative neighborhood graphs and their relatives[J]. Proceedings of the IEEE,1992,80(9):1502-1517.
[26]TOUSSAINT G T. The relative neighbourhood graph of a finite planar set[J]. Pattern recognition,1980,12(4):261-268.
[27]AURENHAMM F. Voronoi diagrams—a survey of a fundamental geometric data structure[J]. ACM computing surveys(CSUR),1991,23(3):345-405.
[28]HAJEBI K,ABBASI-YADKORI Y,SHAHBAZI H,et al. Fast approximate nearest-neighbor search with k-nearest neighbor graph[C]//Twenty-Second International Joint Conference on Artificial Intelligence. Catalonsa,Spain:ACM,2011.
[29]LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International journal of computer vision,2004,60(2):91-110.
[30]DOUZE M,JÉGOU H,Sandhawalia H,et al. Evaluation of gist descriptors for webscale image search[C]//Proceedings of the ACM International Conference on Image and Video Retrieval. Santorini,Greece:ACM,2009:1-8.

备注/Memo

备注/Memo:
收稿日期:2023-08-29.
基金项目:国家电网有限公司大数据中心自建科技项目(SGSJ0000SJJS2310021).
通讯作者:刘识,硕士,高级工程师,研究方向:最近邻算法大数据检索、智能推荐. E-mail:yetfly092@163.com
更新日期/Last Update: 2024-06-15