[1]杜逸飞,李 琳,孔京辉.基于注意力卷积混合架构的实时手部网格重建[J].南京师大学报(自然科学版),2025,48(06):90-100.[doi:10.3969/j.issn.1001-4616.2025.06.010]
 Du Yifei,Li Lin,Kong Jinghui.Real-Time Hand Mesh Reconstruction Based on Attention-Convolution Hybrid Architecture[J].Journal of Nanjing Normal University(Natural Science Edition),2025,48(06):90-100.[doi:10.3969/j.issn.1001-4616.2025.06.010]
点击复制

基于注意力卷积混合架构的实时手部网格重建()

《南京师大学报(自然科学版)》[ISSN:1001-4616/CN:32-1239/N]

卷:
48
期数:
2025年06期
页码:
90-100
栏目:
计算机科学与技术
出版日期:
2025-12-20

文章信息/Info

Title:
Real-Time Hand Mesh Reconstruction Based on Attention-Convolution Hybrid Architecture
文章编号:
1001-4616(2025)06-0090-11
作者:
杜逸飞李 琳孔京辉
(合肥工业大学计算机与信息学院(人工智能学院),安徽 合肥 230601)
Author(s):
Du YifeiLi LinKong Jinghui
(Visualization and Cooperative Computing(VCC)Laboratory,School of Computer Science and Information Engineering(School of Artificial Intelligence),Hefei University of Technology,Hefei 230601,China)
关键词:
三维手部网格重建混合层级特征提取实时性维度映射转换网格重建解码
Keywords:
3D hand mesh reconstructionhybrid hierarchical feature extractionreal-time performancedimension mapping transformationmesh reconstruction decoding
分类号:
TP391
DOI:
10.3969/j.issn.1001-4616.2025.06.010
文献标志码:
A
摘要:
当前的手部网格重建方法主要关注手部网格重建的精度和偏差,而对于其实时性方面的因素关注则较少. 为此,本文提出了一种基于三阶段注意力卷积混合架构的三维手部网格重建方法(QuickHand). 首先,设计了一个混合层级特征提取器,通过将轻量级卷积编码、多尺度特征变换和全局注意力机制有机结合,实现了从单视图手部图像到判别性特征表示的高效映射. 其次,构建了一个维度映射转换器,通过自适应的关节特征编码和空间变换,实现了从二维平面特征到网格顶点特征空间的精确转换. 最后,设计了一个高效的网格重建解码器,通过深度可分离螺旋卷积和多级上采样策略,在保持低计算复杂度的同时实现了高精度的手部三维网格重建. 实验表明,本文在保持高精度手部网格重建的同时提高了实时性能,相比面向精度的方法具有更高的实时性,相比面向轻量化的方法具有更优的重建精度.
Abstract:
Current hand mesh reconstruction methods primarily focus on the accuracy and deviation of hand mesh reconstruction,with less attention paid to the real-time performance of hand mesh reconstruction. To address this,this paper proposes QuickHand,a three-dimensional hand mesh reconstruction method based on a three-stage attention-convolution hybrid architecture. Firstly,a hybrid hierarchical feature extractor is designed,which achieves efficient mapping from single-view hand images to discriminative feature representations by combining lightweight convolutional encoding,multi-scale feature transformation,and a global attention mechanism. Secondly,a dimension mapping transformer is constructed,which enables precise conversion from 2D planar features to mesh vertex feature spaces through adaptive joint feature encoding and spatial transformation. Finally,an efficient mesh reconstruction decoder is designed,which achieves high-precision 3D hand mesh reconstruction while maintaining low computational complexity through depth-separable spiral convolution and multi-level upsampling strategies. Experiments demonstrate that the proposed method achieves real-time performance while maintaining high-precision hand mesh reconstruction,offering better real-time performance compared to accuracy-oriented methods and superior reconstruction accuracy compared to lightweight-oriented methods.

参考文献/References:

[1]KESKIN C,KIRAÇ F,KARA Y E,et al. Hand pose estimation and hand shape classification using multi-layered randomized decision forests[C]//Computer Vision-ECCV 2012:12th European Conference on Computer Vision. UK:Springer International Publishing,2012:852-863.
[2]TANG D,JIN CHANG H,TEJANI A,et al. Latent regression forest:structured estimation of 3d articulated hand posture[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA:CVPR,2014:3786-3793.
[3]ROMERO J,KJELLSTRÖM H,KRAGIC D. Hands in action:real-time 3D reconstruction of hands in interaction with objects[C]//2010 IEEE International Conference on Robotics and Automation. USA:ICACC,2010:458-463.
[4]MUELLER F,BERNARD F,SOTNYCHENKO O,et al. Ganerated hands for real-time 3d hand tracking from monocular rgb[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA:CVPR,2018:49-59.
[5]ZHANG X,LI Q,MO H,et al. End-to-end hand mesh recovery from a monocular rgb image[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. USA:CVF,2019:2354-2364.
[6]MOON G,SHIRATORI T,LEE K M. Deephandmesh:a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling[C]//Computer Vision-ECCV 2020:16th European Conference. UK:Springer International Publishing,2020:440-455.
[7]ZIMMERMANN C,CEYLAN D,YANG J,et al. Freihand:a dataset for markerless capture of hand pose and shape from single rgb images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. USA:CVF,2019:813-822.
[8]HAMPALI S,RAD M,OBERWEGER M,et al. Honnotate:a method for 3d annotation of hand and object poses[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVPR,2020:3196-3206.
[9]MOON G,YU S I,WEN H,et al. Interhand2.6m:A dataset and baseline for 3d interacting hand pose estimation from a single rgb image[C]//Computer Vision-ECCV 2020:16th European Conference. UK:Springer International Publishing,2020:548-564.
[10]CHAO Y W,YANG W,XIANG Y,et al. DexYCB:a benchmark for capturing hand grasping of objects[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVPR,2021:9044-9053.
[11]YANG L,XU J,ZHONG L,et al. POEM:reconstructing hand in a point embedded multi-view stereo[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVPR,2023:21108-21117.
[12]CHEN X,SONG Z,JIANG X,et al. HandOS:3D hand reconstruction in one stage[J/OL]. arXiv Preprint arXiv:2412.01537,2024.
[13]CHEN X,LIU Y,MA C,et al. Camera-space hand mesh recovery via semantic aggregation and adaptive 2d-1d registration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVF,2021:13274-13283.
[14]LUAN T,ZHAI Y,MENG J,et al. High fidelity 3d hand shape reconstruction via scalable graph frequency decomposition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVPR,2023:16795-16804.
[15]ZHENG X,WEN C,XUE Z,et al. HaMuCo:Hand pose estimation via multiview collaborative self-supervised learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. USA:CVF,2023:20763-20773.
[16]CHEN X,LIU Y,DONG Y,et al. Mobrecon:Mobile-friendly hand mesh reconstruction from monocular image[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVPR,2022:20544-20554.
[17]HUANG L,TAN J,LIU J,et al. Hand-transformer:Non-autoregressive structured modeling for 3d hand pose estimation[C]//Computer Vision-ECCV 2020:16th European Conference. UK:Springer International Publishing,2020:17-33.
[18]CHENG W,KIM E,KO J H. HandDAGT:A denoising adaptive graph transformer for 3D hand pose estimation[C]//European Conference on Computer Vision. Cham:Springer Nature Switzerland,2024:35-52.
[19]FU Q,LIU X,XU R,et al. Deformer:Dynamic fusion transformer for robust hand pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. USA:CVF,2023:23600-23611.
[20]KONG D,ZHANG L,CHEN L,et al. Identity-aware hand mesh estimation and personalization from rgb images[C]//European Conference on Computer Vision. Cham:Springer Nature Switzerland,2022:536-553.
[21]PARK J K,OH Y,MOON G,et al. Handoccnet:Occlusion-robust 3d hand mesh estimation network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVF,2022:1496-1505.
[22]CHENG W,KO J H. Handr2n2:Iterative 3d hand pose estimation using a residual recurrent neural network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. USA:CVF,2023:20904-20913.
[23]FANG L,LIU X,LIU L,et al. Jgr-p2o:Joint graph reasoning based pixel-to-offset prediction network for 3d hand pose estimation from a single depth image[C]//Computer Vision-ECCV 2020:16th European Conference. UK:Springer International Publishing,2020:120-137.
[24]IQBAL U,MOLCHANOV P,GALL T B J,et al. Hand pose estimation via latent 2.5d heatmap regression[C]//Proceedings of the European Conference on Computer Vision(ECCV). Italy:ECCV,2018:118-134.
[25]MALIK J,ABDELAZIZ I,ELHAYEK A,et al. Handvoxnet:Deep voxel-based network for 3d hand shape and pose estimation from a single depth map[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVF,2020:7113-7122.
[26]KULON D,GULER R A,KOKKINOS I,et al. Weakly-supervised mesh-convolutional hand reconstruction in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVPR,2020:4990-5000.
[27]MOON G,LEE K M. I2l-meshnet:Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image[C]//Computer Vision-ECCV 2020:16th European Conference. UK:Springer International Publishing,2020:752-768.
[28]POTAMIAS R A,ZHANG J,DENG J,et al. Wilor:End-to-end 3d hand localization and reconstruction in-the-wild[J/OL]. arXiv Preprint arXiv:2409.12259,2024.
[29]REN J,ZHU J,ZHANG J. End-to-end weakly-supervised single-stage multiple 3D hand mesh reconstruction from a single RGB image[J]. Computer vision and image understanding,2023,232:103706.
[30]CHEN X,LIU Y,MA C,et al. Camera-space hand mesh recovery via semantic aggregation and adaptive 2d-1d registration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVPR,2021:13274-13283.
[31]PAVLAKOS G,SHAN D,RADOSAVOVIC I,et al. Reconstructing hands in 3d with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA:CVF,2024:9826-9836.

备注/Memo

备注/Memo:
收稿日期:2025-03-23.
基金项目:国家自然科学基金面上项目(62277014).
通讯作者:李琳,博士,副教授,研究方向:虚拟现实与人机交互. E-mail:lilin_julia@hfut.edu.cn
更新日期/Last Update: 2025-12-20