{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T01:57:27Z","timestamp":1775786247444,"version":"3.50.1"},"reference-count":33,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2022,3,21]],"date-time":"2022-03-21T00:00:00Z","timestamp":1647820800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundations of China","award":["62002271, 61772396, 61772392, 61902296"],"award-info":[{"award-number":["62002271, 61772396, 61772392, 61902296"]}]},{"name":"National Key R&amp;D Program of China","award":["2018YFC0807500"],"award-info":[{"award-number":["2018YFC0807500"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["XJS210310"],"award-info":[{"award-number":["XJS210310"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Province Key R&amp;D Program of Shaanxi","award":["2020LSFP3-15"],"award-info":[{"award-number":["2020LSFP3-15"]}]},{"name":"National Natural Science Foundation of Shaanxi Province","award":["2020JQ-330, 2020JM-195"],"award-info":[{"award-number":["2020JQ-330, 2020JM-195"]}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"publisher","award":["2019M663640"],"award-info":[{"award-number":["2019M663640"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Guangxi Key Laboratory of Trusted Software","award":["KX202061"],"award-info":[{"award-number":["KX202061"]}]},{"name":"Key R&amp;D Projects of Qingdao Science and Technology Plan","award":["21-1-2-18-xx"],"award-info":[{"award-number":["21-1-2-18-xx"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Gesture recognition is an important direction in computer vision research. Information from the hands is crucial in this task. However, current methods consistently achieve attention on hand regions based on estimated keypoints, which will significantly increase both time and complexity, and may lose position information of the hand due to wrong keypoint estimations. Moreover, for dynamic gesture recognition, it is not enough to consider only the attention in the spatial dimension. This paper proposes a multi-scale attention 3D convolutional network for gesture recognition, with a fusion of multimodal data. The proposed network achieves attention mechanisms both locally and globally. The local attention leverages the hand information extracted by the hand detector to focus on the hand region, and reduces the interference of gesture-irrelevant factors. Global attention is achieved in both the human-posture context and the channel context through a dual spatiotemporal attention module. Furthermore, to make full use of the differences between different modalities of data, we designed a multimodal fusion scheme to fuse the features of RGB and depth data. The proposed method is evaluated using the Chalearn LAP Isolated Gesture Dataset and the Briareo Dataset. Experiments on these two datasets prove the effectiveness of our network and show it outperforms many state-of-the-art methods.<\/jats:p>","DOI":"10.3390\/s22062405","type":"journal-article","created":{"date-parts":[[2022,3,21]],"date-time":"2022-03-21T21:48:42Z","timestamp":1647899322000},"page":"2405","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":31,"title":["Multi-Scale Attention 3D Convolutional Network for Multimodal Gesture Recognition"],"prefix":"10.3390","volume":"22","author":[{"given":"Huizhou","family":"Chen","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an 710071, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7316-4354","authenticated-orcid":false,"given":"Yunan","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an 710071, China"}]},{"given":"Huijuan","family":"Fang","sequence":"additional","affiliation":[{"name":"Xiaomi Communications, Beijing 100085, China"}]},{"given":"Wentian","family":"Xin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an 710071, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2743-2017","authenticated-orcid":false,"given":"Zixiang","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an 710071, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2872-388X","authenticated-orcid":false,"given":"Qiguang","family":"Miao","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an 710071, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,3,21]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Zhou, B., Li, Y., and Wan, J. (2021). Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition. arXiv.","DOI":"10.1609\/aaai.v35i4.16471"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"2551","DOI":"10.1109\/TMM.2019.2960700","article-title":"Deep gesture video generation with learning on regions of interest","volume":"22","author":"Cui","year":"2019","journal-title":"IEEE Trans. Multimed."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1330","DOI":"10.1007\/s40815-020-00825-w","article-title":"Hand Gesture recognition in complex background based on convolutional pose machine and fuzzy Gaussian mixture models","volume":"22","author":"Zhang","year":"2020","journal-title":"Int. J. Fuzzy Syst."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1007\/s11554-012-0295-0","article-title":"Novel Haar features for real-time hand gesture recognition using SVM","volume":"10","author":"Hsieh","year":"2015","journal-title":"J. Real-Time Image Process."},{"key":"ref_6","first-page":"19","article-title":"Real time hand gesture recognition using SIFT","volume":"2","author":"Gurjal","year":"2012","journal-title":"Int. J. Electron. Electr. Eng."},{"key":"ref_7","unstructured":"Bao, J., Song, A., Guo, Y., and Tang, H. (2011, January 5\u201317). Dynamic hand gesture recognition based on SURF tracking. Proceedings of the 2011 International Conference on Electric Information and Control Engineering, Wuhan, China."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ghafouri, S., and Seyedarabi, H. (2013, January 14\u201316). Hybrid method for hand gesture recognition based on combination of Haar-like and HOG features. Proceedings of the 2013 21st Iranian Conference on Electrical Engineering (ICEE), Mashhad, Iran.","DOI":"10.1109\/IranianCEE.2013.6599529"},{"key":"ref_9","first-page":"2513","article-title":"One-shot-learning gesture recognition using hog-hof features","volume":"15","author":"Hagara","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_10","unstructured":"Simonyan, K., and Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Donahue, J., Anne, H.L., Guadarrama, S., and Rohrbach, M. (2015, January 7\u201312). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1016\/j.patrec.2017.12.003","article-title":"Large-scale gesture recognition with a fusion of RGB-D data based on optical flow and the C3D model","volume":"119","author":"Li","year":"2019","journal-title":"Pattern Recognit. Lett."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., and Cao, X. (2017, January 22\u201329). Multimodal gesture recognition based on the resc3d network. Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy.","DOI":"10.1109\/ICCVW.2017.360"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"2956","DOI":"10.1109\/TCSVT.2017.2749509","article-title":"Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model","volume":"28","author":"Li","year":"2017","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3131343","article-title":"A unified framework for multimodal isolated gesture recognition","volume":"14","author":"Duan","year":"2018","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, P., Li, W., Liu, S., Gao, Z., Tang, C., and Ogunbona, P. (2016, January 4\u20138). Large-scale isolated gesture recognition using convolutional neural networks. Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico.","DOI":"10.1109\/ICPR.2016.7899599"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Liu, Z., Chai, X., Liu, Z., and Chen, X. (2017, January 22\u201329). Continuous gesture recognition with hand-oriented spatiotemporal feature. Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy.","DOI":"10.1109\/ICCVW.2017.361"},{"key":"ref_19","first-page":"91","article-title":"Faster r-cnn: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren","year":"2015","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Narayana, P., Beveridge, R., and Draper, B.A. (2018, January 18\u201322). Gesture recognition: Focus on the hands. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00549"},{"key":"ref_21","unstructured":"(2022, March 15). YOLO v5. Available online: https:\/\/github.com\/ultralytics\/yolov5."},{"key":"ref_22","unstructured":"Mittal, A., Zisserman, A., and Torr, P.H.S. (September, January 29). Hand detection using multiple proposals. Proceedings of the The British Machine Vision Conference, Dundee, UK."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Fu, J., Liu, J., Tian, H., Li, Y., Fang, Z., and Lu, H. (2019, January 15\u201320). Dual attention network for scene segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00326"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Wan, J., Zhao, Y., Zhou, S., Guyon, I., Escalera, S., and Li, S.Z. (2016, January 27\u201330). Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA.","DOI":"10.1109\/CVPRW.2016.100"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1929","DOI":"10.1007\/s00138-014-0596-3","article-title":"The chalearn gesture dataset (cgd 2011)","volume":"25","author":"Guyon","year":"2014","journal-title":"Mach. Vis. Appl."},{"key":"ref_26","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., and Zisserman, A. (2017). The kinetics human action video dataset. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1323","DOI":"10.1109\/TNNLS.2019.2919764","article-title":"Redundancy and attention in convolutional LSTM for gesture recognition","volume":"31","author":"Zhu","year":"2019","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Wang, P., Li, W., Wan, J., Ogunbona, P., and Liu, X. (2018, January 2\u20137). Cooperative training of deep aggregation networks for RGB-D action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Hilton New Orleans Riverside, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12228"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Hu, T.K., Lin, Y.Y., and Hsiu, P.C. (2018, January 2\u20137). Learning adaptive hidden layers for mobile gesture recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Hilton New Orleans Riverside, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12279"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zhang, L., Zhu, G., Shen, P., Song, J., Shah, S.A., and Ben-namoun, M. (2017, January 22\u201329). Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy.","DOI":"10.1109\/ICCVW.2017.369"},{"key":"ref_31","unstructured":"Zhang, L., Zhu, G., Mei, L., Shen, P., Shah, S.A.A., and Bennamoun, M. (2018, January 3\u20138). Attention in convolutional LSTM for gesture recognition. Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Manganaro, F., Pini, S., Borghi, G., Vezzani, R., and Cucchiara, R. (2019, January 9\u201313). Hand gestures for the human-car interaction: The briareo dataset. Proceedings of the International Conference on Image Analysis and Processing, Trento, Italy.","DOI":"10.1007\/978-3-030-30645-8_51"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"D\u2019Eusanio, A., Simoni, A., Pini, S., Borghi, G., Vezzani, R., and Cucchiara, R. (2020, January 25\u201328). A transformer-based network for dynamic hand gesture recognition. Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan.","DOI":"10.1109\/3DV50981.2020.00072"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/6\/2405\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:40:14Z","timestamp":1760136014000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/6\/2405"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,21]]},"references-count":33,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2022,3]]}},"alternative-id":["s22062405"],"URL":"https:\/\/doi.org\/10.3390\/s22062405","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,21]]}}}