{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,1]],"date-time":"2026-03-01T03:39:48Z","timestamp":1772336388049,"version":"3.50.1"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"1s","license":[{"start":{"date-parts":[[2023,1,23]],"date-time":"2023-01-23T00:00:00Z","timestamp":1674432000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Seventh special support plan for innovation and entrepreneurship of Anhui Province, China"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,2,28]]},"abstract":"<jats:p>Lipreading is a task of decoding the movement of the speaker\u2019s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human lipreaders. The visual differences in some phonemes are extremely subtle and pose a great challenge to lipreading. Most of the lipreading existing methods do not process the extracted visual features, which mainly suffer from two problems. First, the extracted features contain lot of useless information such as noise caused by differences in speech speed and lip shape, for example. In addition, the extracted features are not abstract enough to distinguish phonemes with similar pronunciation. These problems have a bad effect on the performance of lipreading. To extract features from the lip regions that are more distinguishable and more relevant to the speech content, this article proposes an end-to-end deep neural network-based lipreading model (LCSNet). The proposed model extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips. The extracted features are filtered by the channel attention module to eliminate the useless features and then used as input to the proposed Selective Feature Fusion Module (SFFM) to extract the high-level abstract features. Afterwards, these features are used as input to the bidirectional GRU network in time order for temporal modeling to obtain the long-term spatio-temporal features. Finally, a Connectionist Temporal Classification (CTC) decoder is used to generate the output text. The experimental results show that the proposed model achieves a 1.0% CER and 2.3% WER on the GRID corpus database, which, respectively, represents an improvement of 52% and 47% compared to LipNet.<\/jats:p>","DOI":"10.1145\/3524620","type":"journal-article","created":{"date-parts":[[2022,3,17]],"date-time":"2022-03-17T13:36:53Z","timestamp":1647524213000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["LCSNet: End-to-end Lipreading with Channel-aware Feature Selection"],"prefix":"10.1145","volume":"19","author":[{"given":"Feng","family":"Xue","sequence":"first","affiliation":[{"name":"Key Laboratory of Knowledge Engineering with Big Data, Intelligent Interconnected Systems Laboratory of Anhui Province, Ministry of Education, Hefei University of Technology, Hefei, China"}]},{"given":"Tian","family":"Yang","sequence":"additional","affiliation":[{"name":"Hefei University of Technology, Hefei, China"}]},{"given":"Kang","family":"Liu","sequence":"additional","affiliation":[{"name":"Hefei University of Technology, Hefei, China"}]},{"given":"Zikun","family":"Hong","sequence":"additional","affiliation":[{"name":"Hefei University of Technology, Hefei, China"}]},{"given":"Mingwei","family":"Cao","sequence":"additional","affiliation":[{"name":"Anhui University, Hefei, China"}]},{"given":"Dan","family":"Guo","sequence":"additional","affiliation":[{"name":"Hefei University of Technology, Hefei, China"}]},{"given":"Richang","family":"Hong","sequence":"additional","affiliation":[{"name":"Hefei University of Technology, Hefei, China"}]}],"member":"320","published-online":{"date-parts":[[2023,1,23]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"Conference on Audio Visual Speech Processing","author":"Hilder S.","year":"2009","unstructured":"S. Hilder, R. Harvey, and B. Theobald. 2009. Comparison of human and machine-based lip-reading. In Conference on Audio Visual Speech Processing."},{"key":"e_1_3_1_3_2","first-page":"3104","volume-title":"Conference on Advances in Neural Information Processing Systems","author":"Sutskever I.","year":"2014","unstructured":"I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Conference on Advances in Neural Information Processing Systems. 3104\u20133112."},{"key":"e_1_3_1_4_2","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1145\/1143844.1143891","volume-title":"Proceedings of the 23rd International Conference on Machine Learning ser. (ICML\u201906)","author":"Graves A.","year":"2006","unstructured":"A. Graves, S. Fern\u00e1ndez, and F. Gomez. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning ser. (ICML\u201906). Association for Computing Machinery, New York, NY, USA, 369\u2013376."},{"key":"e_1_3_1_5_2","volume-title":"International Conference on Learning Representations","author":"Bahdanau D.","year":"2015","unstructured":"D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations."},{"key":"e_1_3_1_6_2","unstructured":"W. Chan N. Jaitly Q. V. Le and O. Vinyals. 2015. Listen Attend and Spell. arXiv e-prints p. arXiv:1508.01211."},{"key":"e_1_3_1_7_2","first-page":"1764","volume-title":"International Conference on Machine Learning","author":"Graves A.","year":"2014","unstructured":"A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning. 1764\u20131772."},{"key":"e_1_3_1_8_2","unstructured":"A. van den Oord S. Dieleman H. Zen K. Simonyan O. Vinyals A. Graves N. Kalchbrenner A. Senior and K. Kavukcuoglu. 2016. Wavenet: A generative model for raw audio arXiv e-prints p. arXiv:1609.03499."},{"key":"e_1_3_1_9_2","first-page":"99","volume-title":"IEEE Trans. Pattern Anal. Mach. Intell.","author":"Jie H.","year":"2017","unstructured":"H. Jie, S. Li, S. Gang, and S. Albanie. 2017. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. PP, 99 (2017)."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2018.2890770"},{"key":"e_1_3_1_12_2","first-page":"4109","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Cui Y.","year":"2018","unstructured":"Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. 2018. Large scale fine-grained categorization and domain-specific transfer learning. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 4109\u20134118."},{"key":"e_1_3_1_13_2","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Wang T.","year":"2019","unstructured":"T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. Lau. 2019. Spatial attentive single-image deraining with a high quality real rain dataset. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_1_14_2","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Li X.","year":"2020","unstructured":"X. Li, W. Wang, X. Hu, and J. Yang. 2020. Selective kernel networks. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_1_15_2","volume-title":"arXiv preprint arXiv:1611.01599","author":"M. Assael Y.","year":"2016","unstructured":"Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas. 2016. LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599."},{"key":"e_1_3_1_16_2","volume-title":"13th IEEE International Conference on Automatic Face & Gesture Recognition (FG\u201918)","author":"Kai X.","year":"2018","unstructured":"X. Kai, D. Li, N. Cassimatis, and X. Wang. 2018. LCANet: End-to-end lipreading with cascaded attention-CTC. In 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG\u201918)."},{"key":"e_1_3_1_17_2","unstructured":"R. K. Srivastava K. Greff and J. Schmidhuber. 2015. Highway networks. arXiv e-prints p. arXiv:1505.00387."},{"key":"e_1_3_1_18_2","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"S. Chung J.","year":"2017","unstructured":"J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. 2017. Lip reading sentences in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_20_2","first-page":"165","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Potamianos G.","year":"2001","unstructured":"G. Potamianos, J. Luettin, and C. Neti. 2001. Hierarchical discriminant features for audio-visual LVCSR. In IEEE International Conference on Acoustics, Speech, and Signal Processing. 165\u2013168."},{"key":"e_1_3_1_21_2","first-page":"327","volume-title":"3rd International Congress on Image and Signal Processing","author":"A. Shaikh A.","year":"2010","unstructured":"A. A. Shaikh, D. K. Kumar, W. C. Yau, M. Z. C. Azemin, and J. Gubbi. 2010. Lip reading using optical flow and support vector machines. In 3rd International Congress on Image and Signal Processing. 327\u2013330."},{"key":"e_1_3_1_22_2","first-page":"361","volume-title":"International Conference on Computational Intelligence and Security","author":"Li M.","year":"2008","unstructured":"M. Li and Y. Ming Cheung. 2008. A novel motion based lip feature extraction for lip-reading. In International Conference on Computational Intelligence and Security. 361\u2013365."},{"key":"e_1_3_1_23_2","first-page":"561","volume-title":"9th International Conference on Signal Processing","author":"Alizadeh S.","year":"2008","unstructured":"S. Alizadeh, R. Boostani, and V. Asadpour. 2008. Lip feature extraction and reduction for HMM-based visual speech recognition systems. In 9th International Conference on Signal Processing. 561\u2013564."},{"key":"e_1_3_1_24_2","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1007\/978-3-540-89646-3_23","volume-title":"4th International Symposium on Advances in Visual Computing","author":"Chen J.","year":"2008","unstructured":"J. Chen, B. Tiddeman, and G. Zhao. 2008. Real-time lip contour extraction and tracking using an improved active contour model. In 4th International Symposium on Advances in Visual Computing. 236\u2013245."},{"issue":"1","key":"e_1_3_1_25_2","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1006\/cviu.1995.1004","article-title":"Active shape models\u2013their training and application","volume":"61","author":"F. Cootes T.","year":"1995","unstructured":"T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. 1995. Active shape models\u2013their training and application. Comput. Vis. Image Underst. 61, 1 (1995), 38\u201359.","journal-title":"Comput. Vis. Image Underst."},{"issue":"6","key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"681","DOI":"10.1109\/34.927467","article-title":"Active appearance models","volume":"23","author":"F. Cootes T.","year":"2001","unstructured":"T. F. Cootes, G. J. Edwards, and C. J. Taylor. 2001. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23, 6 (2001), 681\u2013685.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_1_27_2","first-page":"1149","volume-title":"15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages (INTERSPEECH\u201914)","author":"Noda K.","year":"2014","unstructured":"K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. 2014. Lipreading using convolutional neural network. In 15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages (INTERSPEECH\u201914). 1149\u20131153."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.5244\/C.28.6"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","first-page":"3652","DOI":"10.21437\/Interspeech.2017-85","volume-title":"Interspeech Conference","author":"Stafylakis T.","year":"2017","unstructured":"T. Stafylakis and G. Tzimiropoulos. 2017. Combining residual networks with LSTMs for lipreading. In Interspeech Conference. 3652\u20133656."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_32_2","first-page":"602","volume-title":"International Joint Conference on Neural Networks","author":"Graves A.","year":"2005","unstructured":"A. Graves and J. Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. In International Joint Conference on Neural Networks. 602\u2013610."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461326"},{"key":"e_1_3_1_34_2","volume-title":"arXiv preprint arXiv:1803.01271","author":"Bai S.","year":"2018","unstructured":"S. Bai, J. Z. Kolter, and V. Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271."},{"key":"e_1_3_1_35_2","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201920)","author":"Martinez B.","year":"2020","unstructured":"B. Martinez, P. Ma, S. Petridis, and M. Pantic. 2020. Lipreading using temporal convolutional networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201920)."},{"key":"e_1_3_1_36_2","first-page":"7988","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201921)","author":"Hao M.","year":"2021","unstructured":"M. Hao, M. Mamut, N. Yadikar, A. Aysa, and K. Ubul. 2021. How to use time information effectively? Combining with time shift module for lipreading. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201921). 7988\u20137992."},{"key":"e_1_3_1_37_2","first-page":"7608","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201921)","author":"Ma P.","year":"2021","unstructured":"P. Ma, B. Martinez, S. Petridis, and M. Pantic. 2021. Towards practical lipreading with distilled and efficient models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201921). 7608\u20137612."},{"key":"e_1_3_1_38_2","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Zhou Y.","year":"2020","unstructured":"Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang. 2020. More grounded image captioning by distilling image-text matching model. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)."},{"key":"e_1_3_1_39_2","volume-title":"ACM Multimedia Asia Conference","author":"Zhao Y.","year":"2019","unstructured":"Y. Zhao, R. Xu, and M. Song. 2019. A cascade sequence-to-sequence model for Chinese Mandarin lip reading. In ACM Multimedia Asia Conference."},{"key":"e_1_3_1_40_2","first-page":"1","volume-title":"IEEE Trans. Intell. Transport. Syst.","author":"Amodio A.","year":"2018","unstructured":"A. Amodio, M. Ermidoro, D. Maggi, S. Formentin, and S. M. Savaresi. 2018. Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transport. Syst. PP (2018), 1\u201311."},{"key":"e_1_3_1_41_2","volume-title":"International Conference on Learning Representations","author":"Kingma D. P.","year":"2015","unstructured":"D. P. Kingma and J. L. Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations."},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG47880.2020.00134"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475420"},{"key":"e_1_3_1_44_2","first-page":"4328","volume-title":"28th ACM International Conference on Multimedia","author":"Liu J.","year":"2020","unstructured":"J. Liu, Y. Ren, Z. Zhao, C. Zhang, B. Huai, and J. Yuan. 2020. FastLR: Non-autoregressive lipreading model with integrate-and-fire. In 28th ACM International Conference on Multimedia. 4328\u20134336."},{"key":"e_1_3_1_45_2","first-page":"6917","volume-title":"AAAI Conference on Artificial Intelligence","author":"Zhao Y.","year":"2020","unstructured":"Y. Zhao, R. Xu, X. Wang, P. Hou, H. Tang, and M. Song. 2020. Hearing lips: Improving lip reading by distilling speech recognizers. In AAAI Conference on Artificial Intelligence. 6917\u20136924."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3524620","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3524620","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:54Z","timestamp":1750183794000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3524620"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,23]]},"references-count":44,"journal-issue":{"issue":"1s","published-print":{"date-parts":[[2023,2,28]]}},"alternative-id":["10.1145\/3524620"],"URL":"https:\/\/doi.org\/10.1145\/3524620","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,23]]},"assertion":[{"value":"2021-10-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-03-08","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}