{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,11]],"date-time":"2026-02-11T06:18:15Z","timestamp":1770790695545,"version":"3.50.0"},"reference-count":87,"publisher":"Association for Computing Machinery (ACM)","issue":"FSE","license":[{"start":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T00:00:00Z","timestamp":1720742400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100011039","name":"IARPA","doi-asserted-by":"crossref","award":["W911NF-19-S-0012"],"award-info":[{"award-number":["W911NF-19-S-0012"]}],"id":[{"id":"10.13039\/100011039","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000185","name":"DARPA","doi-asserted-by":"crossref","award":["HR001120S0058"],"award-info":[{"award-number":["HR001120S0058"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]},{"name":"ONR","award":["N000141712045, N000141410468,N000141712947"],"award-info":[{"award-number":["N000141712045, N000141410468,N000141712947"]}]},{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["1901242 and 1910300"],"award-info":[{"award-number":["1901242 and 1910300"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2024,7,12]]},"abstract":"<jats:p>\n                    Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations\/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>53<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    to\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>64<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                          <mml:mo>,<\/mml:mo>\n                          <mml:mn>49<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    to\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>60<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    , and\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>74<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    to\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>94<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    , respectively. It also substantially outperforms other general pre-training techniques of code understanding models.\n                  <\/jats:p>","DOI":"10.1145\/3643752","type":"journal-article","created":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T10:22:09Z","timestamp":1720779729000},"page":"562-585","source":"Crossref","is-referenced-by-count":1,"title":["CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-8283-2825","authenticated-orcid":false,"given":"Zian","family":"Su","sequence":"first","affiliation":[{"name":"Purdue University, West Lafayette, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6619-781X","authenticated-orcid":false,"given":"Xiangzhe","family":"Xu","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2764-6091","authenticated-orcid":false,"given":"Ziyang","family":"Huang","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6515-0021","authenticated-orcid":false,"given":"Zhuo","family":"Zhang","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7232-0650","authenticated-orcid":false,"given":"Yapeng","family":"Ye","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4403-0060","authenticated-orcid":false,"given":"Jianjun","family":"Huang","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9544-2500","authenticated-orcid":false,"given":"Xiangyu","family":"Zhang","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,7,12]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","unstructured":"2024. CodeArt Artifact. https:\/\/doi.org\/10.5281\/zenodo.11096386 10.5281\/zenodo.11096386","DOI":"10.5281\/zenodo.11096386"},{"key":"e_1_3_1_3_2","article-title":"Gpt-4 technical report","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).","journal-title":"arXiv preprint arXiv:2303.08774"},{"key":"e_1_3_1_4_2","doi-asserted-by":"crossref","unstructured":"Wasi Ahmad Saikat Chakraborty Baishakhi Ray and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655\u20132668.","DOI":"10.18653\/v1\/2021.naacl-main.211"},{"key":"e_1_3_1_5_2","article-title":"Learning to represent programs with graphs","author":"Allamanis Miltiadis","year":"2017","unstructured":"Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017).","journal-title":"arXiv preprint arXiv:1711.00740"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/360018.360025"},{"key":"e_1_3_1_7_2","article-title":"On the bottleneck of graph neural networks and its practical implications","author":"Alon Uri","year":"2020","unstructured":"Uri Alon and Eran Yahav. 2020. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205 (2020).","journal-title":"arXiv preprint arXiv:2006.05205"},{"key":"e_1_3_1_8_2","first-page":"8626","article-title":"Learning to execute programs with instruction pointer attention graph neural networks","volume":"33","author":"Bieber David","year":"2020","unstructured":"David Bieber, Charles Sutton, Hugo Larochelle, and Daniel Tarlow. 2020. Learning to execute programs with instruction pointer attention graph neural networks. Advances in Neural Information Processing Systems 33 (2020), 8626\u20138637.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_9_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877\u20131901.","journal-title":"Advances in neural information processing systems"},{"key":"e_1_3_1_10_2","first-page":"1186","volume-title":"2021 IEEE\/ACM 43rd International Conference on Software Engineering (ICSE)","author":"Bui Nghi DQ","year":"2021","unstructured":"Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. Infercode: Self-supervised learning of code representations by predicting subtrees. In 2021 IEEE\/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1186\u20131197."},{"key":"e_1_3_1_11_2","doi-asserted-by":"crossref","unstructured":"Nghi DQ Bui Yijun Yu and Lingxiao Jiang. 2021. Treecaps: Tree-based capsule networks for source code processing. In Proceedings of the AAA IConference on Artificial Intelligence Vol. 35. 30\u201338.","DOI":"10.1609\/aaai.v35i1.16074"},{"key":"e_1_3_1_12_2","article-title":"A study on prompt design, advantages and limitations of chatgpt for deep learning program repair","author":"Cao Jialun","year":"2023","unstructured":"Jialun Cao, Meiziniu Li, Ming Wen, and Shing-chi Cheung. 2023. A study on prompt design, advantages and limitations of chatgpt for deep learning program repair. arXiv preprint arXiv:2304.08191 (2023).","journal-title":"arXiv preprint arXiv:2304.08191"},{"key":"e_1_3_1_13_2","first-page":"19930","article-title":"GALOIS: boosting deep reinforcement learning via generalizable logic synthesis","volume":"35","author":"Cao Yushi","year":"2022","unstructured":"Yushi Cao, Zhiming Li, Tianpei Yang, Hao Zhang, Yan Zheng, Yi Li, Jianye Hao, and Yang Liu. 2022. GALOIS: boosting deep reinforcement learning via generalizable logic synthesis. Advances in Neural Information Processing Systems 35 (2022), 19930\u201319943.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","unstructured":"Saikat Chakraborty Toufique Ahmed Yangruibo Ding Premkumar T Devanbu and Baishakhi Ray. 2022. Natgen: generative pre-training by \u201cnaturalizing\u201d source code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 18\u201330.","DOI":"10.1145\/3540250.3549162"},{"key":"e_1_3_1_15_2","doi-asserted-by":"crossref","unstructured":"Deli Chen Yankai Lin Wei Li Peng Li Jie Zhou and Xu Sun. 2020. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI conference on artificial intelligence Vol. 34. 3438\u20133445.","DOI":"10.1609\/aaai.v34i04.5747"},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","unstructured":"Fuxiang Chen Fatemeh H Fard David Lo and Timofey Bryksin. 2022. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE\/ACM International Conference on Program Comprehension. 401\u2013412.","DOI":"10.1145\/3524610.3527917"},{"key":"e_1_3_1_17_2","unstructured":"Qibin Chen Jeremy Lacomis Edward J Schwartz Claire Le Goues Graham Neubig and Bogdan Vasilescu. 2022. Augmenting decompiler output with learned variable names and types. In 31st USENIX Security Symposium (USENIX Security 22). 4327\u20134343."},{"key":"e_1_3_1_18_2","first-page":"23089","article-title":"PLUR: A unifying, graph-based view of program learning, understanding, and repair","volume":"34","author":"Chen Zimin","year":"2021","unstructured":"Zimin Chen, Vincent J Hellendoorn, Pascal Lamblin, Petros Maniatis, Pierre-Antoine Manzagol, Daniel Tarlow, and Subhodeep Moitra. 2021. PLUR: A unifying, graph-based view of program learning, understanding, and repair. Advances in Neural Information Processing Systems 34 (2021), 23089\u201323101.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_19_2","volume-title":"Introduction to algorithms","author":"Cormen Thomas H","year":"2022","unstructured":"Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2022. Introduction to algorithms. MIT press."},{"key":"e_1_3_1_20_2","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).","journal-title":"arXiv preprint arXiv:1810.04805"},{"key":"e_1_3_1_21_2","article-title":"TRACED: Execution-aware Pre-training for Source Code","author":"Ding Yangruibo","year":"2023","unstructured":"Yangruibo Ding, Ben Steenhoek, Kexin Pei, Gail Kaiser, Wei Le, and Baishakhi Ray. 2023. TRACED: Execution-aware Pre-training for Source Code. arXiv preprint arXiv:2306.07487 (2023).","journal-title":"arXiv preprint arXiv:2306.07487"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3542944"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-022-10118-5"},{"key":"e_1_3_1_24_2","doi-asserted-by":"crossref","unstructured":"Jinhao Dong Yiling Lou Qihao Zhu Zeyu Sun Zhilin Li Wenjie Zhang and Dan Hao. 2022. FIRA: fine-grained graphbased code change representation for automated commit message generation. In Proceedings of the 44th International Conference on Software Engineering. 970\u2013981.","DOI":"10.1145\/3510003.3510069"},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","unstructured":"Yue Duan Xuezixiang Li Jinghan Wang and Heng Yin. 2020. Deepbindiff: Learning program-wide code representations for binary diffing. In Network and distributed system security symposium.","DOI":"10.14722\/ndss.2020.24311"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00128"},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Pengcheng Fang Zhenhua Zou Xusheng Xiao and Zhuotao Liu. 2023. iSyn: Semi-automated Smart Contract Synthesis from Legal Financial Agreements. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 727\u2013739.","DOI":"10.1145\/3597926.3598091"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00060"},{"key":"e_1_3_1_29_2","unstructured":"Kassem Fawaz Huan Feng and Kang G Shin. 2015. Anatomization and protection of mobile apps\u2019 location privacy threats. In 24th USENIX Security Symposium (USENIX Security 15). 753\u2013768."},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Zhangyin Feng Daya Guo Duyu Tang Nan Duan Xiaocheng Feng Ming Gong Linjun Shou Bing Qin Ting Liu Daxin Jiang et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536\u20131547.","DOI":"10.18653\/v1\/2020.findings-emnlp.139"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/24039.24041"},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Lian Gao Yu Qu Sheng Yu Yue Duan and Heng Yin. [n. d.]. SIGMADIFF: Semantics-Aware Deep Graph Matching for Pseudocode Diffing. In Network and Distributed System Security (NDSS) Symposium 2024.","DOI":"10.14722\/ndss.2024.23208"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00164"},{"key":"e_1_3_1_34_2","article-title":"Identifying authorship style in malicious binaries: techniques, challenges & datasets","author":"Gray Jason","year":"2021","unstructured":"Jason Gray, Daniele Sgandurra, and Lorenzo Cavallaro. 2021. Identifying authorship style in malicious binaries: techniques, challenges & datasets. arXiv preprint arXiv:2101.06124 (2021).","journal-title":"arXiv preprint arXiv:2101.06124"},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","unstructured":"Daya Guo Shuai Lu Nan Duan Yanlin Wang Ming Zhou and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7212\u20137225.","DOI":"10.18653\/v1\/2022.acl-long.499"},{"key":"e_1_3_1_36_2","article-title":"Graphcodebert: Pre-training code representations with data flow","author":"Guo Daya","year":"2020","unstructured":"Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).","journal-title":"arXiv preprint arXiv:2009.08366"},{"key":"e_1_3_1_37_2","article-title":"A study on the impact of pre-trained model on Just-In-Time defect prediction","author":"Guo Yuxiang","year":"2023","unstructured":"Yuxiang Guo, Xiaopeng Gao, Zhenyu Zhang, WK Chan, and Bo Jiang. 2023. A study on the impact of pre-trained model on Just-In-Time defect prediction. arXiv preprint arXiv:2309.02317 (2023).","journal-title":"arXiv preprint arXiv:2309.02317"},{"key":"e_1_3_1_38_2","article-title":"GRAPHSPY: Fused Program Semantic-Level Embedding via Graph Neural Networks for Dead Store Detection","author":"Guo Yixin","year":"2020","unstructured":"Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, and Zhenlin Wang. 2020. GRAPHSPY: Fused Program Semantic-Level Embedding via Graph Neural Networks for Dead Store Detection. arXiv preprint arXiv:2011.09501 (2020).","journal-title":"arXiv preprint arXiv:2011.09501"},{"key":"e_1_3_1_39_2","unstructured":"IDA Pro 2023. A powerful disassembler and a versatile debugger. https:\/\/hex-rays.com\/ida-pro\/"},{"key":"e_1_3_1_40_2","doi-asserted-by":"crossref","unstructured":"Xin Jin Kexin Pei Jun Yeon Won and Zhiqiang Lin. 2022. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1631\u20131645.","DOI":"10.1145\/3548606.3560612"},{"key":"e_1_3_1_41_2","first-page":"5110","volume-title":"International conference on machine learning","author":"Kanade Aditya","year":"2020","unstructured":"Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and evaluating contextual embedding of source code. In International conference on machine learning. PMLR, 5110\u20135121."},{"key":"e_1_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Geunwoo Kim Sanghyun Hong Michael Franz and Dokyung Song. 2022. Improving cross-platform binary analysis using representation learning via graph alignment. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 151\u2013163.","DOI":"10.1145\/3533767.3534383"},{"key":"e_1_3_1_43_2","doi-asserted-by":"crossref","unstructured":"Qimai Li Zhichao Han and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semisupervised learning. In Proceedings of the AAAI conference on artificial intelligence Vol. 32.","DOI":"10.1609\/aaai.v32i1.11604"},{"key":"e_1_3_1_44_2","article-title":"Soft-Labeled Contrastive Pre-training for Function-level Code Representation","author":"Li Xiaonan","year":"2022","unstructured":"Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. Soft-Labeled Contrastive Pre-training for Function-level Code Representation. arXiv preprint arXiv:2210.09597 (2022).","journal-title":"arXiv preprint arXiv:2210.09597"},{"key":"e_1_3_1_45_2","first-page":"3835","volume-title":"International conference on machine learning","author":"Li Yujia","year":"2019","unstructured":"Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning. PMLR, 3835\u20133845."},{"key":"e_1_3_1_46_2","article-title":"CCT5: A Code-Change-Oriented Pre-Trained Model","author":"Lin Bo","year":"2023","unstructured":"Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. 2023. CCT5: A Code-Change-Oriented Pre-Trained Model. arXiv preprint arXiv:2305.10785 (2023).","journal-title":"arXiv preprint arXiv:2305.10785"},{"key":"e_1_3_1_47_2","unstructured":"Zhiqiang Lin Xiangyu Zhang and Dongyan Xu. 2010. Automatic reverse engineering of data structures from binary execution. In Proceedings of the 11th Annual Information Security Symposium. 1\u20131."},{"key":"e_1_3_1_48_2","article-title":"Roberta: A robustly optimized bert pretraining approach","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).","journal-title":"arXiv preprint arXiv:1907.11692"},{"key":"e_1_3_1_49_2","article-title":"Codexglue: A machine learning benchmark dataset for code understanding and generation","author":"Lu Shuai","year":"2021","unstructured":"Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).","journal-title":"arXiv preprint arXiv:2102.04664"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2017.2655046"},{"key":"e_1_3_1_51_2","doi-asserted-by":"crossref","unstructured":"Minh-Thang Luong Hieu Pham and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412\u20131421.","DOI":"10.18653\/v1\/D15-1166"},{"key":"e_1_3_1_52_2","doi-asserted-by":"crossref","unstructured":"Wei Ma Mengjie Zhao Ezekiel Soremekun Qiang Hu Jie M Zhang Mike Papadakis Maxime Cordy Xiaofei Xie and Yves Le Traon. 2022. Graphcode2vec: Generic code embedding via lexical and program dependence analyses. In Proceedings of the 19th International Conference on Mining Software Repositories. 524\u2013536.","DOI":"10.1145\/3524842.3528456"},{"key":"e_1_3_1_53_2","doi-asserted-by":"crossref","unstructured":"Yixuan Ma Shuang Liu Jiajun Jiang Guanhong Chen and Keqiu Li. 2021. A comprehensive study on learning-based PE malware family classification methods. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1314\u20131325.","DOI":"10.1145\/3468264.3473925"},{"key":"e_1_3_1_54_2","unstructured":"Andrea Marcelli Mariano Graziano Xabier Ugarte-Pedrero Yanick Fratantonio Mohamad Mansouri and Davide Balzarotti. 2022. How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22). 2099\u20132116."},{"issue":"2","key":"e_1_3_1_55_2","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1007\/s10664-022-10276-6","article-title":"An empirical study of text-based machine learning models for vulnerability detection","volume":"28","author":"Napier Kollin","year":"2023","unstructured":"Kollin Napier, Tanmay Bhowmik, and Shaowei Wang. 2023. An empirical study of text-based machine learning models for vulnerability detection. Empirical Software Engineering 28, 2 (2023), 38.","journal-title":"Empirical Software Engineering"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/2016904.2016908"},{"key":"e_1_3_1_57_2","doi-asserted-by":"crossref","unstructured":"Kexin Pei Jonas Guan Matthew Broughton Zhongtian Chen Songchen Yao David Williams-King Vikas Ummadisetty Junfeng Yang Baishakhi Ray and Suman Jana. 2021. StateFormer: Fine-grained type recovery from binaries using generative state modeling. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 690\u2013702.","DOI":"10.1145\/3468264.3468607"},{"key":"e_1_3_1_58_2","article-title":"Trex: Learning execution semantics from micro-traces for binary similarity","author":"Pei Kexin","year":"2020","unstructured":"Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020).","journal-title":"arXiv preprint arXiv:2012.08680"},{"key":"e_1_3_1_59_2","first-page":"8476","volume-title":"International Conference on Machine Learning","author":"Peng Dinglan","year":"2021","unstructured":"Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu. 2021. How could neural networks understand programs?. In International Conference on Machine Learning. PMLR, 8476\u20138486."},{"key":"e_1_3_1_60_2","article-title":"Domain Knowledge Matters: Improving Prompts with Fix Templates for Repairing Python Type Errors","author":"Peng Yun","year":"2023","unstructured":"Yun Peng, Shuzheng Gao, Cuiyun Gao, Yintong Huo, and Michael R Lyu. 2023. Domain Knowledge Matters: Improving Prompts with Fix Templates for Repairing Python Type Errors. arXiv preprint arXiv:2306.01394 (2023).","journal-title":"arXiv preprint arXiv:2306.01394"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.5555\/509043"},{"key":"e_1_3_1_62_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever et al. 2018. Improving language understanding by generative pre-training. (2018)."},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_3_1_64_2","article-title":"Code llama: Open foundation models for code","author":"Rozi\u00e8re Baptiste","year":"2023","unstructured":"Baptiste Rozi\u00e8re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J\u00e9r\u00e9my Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).","journal-title":"arXiv preprint arXiv:2308.12950"},{"key":"e_1_3_1_65_2","article-title":"Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models","author":"Saberi Iman","year":"2023","unstructured":"Iman Saberi and Fatemeh H Fard. 2023. Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models. arXiv preprint arXiv:2303.06233 (2023).","journal-title":"arXiv preprint arXiv:2303.06233"},{"key":"e_1_3_1_66_2","doi-asserted-by":"crossref","DOI":"10.1145\/3585008","article-title":"SEAL: Integrating Program Analysis and Repository Mining","author":"Sattler Florian","year":"2023","unstructured":"Florian Sattler, Sebastian B\u00f6hm, Philipp Dominik Schubert, Norbert Siegmund, and Sven Apel. 2023. SEAL: Integrating Program Analysis and Repository Mining. ACM Transactions on Software Engineering and Methodology (2023).","journal-title":"ACM Transactions on Software Engineering and Methodology"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3412376"},{"key":"e_1_3_1_68_2","article-title":"Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond","author":"Shi Ensheng","year":"2023","unstructured":"Ensheng Shi, Yanlin Wang, Hongyu Zhang, Lun Du, Shi Han, Dongmei Zhang, and Hongbin Sun. 2023. Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond. arXiv preprint arXiv:2304.05216 (2023).","journal-title":"arXiv preprint arXiv:2304.05216"},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","unstructured":"Yucen Shi Ying Yin Zhengkui Wang David Lo Tao Zhang Xin Xia Yuhai Zhao and Bowen Xu. 2022. How to better utilize code graphs in semantic code search?. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 722\u2013733.","DOI":"10.1145\/3540250.3549087"},{"key":"e_1_3_1_70_2","doi-asserted-by":"crossref","unstructured":"Benjamin Steenhoek Hongyang Gao and Wei Le. 2024. Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection. In Proceedings of the 46th IEEE\/ACM International Conference on Software Engineering. 1\u201313.","DOI":"10.1145\/3597503.3623345"},{"key":"e_1_3_1_71_2","article-title":"CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking","author":"Su Zian","year":"2024","unstructured":"Zian Su, Xiangzhe Xu, Ziyang Huang, Zhuo Zhang, Yapeng Ye, Jianjun Huang, and Xiangyu Zhang. 2024. CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking. arXiv preprint arXiv:2402.11842 (2024).","journal-title":"arXiv preprint arXiv:2402.11842"},{"key":"e_1_3_1_72_2","doi-asserted-by":"crossref","unstructured":"Zeyu Sun Qihao Zhu Yingfei Xiong Yican Sun Lili Mou and Lu Zhang. 2020. Treegen: A tree-based transformer architecture for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34. 8984\u20138991.","DOI":"10.1609\/aaai.v34i05.6430"},{"key":"e_1_3_1_73_2","first-page":"667","article-title":"Mining multi-label data","author":"Tsoumakas Grigorios","year":"2010","unstructured":"Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. 2010. Mining multi-label data. Data mining and knowledge discovery handbook (2010), 667\u2013685.","journal-title":"Data mining and knowledge discovery handbook"},{"key":"e_1_3_1_74_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).","journal-title":"Advances in neural information processing systems"},{"key":"e_1_3_1_75_2","doi-asserted-by":"crossref","unstructured":"Deze Wang Zhouyang Jia Shanshan Li Yue Yu Yun Xiong Wei Dong and Xiangke Liao. 2022. Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th International Conference on Software Engineering. 287\u2013298.","DOI":"10.1145\/3510003.3510062"},{"key":"e_1_3_1_76_2","doi-asserted-by":"crossref","unstructured":"Hao Wang Wenjie Qu Gilad Katz Wenyu Zhu Zeyu Gao Han Qiu Jianwei Zhuge and Chao Zhang. 2022. Jtrans: Jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 1\u201313.","DOI":"10.1145\/3533767.3534367"},{"key":"e_1_3_1_77_2","first-page":"319","volume-title":"Proceedings of the 32nd IEEE\/ACM International Conference on Automated Software Engineering (Urbana-Champaign, IL, USA) (ASE 2017)","author":"Wang Shuai","year":"2017","unstructured":"Shuai Wang and Dinghao Wu. 2017. In-Memory Fuzzing for Binary Code Similarity Analysis. In Proceedings of the 32nd IEEE\/ACM International Conference on Automated Software Engineering (Urbana-Champaign, IL, USA) (ASE 2017). IEEE Press, 319\u2013330."},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","unstructured":"Yue Wang Weishi Wang Shafiq Joty and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics Online and Punta Cana Dominican Republic 8696\u20138708. https:\/\/doi.org\/10.18653\/v1\/2021.emnlp-main.685 10.18653\/v1\/2021.emnlp-main.685","DOI":"10.18653\/v1\/2021.emnlp-main.685"},{"key":"e_1_3_1_79_2","article-title":"How Effective Are Neural Networks for Fixing Security Vulnerabilities","author":"Wu Yi","year":"2023","unstructured":"Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How Effective Are Neural Networks for Fixing Security Vulnerabilities. arXiv preprint arXiv:2305.18607 (2023).","journal-title":"arXiv preprint arXiv:2305.18607"},{"key":"e_1_3_1_80_2","first-page":"13266","article-title":"Representing long-range context for graph neural networks with global attention","volume":"34","author":"Wu Zhanghao","year":"2021","unstructured":"Zhanghao Wu, Paras Jain, Matthew Wright, Azalia Mirhoseini, Joseph E Gonzalez, and Ion Stoica. 2021. Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems 34 (2021), 13266\u201313279.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_81_2","doi-asserted-by":"crossref","unstructured":"Frank F Xu Uri Alon Graham Neubig and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1\u201310.","DOI":"10.1145\/3520312.3534862"},{"key":"e_1_3_1_82_2","first-page":"368","volume-title":"Collaborative Computing: Networking, Applications and Worksharing: 17th EAI International Conference, CollaborateCom 2021, Virtual Event, October 16-18, 2021, Proceedings, Part I 17","author":"Xu Guangquan","year":"2021","unstructured":"Guangquan Xu, Meiqi Feng, Litao Jiao, Jian Liu, Hong-Ning Dai, Ding Wang, Emmanouil Panaousis, and Xi Zheng. 2021. MFF-AMD: multivariate feature fusion for Android malware detection. In Collaborative Computing: Networking, Applications and Worksharing: 17th EAI International Conference, CollaborateCom 2021, Virtual Event, October 16-18, 2021, Proceedings, Part I 17. Springer, 368\u2013385."},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/3597926.3598121"},{"key":"e_1_3_1_84_2","article-title":"PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model","author":"Xu Xiangzhe","year":"2023","unstructured":"Xiangzhe Xu, Zhou Xuan, Shiwei Feng, Siyuan Cheng, Yapeng Ye, Qingkai Shi, Guanhong Tao, Le Yu, Zhuo Zhang, and Xiangyu Zhang. 2023. PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model. arXiv preprint arXiv:2308.15449 (2023).","journal-title":"arXiv preprint arXiv:2308.15449"},{"key":"e_1_3_1_85_2","unstructured":"Sheng Yu Yu Qu Xunchao Hu and Heng Yin. 2022. DeepDi: Learning a Relational Graph Convolutional Network Model on Instructions for Fast and Accurate Disassembly. In 31st USENIX Security Symposium (USENIX Security 22). 2709\u20132725."},{"key":"e_1_3_1_86_2","first-page":"1","volume-title":"Proceedings of the ACM on Programming Languages","author":"Zhang Zhuo","year":"2019","unstructured":"Zhuo Zhang, Wei You, Guanhong Tao, Guannan Wei, Yonghwi Kwon, and Xiangyu Zhang. 2019. BDA: practical dependence analysis for binary executables by unbiased whole-program path sampling and per-path abstract interpretation. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1\u201331."},{"key":"e_1_3_1_87_2","first-page":"7793","article-title":"Beyond homophily in graph neural networks: Current limitations and effective designs","volume":"33","author":"Zhu Jiong","year":"2020","unstructured":"Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in neural information processing systems 33 (2020), 7793\u20137804.","journal-title":"Advances in neural information processing systems"},{"key":"e_1_3_1_88_2","article-title":"kTrans: Knowledge-Aware Transformer for Binary Code Embedding","author":"Zhu Wenyu","year":"2023","unstructured":"Wenyu Zhu, Hao Wang, Yuchen Zhou, Jiaming Wang, Zihan Sha, Zeyu Gao, and Chao Zhang. 2023. kTrans: Knowledge-Aware Transformer for Binary Code Embedding. arXiv preprint arXiv:2308.12659 (2023).","journal-title":"arXiv preprint arXiv:2308.12659"}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643752","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643752","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643752","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T08:08:06Z","timestamp":1770192486000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643752"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,12]]},"references-count":87,"journal-issue":{"issue":"FSE","published-print":{"date-parts":[[2024,7,12]]}},"alternative-id":["10.1145\/3643752"],"URL":"https:\/\/doi.org\/10.1145\/3643752","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,12]]}}}