0% found this document useful (0 votes)
22 views6 pages

Arithmattack: Evaluating Robustness of Llms To Noisy Context in Math Problem Solving

The document presents ArithmAttack, a method for evaluating the robustness of Large Language Models (LLMs) against noisy inputs, specifically focusing on the impact of random punctuation in math problem-solving tasks. The study assesses seven LLMs, revealing that all models exhibit vulnerability to noise, with performance degrading as noise levels increase. Results indicate that Llama3.1 consistently outperforms other models, demonstrating the highest accuracy under both clean and noisy conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

Arithmattack: Evaluating Robustness of Llms To Noisy Context in Math Problem Solving

The document presents ArithmAttack, a method for evaluating the robustness of Large Language Models (LLMs) against noisy inputs, specifically focusing on the impact of random punctuation in math problem-solving tasks. The study assesses seven LLMs, revealing that all models exhibit vulnerability to noise, with performance degrading as noise levels increase. Results indicate that Llama3.1 consistently outperforms other models, demonstrating the highest accuracy under both clean and noisy conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context

in Math Problem Solving


Zain Ul Abedin♣ * Shahzeb Qamar♣ * Lucie Flek♣♠ Akbar Karimi♣♠

Conversational AI and Social Analytics (CAISA) Lab, University of Bonn, Germany

Lamarr Institute for Machine Learning and Artificial Intelligence, Germany
{s28zabed, s57sqama}@uni-bonn.de
{flek, ak}@bit.uni-bonn.de

Abstract Tiffany baked 8 brownies, but needed


Real 17 total for her party. If she used 8
72

LLM Response
Prompt cups of flour on each one, how many
While Large Language Models (LLMs) have cups of flour does she still need?
shown impressive capabilities in math problem-
arXiv:2501.08203v1 [cs.CL] 14 Jan 2025

? Tiffany baked 8 brownies, but needed


solving tasks, their robustness to noisy inputs Noisy 17 ! total for ; her party. If she : used 8
is not well-studied. In this work, we pro- Prompt cups of flour on each one, how many 8
cups of flour does she still need?
pose ArithmAttack to examine how robust the
LLMs are when they encounter noisy prompts
that contain extra noise in the form of punc- Figure 1: Noisy context breaks the LLM’s capability to
tuation marks. While being easy to imple- give the right answer.
ment, ArithmAttack does not cause any in-
formation loss since words are not added or
deleted from the context. We evaluate the ro- LLMs respond to noise attacks consisting of ran-
bustness of seven LLMs, including LLama3, dom punctuation marks in the context of math
Mistral and Mathstral, on noisy GSM8K and
problem-solving?. Figure 1 shows an example of
MultiArith datasets. Our experiments suggest
that all the studied models show vulnerability an LLM response under ArithmAttack, where the
to such noise, with more noise leading to poorer model behaves erratically when it sees a noisy con-
performances1 . text whereas it answers the question in the clean
prompt correctly.
1 Introduction Inspired by the AEDA method (Karimi et al.,
2021) which was initially utilized as a data aug-
As Large Language Models (LLMs) are improv- mentation method, we propose ArithmAttack to as-
ing in their ability to accurately process human sess the robustness of seven LLMs (i.e. two Llama
language, their math problem-solving is also en- models (Dubey et al., 2024), two Mistral models
hancing (Saraf et al., 2024; Agrawal et al., 2024; (Jiang et al., 2023), Zephyr (Tunstall et al., 2023),
Wu et al., 2024). However, these sets of questions Gemma2 (Team et al., 2024), and Qwen2.5 (Yang
might require reasoning capabilities to be answered. et al., 2024)) to noisy data. Similarly to AEDA, we
While LLMs have been shown to have such capa- introduce this noise by randomly inserting punc-
bilities to some extent (Imani et al., 2023), their ro- tuation marks into the context of math problems
bustness to adversarial inputs remains a challenge. from two math datasets, namely GSM8K (Cobbe
For instance, these models can be vulnerable to et al., 2021) and MultiArith (Roy and Roth, 2015).
simple replacement of words with their synonyms We then evaluate how these models perform under
(Zhou et al., 2024) and even typographical errors different noise levels, with the noise affecting 10%,
can negatively impact their ability to reason (Gan 30%, and 50% of the sentence length (based on the
et al., 2024). number of words).
In this paper, we further investigate the math Our contributions are twofold: 1) We propose
problem-solving robustness of LLMs to a different ArithmAttack which produces noisy contexts con-
set of changes that take the form of noisy context taining random punctuation marks to assess the
containing a variety of punctuation marks. The robustness of LLMs in math problem-solving. 2)
key research question for this study is: How do We evaluate seven LLMs, with parameter counts of
* Equal contriubtion 1.5B, 2B, 7B, and 8B on math datasets and observe
1
https://github.com/caisa-lab/ArithmAttack that all the studied models show growing vulner-

1
ability to ArithmAttack as the amount of noise MultiArith (Roy and Roth, 2015) offers a broad
increases. examination of language model performance across
multiple arithmetic problem types and complexi-
2 Related Works ties. The test set contains 180 data points on which
we do our experiments. It serves as a crucial bench-
Language models have been shown to be vulner- mark for understanding how contextual noise im-
able to a variety of changes in the input context, pacts the model’s ability to parse and solve mathe-
including character altering (Ebrahimi et al., 2018; matical questions.
Pruthi et al., 2019), typographical errors (Gan et al.,
2024), word replacement (Zhou et al., 2024), gib- 3.2 Models
berish or irrelevant context inclusion (Cherepanova To study a variety of language models and at the
and Zou, 2024; Shi et al., 2023), and semantic per- same time observing our computational budget, we
turbations (Ribeiro et al., 2020; Zhu et al., 2023). opted for seven widely utilized LLMs that have
Ebrahimi et al. (2018) show that a single character been trained by different companies. These models
change can make the neural classifier alter its cor- are Mistral-7B-Instruct-v0.2 (Jiang et al., 2023),
rect prediction. Similarly, Gan et al. (2024) propose Mathstral-7b-v0.1 (Jiang et al., 2023), Llama-3-
an adversarial typo attack that breaks the reasoning 8B-Instruct and Llama-3.1-8B-Instruct (Dubey
process of LLMs. Instead of modifying charac- et al., 2024), Gemma-2-2b-it (Team et al., 2024),
ters, Gan et al. (2024) propose a dataset, called Zephyr-7b-beta (Tunstall et al., 2023), Qwen2.5-
RobustMath, where they replace words with their 1.5B-Instruct (Yang et al., 2024). Throughout this
synonyms to evaluate the robustness of large lan- paper, we will refer to these models as Mistral,
guage models. In the study by Zhu et al. (2023), the Mathstral, Llama3, Llama3.1, Gemma2, Zephyr,
authors employ different types of textual attacks on and Qwen2.5, respectively.
prompts, including character, word, sentence, and
semantic attacks. 4 Methodology
While the literature mainly concentrates on
modifying the lexical or semantic content of the To obtain the responses from LLMs, we use Au-
prompts, we aim to keep the contextual informa- toCoT (Zhang et al., 2023) prompting, using the
tion intact and instead focus on the model behav- following prompt:
ior changes when encountering punctuation noise. Prompt 1
In addition, an advantage of our method is that it
is extremely straightforward to implement and as Think step by step through the following
we show in the results section, it is also effective problem and clearly show each step of your
in degrading the performance of LLMs in math reasoning. Ensure the final answer is clearly
problem-solving. indicated by ending with {The final answer
is}.
3 Experiments
To carry out our experiments, we utilize two well- 4.1 Answer Extraction
known math datasets and seven large language To evaluate the accuracy of the models, we devel-
models. oped a script to extract answers from the LLM
responses. The extraction process underwent mul-
3.1 Datasets
tiple iterations, as it needed to accurately extract
GSM8K (Cobbe et al., 2021) contains 8.5K high- the answer and compare it with the ground truth.
quality, linguistically diverse grade school math However, even with the final prompt, we observed
word problems. The test set contains 1.32k data a couple of inconsistencies in the answer extrac-
points on which we do our experiments. This tion. Therefore, we went through outputs manually
dataset provides a variety of arithmetic and log- to estimate the miss rate (i.e. the rate with which
ical questions typical of middle school education, the correct answer is not extracted). In manual in-
making it ideal for testing the comprehension and spection, we evaluated the entire responses for the
problem-solving capabilities of LLMs under noisy MultiArith dataset and the first 100 responses from
conditions. the GSM8K dataset. Table 2 shows that the miss

2
Clean Punctuation Clean Punctuation
Models ASR Models ASR
Accuracy Percentage Accuracy Percentage
10 30 50 10 30 50
Mistral 42.07 41.62 37.75 36.39 39.69 Mistral 73.88 72.77 71.11 65.55 23.66
Mathstral 77.63 75.51 71.34 70.65 19.81 Mathstral 96.11 92.77 86.11 87.22 9.47
Llama3 75.43 73.31 73.08 72.17 11.73 Llama3 95.00 92.77 91.66 88.33 7.79
Llama3.1 82.25 81.04 78.84 77.02 12.53 Llama3.1 99.44 94.44 91.66 83.88 9.67
Gemma2 49.65 45.10 36.46 35.63 41.82 Gemma2 89.44 82.77 78.88 72.22 19.45
Zephyr 23.27 18.04 18.04 10.08 74.80 Zephyr 37.22 22.22 16.11 12.77 77.10
Qwen2.5 61.10 56.02 52.69 49.35 31.59 Qwen2.5 97.22 94.44 85.55 83.88 11.04

Table 1: Results for GSM8K dataset (percent). The Table 3: Results for MultiArith dataset (percent). The
performance for all models drops under ArithmAttack. performance for all models drops under ArithmAttack.
While Llama3.1 has the top performance under all levels Llama3 has the lowest drop, making it the most robust
of noise, the ASR score shows that Llama3 has the model.
lowest drop from its original performance, making it the
most robust model.
the noise. We employed six types of punctuation
Miss Rate (%) marks: {".", ’,’, ’!’, ’?’, ’;’, ’:’}.
Model GSM8K MultiArith
4.3 ASR and Similarity Calculation
Mistral 9.0 1.1
Mathstral 0.0 1.1 We evaluate the models with their performance
Llama3 1.0 1.1 accuracy against noisy input and Attack Success
Llama3.1 0.0 0.0 Rate (ASR). ASR (Wang et al., 2021) measures
Gemma2 3.0 2.2 how effective an adversarial attack is on a model.
Zephyr 2.0 12.8 Specifically, it looks at how often the model’s
Qwen2.5 1.0 0.5 predictions are changed incorrectly after the
adversarial attack. In this study, the average ASR
Table 2: Miss rate of the models in answer extraction. has been taken for every model with 10%, 30%
and 50% noisy dataset’s responses with the help of
Formula 1:
rate is minimal for most of the models. In the cases P
of Mistral (for GSM8K) and Zephyer (for Multi- (x,y)∈D I [f (A(x)) ̸= y]
ASR = P (1)
Arith), the miss rates can be significant. While this (x,y)∈D I [f (x) = y]
can be an indication of lower ability in following
instructions in these models, considering the gap In other words, ASR is the ratio of changed an-
in the performance and ASR scores, this does not swers after attack to previously correct answers
affect the observed trends. produced by the LLM.
We also calculate the similarity of the perturbed
4.2 Noisy Dataset Creation samples to the original ones. Similarity represents
the average semantic similarity between two con-
Once satisfactory results were achieved with clean
texts. Given that our method does not alter the
datasets, we proceeded to test the models on noisy
words in the sentence, the resulting samples af-
data. For the introduction of noise, we follow a sim-
ter applying ArithmAttack are scored 100 percent
ilar approach to Karimi et al. (2021), by altering the
similar to the original samples using Universal Sen-
hyper-parameters in the logic. In their study, they
tence Encoder (Cer et al., 2018) as the scorer.
insert the punctuation marks by randomly choosing
a number between 1 and one-third of the length of 5 Results and Analysis
the sequence which indicates how many insertions
will be carried out. But in our case, instead of ran- As shown in Tables 1 and 3, Llama3.1 consis-
domly choosing the number of insertions, we fix it tently outperforms all other models across both
to be 10%, 30% and 50% of the total length of the datasets. It achieves the highest accuracies in both
sentence but still choose random positions to insert clean and noise-affected conditions (except in 50%

3
85 100 Mistral
Mathstral
Llama3
80
60 Llama3.1
Accuracy

Gemma2
60 Zephyr
Qwen2.5
35
40

10 20
0 10% 30% 50% 0 10% 30% 50%

Punctuation Ratio Punctuation Ratio

Figure 2: Accuracy of the studied models on different levels of noise for GSM8K (left) and MultiArith (right)
datasets. Llama models show the highest robustness as well as performance.

GSM8K MultiArith
able for tasks involving noisy data, reflecting poor
80 robustness.
Attack Success Rate (%)

Figure 2 shows the relationship between the


60
model’s accuracy and the noise present in the
40
datasets. For both datasets, as the percentage of
noise in the data increases, the accuracy decreases.
20 This indicates that these models are not robust
against noise in the data. This also provides a
0
future direction for improving these models and
l

yr

5
l

.1

a2
a3
ra
tra

2.
ph
a3
st

m
m

en
is

making them more robust to noise.


h

Ze
am

em
a
M

at

w
Ll

Q
M

Ll

Across all models except for Zephyr, the im-


Models
pact of noise was more pronounced in the GSM8K
dataset than in MultiArith, with a larger drop in ac-
Figure 3: Comparing the attack success rates on the
curacy as the noise levels increased (Figure 3). In
studied models for GSM8K and MultiArith datasets
(lower is better). Llama models are the least affected manual inspection, we found out that the GSM8K
under the ArithmAttack, hence they are the most robust dataset was more difficult to solve than the Multi-
ones. Arith dataset. This suggests that the models may
struggle more with noise in math datasets with
more difficulty.
noisy data of MultiArith dataset where Llama3 has
highest accuracy). This makes it the most reliable 6 Conclusion
model for handling mathematical problems under We evaluated how well different language mod-
noisy input conditions. However, in terms of ASR els handle mathematical problem-solving tasks in
score, Llama3 has the lowest score, with Llama3.1 both clean and noisy conditions. Our results in-
slightly higher, indicating that Llama models are dicate that all studied models can be vulnerable
more robust to noise than other studied models. In to extra noise with varying degrees, with Llama
addition, Mathstral model compared to Mistral ex- models being the highest-performing and the most
hibits more robustness which can be attributed to robust model among others. In addition, compar-
its higher mathematical understanding. ing the two models of Mathstral and Mistral from
In contrast, Zephyr was the lowest-performing the same family, the one with mathematical knowl-
model, exhibiting low clean accuracy and suffering edge exhibited more robustness to noise. Lastly,
a significant decline in performance as noise was the findings revealed that more complex datasets
introduced. Its high ASR score makes it unsuit- such as GSM8K can become more difficult to un-

4
derstand as they become noisier. Future research de las Casas, Florian Bressand, Gianna Lengyel, Guil-
can include datasets beyond GSM8K and Multi- laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Arith which could provide deeper insights into the
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
models’ robustness in different scenarios. Further and William El Sayed. 2023. Mistral 7b. Preprint,
experimentation with different types of noise could arXiv:2310.06825.
also help enhance our understanding of the latent
Akbar Karimi, Leonardo Rossi, and Andrea Prati. 2021.
vulnerabilities in LLMs. AEDA: An easier data augmentation technique for
text classification. In Findings of the Association
for Computational Linguistics: EMNLP 2021, pages
References 2748–2754, Punta Cana, Dominican Republic. Asso-
ciation for Computational Linguistics.
Vansh Agrawal, Pratham Singla, Amitoj Singh Miglani,
Shivank Garg, and Ayush Mangal. 2024. Give me a Danish Pruthi, Bhuwan Dhingra, and Zachary C Lip-
hint: Can llms take a hint to solve math problems? In ton. 2019. Combating adversarial misspellings with
The 4th Workshop on Mathematical Reasoning and robust word recognition. In Proceedings of the 57th
AI at NeurIPS’24. Annual Meeting of the Association for Computational
Linguistics, pages 5582–5591.
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
Nicole Limtiaco, Rhomni St John, Noah Constant, Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, and Sameer Singh. 2020. Beyond accuracy: Behav-
et al. 2018. Universal sentence encoder for english. ioral testing of nlp models with checklist. In Annual
In Proceedings of the 2018 conference on empiri- Meeting of the Association for Computational Lin-
cal methods in natural language processing: system guistics.
demonstrations, pages 169–174.
Subhro Roy and Dan Roth. 2015. Solving general arith-
Valeriia Cherepanova and James Zou. 2024. Talk- metic word problems. In Proceedings of the 2015
ing nonsense: Probing large language models’ un- Conference on Empirical Methods in Natural Lan-
derstanding of adversarial gibberish inputs. arXiv guage Processing, pages 1743–1752.
preprint arXiv:2404.17120.
Amrutesh Saraf, Pooja Kamat, Shilpa Gite, Satish Ku-
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, mar, and Ketan Kotecha. 2024. Towards robust auto-
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias mated math problem solving: a survey of statistical
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro and deep learning approaches. Evolutionary Intelli-
Nakano, Christopher Hesse, and John Schulman. gence, pages 1–38.
2021. Training verifiers to solve math word prob- Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
lems. arXiv preprint arXiv:2110.14168. Scales, David Dohan, Ed H Chi, Nathanael Schärli,
and Denny Zhou. 2023. Large language models can
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, be easily distracted by irrelevant context. In Inter-
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, national Conference on Machine Learning, pages
Akhil Mathur, Alan Schelten, Amy Yang, Angela 31210–31227. PMLR.
Fan, et al. 2024. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783. Gemma Team, Morgane Riviere, Shreya Pathak,
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing raju, Léonard Hussenot, Thomas Mesnard, Bobak
Dou. 2018. Hotflip: White-box adversarial examples Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
for text classification. In Proceedings of the 56th Improving open language models at a practical size.
Annual Meeting of the Association for Computational arXiv preprint arXiv:2408.00118.
Linguistics (Volume 2: Short Papers), pages 31–36.
Lewis Tunstall, Edward Beeching, Nathan Lambert,
Esther Gan, Yiran Zhao, Liying Cheng, Yancan Mao, Nazneen Rajani, Kashif Rasul, Younes Belkada,
Anirudh Goyal, Kenji Kawaguchi, Min-Yen Kan, and Shengyi Huang, Leandro von Werra, Clémentine
Michael Shieh. 2024. Reasoning robustness of llms Fourrier, Nathan Habib, Nathan Sarrazin, Omar San-
to adversarial typographical errors. arXiv preprint seviero, Alexander M. Rush, and Thomas Wolf. 2023.
arXiv:2411.05345. Zephyr: Direct distillation of lm alignment. Preprint,
arXiv:2310.16944.
Shima Imani, Liang Du, and Harsh Shrivastava. 2023.
Mathprompter: Mathematical reasoning using large Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan,
language models. In Proceedings of the 61st An- Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah,
nual Meeting of the Association for Computational and Bo Li. 2021. Adversarial glue: A multi-task
Linguistics (Volume 5: Industry Track), pages 37–42. benchmark for robustness evaluation of language
models. In Thirty-fifth Conference on Neural In-
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- formation Processing Systems Datasets and Bench-
sch, Chris Bamford, Devendra Singh Chaplot, Diego marks Track (Round 2).

5
Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li,
Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng,
Qingyun Wu, and Chi Wang. 2024. Mathchat: Con-
verse to tackle challenging math problems with llm
agents. In ICLR 2024 Workshop on Large Language
Model (LLM) Agents.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
nical report. arXiv preprint arXiv:2412.15115.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Smola. 2023. Automatic chain of thought prompting
in large language models. In The Eleventh Interna-
tional Conference on Learning Representations.
Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan
Ye, Wei Liu, Wei Wang, Xiaowei Huang, and Kaizhu
Huang. 2024. Mathattack: Attacking large language
models towards math solving ability. In Proceedings
of the AAAI Conference on Artificial Intelligence,
volume 38, pages 19750–19758.
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang,
Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue
Zhang, Neil Zhenqiang Gong, et al. 2023. Prompt-
bench: Towards evaluating the robustness of large
language models on adversarial prompts. arXiv e-
prints, pages arXiv–2306.

You might also like