0% found this document useful (0 votes)

22 views6 pages

Arithmattack: Evaluating Robustness of Llms To Noisy Context in Math Problem Solving

The document presents ArithmAttack, a method for evaluating the robustness of Large Language Models (LLMs) against noisy inputs, specifically focusing on the impact of random punctuation in math problem-solving tasks. The study assesses seven LLMs, revealing that all models exhibit vulnerability to noise, with performance degrading as noise levels increase. Results indicate that Llama3.1 consistently outperforms other models, demonstrating the highest accuracy under both clean and noisy conditions.

Uploaded by

austriadelafuente

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views6 pages

Arithmattack: Evaluating Robustness of Llms To Noisy Context in Math Problem Solving

Uploaded by

austriadelafuente

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context

in Math Problem Solving

Zain Ul Abedin♣ * Shahzeb Qamar♣ * Lucie Flek♣♠ Akbar Karimi♣♠
♣
Conversational AI and Social Analytics (CAISA) Lab, University of Bonn, Germany
♠
Lamarr Institute for Machine Learning and Artificial Intelligence, Germany
{s28zabed, s57sqama}@uni-bonn.de
{flek, ak}@bit.uni-bonn.de

Abstract Tiffany baked 8 brownies, but needed

Real 17 total for her party. If she used 8
72

LLM Response
Prompt cups of flour on each one, how many
While Large Language Models (LLMs) have cups of flour does she still need?
shown impressive capabilities in math problem-
arXiv:2501.08203v1 [cs.CL] 14 Jan 2025

? Tiffany baked 8 brownies, but needed

solving tasks, their robustness to noisy inputs Noisy 17 ! total for ; her party. If she : used 8
is not well-studied. In this work, we pro- Prompt cups of flour on each one, how many 8
cups of flour does she still need?
pose ArithmAttack to examine how robust the
LLMs are when they encounter noisy prompts
that contain extra noise in the form of punc- Figure 1: Noisy context breaks the LLM’s capability to
tuation marks. While being easy to imple- give the right answer.
ment, ArithmAttack does not cause any in-
formation loss since words are not added or
deleted from the context. We evaluate the ro- LLMs respond to noise attacks consisting of ran-
bustness of seven LLMs, including LLama3, dom punctuation marks in the context of math
Mistral and Mathstral, on noisy GSM8K and
problem-solving?. Figure 1 shows an example of
MultiArith datasets. Our experiments suggest
that all the studied models show vulnerability an LLM response under ArithmAttack, where the
to such noise, with more noise leading to poorer model behaves erratically when it sees a noisy con-
performances1 . text whereas it answers the question in the clean
prompt correctly.
1 Introduction Inspired by the AEDA method (Karimi et al.,
2021) which was initially utilized as a data aug-
As Large Language Models (LLMs) are improv- mentation method, we propose ArithmAttack to as-
ing in their ability to accurately process human sess the robustness of seven LLMs (i.e. two Llama
language, their math problem-solving is also en- models (Dubey et al., 2024), two Mistral models
hancing (Saraf et al., 2024; Agrawal et al., 2024; (Jiang et al., 2023), Zephyr (Tunstall et al., 2023),
Wu et al., 2024). However, these sets of questions Gemma2 (Team et al., 2024), and Qwen2.5 (Yang
might require reasoning capabilities to be answered. et al., 2024)) to noisy data. Similarly to AEDA, we
While LLMs have been shown to have such capa- introduce this noise by randomly inserting punc-
bilities to some extent (Imani et al., 2023), their ro- tuation marks into the context of math problems
bustness to adversarial inputs remains a challenge. from two math datasets, namely GSM8K (Cobbe
For instance, these models can be vulnerable to et al., 2021) and MultiArith (Roy and Roth, 2015).
simple replacement of words with their synonyms We then evaluate how these models perform under
(Zhou et al., 2024) and even typographical errors different noise levels, with the noise affecting 10%,
can negatively impact their ability to reason (Gan 30%, and 50% of the sentence length (based on the
et al., 2024). number of words).
In this paper, we further investigate the math Our contributions are twofold: 1) We propose
problem-solving robustness of LLMs to a different ArithmAttack which produces noisy contexts con-
set of changes that take the form of noisy context taining random punctuation marks to assess the
containing a variety of punctuation marks. The robustness of LLMs in math problem-solving. 2)
key research question for this study is: How do We evaluate seven LLMs, with parameter counts of
* Equal contriubtion 1.5B, 2B, 7B, and 8B on math datasets and observe
1
https://github.com/caisa-lab/ArithmAttack that all the studied models show growing vulner-

1
ability to ArithmAttack as the amount of noise MultiArith (Roy and Roth, 2015) offers a broad
increases. examination of language model performance across
multiple arithmetic problem types and complexi-
2 Related Works ties. The test set contains 180 data points on which
we do our experiments. It serves as a crucial bench-
Language models have been shown to be vulner- mark for understanding how contextual noise im-
able to a variety of changes in the input context, pacts the model’s ability to parse and solve mathe-
including character altering (Ebrahimi et al., 2018; matical questions.
Pruthi et al., 2019), typographical errors (Gan et al.,
2024), word replacement (Zhou et al., 2024), gib- 3.2 Models
berish or irrelevant context inclusion (Cherepanova To study a variety of language models and at the
and Zou, 2024; Shi et al., 2023), and semantic per- same time observing our computational budget, we
turbations (Ribeiro et al., 2020; Zhu et al., 2023). opted for seven widely utilized LLMs that have
Ebrahimi et al. (2018) show that a single character been trained by different companies. These models
change can make the neural classifier alter its cor- are Mistral-7B-Instruct-v0.2 (Jiang et al., 2023),
rect prediction. Similarly, Gan et al. (2024) propose Mathstral-7b-v0.1 (Jiang et al., 2023), Llama-3-
an adversarial typo attack that breaks the reasoning 8B-Instruct and Llama-3.1-8B-Instruct (Dubey
process of LLMs. Instead of modifying charac- et al., 2024), Gemma-2-2b-it (Team et al., 2024),
ters, Gan et al. (2024) propose a dataset, called Zephyr-7b-beta (Tunstall et al., 2023), Qwen2.5-
RobustMath, where they replace words with their 1.5B-Instruct (Yang et al., 2024). Throughout this
synonyms to evaluate the robustness of large lan- paper, we will refer to these models as Mistral,
guage models. In the study by Zhu et al. (2023), the Mathstral, Llama3, Llama3.1, Gemma2, Zephyr,
authors employ different types of textual attacks on and Qwen2.5, respectively.
prompts, including character, word, sentence, and
semantic attacks. 4 Methodology
While the literature mainly concentrates on
modifying the lexical or semantic content of the To obtain the responses from LLMs, we use Au-
prompts, we aim to keep the contextual informa- toCoT (Zhang et al., 2023) prompting, using the
tion intact and instead focus on the model behav- following prompt:
ior changes when encountering punctuation noise. Prompt 1
In addition, an advantage of our method is that it
is extremely straightforward to implement and as Think step by step through the following
we show in the results section, it is also effective problem and clearly show each step of your
in degrading the performance of LLMs in math reasoning. Ensure the final answer is clearly
problem-solving. indicated by ending with {The final answer
is}.
3 Experiments
To carry out our experiments, we utilize two well- 4.1 Answer Extraction
known math datasets and seven large language To evaluate the accuracy of the models, we devel-
models. oped a script to extract answers from the LLM
responses. The extraction process underwent mul-
3.1 Datasets
tiple iterations, as it needed to accurately extract
GSM8K (Cobbe et al., 2021) contains 8.5K high- the answer and compare it with the ground truth.
quality, linguistically diverse grade school math However, even with the final prompt, we observed
word problems. The test set contains 1.32k data a couple of inconsistencies in the answer extrac-
points on which we do our experiments. This tion. Therefore, we went through outputs manually
dataset provides a variety of arithmetic and log- to estimate the miss rate (i.e. the rate with which
ical questions typical of middle school education, the correct answer is not extracted). In manual in-
making it ideal for testing the comprehension and spection, we evaluated the entire responses for the
problem-solving capabilities of LLMs under noisy MultiArith dataset and the first 100 responses from
conditions. the GSM8K dataset. Table 2 shows that the miss

2
Clean Punctuation Clean Punctuation
Models ASR Models ASR
Accuracy Percentage Accuracy Percentage
10 30 50 10 30 50
Mistral 42.07 41.62 37.75 36.39 39.69 Mistral 73.88 72.77 71.11 65.55 23.66
Mathstral 77.63 75.51 71.34 70.65 19.81 Mathstral 96.11 92.77 86.11 87.22 9.47
Llama3 75.43 73.31 73.08 72.17 11.73 Llama3 95.00 92.77 91.66 88.33 7.79
Llama3.1 82.25 81.04 78.84 77.02 12.53 Llama3.1 99.44 94.44 91.66 83.88 9.67
Gemma2 49.65 45.10 36.46 35.63 41.82 Gemma2 89.44 82.77 78.88 72.22 19.45
Zephyr 23.27 18.04 18.04 10.08 74.80 Zephyr 37.22 22.22 16.11 12.77 77.10
Qwen2.5 61.10 56.02 52.69 49.35 31.59 Qwen2.5 97.22 94.44 85.55 83.88 11.04

Table 1: Results for GSM8K dataset (percent). The Table 3: Results for MultiArith dataset (percent). The
performance for all models drops under ArithmAttack. performance for all models drops under ArithmAttack.
While Llama3.1 has the top performance under all levels Llama3 has the lowest drop, making it the most robust
of noise, the ASR score shows that Llama3 has the model.
lowest drop from its original performance, making it the
most robust model.
the noise. We employed six types of punctuation
Miss Rate (%) marks: {".", ’,’, ’!’, ’?’, ’;’, ’:’}.
Model GSM8K MultiArith
4.3 ASR and Similarity Calculation
Mistral 9.0 1.1
Mathstral 0.0 1.1 We evaluate the models with their performance
Llama3 1.0 1.1 accuracy against noisy input and Attack Success
Llama3.1 0.0 0.0 Rate (ASR). ASR (Wang et al., 2021) measures
Gemma2 3.0 2.2 how effective an adversarial attack is on a model.
Zephyr 2.0 12.8 Specifically, it looks at how often the model’s
Qwen2.5 1.0 0.5 predictions are changed incorrectly after the
adversarial attack. In this study, the average ASR
Table 2: Miss rate of the models in answer extraction. has been taken for every model with 10%, 30%
and 50% noisy dataset’s responses with the help of
Formula 1:
rate is minimal for most of the models. In the cases P
of Mistral (for GSM8K) and Zephyer (for Multi- (x,y)∈D I [f (A(x)) ̸= y]
ASR = P (1)
Arith), the miss rates can be significant. While this (x,y)∈D I [f (x) = y]
can be an indication of lower ability in following
instructions in these models, considering the gap In other words, ASR is the ratio of changed an-
in the performance and ASR scores, this does not swers after attack to previously correct answers
affect the observed trends. produced by the LLM.
We also calculate the similarity of the perturbed
4.2 Noisy Dataset Creation samples to the original ones. Similarity represents
the average semantic similarity between two con-
Once satisfactory results were achieved with clean
texts. Given that our method does not alter the
datasets, we proceeded to test the models on noisy
words in the sentence, the resulting samples af-
data. For the introduction of noise, we follow a sim-
ter applying ArithmAttack are scored 100 percent
ilar approach to Karimi et al. (2021), by altering the
similar to the original samples using Universal Sen-
hyper-parameters in the logic. In their study, they
tence Encoder (Cer et al., 2018) as the scorer.
insert the punctuation marks by randomly choosing
a number between 1 and one-third of the length of 5 Results and Analysis
the sequence which indicates how many insertions
will be carried out. But in our case, instead of ran- As shown in Tables 1 and 3, Llama3.1 consis-
domly choosing the number of insertions, we fix it tently outperforms all other models across both
to be 10%, 30% and 50% of the total length of the datasets. It achieves the highest accuracies in both
sentence but still choose random positions to insert clean and noise-affected conditions (except in 50%

3
85 100 Mistral
Mathstral
Llama3
80
60 Llama3.1
Accuracy

Gemma2
60 Zephyr
Qwen2.5
35
40

10 20
0 10% 30% 50% 0 10% 30% 50%

Punctuation Ratio Punctuation Ratio

Figure 2: Accuracy of the studied models on different levels of noise for GSM8K (left) and MultiArith (right)
datasets. Llama models show the highest robustness as well as performance.

GSM8K MultiArith
able for tasks involving noisy data, reflecting poor
80 robustness.
Attack Success Rate (%)

Figure 2 shows the relationship between the

60
model’s accuracy and the noise present in the
40
datasets. For both datasets, as the percentage of
noise in the data increases, the accuracy decreases.
20 This indicates that these models are not robust
against noise in the data. This also provides a
0
future direction for improving these models and
l

5
l

a2
a3
ra
tra

2.
ph
a3
st

m
m

en
is

making them more robust to noise.

Ze
am

em
a
M

w
Ll

Q
M

Across all models except for Zephyr, the im-

Models
pact of noise was more pronounced in the GSM8K
dataset than in MultiArith, with a larger drop in ac-
Figure 3: Comparing the attack success rates on the
curacy as the noise levels increased (Figure 3). In
studied models for GSM8K and MultiArith datasets
(lower is better). Llama models are the least affected manual inspection, we found out that the GSM8K
under the ArithmAttack, hence they are the most robust dataset was more difficult to solve than the Multi-
ones. Arith dataset. This suggests that the models may
struggle more with noise in math datasets with
more difficulty.
noisy data of MultiArith dataset where Llama3 has
highest accuracy). This makes it the most reliable 6 Conclusion
model for handling mathematical problems under We evaluated how well different language mod-
noisy input conditions. However, in terms of ASR els handle mathematical problem-solving tasks in
score, Llama3 has the lowest score, with Llama3.1 both clean and noisy conditions. Our results in-
slightly higher, indicating that Llama models are dicate that all studied models can be vulnerable
more robust to noise than other studied models. In to extra noise with varying degrees, with Llama
addition, Mathstral model compared to Mistral ex- models being the highest-performing and the most
hibits more robustness which can be attributed to robust model among others. In addition, compar-
its higher mathematical understanding. ing the two models of Mathstral and Mistral from
In contrast, Zephyr was the lowest-performing the same family, the one with mathematical knowl-
model, exhibiting low clean accuracy and suffering edge exhibited more robustness to noise. Lastly,
a significant decline in performance as noise was the findings revealed that more complex datasets
introduced. Its high ASR score makes it unsuit- such as GSM8K can become more difficult to un-

4
derstand as they become noisier. Future research de las Casas, Florian Bressand, Gianna Lengyel, Guil-
can include datasets beyond GSM8K and Multi- laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
Arith which could provide deeper insights into the
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
models’ robustness in different scenarios. Further and William El Sayed. 2023. Mistral 7b. Preprint,
experimentation with different types of noise could arXiv:2310.06825.
also help enhance our understanding of the latent
Akbar Karimi, Leonardo Rossi, and Andrea Prati. 2021.
vulnerabilities in LLMs. AEDA: An easier data augmentation technique for
text classification. In Findings of the Association
for Computational Linguistics: EMNLP 2021, pages
References 2748–2754, Punta Cana, Dominican Republic. Asso-
ciation for Computational Linguistics.
Vansh Agrawal, Pratham Singla, Amitoj Singh Miglani,
Shivank Garg, and Ayush Mangal. 2024. Give me a Danish Pruthi, Bhuwan Dhingra, and Zachary C Lip-
hint: Can llms take a hint to solve math problems? In ton. 2019. Combating adversarial misspellings with
The 4th Workshop on Mathematical Reasoning and robust word recognition. In Proceedings of the 57th
AI at NeurIPS’24. Annual Meeting of the Association for Computational
Linguistics, pages 5582–5591.
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
Nicole Limtiaco, Rhomni St John, Noah Constant, Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, and Sameer Singh. 2020. Beyond accuracy: Behav-
et al. 2018. Universal sentence encoder for english. ioral testing of nlp models with checklist. In Annual
In Proceedings of the 2018 conference on empiri- Meeting of the Association for Computational Lin-
cal methods in natural language processing: system guistics.
demonstrations, pages 169–174.
Subhro Roy and Dan Roth. 2015. Solving general arith-
Valeriia Cherepanova and James Zou. 2024. Talk- metic word problems. In Proceedings of the 2015
ing nonsense: Probing large language models’ un- Conference on Empirical Methods in Natural Lan-
derstanding of adversarial gibberish inputs. arXiv guage Processing, pages 1743–1752.
preprint arXiv:2404.17120.
Amrutesh Saraf, Pooja Kamat, Shilpa Gite, Satish Ku-
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, mar, and Ketan Kotecha. 2024. Towards robust auto-
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias mated math problem solving: a survey of statistical
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro and deep learning approaches. Evolutionary Intelli-
Nakano, Christopher Hesse, and John Schulman. gence, pages 1–38.
2021. Training verifiers to solve math word prob- Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
lems. arXiv preprint arXiv:2110.14168. Scales, David Dohan, Ed H Chi, Nathanael Schärli,
and Denny Zhou. 2023. Large language models can
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, be easily distracted by irrelevant context. In Inter-
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, national Conference on Machine Learning, pages
Akhil Mathur, Alan Schelten, Amy Yang, Angela 31210–31227. PMLR.
Fan, et al. 2024. The llama 3 herd of models. arXiv
preprint arXiv:2407.21783. Gemma Team, Morgane Riviere, Shreya Pathak,
Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing raju, Léonard Hussenot, Thomas Mesnard, Bobak
Dou. 2018. Hotflip: White-box adversarial examples Shahriari, Alexandre Ramé, et al. 2024. Gemma 2:
for text classification. In Proceedings of the 56th Improving open language models at a practical size.
Annual Meeting of the Association for Computational arXiv preprint arXiv:2408.00118.
Linguistics (Volume 2: Short Papers), pages 31–36.
Lewis Tunstall, Edward Beeching, Nathan Lambert,
Esther Gan, Yiran Zhao, Liying Cheng, Yancan Mao, Nazneen Rajani, Kashif Rasul, Younes Belkada,
Anirudh Goyal, Kenji Kawaguchi, Min-Yen Kan, and Shengyi Huang, Leandro von Werra, Clémentine
Michael Shieh. 2024. Reasoning robustness of llms Fourrier, Nathan Habib, Nathan Sarrazin, Omar San-
to adversarial typographical errors. arXiv preprint seviero, Alexander M. Rush, and Thomas Wolf. 2023.
arXiv:2411.05345. Zephyr: Direct distillation of lm alignment. Preprint,
arXiv:2310.16944.
Shima Imani, Liang Du, and Harsh Shrivastava. 2023.
Mathprompter: Mathematical reasoning using large Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan,
language models. In Proceedings of the 61st An- Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah,
nual Meeting of the Association for Computational and Bo Li. 2021. Adversarial glue: A multi-task
Linguistics (Volume 5: Industry Track), pages 37–42. benchmark for robustness evaluation of language
models. In Thirty-fifth Conference on Neural In-
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- formation Processing Systems Datasets and Bench-
sch, Chris Bamford, Devendra Singh Chaplot, Diego marks Track (Round 2).

5
Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li,
Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng,
Qingyun Wu, and Chi Wang. 2024. Mathchat: Con-
verse to tackle challenging math problems with llm
agents. In ICLR 2024 Workshop on Large Language
Model (LLM) Agents.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
nical report. arXiv preprint arXiv:2412.15115.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Smola. 2023. Automatic chain of thought prompting
in large language models. In The Eleventh Interna-
tional Conference on Learning Representations.
Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan
Ye, Wei Liu, Wei Wang, Xiaowei Huang, and Kaizhu
Huang. 2024. Mathattack: Attacking large language
models towards math solving ability. In Proceedings
of the AAAI Conference on Artificial Intelligence,
volume 38, pages 19750–19758.
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang,
Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue
Zhang, Neil Zhenqiang Gong, et al. 2023. Prompt-
bench: Towards evaluating the robustness of large
language models on adversarial prompts. arXiv e-
prints, pages arXiv–2306.

Automatic Robustness Stress Testing of Llms As Mathematical Problem Solvers
No ratings yet
Automatic Robustness Stress Testing of Llms As Mathematical Problem Solvers
16 pages
How Well Do LLM Perform Iin Arithmetic Tasks
No ratings yet
How Well Do LLM Perform Iin Arithmetic Tasks
10 pages
Augmenting LLMS' Reasoning by Reinforcing Abs. Thinking-1
No ratings yet
Augmenting LLMS' Reasoning by Reinforcing Abs. Thinking-1
23 pages
What Makes Math Word Problems Challenging For LLMS?
No ratings yet
What Makes Math Word Problems Challenging For LLMS?
11 pages
AbstRaL - Ampliando o Raciocínio Dos LLMs Reforçando o Pensamento Abstrato
No ratings yet
AbstRaL - Ampliando o Raciocínio Dos LLMs Reforçando o Pensamento Abstrato
23 pages
NLP Models and Simple Math Challenges
No ratings yet
NLP Models and Simple Math Challenges
15 pages
Combining LLMs and Symbolic Solvers for Math Problems
No ratings yet
Combining LLMs and Symbolic Solvers for Math Problems
7 pages
Enhancing LLM Intelligence With Arm-Rag Auxiliary Rationale Memory For Retrieval Augmented Generation
No ratings yet
Enhancing LLM Intelligence With Arm-Rag Auxiliary Rationale Memory For Retrieval Augmented Generation
8 pages
Source
No ratings yet
Source
16 pages
TinyGSM: Small Models Solve GSM8K
No ratings yet
TinyGSM: Small Models Solve GSM8K
102 pages
MathChat: Multi-Turn Math Reasoning
No ratings yet
MathChat: Multi-Turn Math Reasoning
24 pages
Math Problem Solving for AI Researchers
No ratings yet
Math Problem Solving for AI Researchers
22 pages
LLEMMA: Open Language Model for Math
No ratings yet
LLEMMA: Open Language Model for Math
28 pages
Jailbreaking Large Language Models With Symbolic Mathematics
No ratings yet
Jailbreaking Large Language Models With Symbolic Mathematics
15 pages
Language Models' Math Reasoning
No ratings yet
Language Models' Math Reasoning
33 pages
Jailbreaking Large Language Models With Symbolic Mathematics
No ratings yet
Jailbreaking Large Language Models With Symbolic Mathematics
16 pages
LLEMMA: Open Math Language Model
No ratings yet
LLEMMA: Open Math Language Model
28 pages
Do Phd-Level Llms Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models
No ratings yet
Do Phd-Level Llms Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models
15 pages
Reasoning in Large Language Models Through Symbolic Math Word Problems
No ratings yet
Reasoning in Large Language Models Through Symbolic Math Word Problems
13 pages
Leveraging Online Olympiad-Level Math Problems For Llms Training and Contamination-Resistant Evaluation
No ratings yet
Leveraging Online Olympiad-Level Math Problems For Llms Training and Contamination-Resistant Evaluation
25 pages
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
No ratings yet
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
11 pages
Ai-A G D M Q - : Ssisted Eneration of Ifficult ATH UES Tions
No ratings yet
Ai-A G D M Q - : Ssisted Eneration of Ifficult ATH UES Tions
30 pages
Dataset 1
No ratings yet
Dataset 1
32 pages
Mathprompter - Mathematical Reasoning Using Large Language Models
No ratings yet
Mathprompter - Mathematical Reasoning Using Large Language Models
7 pages
LLMs Benchmark on Mathematical Problem-Solving
No ratings yet
LLMs Benchmark on Mathematical Problem-Solving
8 pages
2024 Emnlp-Main 313
No ratings yet
2024 Emnlp-Main 313
28 pages
3 5364 202410212005 Apple Researchers Challenge Large Language Models Math Reasoning Capabilities With New Ben
No ratings yet
3 5364 202410212005 Apple Researchers Challenge Large Language Models Math Reasoning Capabilities With New Ben
6 pages
Bangla AI for Math Olympiad Solutions
No ratings yet
Bangla AI for Math Olympiad Solutions
11 pages
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
No ratings yet
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
11 pages
Large Language Models For Mathematical Reasoning - Progresses and Challenges
No ratings yet
Large Language Models For Mathematical Reasoning - Progresses and Challenges
14 pages
Limitations of LLMs in Math Reasoning
No ratings yet
Limitations of LLMs in Math Reasoning
22 pages
MathScale: Enhancing Math Reasoning
No ratings yet
MathScale: Enhancing Math Reasoning
15 pages
Math Odyssey Benchmarks
No ratings yet
Math Odyssey Benchmarks
14 pages
Part 2
No ratings yet
Part 2
3 pages
How Numerical Precision Affects Mathematical Reasoning Capabilities of Llms
No ratings yet
How Numerical Precision Affects Mathematical Reasoning Capabilities of Llms
34 pages
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
No ratings yet
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
15 pages
Promptrobust: Towards Evaluating The Robustness of Large Language Models On Adversarial Prompts
No ratings yet
Promptrobust: Towards Evaluating The Robustness of Large Language Models On Adversarial Prompts
26 pages
Symmetry 15 00916 v2
No ratings yet
Symmetry 15 00916 v2
13 pages
Green Wizards
No ratings yet
Green Wizards
17 pages
WizardMath: Advancing GSM8k Reasoning
No ratings yet
WizardMath: Advancing GSM8k Reasoning
28 pages
Deep Learning in Mathematical Reasoning
No ratings yet
Deep Learning in Mathematical Reasoning
24 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
22 pages
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
No ratings yet
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
56 pages
F: Formula-Driven Reinforcement Learning For Symbolic Table Reasoning in Language Models
No ratings yet
F: Formula-Driven Reinforcement Learning For Symbolic Table Reasoning in Language Models
38 pages
Tinygsm: Achieving 80% On Gsm8K With Small Language Models
No ratings yet
Tinygsm: Achieving 80% On Gsm8K With Small Language Models
15 pages
Enhancing Reasoning Capabilities of Large Language Models: A Graph-Based Verification Approach
No ratings yet
Enhancing Reasoning Capabilities of Large Language Models: A Graph-Based Verification Approach
9 pages
Word Problems Algebra Solving
No ratings yet
Word Problems Algebra Solving
11 pages
Dissecting Multiplication in Transformers: Insights Into Llms
No ratings yet
Dissecting Multiplication in Transformers: Insights Into Llms
8 pages
Autonomous Data Selection With Language Models For Mathematical Texts
No ratings yet
Autonomous Data Selection With Language Models For Mathematical Texts
25 pages
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
No ratings yet
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
45 pages
NLP for Solving Arithmetic Word Problems
No ratings yet
NLP for Solving Arithmetic Word Problems
12 pages
AI in Math Word Problem Solving
No ratings yet
AI in Math Word Problem Solving
15 pages
1 s2.0 S2666827023000592 Main
No ratings yet
1 s2.0 S2666827023000592 Main
8 pages
Language Models' Arithmetic Limits
No ratings yet
Language Models' Arithmetic Limits
12 pages
C LLM F S R ?: AN S Ollow Imple Ules
No ratings yet
C LLM F S R ?: AN S Ollow Imple Ules
19 pages
Mathematical Language Models: A Survey
No ratings yet
Mathematical Language Models: A Survey
34 pages
Big Math
No ratings yet
Big Math
43 pages
1 s2.0 S0377221724002200 Main
No ratings yet
1 s2.0 S0377221724002200 Main
19 pages
CPST Manuscript
No ratings yet
CPST Manuscript
31 pages
1 s2.0 S0893608024010426 Main
No ratings yet
1 s2.0 S0893608024010426 Main
17 pages
1 s2.0 S0959475224001543 Main
No ratings yet
1 s2.0 S0959475224001543 Main
16 pages
AZMEH, Aziz Al. (2014) The Emergence of Islam in Late Antiquity. Allah and His People, Cap.03
No ratings yet
AZMEH, Aziz Al. (2014) The Emergence of Islam in Late Antiquity. Allah and His People, Cap.03
64 pages
Krisnavimala D/O Krishuan
No ratings yet
Krisnavimala D/O Krishuan
3 pages
Dap An Writing Level 3 PDF Compress 1729957253
No ratings yet
Dap An Writing Level 3 PDF Compress 1729957253
64 pages
GW Workshop Maerz 2010
No ratings yet
GW Workshop Maerz 2010
29 pages
Gis Aplication
No ratings yet
Gis Aplication
233 pages
Websites: Website Purpose
No ratings yet
Websites: Website Purpose
6 pages
Tigrinya Cordon Search LSK
No ratings yet
Tigrinya Cordon Search LSK
28 pages
Poetry by Pablo Neruda
No ratings yet
Poetry by Pablo Neruda
9 pages
English Learn Through Wikipedea
No ratings yet
English Learn Through Wikipedea
10 pages
OPT A2 Unit Test 11 Standard
No ratings yet
OPT A2 Unit Test 11 Standard
5 pages
b1 Egzamin Dla Pracowników
No ratings yet
b1 Egzamin Dla Pracowników
6 pages
Inter 1st Year English-Grammar - CONSTRUCTING DIALOGUE Study Material
No ratings yet
Inter 1st Year English-Grammar - CONSTRUCTING DIALOGUE Study Material
9 pages
BCJ 3101 English Structure and Usage
No ratings yet
BCJ 3101 English Structure and Usage
4 pages
Comparatives Superlatives Worksheet
No ratings yet
Comparatives Superlatives Worksheet
6 pages
Arabic and Semetic Languages Summary
No ratings yet
Arabic and Semetic Languages Summary
13 pages
Mastering Precise Writing Techniques
No ratings yet
Mastering Precise Writing Techniques
5 pages
Formatos PSP
No ratings yet
Formatos PSP
77 pages
Argument Lesson Day 5
No ratings yet
Argument Lesson Day 5
2 pages
Group 3 - Psycholinguistic Final Project
No ratings yet
Group 3 - Psycholinguistic Final Project
18 pages
Isbn: 978-1-927042-06-9
No ratings yet
Isbn: 978-1-927042-06-9
5 pages
Escobar Abigail - R J Scene Performance Reflection
No ratings yet
Escobar Abigail - R J Scene Performance Reflection
2 pages
Chapter 7
100% (1)
Chapter 7
24 pages
Dalhousie Graduate Application Guide
No ratings yet
Dalhousie Graduate Application Guide
4 pages
Tamil Nadu Branch Report Aug 2019
No ratings yet
Tamil Nadu Branch Report Aug 2019
234 pages
அருங்காட்சியகம்
No ratings yet
அருங்காட்சியகம்
216 pages
Narrative Research
No ratings yet
Narrative Research
26 pages
BSSE English Language Paper 2 Mark Scheme
100% (4)
BSSE English Language Paper 2 Mark Scheme
6 pages
Philippines: Affordable English Education
No ratings yet
Philippines: Affordable English Education
5 pages
English Grammar: Modal Verbs Guide
No ratings yet
English Grammar: Modal Verbs Guide
4 pages
Global English 3º Medio Guia Docente
75% (8)
Global English 3º Medio Guia Docente
216 pages

Arithmattack: Evaluating Robustness of Llms To Noisy Context in Math Problem Solving

Uploaded by

Arithmattack: Evaluating Robustness of Llms To Noisy Context in Math Problem Solving

Uploaded by

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context

in Math Problem Solving

Abstract Tiffany baked 8 brownies, but needed

? Tiffany baked 8 brownies, but needed

Punctuation Ratio Punctuation Ratio

Figure 2 shows the relationship between the

making them more robust to noise.

Across all models except for Zephyr, the im-

You might also like