Skip to content

tmlr-group/NoisyRationales

Repository files navigation

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

Paper Stars


Exemplars of noisy questions and noisy rationales (our new research problem).

Abstract

This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales, which include irrelevant or inaccurate reasoning thoughts within examples used for in-context learning. We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales. Our findings on NoRa dataset reveal a prevalent vulnerability to such noise among current LLMs, with existing robust methods like self-correction and self-consistency showing limited efficacy. Notably, compared to prompting with clean rationales, GPT-3.5 drops by 1.4%-19.8% in accuracy with irrelevant thoughts and more drastically by 2.2%-40.4% with inaccurate thoughts.

Addressing this challenge necessitates external supervision that should be accessible in practice. Here, we propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT). It enhances LLMs’ denoising-reasoning capabilities by contrasting noisy rationales with only one clean rationale, which can be the minimal requirement for denoising-purpose prompting. This method follows a principle of exploration and exploitation: (1) rephrasing and selecting rationales in the input space to achieve explicit denoising and (2) exploring diverse reasoning paths and voting on answers in the output space. Empirically, CD-CoT demonstrates an average improvement of 17.8% in accuracy over the base model and shows significantly stronger denoising capabilities than baseline methods.

[Update] The NoRa dataset can be found on huggingface.

Getting Started

1. Install Dependencies

Choose one of the following installation commands based on your OpenAI API version:

## Use new openai API:
pip install openai requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv

## Use old openai API:
pip install openai==0.28 requests pandas nltk pyyaml scikit-learn tiktoken python-dotenv

2. Configure OpenAI API

You can set up your OpenAI API credentials using either environment variables or a configuration file:

Option A: Using Environment Variables

export OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple API keys
export OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific API base
export OPENAI_API_BASE=[YOUR_API_BASE_HERE]

or create a .env file in the project path:

Option B: Using Configuration File

Create a .env file in the project root directory:

OPENAI_API_KEY=[YOUR_API_KEY_HERE]
# if you have mutiple keys
OPENAI_API_KEY=[key1]:[key2]:[key3]...
# if you required specific api base
OPENAI_API_BASE=[YOUR_API_BASE_HERE]

Project Structure

The repository is organized as follows:

  • data/: Contains raw datasets and preprocessed datasets

    • Original datasets used for generation
    • Pre-processed datasets ready for experiments
  • data_process/: Libraries and utilities for dataset processing and manipulation

  • method/: Implementation of different noise handling methods

    • Various approaches for handling noisy rationales in chain-of-thought prompting
  • llm_model/: Interfaces for different large language models

    • Wrappers and utilities for interacting with various LLMs
  • noise_test.py: Main experiment script for testing noise rationale handling

  • config.yml: Configuration file for experiment settings

    • Model parameters
    • Dataset options
    • Testing configurations

Run experiments

NoRa supports three task categories with corresponding subtasks:

  • Math: base-9, base-11
  • Symbolic: equal, longer
  • Commonsense: (no subtasks)

Running Options

Option 1: Using Config File (Recommended)

Configure experiment settings in config.yml and run:

python noise_test.py

For detailed configuration parameters, please refer to config/introduction.md

Option 2: Command Line Arguments (for quick start)

# Zero-shot testing
python noise_test.py -task [task]_[subtask]_zeroshot -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Clean ICL testing
python noise_test.py -task [task]_[subtask]_clean -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]
# Noisy ICL testing
python noise_test.py -task [task]_[subtask]_[irrelevant|inaccurate]_[easy|medium|hard] -method [basemodel|CD-CoT] -model gpt-3.5-turbo-0125 -test_num [test_num]

Parameters: Available arguments:

  • task_subtask: math_base-9, math_base-11, symbolic_equal, symbolic_longer, commonsense
  • method: basemodel, CD-CoT, smoothllm, selfdenoise, selfpolish, contrastivecot, ISC, SCO, BT
  • noise-type: irrelevant, inaccurate
  • difficulty: easy, medium, hard
  • model: e.g., gpt-3.5-turbo-0125
  • test_num: e.g., 100

Examples:

# Clean data testing
python noise_test.py -task math_base-9_clean -method basemodel

# Noisy data testing with different configurations
python noise_test.py -task math_base-9_inaccurate_easy -method basemodel
python noise_test.py -task symbolic_longer_irrelevant_easy -method CD-CoT
python noise_test.py -task commonsense_inaccurate_hard -method contrastivecot

Result

The results would appear in ./results/{task}/{subtask}/{model}/{method}/

The file will be log_[ICL_|][n_clean_shots]clean_[noise_[n_noisy_shots][inaccurate|irrelevant]_[fixed|random]_ratio[ratio]|origin]_case[cases_num]_temp[temperature]_n[reasoning_times].json

Citation

If you find our work helpful, please kindly cite our paper:

@inproceedings{zhou2024can,
  title={Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?},
  author={Zhou, Zhanke and Tao, Rong and Zhu, Jianing and Luo, Yiwen and Wang, Zengmao and Han, Bo},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)},
  year={2024},
  url={https://openreview.net/pdf?id=FbuODM02ra}
}

About

[NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •