Skip to content

chattergpt/HSO-Bench

Repository files navigation

Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models

Sreejato Chatterjee1, Linh Tran1, Quoc Duy Nguyen1, Roni Kirson1, Drue Hamlin2, Harvest Aquino1, Hanjia Lyu1, Jiebo Luo1, Timothy Dye3

1 University of Rochester

2 Rochester Institute of Technology

3 University of Rochester School of Medicine & Dentistry

Accepted for publication in IEEE Big Data 2025: 11th Special Session on Intelligent Data Mining

Introduction

Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion.

Alt text

We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs.

Alt text

Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts.

Example Usage

This pipeline assigns oppression scores (1-5) and explanations to free-text identity-country pairs using LangChain. It is designed for use in Google Colab, with manual variable configuration.

python ethnicity_assignment_pipeline.py

Before running, update the following lines inside the script:

# Choose LLM provider: "gemini" or "openai"
model_choice = "gemini"

# Choose prompt type: "vanilla", "cot", or "rule-guided"
prompt_mode = "rule-guided"

# Path to input Excel file (each sheet must contain columns: 'identity', 'country')
excel_path = "/content/drive/My Drive/Dye Lab/unmatched_identities.xlsx"

Note: In Google Colab, you will also need to:

  • Mount Google Drive:
from google.colab import drive
drive.mount('/content/drive')
  • Add your API Keys securely:
from google.colab import userdata
userdata.set("GPT-Key", "sk-...")
userdata.set("GeminiKey", "AIza...")

The output will be saved as:

gemini_rule-guided.csv  # (or similar, depending on config)

Reproducibility

Figure 2

python fig_2.py

Figure 3 & 4

python divergences.py
python fig_3_4.py

Figure 5

python fig_5.py

Table 1

python tab_1.py

Citation

@inproceedings{chatterjee2025oppression,
  title = {Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models},
  author = {Chatterjee, Sreejato and Tran, Linh and Nguyen, Quoc Duy and Kirson, Roni and Hamlin, Drue and Aquino, Harvest and Lyu, Hanjia and Luo, Jiebo and Dye, Timothy},
  booktitle = {Proceedings of the 2025 IEEE International Conference on Big Data (IEEE Big Data)},
  year = {2025},
  publisher = {IEEE},
  address = {Macau, CN},
  url = {https://arxiv.org/abs/2509.15216},
  note = {11th Special Session on Intelligent Data Mining},
  organization = {IEEE}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages