HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

【📊Leaderboard】•【🗄️Dataset】•【📄Paper】•【💻Code】

This is the codebase for HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games.

Requirements

We recommend using uv to setup a Python environment. Run uv sync under the project's root directory to create an environment same as the uv.lock file we provide.

We use and recommend using vLLM to serve open source models, which provides powerful reasoning and structured output parsers that we rely on.

Usage

Model Benchmarking

Download the dataset and put all directories with .parquet files in the data directory. Modify config/api.json to include API endpoints towards your model. Use main.py to generate outputs from your model and evaluate.py to evaluate them.

Example API config (see src/llm_client for available API types):

{
    "openai": {
        "type": "openai",
        "addr": null,
        "ports": null,
        "key": "sk-Y0urC1e!",
        "proxy": "http://example.com:12345"
    },
    "local": {
        "type": "openai-vllm",
        "addr": "http://localhost",
        "ports": [8000],
        "key": "",
        "proxy": null
    },
}

Example script that evaluates gpt-oss-120b (served on a local vLLM instance at http://localhost:8000) on the ZebraLogic task (see src/model for available model types):

# Generated output stored in output/zebra/hardcore_gpt-oss.jsonl
python main.py \
    --split hardcore \
    --task zebra \
    --api local \
    --model-type gpt-oss \
    --model gpt-oss-120b \
    --run-name gpt-oss \
    --sample-rep 4 \
    --max-token 32768 \
    --max-comp-token 2048 \
    --temperature 1 \
    --seed 19260817 \
    --api-rep 8 \

# Evaluate specific runs: python evaluate.py zebra hardcore_gpt-oss [...]
python evaluate.py zebra

After running the evaluation script, use stat/example.ipynb to collect the results.

Data Generation

We provide game generation scrips in src/prepare for reference. These scripts do not produce the dataset directly; instead, they generate logic games following the long-tail transformations we introduced in HardcoreLogic.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
data		data
prepare/src		prepare/src
src		src
stat		stat
template		template
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

Requirements

Usage

Model Benchmarking

Data Generation

Cite

About

Uh oh!

Languages

License

ljcleo/hardcore-logic

Folders and files

Latest commit

History

Repository files navigation

HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

Requirements

Usage

Model Benchmarking

Data Generation

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages