SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

News

[February 2026]: 🎉SC-Arena has been accepted to ICLR 2026!
[February 2026]: Preprint available on arXiv.

SC-Arena is a task-oriented inference and evaluation framework for single-cell related benchmarks. It provides a unified runtime for model inference (via pluggable providers) and task evaluation (via pluggable evaluators), driven by YAML configs and prompt templates.

What This Project Does

Load a dataset and convert each sample into task-specific prompts.
Run batched inference through a selected provider (for example openai, vllm, vllm_api).
Evaluate model outputs with task evaluators.
Save prediction outputs and final scores to JSON files.

Repository Layout

SC-Arena/
|-- base.py                       # Abstract base classes: InferenceEngine / EvaluateEngine
|-- registry.py                   # Provider and evaluator registries
|-- providers/                    # Inference backends
|   |-- openai_provider.py
|   |-- qwen3_provider.py
|   |-- vllm_provider.py
|   `-- vllm_api_provider.py
|-- evaluators/                   # Task evaluators
|   |-- cell_type_annotation.py
|   |-- perturbation_prediction.py
|   |-- captioning.py
|   |-- generation.py
|   `-- scienceqa.py
|-- prompts/                      # Prompt templates (.jsonl)
|-- data/                         # Example datasets
|-- configs/                      # Provider configs
|-- scripts/run_inference.py      # Main entry point
`-- requirements.txt

Supported Tasks

Task (`--task`)	Evaluator	Expected Answer Pattern
`celltype`	`CellTypeEvaluator`	`[Predicted_Cell_Type: ...]`
`captioning`	`CaptioningEvaluator`	`[Captioning: ...]`
`generation`	`GenerationEvaluator`	`[Cell_Sentence: ...]`
`perturbation`	`PerturbationEvaluator`	`[Up: ...] [Down: ...] [Cell_Sentence: ...]`
`scienceqa`	`ScienceqaEvaluator`	`[Answer: ...]`

Installation

git clone https://github.com/SUAT-AIRI/SC-ARENA.git
cd SC-ARENA

python -m venv .venv
# Windows PowerShell:
.venv\Scripts\Activate.ps1
# Linux/macOS:
# source .venv/bin/activate

pip install -r requirements.txt

Notes:

The dependency set is large and includes GPU-related packages.
If you only use API-based providers, you may trim dependencies for your environment.

Configuration

Use files in configs/ as templates:

configs/openai_exmaple.yaml
configs/vllm_example.yaml
configs/vllm_api.yaml

Minimum schema:

provider: openai
init_kwargs:
  model_name: "gpt-4o-mini"
  api_key: "${OPENAI_API_KEY}"

gen_kwargs:
  temperature: 0.7
  max_tokens: 1024

Run Inference

Main command:

python -m scripts.run_inference \
  --config configs/openai_exmaple.yaml \
  --data data/cell_sentences_fixed.jsonl \
  --task celltype \
  --out outputs/celltype/openai_celltype.jsonl \
  --score scores/celltype/openai_celltype.json \
  --baseurl https://api.openai.com/v1 \
  --apikey YOUR_API_KEY \
  --modelname gpt-4o-mini \
  --evaluated_model openai_celltype

Task examples:

# 1) Cell type annotation
python -m scripts.run_inference --config configs/openai_exmaple.yaml --data data/cell_sentences_fixed.jsonl --task celltype --out outputs/celltype/result.jsonl --score scores/celltype/result.json --baseurl https://api.openai.com/v1 --apikey YOUR_API_KEY --modelname gpt-4o-mini --evaluated_model model_a

# 2) Captioning
python -m scripts.run_inference --config configs/openai_exmaple.yaml --data data/cell_sentences_fixed.jsonl --task captioning --out outputs/captioning/result.jsonl --score scores/captioning/result.json --baseurl https://api.openai.com/v1 --apikey YOUR_API_KEY --modelname gpt-4o-mini --evaluated_model model_a

# 3) Generation
python -m scripts.run_inference --config configs/openai_exmaple.yaml --data data/cell_sentences_fixed.jsonl --task generation --out outputs/generation/result.jsonl --score scores/generation/result.json --baseurl https://api.openai.com/v1 --apikey YOUR_API_KEY --modelname gpt-4o-mini --evaluated_model model_a

# 4) Perturbation
python -m scripts.run_inference --config configs/openai_exmaple.yaml --data data/test_perturbation.json --task perturbation --out outputs/perturbation/result.jsonl --score scores/perturbation/result.json --baseurl https://api.openai.com/v1 --apikey YOUR_API_KEY --modelname gpt-4o-mini --evaluated_model model_a

# 5) ScienceQA
python -m scripts.run_inference --config configs/openai_exmaple.yaml --data data/ScientificQA_final.json --task scienceqa --out outputs/scienceqa/result.jsonl --score scores/scienceqa/result.json --baseurl https://api.openai.com/v1 --apikey YOUR_API_KEY --modelname gpt-4o-mini --evaluated_model model_a

Outputs

--out: model prediction file (JSONL)
--score: aggregated score summary (JSON), for example:

{
  "task": "celltype",
  "accuracy": 0.82,
  "correct": 82,
  "total": 100
}

Common Pitfalls

celltype currently reads prompt templates from prompts/test_prompt.jsonl in code. If this file is missing, copy prompts/cell_type_annotation.jsonl to that name.
--baseurl, --apikey, and --modelname are required CLI args.
Ensure output directories are writable.

Extending the Framework

Add a provider

Create a class inheriting InferenceEngine in providers/.
Register it with @register("your_provider").
Add a config file in configs/.

Add an evaluator

Create a class inheriting EvaluateEngine in evaluators/.
Register it with @register_evaluator("your_task").
Add task prompts under prompts/.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

News

What This Project Does

Repository Layout

Supported Tasks

Installation

Configuration

Run Inference

Outputs

Common Pitfalls

Extending the Framework

Add a provider

Add an evaluator

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
evaluate_metric		evaluate_metric
evaluators		evaluators
prompts		prompts
providers		providers
scripts		scripts
utils		utils
README.md		README.md
__init__.py		__init__.py
base.py		base.py
registry.py		registry.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

News

What This Project Does

Repository Layout

Supported Tasks

Installation

Configuration

Run Inference

Outputs

Common Pitfalls

Extending the Framework

Add a provider

Add an evaluator

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages