DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

If you like our project, please give us a star ⭐ on GitHub for the latest update.

✨ News

[Feb 24, 2026] 📦 DeepResearch Bench II Dataset Released
- We have released the DeepResearch Bench II Dataset on Hugging Face, containing articles generated by a subset of the models we evaluated.
[Feb 2026] 🌐 Official Website & Leaderboard Released
- The official DeepResearch Bench II Website is now live!
- Check out the Leaderboard to see how SOTA deep research agents compare across 9,430 expert-written rubrics.
- 🎯 Welcome to submit your model! Contact us at [email protected] or [email protected] to join the leaderboard.
[Jan 2026] 📄 Paper Released on arXiv
- Our paper is now available on arXiv (2601.08536).
[Nov 2025] 🎉 DeepResearch Bench II Evaluation Pipeline Released
- This repo provides the official evaluation pipeline for DeepResearch Bench II, built on Gemini with fine-grained, verifiable rubrics derived from expert-written research reports.
- It supports multimodal inputs (PDF/DOCX/images/text) and batched rubric-based evaluation for information recall, analysis, and presentation.

For complete experimental results, model comparisons, and ablation studies, please refer to the main paper (paper/main.pdf).

📖 Overview

DeepResearch Bench II addresses key limitations of existing deep research benchmarks by combining:

Real-world, expert-authored research reports as the grounding signal.
Fine-grained, fully verifiable rubrics that do not rely on the judge model’s internal domain knowledge.
Three core dimensions of deep research quality:
- 🔍 Information Recall – Can the agent identify, retrieve, and cross-check all key information needed to answer the task?
- 🧠 Analysis – Can the agent synthesize retrieved information into higher-level conclusions and insights?
- 📝 Presentation – Can the agent present the information in a structured, readable, and easily verifiable way?

This repository (DeepResearch-Bench-II) contains a lightweight evaluation pipeline that:

Takes model-generated research reports (PDF/DOCX/HTML/TXT/images),
Uses tasks_and_rubrics.jsonl to load task descriptions and rubrics, and
Invokes Gemini to score each rubric item in batches, producing:
- Per-task, per-dimension rubric scores, and
- Aggregated CSVs summarizing model performance.

Benchmark Construction

Topic and Task Design

DeepResearch Bench II is built on top of the original DeepResearch Bench topic distribution and task design:

We start from real-world user queries and task themes collected in the original benchmark.
For each seed task, we search for expert-written review reports addressing similar research questions in:
- Reputable journals and top conferences,
- High-quality institutional or governmental reports.

These source reports are:

Written by domain experts over weeks or months,
Validated by reviewers, editors, and the broader community,
Released under CC-BY-4.0 / CC-BY-4.0-NC licenses.

After license filtering and quality screening, we retain 132 expert-authored reports, which become the basis for:

Task formulations, and
Ground-truth, expert-aligned rubrics.

Rubric Design from Expert Articles

From each expert article, we construct:

One or more deep research tasks that require both information collection and analysis.
A set of binary rubrics decomposed across the three dimensions:
- Information Recall,
- Analysis,
- Presentation.

Each rubric is:

Essential – captures information necessary to correctly answer the task.
Atomic – checks a single fact or inference; complex points are split into smaller rubrics.
Content-bearing – encodes the actual answer, not just a vague topic (e.g., “states that X increased from A to B between years Y and Z”).
Numerically precise – numerical rubrics explicitly specify values and tolerated error ranges.

Rubrics are built through a four-stage pipeline:

LLM extraction from expert articles, guided by carefully designed prompts.
Self-evaluation iteration – rejecting hallucinated or inconsistent rubrics using the source article as reference.
Manual revision – human annotators refine wording, remove redundancy, and enforce atomicity.
Expert review & refinement – domain experts ensure that rubrics faithfully represent the article’s core content.

Evaluation Framework

DeepResearch Bench II uses LLM-as-judge with verifiable rubrics:

The task + rubric are serialized into a structured JSON prompt.
The model report (PDF/DOCX/image/text) is provided as the passage (possibly as multimodal attachments).
Gemini is prompted to output, for each rubric item:
- score ∈ {1, 0, -1},
- reason, and
- evidence (supporting sentences from the report).

Scoring semantics:

1 – rubric satisfied with valid evidence and no use of blocked references,
0 – rubric not mentioned at all,
-1 – rubric mentioned but evidence relies on explicitly blocked references.

The evaluation pipeline in this repo:

Handles multimodal inputs:
- PDFs are uploaded as binary attachments.
- DOCX files are parsed into text + tables (Markdown) + extracted images.
- Images (PNG/JPEG/WebP/GIF/BMP/TIFF) are attached as inline data.
- TXT/MD/HTML are loaded as plain text.
Supports batched evaluation:
- Rubric items are split into batches of size CHUNK_SIZE (default 50).
- Each batch is evaluated independently; results are merged and re-grouped by dimension.
Aggregates token usage statistics:
- Per batch (usageMetadata),
- Per file, and
- Per model across the whole run.

📊 Evaluation Results

This repository focuses on the evaluation pipeline.
Aggregated scores (per-task, per-dimension, and per-model) can be produced locally via aggregate_scores.py.

For full experimental details, including:

Cross-model comparison,
Dimension-wise analysis,
Error cases and ablations,

please refer to the paper (paper/main.pdf) and any public leaderboard associated with DeepResearch Bench II.

🛠️ Installation

Prerequisites

Python 3.9+
A Gemini-compatible API endpoint and token

1. Environment configuration (`.env`)

Create a .env file in the project root DeepResearch-Bench-II to store API configuration and runtime parameters:

cd DeepResearch-Bench-II
touch .env
vim .env  # or use your favorite editor

Required config (replace with your own values):

GEMINI_API_URL=https://your-api-endpoint.com/v1/chat/completions
GEMINI_API_TOKEN=your-api-token
GEMINI_MODEL=gemini-2.5-pro
GEMINI_REQUEST_ID=eval-request-id

PDF_DIR=report
OUT_JSONL=result.jsonl
TASKS_JSONL=tasks_and_rubrics.jsonl
CHUNK_SIZE=50
MAX_WORKERS=10
MAX_RETRIES=5
MAX_PAPER_CHARS=150000
LOG_FILE=run_evaluation.log

2. Install dependencies (supports `uv` / conda)

Option A: Use `uv` (recommended)

The project ships with pyproject.toml, so you can manage the virtual environment and dependencies via uv:

# Install uv (if not installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create/sync virtual environment and install dependencies
cd DeepResearch-Bench-II
uv sync

How to check whether `uv` is installed correctly

Run any of the following commands in your terminal:

# 1. Check version (recommended)
uv --version

# 2. Check executable path
which uv

# 3. Show help
uv --help

If uv --version prints something like uv 0.x.y, it is installed correctly.
If you see command not found or similar, uv is not installed or not on your PATH.

Option B: Use `conda`

# Create and activate a conda environment
conda create -n drbench-II python=3.10 -y
conda activate drbench-II

# Install Python dependencies
cd DeepResearch-Bench-II
pip install requests python-docx

You can then run all commands inside this conda environment.

Project Structure

DeepResearch-Bench-II/
├── assets/                    # Images and figures for README
│   ├── distribution.png
│   ├── intro.png
│   ├── main_result.png
│   └── method.png
├── report/                    # Input directory for model-generated reports
│   └── <model_name>/         # Per-model subdirectories
│       ├── idx-1.pdf         # Model output for task 1
│       ├── idx-2.docx        # Model output for task 2
│       └── ...
├── gemini_client.py           # Gemini API client (handles API calls and multimodal input)
├── run_evaluation.py          # Main evaluation script (batched rubric scoring logic)
├── aggregate_scores.py        # Score aggregation utility (produces CSV summaries)
├── tasks_and_rubrics.jsonl    # Tasks and rubrics (132 expert-derived tasks)
├── pyproject.toml             # Dependency management (uv / pip / conda)
├── .env_example               # Example configuration file
├── .env                       # Local configuration (user-created, ignored by Git)
├── .gitignore                 # Git ignore rules
└── README.md                  # This documentation

Note: Place your model-generated reports under report/<model_name>/idx-*.pdf|docx|html|md|txt|....
The subdirectory name becomes the model identifier in output files.

Quick Start

1. Prepare your model outputs

Organize your model-generated reports under report with the following structure:

report/
├── ModelA/
│   ├── idx-1.pdf
│   ├── idx-2.pdf
│   └── ...
└── ModelB/
    ├── idx-1.pdf
    ├── idx-2.pdf
    └── ...

Subdirectory name = model name (used in output JSONL).
File name pattern = idx-<task_idx>.<ext> where <ext> can be pdf, docx, html, md, txt, or an image type.

2. Run the evaluator

Run via `uv` (recommended)

cd DeepResearch-Bench-II
uv run python run_evaluation.py

Run directly with `python`

cd DeepResearch-Bench-II

# Use configuration from .env
python run_evaluation.py

# Or override configuration via CLI arguments
python run_evaluation.py \
    --pdf_dir grok \
    --out_jsonl result.jsonl \
    --chunk_size 50

This produces a JSONL file where each line has the form:

{"model": "ModelA", "idx": 1, "result": {...}}

3. Aggregate scores

After you have a merged JSONL of evaluation results (e.g., merged.jsonl), run:

python aggregate_scores.py \
  --input result.jsonl \
  --tasks-file tasks_and_rubrics.jsonl

This will generate multiple CSVs:

agg_scores_inforecall.csv
agg_scores_analysis.csv
agg_scores_presentation.csv
agg_scores_total.csv
agg_scores_blocked.csv

Each CSV summarizes model performance by task (idx), including:

Per-dimension scores,
Overall averages,
Blocked-rate statistics.

Acknowledgements

DeepResearch Bench II builds on the ideas and infrastructure of DeepResearch Bench and related benchmarks.
We thank all authors and annotators involved in collecting tasks, source articles, and rubrics.

Citation

If you use DeepResearch Bench II or this evaluation pipeline in your research, please cite:

@misc{li2026deepresearchbenchiidiagnosing,
      title={DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report}, 
      author={Ruizhe Li and Mingxuan Du and Benfeng Xu and Chiwei Zhu and Xiaorui Wang and Zhendong Mao},
      year={2026},
      eprint={2601.08536},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2601.08536}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

If you like our project, please give us a star ⭐ on GitHub for the latest update.

✨ News

📖 Overview

Benchmark Construction

Topic and Task Design

Rubric Design from Expert Articles

Evaluation Framework

📊 Evaluation Results

🛠️ Installation

Prerequisites

1. Environment configuration (`.env`)

2. Install dependencies (supports `uv` / conda)

Option A: Use `uv` (recommended)

How to check whether `uv` is installed correctly

Option B: Use `conda`

Project Structure

Quick Start

1. Prepare your model outputs

2. Run the evaluator

Run via `uv` (recommended)

Run directly with `python`

3. Aggregate scores

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
.env_example		.env_example
.gitignore		.gitignore
README.md		README.md
aggregate_scores.py		aggregate_scores.py
gemini_client.py		gemini_client.py
pyproject.toml		pyproject.toml
run_evaluation.py		run_evaluation.py
tasks_and_rubrics.jsonl		tasks_and_rubrics.jsonl
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

If you like our project, please give us a star ⭐ on GitHub for the latest update.

✨ News

📖 Overview

Benchmark Construction

Topic and Task Design

Rubric Design from Expert Articles

Evaluation Framework

📊 Evaluation Results

🛠️ Installation

Prerequisites

1. Environment configuration (.env)

2. Install dependencies (supports uv / conda)

Option A: Use uv (recommended)

How to check whether uv is installed correctly

Option B: Use conda

Project Structure

Quick Start

1. Prepare your model outputs

2. Run the evaluator

Run via uv (recommended)

Run directly with python

3. Aggregate scores

Acknowledgements

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Environment configuration (`.env`)

2. Install dependencies (supports `uv` / conda)

Option A: Use `uv` (recommended)

How to check whether `uv` is installed correctly

Option B: Use `conda`

Run via `uv` (recommended)

Run directly with `python`

Packages