Skip to content

imlrz/DeepResearch-Bench-II

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

Website Leaderboard arxiv Dataset

If you like our project, please give us a star ⭐ on GitHub for the latest update.

✨ News

  • [Feb 24, 2026] πŸ“¦ DeepResearch Bench II Dataset Released

  • [Feb 2026] 🌐 Official Website & Leaderboard Released

  • [Jan 2026] πŸ“„ Paper Released on arXiv

  • [Nov 2025] πŸŽ‰ DeepResearch Bench II Evaluation Pipeline Released

    • This repo provides the official evaluation pipeline for DeepResearch Bench II, built on Gemini with fine-grained, verifiable rubrics derived from expert-written research reports.
    • It supports multimodal inputs (PDF/DOCX/images/text) and batched rubric-based evaluation for information recall, analysis, and presentation.

For complete experimental results, model comparisons, and ablation studies, please refer to the main paper (paper/main.pdf).


πŸ“– Overview

Three-layer framework: recall, analysis, presentation

Three-layer framework: recall, analysis, presentation

DeepResearch Bench II addresses key limitations of existing deep research benchmarks by combining:

  • Real-world, expert-authored research reports as the grounding signal.
  • Fine-grained, fully verifiable rubrics that do not rely on the judge model’s internal domain knowledge.
  • Three core dimensions of deep research quality:
    • πŸ” Information Recall – Can the agent identify, retrieve, and cross-check all key information needed to answer the task?
    • 🧠 Analysis – Can the agent synthesize retrieved information into higher-level conclusions and insights?
    • πŸ“ Presentation – Can the agent present the information in a structured, readable, and easily verifiable way?

This repository (DeepResearch-Bench-II) contains a lightweight evaluation pipeline that:

  • Takes model-generated research reports (PDF/DOCX/HTML/TXT/images),
  • Uses tasks_and_rubrics.jsonl to load task descriptions and rubrics, and
  • Invokes Gemini to score each rubric item in batches, producing:
    • Per-task, per-dimension rubric scores, and
    • Aggregated CSVs summarizing model performance.

Benchmark Construction

Topic and Task Design

DeepResearch Bench II is built on top of the original DeepResearch Bench topic distribution and task design:

  • We start from real-world user queries and task themes collected in the original benchmark.
  • For each seed task, we search for expert-written review reports addressing similar research questions in:
    • Reputable journals and top conferences,
    • High-quality institutional or governmental reports.

These source reports are:

  • Written by domain experts over weeks or months,
  • Validated by reviewers, editors, and the broader community,
  • Released under CC-BY-4.0 / CC-BY-4.0-NC licenses.

After license filtering and quality screening, we retain 132 expert-authored reports, which become the basis for:

  • Task formulations, and
  • Ground-truth, expert-aligned rubrics.

Topic distribution

Rubric Design from Expert Articles

From each expert article, we construct:

  • One or more deep research tasks that require both information collection and analysis.
  • A set of binary rubrics decomposed across the three dimensions:
    • Information Recall,
    • Analysis,
    • Presentation.

Each rubric is:

  1. Essential – captures information necessary to correctly answer the task.
  2. Atomic – checks a single fact or inference; complex points are split into smaller rubrics.
  3. Content-bearing – encodes the actual answer, not just a vague topic (e.g., β€œstates that X increased from A to B between years Y and Z”).
  4. Numerically precise – numerical rubrics explicitly specify values and tolerated error ranges.

Rubrics are built through a four-stage pipeline:

  1. LLM extraction from expert articles, guided by carefully designed prompts.
  2. Self-evaluation iteration – rejecting hallucinated or inconsistent rubrics using the source article as reference.
  3. Manual revision – human annotators refine wording, remove redundancy, and enforce atomicity.
  4. Expert review & refinement – domain experts ensure that rubrics faithfully represent the article’s core content.

Method overview


Evaluation Framework

DeepResearch Bench II uses LLM-as-judge with verifiable rubrics:

  1. The task + rubric are serialized into a structured JSON prompt.
  2. The model report (PDF/DOCX/image/text) is provided as the passage (possibly as multimodal attachments).
  3. Gemini is prompted to output, for each rubric item:
    • score ∈ {1, 0, -1},
    • reason, and
    • evidence (supporting sentences from the report).

Scoring semantics:

  • 1 – rubric satisfied with valid evidence and no use of blocked references,
  • 0 – rubric not mentioned at all,
  • -1 – rubric mentioned but evidence relies on explicitly blocked references.

The evaluation pipeline in this repo:

  • Handles multimodal inputs:
    • PDFs are uploaded as binary attachments.
    • DOCX files are parsed into text + tables (Markdown) + extracted images.
    • Images (PNG/JPEG/WebP/GIF/BMP/TIFF) are attached as inline data.
    • TXT/MD/HTML are loaded as plain text.
  • Supports batched evaluation:
    • Rubric items are split into batches of size CHUNK_SIZE (default 50).
    • Each batch is evaluated independently; results are merged and re-grouped by dimension.
  • Aggregates token usage statistics:
    • Per batch (usageMetadata),
    • Per file, and
    • Per model across the whole run.

πŸ“Š Evaluation Results

This repository focuses on the evaluation pipeline.
Aggregated scores (per-task, per-dimension, and per-model) can be produced locally via aggregate_scores.py.

For full experimental details, including:

  • Cross-model comparison,
  • Dimension-wise analysis,
  • Error cases and ablations,

please refer to the paper (paper/main.pdf) and any public leaderboard associated with DeepResearch Bench II.


πŸ› οΈ Installation

Prerequisites

  • Python 3.9+
  • A Gemini-compatible API endpoint and token

1. Environment configuration (.env)

Create a .env file in the project root DeepResearch-Bench-II to store API configuration and runtime parameters:

cd DeepResearch-Bench-II
touch .env
vim .env  # or use your favorite editor

Required config (replace with your own values):

GEMINI_API_URL=https://your-api-endpoint.com/v1/chat/completions
GEMINI_API_TOKEN=your-api-token
GEMINI_MODEL=gemini-2.5-pro
GEMINI_REQUEST_ID=eval-request-id

PDF_DIR=report
OUT_JSONL=result.jsonl
TASKS_JSONL=tasks_and_rubrics.jsonl
CHUNK_SIZE=50
MAX_WORKERS=10
MAX_RETRIES=5
MAX_PAPER_CHARS=150000
LOG_FILE=run_evaluation.log

2. Install dependencies (supports uv / conda)

Option A: Use uv (recommended)

The project ships with pyproject.toml, so you can manage the virtual environment and dependencies via uv:

# Install uv (if not installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create/sync virtual environment and install dependencies
cd DeepResearch-Bench-II
uv sync
How to check whether uv is installed correctly

Run any of the following commands in your terminal:

# 1. Check version (recommended)
uv --version

# 2. Check executable path
which uv

# 3. Show help
uv --help
  • If uv --version prints something like uv 0.x.y, it is installed correctly.
  • If you see command not found or similar, uv is not installed or not on your PATH.

Option B: Use conda

# Create and activate a conda environment
conda create -n drbench-II python=3.10 -y
conda activate drbench-II

# Install Python dependencies
cd DeepResearch-Bench-II
pip install requests python-docx

You can then run all commands inside this conda environment.


Project Structure

DeepResearch-Bench-II/
β”œβ”€β”€ assets/                    # Images and figures for README
β”‚   β”œβ”€β”€ distribution.png
β”‚   β”œβ”€β”€ intro.png
β”‚   β”œβ”€β”€ main_result.png
β”‚   └── method.png
β”œβ”€β”€ report/                    # Input directory for model-generated reports
β”‚   └── <model_name>/         # Per-model subdirectories
β”‚       β”œβ”€β”€ idx-1.pdf         # Model output for task 1
β”‚       β”œβ”€β”€ idx-2.docx        # Model output for task 2
β”‚       └── ...
β”œβ”€β”€ gemini_client.py           # Gemini API client (handles API calls and multimodal input)
β”œβ”€β”€ run_evaluation.py          # Main evaluation script (batched rubric scoring logic)
β”œβ”€β”€ aggregate_scores.py        # Score aggregation utility (produces CSV summaries)
β”œβ”€β”€ tasks_and_rubrics.jsonl    # Tasks and rubrics (132 expert-derived tasks)
β”œβ”€β”€ pyproject.toml             # Dependency management (uv / pip / conda)
β”œβ”€β”€ .env_example               # Example configuration file
β”œβ”€β”€ .env                       # Local configuration (user-created, ignored by Git)
β”œβ”€β”€ .gitignore                 # Git ignore rules
└── README.md                  # This documentation

Note: Place your model-generated reports under report/<model_name>/idx-*.pdf|docx|html|md|txt|....
The subdirectory name becomes the model identifier in output files.


Quick Start

1. Prepare your model outputs

Organize your model-generated reports under report with the following structure:

report/
β”œβ”€β”€ ModelA/
β”‚   β”œβ”€β”€ idx-1.pdf
β”‚   β”œβ”€β”€ idx-2.pdf
β”‚   └── ...
└── ModelB/
    β”œβ”€β”€ idx-1.pdf
    β”œβ”€β”€ idx-2.pdf
    └── ...
  • Subdirectory name = model name (used in output JSONL).
  • File name pattern = idx-<task_idx>.<ext> where <ext> can be pdf, docx, html, md, txt, or an image type.

2. Run the evaluator

Run via uv (recommended)

cd DeepResearch-Bench-II
uv run python run_evaluation.py

Run directly with python

cd DeepResearch-Bench-II

# Use configuration from .env
python run_evaluation.py

# Or override configuration via CLI arguments
python run_evaluation.py \
    --pdf_dir grok \
    --out_jsonl result.jsonl \
    --chunk_size 50

This produces a JSONL file where each line has the form:

{"model": "ModelA", "idx": 1, "result": {...}}

3. Aggregate scores

After you have a merged JSONL of evaluation results (e.g., merged.jsonl), run:

python aggregate_scores.py \
  --input result.jsonl \
  --tasks-file tasks_and_rubrics.jsonl

This will generate multiple CSVs:

  • agg_scores_inforecall.csv
  • agg_scores_analysis.csv
  • agg_scores_presentation.csv
  • agg_scores_total.csv
  • agg_scores_blocked.csv

Each CSV summarizes model performance by task (idx), including:

  • Per-dimension scores,
  • Overall averages,
  • Blocked-rate statistics.

Acknowledgements

DeepResearch Bench II builds on the ideas and infrastructure of DeepResearch Bench and related benchmarks.
We thank all authors and annotators involved in collecting tasks, source articles, and rubrics.


Citation

If you use DeepResearch Bench II or this evaluation pipeline in your research, please cite:

@misc{li2026deepresearchbenchiidiagnosing,
      title={DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report}, 
      author={Ruizhe Li and Mingxuan Du and Benfeng Xu and Chiwei Zhu and Xiaorui Wang and Zhendong Mao},
      year={2026},
      eprint={2601.08536},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2601.08536}, 
}

About

DeepResearch Bench II (DRB2) is the follow-up to DeepResearch Bench, with a stronger focus on measuring the gap between deep research systems and human experts. It does so by decomposing expert-written reports into hierarchical rubrics covering presentation, analysis, and evidence, and using them to evaluate model-generated.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages