-
[Feb 24, 2026] π¦ DeepResearch Bench II Dataset Released
- We have released the DeepResearch Bench II Dataset on Hugging Face, containing articles generated by a subset of the models we evaluated.
-
[Feb 2026] π Official Website & Leaderboard Released
- The official DeepResearch Bench II Website is now live!
- Check out the Leaderboard to see how SOTA deep research agents compare across 9,430 expert-written rubrics.
- π― Welcome to submit your model! Contact us at [email protected] or [email protected] to join the leaderboard.
-
[Jan 2026] π Paper Released on arXiv
- Our paper is now available on arXiv (2601.08536).
-
[Nov 2025] π DeepResearch Bench II Evaluation Pipeline Released
- This repo provides the official evaluation pipeline for DeepResearch Bench II, built on Gemini with fine-grained, verifiable rubrics derived from expert-written research reports.
- It supports multimodal inputs (PDF/DOCX/images/text) and batched rubric-based evaluation for information recall, analysis, and presentation.
For complete experimental results, model comparisons, and ablation studies, please refer to the main paper (paper/main.pdf).
DeepResearch Bench II addresses key limitations of existing deep research benchmarks by combining:
- Real-world, expert-authored research reports as the grounding signal.
- Fine-grained, fully verifiable rubrics that do not rely on the judge modelβs internal domain knowledge.
- Three core dimensions of deep research quality:
- π Information Recall β Can the agent identify, retrieve, and cross-check all key information needed to answer the task?
- π§ Analysis β Can the agent synthesize retrieved information into higher-level conclusions and insights?
- π Presentation β Can the agent present the information in a structured, readable, and easily verifiable way?
This repository (DeepResearch-Bench-II) contains a lightweight evaluation pipeline that:
- Takes model-generated research reports (PDF/DOCX/HTML/TXT/images),
- Uses
tasks_and_rubrics.jsonlto load task descriptions and rubrics, and - Invokes Gemini to score each rubric item in batches, producing:
- Per-task, per-dimension rubric scores, and
- Aggregated CSVs summarizing model performance.
DeepResearch Bench II is built on top of the original DeepResearch Bench topic distribution and task design:
- We start from real-world user queries and task themes collected in the original benchmark.
- For each seed task, we search for expert-written review reports addressing similar research questions in:
- Reputable journals and top conferences,
- High-quality institutional or governmental reports.
These source reports are:
- Written by domain experts over weeks or months,
- Validated by reviewers, editors, and the broader community,
- Released under CC-BY-4.0 / CC-BY-4.0-NC licenses.
After license filtering and quality screening, we retain 132 expert-authored reports, which become the basis for:
- Task formulations, and
- Ground-truth, expert-aligned rubrics.
From each expert article, we construct:
- One or more deep research tasks that require both information collection and analysis.
- A set of binary rubrics decomposed across the three dimensions:
- Information Recall,
- Analysis,
- Presentation.
Each rubric is:
- Essential β captures information necessary to correctly answer the task.
- Atomic β checks a single fact or inference; complex points are split into smaller rubrics.
- Content-bearing β encodes the actual answer, not just a vague topic (e.g., βstates that X increased from A to B between years Y and Zβ).
- Numerically precise β numerical rubrics explicitly specify values and tolerated error ranges.
Rubrics are built through a four-stage pipeline:
- LLM extraction from expert articles, guided by carefully designed prompts.
- Self-evaluation iteration β rejecting hallucinated or inconsistent rubrics using the source article as reference.
- Manual revision β human annotators refine wording, remove redundancy, and enforce atomicity.
- Expert review & refinement β domain experts ensure that rubrics faithfully represent the articleβs core content.
DeepResearch Bench II uses LLM-as-judge with verifiable rubrics:
- The task + rubric are serialized into a structured JSON prompt.
- The model report (PDF/DOCX/image/text) is provided as the passage (possibly as multimodal attachments).
- Gemini is prompted to output, for each rubric item:
score β {1, 0, -1},reason, andevidence(supporting sentences from the report).
Scoring semantics:
1β rubric satisfied with valid evidence and no use of blocked references,0β rubric not mentioned at all,-1β rubric mentioned but evidence relies on explicitly blocked references.
The evaluation pipeline in this repo:
- Handles multimodal inputs:
- PDFs are uploaded as binary attachments.
- DOCX files are parsed into text + tables (Markdown) + extracted images.
- Images (PNG/JPEG/WebP/GIF/BMP/TIFF) are attached as inline data.
- TXT/MD/HTML are loaded as plain text.
- Supports batched evaluation:
- Rubric items are split into batches of size
CHUNK_SIZE(default 50). - Each batch is evaluated independently; results are merged and re-grouped by dimension.
- Rubric items are split into batches of size
- Aggregates token usage statistics:
- Per batch (
usageMetadata), - Per file, and
- Per model across the whole run.
- Per batch (
This repository focuses on the evaluation pipeline.
Aggregated scores (per-task, per-dimension, and per-model) can be produced locally via aggregate_scores.py.
For full experimental details, including:
- Cross-model comparison,
- Dimension-wise analysis,
- Error cases and ablations,
please refer to the paper (paper/main.pdf) and any public leaderboard associated with DeepResearch Bench II.
- Python 3.9+
- A Gemini-compatible API endpoint and token
Create a .env file in the project root DeepResearch-Bench-II to store API configuration and runtime parameters:
cd DeepResearch-Bench-II
touch .env
vim .env # or use your favorite editorRequired config (replace with your own values):
GEMINI_API_URL=https://your-api-endpoint.com/v1/chat/completions
GEMINI_API_TOKEN=your-api-token
GEMINI_MODEL=gemini-2.5-pro
GEMINI_REQUEST_ID=eval-request-id
PDF_DIR=report
OUT_JSONL=result.jsonl
TASKS_JSONL=tasks_and_rubrics.jsonl
CHUNK_SIZE=50
MAX_WORKERS=10
MAX_RETRIES=5
MAX_PAPER_CHARS=150000
LOG_FILE=run_evaluation.logThe project ships with pyproject.toml, so you can manage the virtual environment and dependencies via uv:
# Install uv (if not installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create/sync virtual environment and install dependencies
cd DeepResearch-Bench-II
uv syncRun any of the following commands in your terminal:
# 1. Check version (recommended)
uv --version
# 2. Check executable path
which uv
# 3. Show help
uv --help- If
uv --versionprints something likeuv 0.x.y, it is installed correctly. - If you see
command not foundor similar,uvis not installed or not on yourPATH.
# Create and activate a conda environment
conda create -n drbench-II python=3.10 -y
conda activate drbench-II
# Install Python dependencies
cd DeepResearch-Bench-II
pip install requests python-docxYou can then run all commands inside this conda environment.
DeepResearch-Bench-II/
βββ assets/ # Images and figures for README
β βββ distribution.png
β βββ intro.png
β βββ main_result.png
β βββ method.png
βββ report/ # Input directory for model-generated reports
β βββ <model_name>/ # Per-model subdirectories
β βββ idx-1.pdf # Model output for task 1
β βββ idx-2.docx # Model output for task 2
β βββ ...
βββ gemini_client.py # Gemini API client (handles API calls and multimodal input)
βββ run_evaluation.py # Main evaluation script (batched rubric scoring logic)
βββ aggregate_scores.py # Score aggregation utility (produces CSV summaries)
βββ tasks_and_rubrics.jsonl # Tasks and rubrics (132 expert-derived tasks)
βββ pyproject.toml # Dependency management (uv / pip / conda)
βββ .env_example # Example configuration file
βββ .env # Local configuration (user-created, ignored by Git)
βββ .gitignore # Git ignore rules
βββ README.md # This documentation
Note: Place your model-generated reports under
report/<model_name>/idx-*.pdf|docx|html|md|txt|....
The subdirectory name becomes the model identifier in output files.
Organize your model-generated reports under report with the following structure:
report/
βββ ModelA/
β βββ idx-1.pdf
β βββ idx-2.pdf
β βββ ...
βββ ModelB/
βββ idx-1.pdf
βββ idx-2.pdf
βββ ...
- Subdirectory name = model name (used in output JSONL).
- File name pattern =
idx-<task_idx>.<ext>where<ext>can bepdf,docx,html,md,txt, or an image type.
cd DeepResearch-Bench-II
uv run python run_evaluation.pycd DeepResearch-Bench-II
# Use configuration from .env
python run_evaluation.py
# Or override configuration via CLI arguments
python run_evaluation.py \
--pdf_dir grok \
--out_jsonl result.jsonl \
--chunk_size 50This produces a JSONL file where each line has the form:
{"model": "ModelA", "idx": 1, "result": {...}}After you have a merged JSONL of evaluation results (e.g., merged.jsonl), run:
python aggregate_scores.py \
--input result.jsonl \
--tasks-file tasks_and_rubrics.jsonlThis will generate multiple CSVs:
agg_scores_inforecall.csvagg_scores_analysis.csvagg_scores_presentation.csvagg_scores_total.csvagg_scores_blocked.csv
Each CSV summarizes model performance by task (idx), including:
- Per-dimension scores,
- Overall averages,
- Blocked-rate statistics.
DeepResearch Bench II builds on the ideas and infrastructure of DeepResearch Bench and related benchmarks.
We thank all authors and annotators involved in collecting tasks, source articles, and rubrics.
If you use DeepResearch Bench II or this evaluation pipeline in your research, please cite:
@misc{li2026deepresearchbenchiidiagnosing,
title={DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report},
author={Ruizhe Li and Mingxuan Du and Benfeng Xu and Chiwei Zhu and Xiaorui Wang and Zhendong Mao},
year={2026},
eprint={2601.08536},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2601.08536},
}


