This repository contains the code and data for the paper "RottenReviews : Benchmarking Review Quality with Human and LLM-Based Judgments". It should be noted that due to the size of the dataset, we are unable to provide the full dataset in this repository. Hence, the repository contains the codes for the sake of reproducibility and the data are available on Google Drive.
By following the instructions below, you can download the dataset and run files to either reproduce the results or use the dataset for your research.
- Title
- Dataset
- Calculate the Metrics
- Human Annotation
- Predicting Overall Quality of a Review
- Abstract
- Citation
RottenReviews
├─ data
│ ├─ dataset-overview.txt
│ ├─ human-annotation-data/ (all the human annotated data would appear here)
│ ├─ processed/ (all the processed and cleaned data would appear here)
│ └─ raw/ (all the raw data crawled from different venues would appear here)
├─ feature_analysis
│ ├─ Figure/ (all the figures and visualizations would appear here)
│ ├─ stats-qmetrics-distributions-f1000-swj.ipynb
│ └─ stats-qmetrics-distributions-iclr-neurips-HA.ipynb
├─ feature_extraction
│ └─ process-neurips2023.ipynb
├─ human_annotation
│ ├─ HA-decision-vs-overall-quality.ipynb
│ ├─ interface/ (web application for gathering human annotation data)
│ ├─ llm-process-human-annotation-data.ipynb
│ ├─ llm-self-correlation-map.ipynb
│ └─ llm-vs-human-kendaltau.ipynb
├─ images/ (all the figures presented in README.md)
├─ predict_review_quality_score
│ ├─ Folds/ (train and test splitted data with 5 folds)
│ ├─ all_folds_data.csv
│ ├─ classical-ml-vs-llms.ipynb
│ ├─ human_llms_qmetrics.csv
│ ├─ human_vs_llm.csv
│ └─ llama3-finetune/
├─ requirements.txt
├─ README.md
└─ .gitignore
In this section, we provide detailed instructions on how to download the dataset, an overview of the dataset file organization, and a sample of the raw data.
Note: You need to install the gdown package to download the dataset.
pip install gdownIf you already have the gdown package installed, you can use the following commands to download the dataset:
cd RottenReviews/
gdown --folder https://drive.google.com/drive/folders/1Uqfyl5uBKBdZem9kQHkhNSPMPnwqJrYV?usp=sharing| Folder Name | File Name | File Size | Record Type | Number of Records | Format |
|---|---|---|---|---|---|
| raw | f1000research | 497 MB | Submission | 4,509 | JSON |
| raw | semantic-web-journal | 12.7 MB | Submission | 796 | JSON |
| raw | iclr-2024 | 148 MB | Submission | 7,262 | PKL |
| raw | neurips-2023 | 81.6 MB | Submission | 3,395 | PKL |
| processed | f1000research | 41.2 MB | Review | 9,482 | CSV |
| processed | semantic-web-journal | 14.6 MB | Review | 2,337 | CSV |
| processed | iclr-2024 | 147 MB | Review | 28,028 | JSON |
| processed | neurips-2023 | 80.6 MB | Review | 15,175 | JSON |
| processed | merged-200-papers | 3.3 MB | Submission | 200 | JSON |
Here's a sample of raw data from Semantic Web Journal.
{
"id": "3654-4868",
"date": "02/29/2024",
"type": "Full Paper",
"abstract": "Relation prediction in Knowledge...",
"pdf_link": "https://www.semantic-web-journal...",
"authors": [
"Jiangtao Ma",
"Yuke Ma",
"Fan Zhang1",
"..."
],
"editor": "Guest Editors KG Gen from Text 2023",
"status": [
"Reviewed"
],
"decision": "Major Revision",
"reviews": [
{
"reviewer": "Anonymous",
"date": "27/Aug/2024",
"suggestion": "Minor Revision",
"comment": "The paper proposed a new approach..."
},
{
"reviewer": "Anonymous",
"date": "03/Sep/2024",
"suggestion": "Major Revision",
"comment": "The subject on which the proposal..."
}
],
"version_prev": null,
"version_next": null,
"title": "HiHo: A Hierarchical and Homogenous Subgraph...",
"rev_link": "https://www.semantic-web-journal..."
}In this work, we calculated two categories of metrics: (1) quantifiable metrics derived using off-the-shelf text processing methods, and (2) metrics obtained by prompting various LLM models to quantify specific aspects of review quality. Below, you will find instructions to reproduce the results, along with sample outputs and visualizations generated from these metrics.
To extract features, start by running the notebooks in the feature_extraction folder. Each notebook processes raw data from a specific venue, extracts defined features, and saves the processed files in the data/processed folder. Both quantifiable metrics and LLM-based metrics are calculated within the same notebook. Each notebook includes a dedicated section named "LLM" that separates the code for these two categories of metrics.
cd feature_extraction/
jupyter nbconvert --to notebook --execute process-iclr2024.ipynbAfter extracting features, analyze the results by running notebooks in the feature_analysis folder. These notebooks plot metric distributions, correlation maps, and relationships between quantifiable metrics, human annotations, and LLM evaluations. Visualizations are saved in the feature_analysis/Figure directory.
cd feature_analysis/
jupyter nbconvert --to notebook --execute stats-qmetrics-distributions-iclr-neurips-HA.ipynbQuantifiable metrics derived using off-the-shelf text processing methods are measurable attributes extracted directly from the text of reviews and submissions. These metrics include features such as review length, lexical diversity, sentiment polarity, and readability, among others. By leveraging established natural language processing (NLP) techniques and tools, these metrics provide objective insights into various aspects of review quality, enabling systematic analysis and comparison across datasets.
{
"paper_id": "11-565",
"reviewer": "Daniel A Nation",
"review_date": "19 Jan 2023",
"review_suggestion": "Approved",
"length_words": 278,
"title": "Assessing the role of...",
"abstract": "Although observational studies...",
"days_to_submit": 240,
"review_text": "This is a meta-analysis of...",
"mattr": 0.7832,
"question_count": 1,
"citation_count": 0,
"sentiment_polarity": 0.1869,
"politeness_score": 0.8376,
"similarity_score": 0.9741,
"reading_ease": 19.97,
"hedgeing": 14.8,
"explicit_reference": 1,
"...": "...",
}The table below summarizes the statistics of quantifiable metrics across four venues: NeurIPs, ICLR, F1000, and SWJ. Metrics such as review length, readability, sentiment polarity, and politeness vary significantly across venues, reflecting differences in review styles and submission characteristics. For instance, SWJ reviews are notably longer and more detailed, while NeurIPs and ICLR reviews exhibit higher lexical diversity and politeness scores. These variations highlight the diverse nature of peer review practices across academic venues.
| Metric | Dependency | NeurIPs | ICLR | F1000 | SWJ |
|---|---|---|---|---|---|
| Review Length | Review | 439.4 | 424.5 | 398.17 | 782.09 |
| # References | Review | 1.25 | 1.42 | 0.29 | 2.29 |
| # Section-specific Comments | Review | 1.43 | 1.73 | 1.78 | 7.27 |
| Semantic Alignment | Review | 0.90 | 0.90 | 0.88 | 0.90 |
| Timeliness | Review | 59.13 | 39.81 | 142.36 | 89.46 |
| Politeness | Review | 0.84 | 0.81 | 0.83 | 0.75 |
| Readability | Review | 38.02 | 37.65 | 36.60 | 43.86 |
| Lexical Diversity | Review | 0.77 | 0.77 | 0.76 | 0.76 |
| # Raised Questions | Review | 3.76 | 4.02 | 1.72 | 2.88 |
| Sentiment Polarity | Review | 0.11 | 0.11 | 0.15 | 0.10 |
| Hedging | Review | 0.005 | 0.009 | 0.013 | 0.007 |
| General Topic Alignment | Reviewer | N/A | N/A | 0.74 | 0.76 |
| Recency-Based Topic Alignment | Reviewer | N/A | N/A | 0.65 | 0.64 |
| In-depth Topical Alignment | Reviewer | N/A | N/A | 0.87 | 0.88 |
| Reviewer’s Citation | Reviewer | N/A | N/A | 4683.0 | 2476.08 |
| Reviewer’s Academic Tenure | Reviewer | N/A | N/A | 29.16 | 25.68 |
The figure below shows correlations between metrics like review length, sentiment polarity, readability, and politeness for the F1000 dataset.
Correlation between quantifiable metrics on F1000.
As mentioned in Run notebooks, by running the notebooks, you can reproduce the results and prompt LLMs to evaluate the review in each aspect defined in the prompt. A sample of the prompt and the list of models used are available in the following subsections.
- Qwen-3: A state-of-the-art model optimized for natural language understanding and generation tasks.
- Phi-4: Known for its advanced reasoning capabilities and contextual comprehension.
- GPT-4o: OpenAI's latest iteration of the GPT series, designed for high-quality text generation.
We prompted all the models mentioned above with the same prompt to evaluate review quality. Below is the exact prompt used:
# REVIEW-QUALITY JUDGE
## 0 — ROLE
You are **ReviewInspector-LLM**, a rigorous, impartial meta-reviewer.
Your goal is to assess the quality of a single peer-review against a predefined set of criteria and to provide precise, structured evaluations.
## 1 — INPUTS
Title: {title}
Abstract: {abstract}
Review: {review_text}
## 2 — EVALUATION CRITERIA
Return **only** the scale value or label at right (no rationale text).
| # | Criterion | Allowed scale / label | Description |
| -- | ---------------------------- | ------------------------------------------- | -------------------------------------------------------------------------- |
| 1 | **Comprehensiveness** | integer **0-5** | Extent to which the review covers all key aspects of the paper. |
| 2 | **Usage of Technical Terms** | integer **0-5** | Appropriateness and frequency of domain-specific vocabulary. |
| 3 | **Factuality** | **factual / partially factual / unfactual** | Accuracy of the statements made in the review. |
| 4 | **Sentiment Polarity** | **negative / neutral / positive** | Overall sentiment conveyed by the reviewer. |
| 5 | **Politeness** | **polite / neutral / impolite** | Tone and manner of the review language. |
| 6 | **Vagueness** | **none / low / moderate / high / extreme** | Degree of ambiguity or lack of specificity in the review. |
| 7 | **Objectivity** | integer **0-5** | Presence of unbiased, evidence-based commentary. |
| 8 | **Fairness** | integer **0-5** | Perceived impartiality and balance in judgments. |
| 9 | **Actionability** | integer **0-5** | Helpfulness of the review in suggesting clear next steps. |
| 10 | **Constructiveness** | integer **0-5** | Degree to which the review offers improvements rather than just criticism. |
| 11 | **Relevance Alignment** | integer **0-5** | How well the review relates to the content and scope of the paper. |
| 12 | **Clarity and Readability** | integer **0-5** | Ease of understanding the review, including grammar and structure. |
| 13 | **Overall Quality** | integer **0-100** | Holistic evaluation of the review's usefulness and professionalism. |
## 3 — SCORING GUIDELINES
For 0-5 scales:
* 5 = Outstanding
* 4 = Strong
* 3 = Adequate
* 2 = Weak
* 1 = Very weak
* 0 = Absent/irrelevant
## 4 — ANALYSIS & COMPUTATION (silent)
1. Read and understand the review in the context of the paper title and abstract.
2. Extract quantitative and qualitative signals (e.g., term usage, factual consistency, tone, clarity).
3. Map observations to the corresponding scoring scales.
## 5 — OUTPUT FORMAT (strict)
Return **exactly one** JSON block wrapped in the tag below — **no comments or extra text**.
```json
<review_assessment>
{{
"paper_title": "{title}",
"criteria": {{
"Comprehensiveness": ...,
"Usage of Technical Terms": ...,
"Factuality": ...,
"Sentiment Polarity": ...,
"Politeness": ...,
"Vagueness": ...,
"Objectivity": ...,
"Fairness": ...,
"Actionability": ...,
"Constructiveness": ...,
"Relevance Alignment": ...,
"Clarity and Readability": ...,
"Relevance Alignment": ...,
"Overall Quality": ...
}},
}}
</review_assessment>{
"paper_id": "123",
"reviewer": "Reviewer_EGJf",
"Comprehensiveness": 2,
"Vagueness": "moderate",
"Objectivity": 2,
"Fairness": 3,
"Actionability": 1,
"Constructiveness": 2,
"Relevance Alignment": 2,
"Clarity and Readability": 3,
"Usage of Technical Terms": 2,
"Factuality": "partially factual",
"Sentiment Polarity": "neutral",
"Politeness": "polite",
"Overall Quality": 40
}In this study, we conducted a human evaluation of reviews from four academic venues. We randomly selected 50 papers from each venue, resulting in a dataset of 200 papers along with their associated reviews. This dataset comprises 661 reviews and 200 papers, providing a comprehensive basis for analyzing review quality. The processed dataset is available in data/processed/merged-200-papers.json.
We implemented the interface to gather human annotation data using Flask and built a web application for this purpose. The source code for the web application is available in the human_annotation/interface directory, which contains all necessary files to deploy and run the annotation tool.
Here's a screenshot of the user interface of the human data gathering web app:
The recorded data from participants are stored in the data/human-annotation-data folder. Each file in this directory represents a single record, containing annotations provided by individual participants. These records include detailed evaluations of review quality across multiple dimensions, such as comprehensiveness, clarity, and fairness. Below is a sample of one record:
{
"paper_id": "170",
"submitted_at": "2025-05-15T18:57:57.054760",
"metrics": {
"review_Sangeeta-Saha_Comprehensiveness": 3,
"review_Sangeeta-Saha_Usage_of_Technical_Terms": 3,
"review_Sangeeta-Saha_Factuality": "partially factual",
"review_Sangeeta-Saha_Sentiment_Polarity": "negative",
"review_Sangeeta-Saha_Politeness": "neutral",
"review_Sangeeta-Saha_Vagueness": "moderate",
"review_Sangeeta-Saha_Objectivity": 4,
"review_Sangeeta-Saha_Fairness": 4,
"review_Sangeeta-Saha_Actionability": 2,
"review_Sangeeta-Saha_Constructiveness": 3,
"review_Sangeeta-Saha_Relevance_Alignment": 4,
"review_Sangeeta-Saha_Clarity_and_Readability": 1,
"review_Sangeeta-Saha_Overall_Quality": 60,
"review_J-H-Arias-Castro_Comprehensiveness": 3,
"review_J-H-Arias-Castro_Usage_of_Technical_Terms": 3,
"...": "..."
}
}Left: Kendall’s τ correlation between human-evaluated and LLMs-evaluated quality dimensions. Right: Kendall’s τ correlation between human-evaluated quality dimensions Y-axis and quantifiable metrics X-axis.
We aimed to predict the overall quality of a review using quantifiable metrics extracted in the previous steps, with labels for overall quality derived from human annotation data. All relevant codes for this section are available in the predict_review_quality_score directory.
Below is a list of traditional models used for predicting overall review quality, along with a brief explanation of their setup using sklearn library:
-
Random Forest:
A robust ensemble learning method that combines multiple decision trees to improve prediction accuracy.- Key Parameters:
random_state=42: Ensures reproducibility.n_estimators=100: Specifies the number of trees in the forest.
- Key Parameters:
-
Linear Regression:
A simple and interpretable model that fits a linear relationship between input features and the target variable.- Key Parameters:
- Default settings used for simplicity.
- Key Parameters:
-
Neural Network:
A multi-layer perceptron (MLP) designed to capture complex non-linear relationships in the data.- Key Parameters:
random_state=42: Ensures reproducibility.hidden_layer_sizes=(54, 108, 108, 54): Specifies the architecture with four layers and varying neuron counts.
- Key Parameters:
Each LLM quantifies the overall quality of a review on a 0–100 scale, as defined in the Prompt section. This score reflects the review's overall usefulness and professionalism across dimensions like comprehensiveness, clarity, and fairness. Using three base models and one fine-tuned model (as detailed in Models), we compared LLM-generated scores with human annotations to evaluate alignment and reliability in assessing review quality.
All the necessary code to fine-tune LLaMA is available in the predict_review_quality_score/llama3-finetune directory. You can fine-tune the model using the following command:
cd /home/ali/RottenReviews/predict_review_quality_score/llama3-finetune/
python llm_eval.py --train_data_path /path/to/train_data.csv \
--val_data_path /path/to/val_data.csv \
--model_name_or_path llama-base \
--output_dir /path/to/output_dir \
--batch_size 16 \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--logging_dir /path/to/logging_dirKendall’s τ correlation between human-evaluated and models-predicted Overall Quality of peer reviews.
The quality of peer review plays a critical role in scientific publishing, yet remains poorly understood and challenging to evaluate at scale. In this work, we introduce RottenReviews, a benchmark designed to facilitate systematic assessment of review quality. RottenReviews comprises over 15,000 submissions from four distinct academic venues enriched with over 9,000 reviewer scholarly profiles and paper metadata. We define and compute a diverse set of quantifiable review-dependent and reviewer-dependent metrics, and compare them against structured assessments from large language models (LLMs) and expert human annotations. Our human-annotated subset includes over 700 paper–review pairs labeled across 13 explainable and conceptual dimensions of review quality. Our empirical findings reveal that LLMs, both zero-shot and fine-tuned, exhibit limited alignment with human expert evaluations of peer review quality. Surprisingly, simple interpretable models trained on quantifiable features outperform fine-tuned LLMs in predicting overall review quality.
TBA





