Skip to content

Update eval framework to use SummarizerAgent#269

Merged
jjasinsk-amd merged 16 commits intomainfrom
jjasinsk/eval_summarization
Feb 6, 2026
Merged

Update eval framework to use SummarizerAgent#269
jjasinsk-amd merged 16 commits intomainfrom
jjasinsk/eval_summarization

Conversation

@jjasinsk-amd
Copy link
Collaborator

@jjasinsk-amd jjasinsk-amd commented Jan 27, 2026

Summary

This PR introduces and enhances the summarization pipeline by adding support for synthetic PDF creation and groundtruth generation as well as integrating a summarization agent for inference and evaluation across Lemoade and Claude models.

Key Changes

  • Adds a utility to generate synthetic PDF files for testing and benchmarking purposes.
  • Modifies groundtruth.py to create accurate groundtruth PDF files used in summarization evaluation.
  • Updates batch_experiment.py to leverage the SummarizerAgent for summarizing content on both Lemoade Server and Claude.
  • Adjusts evaluation criteria to align with the output format and expectations of the summarization agent, ensuring consistent and meaningful benchmarking.

@github-actions github-actions bot added eval Evaluation framework changes performance Performance-critical changes labels Jan 27, 2026
@jjasinsk-amd
Copy link
Collaborator Author

@claude please review this PR

@github-actions
Copy link

github-actions bot commented Jan 27, 2026

Claude finished @jjasinsk-amd's task —— View job


PR Review: Update eval framework to use SummarizerAgent


Summary

This PR integrates the SummarizerAgent into the evaluation framework, replacing direct PDF extraction and summarization logic with the centralized agent. The changes touch 4 files with significant refactoring.


🔴 Critical Issues

1. Security Finding - False Positive
The GitHub Advanced Security finding at pdf_document_generator.py:680 about "clear-text logging of sensitive information (password)" appears to be a false positive. The line in question:

print(f"  Total API tokens (input+output): {total_api_tokens:,}")

This logs token counts (usage metrics), not passwords or API keys. The scanner likely flagged "token" incorrectly. No action needed.


🟡 Important Observations

1. Good: SummarizerAgent Integration (batch_experiment.py:14-19, 149-153)
The PR correctly imports and instantiates SummarizerAgent for PDF extraction:

from gaia.agents.summarize.agent import SummarizerAgent
from gaia.agents.summarize.prompts import SUMMARY_STYLES, SYSTEM_PROMPTS
# ...
self.summarizer_agent = SummarizerAgent()

This promotes code reuse and centralizes PDF processing logic.

2. Good: Improved PDF Extraction (batch_experiment.py:155-170)
The new _extract_text_from_pdf method leverages SummarizerAgent's built-in caching:

pdf_text = self.summarizer_agent.get_summary_content_from_file(Path(pdf_path))

3. Good: Local LLM Support in Evaluator (eval.py:24-43)
The Evaluator class now supports both Claude and local LLM for evaluation, providing flexibility for different deployment scenarios.

4. Concern: Removed pypdf Import
The diff shows removal of pypdf import and direct PDF reading code. Ensure SummarizerAgent handles all edge cases that were previously handled directly (large PDFs, malformed PDFs, etc.).

5. New File: pdf_document_generator.py
A new 700+ line file for generating synthetic PDF documents using Claude. While functional, consider:

  • The file is quite long - could benefit from splitting into smaller modules
  • Good error handling and cost tracking implementation
  • Template-based approach is well-structured

🟢 Minor Suggestions

1. groundtruth.py Additions
The new UseCase.PDF enum and associated prompts are well-structured. The PDF-specific ground truth generation prompt looks comprehensive.

2. Evaluator Class Improvements (eval.py)

  • Good addition of _send_prompt() method for abstracting Claude vs local LLM
  • Comprehensive timing and progress tracking

Questions for Author

  1. Were there any test failures when switching to SummarizerAgent for PDF extraction?
  2. Is the SummarizerAgent initialization at class level (__init__) intentional, or should it be lazy-loaded?
  3. Should the pdf_document_generator.py be in eval/ or would apps/ be more appropriate?

Checklist

  • Code compiles/parses correctly
  • Security finding is false positive
  • Integration with existing SummarizerAgent is appropriate
  • Tests should be verified (recommend running eval tests)

Overall Assessment: Approve with minor suggestions. The integration looks clean and follows good practices for code reuse.


@jjasinsk-amd jjasinsk-amd marked this pull request as ready for review January 28, 2026 15:20
@kovtcharov-amd
Copy link
Collaborator

@claude couple things:

  • does this change break any pre-existing functionality especially one described in the documentation?
  • does the documentation need to be updated? what needs to be added/removed/modified?

@github-actions
Copy link

github-actions bot commented Jan 28, 2026

Claude finished @kovtcharov-amd's task —— View job


Analyzing Breaking Changes and Documentation Impact

  • Read PR diff to understand all changes
  • Check existing documentation for eval framework
  • Identify any breaking changes to pre-existing functionality
  • Determine documentation updates needed

Does this change break pre-existing functionality?

No breaking changes to the documented CLI interface. The PR is backwards-compatible:

Aspect Status Details
CLI commands No change gaia groundtruth, gaia batch-experiment, gaia eval work identically
Existing config files Compatible basic_summarization.json, basic_qa.json unchanged
Default workflows Compatible Meeting/email/QA workflows work without modification
Evaluation output format Enhanced New performance metrics added, existing fields preserved

Internal implementation changes (not user-facing):

  • BatchExperimentRunner now uses SummarizerAgent for PDF extraction instead of direct pypdf (batch_experiment.py:149)
  • Evaluator class now supports local LLM via use_local_llm parameter (eval.py:24-43)
  • Evaluation criteria adjusted for different content types (PDF, email, transcript) - more nuanced evaluation

Does documentation need to be updated?

Yes, updates needed in docs/reference/eval.mdx:

1. Add new --use-case pdf option (Priority: High)

The PR adds UseCase.PDF (groundtruth.py:23) but documentation doesn't mention it.

Add to "Ground Truth Creation" section (~line 567):

gaia groundtruth -d ./output/pdfs -p "*.pdf" --use-case pdf -o ./output/groundtruth

Update "Use case options" list (~line 589):

- `pdf` - PDF document summarization (supports embedded groundtruth workflow)

2. Add PDF Summarization Workflow (Priority: High)

Add new "Workflow 5: PDF Document Summarization" section after Workflow 4:

### Workflow 5: PDF Document Summarization

**Complete example: Evaluate PDF document summarization capabilities**

Step 1: Generate synthetic PDF documents
\`\`\`bash
python -m gaia.eval.pdf_document_generator -o ./output/test_data/pdfs --count-per-type 2
\`\`\`

Step 2: Create PDF ground truth
\`\`\`bash
gaia groundtruth -d ./output/test_data/pdfs -p "*.pdf" --use-case pdf -o ./output/groundtruth
\`\`\`

Step 3: Run experiments (use pdf_summarization.json config)
\`\`\`bash
gaia batch-experiment -c ./src/gaia/eval/configs/pdf_summarization.json -i ./output/groundtruth/consolidated_pdf_groundtruth.json -o ./output/experiments
\`\`\`

3. Document new config file (Priority: Medium)

Mention pdf_summarization.json config in the Batch Experimentation section (~line 636):

Available config examples in `./src/gaia/eval/configs/`:
- `basic_summarization.json` - Meeting/email summarization
- `basic_qa.json` - Q&A experiments  
- `pdf_summarization.json` - PDF document summarization

4. Document pdf_document_generator.py utility (Priority: Medium)

Add "Synthetic PDF Generation" subsection under "Synthetic Data Generation" (~line 519):

#### PDF Documents
Generate synthetic PDF documents for testing:
\`\`\`bash
python -m gaia.eval.pdf_document_generator -o ./output/test_data/pdfs --count-per-type 1
\`\`\`

**Available document types:** technical_spec, business_proposal, research_report, project_plan, policy_document, white_paper, user_manual, financial_report

5. Update System Architecture diagram (Priority: Low)

The mermaid diagram already shows "Documents/PDFs" as input, so no change needed there.


Summary

Item Required Action
Breaking changes None - PR is backwards-compatible
CLI documentation Add --use-case pdf option
New workflow Add "PDF Document Summarization" workflow
Config reference Document pdf_summarization.json
Utility tool Document pdf_document_generator.py

Recommendation: The PR should either include these documentation updates, or a follow-up issue should be created to track them.

@github-actions github-actions bot added documentation Documentation changes agents Agent system changes cli CLI changes labels Jan 29, 2026
@github-actions github-actions bot added dependencies Dependency updates chat Chat SDK changes labels Feb 5, 2026
@jjasinsk-amd jjasinsk-amd added this pull request to the merge queue Feb 6, 2026
Merged via the queue into main with commit 2cca205 Feb 6, 2026
51 checks passed
@jjasinsk-amd jjasinsk-amd deleted the jjasinsk/eval_summarization branch February 6, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent system changes chat Chat SDK changes cli CLI changes dependencies Dependency updates documentation Documentation changes eval Evaluation framework changes performance Performance-critical changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants