Skip to content

MukundaKatta/ai-eval-forge-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows

This repository is the public artifact bundle for the second paper built from the ai-eval-forge package.

What is in this repo

  • ai-eval-forge-mixed-check-regression-testing-preprint.pdf for the submission-ready manuscript
  • paper.md for the source draft
  • paper.bib for the current references
  • assets/workflow-figure.svg for the workflow figure
  • submission-metadata.json for reusable submission metadata
  • ai-eval-forge-preprint-package.zip for one-click uploads to preprint platforms

Abstract

Large-model and agent teams often need faster regression checks than broad benchmark suites can provide. This paper presents AI Eval Forge, a zero-dependency evaluation harness for mixed-check regression testing across LLM and agent workflows. The tool supports exact-match, substring, regex, token-F1, JSON validity, JSON field equality, citation coverage, and bounded custom-expression checks in a compact case format that works with JSON or JSONL. The contribution is not a new benchmark. It is a small, inspectable evaluation layer that helps teams compare runs, catch regressions, and summarize pass rate, score, cost, and latency without standing up a heavy evaluation stack. The paper describes the harness design, check model, reporting format, and practical role of mixed-check cases in real workflow testing.

Source package

Submission path

This bundle is prepared for:

  • Zenodo
  • OSF Preprints
  • SSRN

Citation

Use the versioned preprint record once published.

About

Public artifact bundle for the AI Eval Forge preprint

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors