This repository is the public artifact bundle for the second paper built from the ai-eval-forge package.
ai-eval-forge-mixed-check-regression-testing-preprint.pdffor the submission-ready manuscriptpaper.mdfor the source draftpaper.bibfor the current referencesassets/workflow-figure.svgfor the workflow figuresubmission-metadata.jsonfor reusable submission metadataai-eval-forge-preprint-package.zipfor one-click uploads to preprint platforms
Large-model and agent teams often need faster regression checks than broad benchmark suites can provide. This paper presents AI Eval Forge, a zero-dependency evaluation harness for mixed-check regression testing across LLM and agent workflows. The tool supports exact-match, substring, regex, token-F1, JSON validity, JSON field equality, citation coverage, and bounded custom-expression checks in a compact case format that works with JSON or JSONL. The contribution is not a new benchmark. It is a small, inspectable evaluation layer that helps teams compare runs, catch regressions, and summarize pass rate, score, cost, and latency without standing up a heavy evaluation stack. The paper describes the harness design, check model, reporting format, and practical role of mixed-check cases in real workflow testing.
- npm package: @mukundakatta/ai-eval-forge
- GitHub repository: MukundaKatta/ai-eval-forge-js
This bundle is prepared for:
- Zenodo
- OSF Preprints
- SSRN
Use the versioned preprint record once published.