This repository is the public artifact bundle for the preprint:
- DOI: 10.5281/zenodo.20034550
- Zenodo record: zenodo.org/records/20034550
- SSRN abstract page: papers.ssrn.com/sol3/papers.cfm?abstract_id=6718599
- ORCID profile: 0009-0007-6071-3896
lightweight-evaluation-and-operational-scorecards-preprint.pdffor the submission-ready manuscriptpaper.mdfor the source draftpaper.bibfor the current referencesassets/workflow-figure.svgfor the workflow figureartifact-manifest.mdandrelated-links.mdfor the public evaluation artifacts connected to the papersubmission-metadata.jsonfor reusable submission metadata
Tool-using AI agents are increasingly used in coding, browser automation, research assistance, and support workflows. In practice, however, many teams still evaluate these systems through isolated prompts, one-off demos, or broad benchmark references that do not translate well into deployment judgment. This paper presents a lightweight workflow for evaluating agent behavior that begins with scenario design, continues through explicit expected behavior and failure-mode definition, and ends with an operational scorecard that helps teams judge rollout readiness. The workflow is instantiated through compact public artifacts, including small datasets, interactive demo apps, and public analytics surfaces. The aim is not to compete with large benchmarks on scale. It is to improve repeatability, interpretability, and operational usefulness for builders who need evaluation methods that are small enough to maintain and concrete enough to use.
- Kaggle dataset: Agent Eval Scenarios
- Kaggle notebook: building-a-lightweight-agent-eval-benchmark
- Hugging Face dataset mirror: agent-eval-scenarios
- Hugging Face Space: agent-eval-lab
- OpenRouter app page: Agent Eval Lab
- Modal endpoint: agent-eval-lab
Use the Zenodo DOI for citation:
Katta, M. R. (2026). Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents (Version v1). Zenodo. https://doi.org/10.5281/zenodo.20034550
The OSF Preprints / MetaArXiv submission is currently pending moderator approval. The SSRN submission is under review.