Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents

This repository is the public artifact bundle for the preprint:

DOI: 10.5281/zenodo.20034550
Zenodo record: zenodo.org/records/20034550
SSRN abstract page: papers.ssrn.com/sol3/papers.cfm?abstract_id=6718599
ORCID profile: 0009-0007-6071-3896

What is in this repo

lightweight-evaluation-and-operational-scorecards-preprint.pdf for the submission-ready manuscript
paper.md for the source draft
paper.bib for the current references
assets/workflow-figure.svg for the workflow figure
artifact-manifest.md and related-links.md for the public evaluation artifacts connected to the paper
submission-metadata.json for reusable submission metadata

Abstract

Tool-using AI agents are increasingly used in coding, browser automation, research assistance, and support workflows. In practice, however, many teams still evaluate these systems through isolated prompts, one-off demos, or broad benchmark references that do not translate well into deployment judgment. This paper presents a lightweight workflow for evaluating agent behavior that begins with scenario design, continues through explicit expected behavior and failure-mode definition, and ends with an operational scorecard that helps teams judge rollout readiness. The workflow is instantiated through compact public artifacts, including small datasets, interactive demo apps, and public analytics surfaces. The aim is not to compete with large benchmarks on scale. It is to improve repeatability, interpretability, and operational usefulness for builders who need evaluation methods that are small enough to maintain and concrete enough to use.

Public artifacts connected to the paper

Kaggle dataset: Agent Eval Scenarios
Kaggle notebook: building-a-lightweight-agent-eval-benchmark
Hugging Face dataset mirror: agent-eval-scenarios
Hugging Face Space: agent-eval-lab
OpenRouter app page: Agent Eval Lab
Modal endpoint: agent-eval-lab

Citation

Use the Zenodo DOI for citation:

Katta, M. R. (2026). Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents (Version v1). Zenodo. https://doi.org/10.5281/zenodo.20034550

Notes

The OSF Preprints / MetaArXiv submission is currently pending moderator approval. The SSRN submission is under review.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
docs		docs
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
abstract.txt		abstract.txt
artifact-manifest.md		artifact-manifest.md
author-bio.txt		author-bio.txt
cover-note.txt		cover-note.txt
keywords.txt		keywords.txt
lightweight-evaluation-and-operational-scorecards-preprint.pdf		lightweight-evaluation-and-operational-scorecards-preprint.pdf
next-edits.md		next-edits.md
paper.bib		paper.bib
paper.md		paper.md
related-links.md		related-links.md
render_preprint_pdf.py		render_preprint_pdf.py
submission-metadata.json		submission-metadata.json
zenodo-upload-checklist.md		zenodo-upload-checklist.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents

What is in this repo

Abstract

Public artifacts connected to the paper

Citation

Notes

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents

What is in this repo

Abstract

Public artifacts connected to the paper

Citation

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages