Skip to content

MukundaKatta/lightweight-agent-eval-paper

Repository files navigation

Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents

This repository is the public artifact bundle for the preprint:

What is in this repo

  • lightweight-evaluation-and-operational-scorecards-preprint.pdf for the submission-ready manuscript
  • paper.md for the source draft
  • paper.bib for the current references
  • assets/workflow-figure.svg for the workflow figure
  • artifact-manifest.md and related-links.md for the public evaluation artifacts connected to the paper
  • submission-metadata.json for reusable submission metadata

Abstract

Tool-using AI agents are increasingly used in coding, browser automation, research assistance, and support workflows. In practice, however, many teams still evaluate these systems through isolated prompts, one-off demos, or broad benchmark references that do not translate well into deployment judgment. This paper presents a lightweight workflow for evaluating agent behavior that begins with scenario design, continues through explicit expected behavior and failure-mode definition, and ends with an operational scorecard that helps teams judge rollout readiness. The workflow is instantiated through compact public artifacts, including small datasets, interactive demo apps, and public analytics surfaces. The aim is not to compete with large benchmarks on scale. It is to improve repeatability, interpretability, and operational usefulness for builders who need evaluation methods that are small enough to maintain and concrete enough to use.

Public artifacts connected to the paper

Citation

Use the Zenodo DOI for citation:

Katta, M. R. (2026). Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents (Version v1). Zenodo. https://doi.org/10.5281/zenodo.20034550

Notes

The OSF Preprints / MetaArXiv submission is currently pending moderator approval. The SSRN submission is under review.

About

Public artifact bundle for the preprint 'Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents'

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors