@mukundakatta/ai-eval-forge

ai-eval-forge is a zero-dependency eval harness for LLM and agent regression testing.

It scores model outputs against mixed checks: exact match, substring, regex, token F1, JSON validity, JSON field equality, citation coverage, and custom JavaScript expressions.

Install

npm install -g @mukundakatta/ai-eval-forge

CLI

aef score cases.json --format markdown

cases.json can be an array or JSONL:

[
  {
    "id": "refund-policy",
    "input": "Can I refund after 45 days?",
    "expected": "Refunds are available within 30 days.",
    "actual": "Refunds are only available within 30 days.",
    "checks": [
      { "type": "contains", "value": "30 days" },
      { "type": "token_f1", "min": 0.6 }
    ]
  }
]

Library

import { evaluateSuite } from "@mukundakatta/ai-eval-forge";

const result = evaluateSuite(cases);
console.log(result.summary.passRate);

Checks reference

Type	Required keys	Behavior
`exact`	`value` (or `expected` on the case)	Trimmed string equality.
`contains`	`value`	Substring match; pass `caseInsensitive: true` to lowercase both sides.
`regex`	`pattern` (and optional `flags`)	`new RegExp(pattern, flags).test(actual)`.
`token_f1`	optional `min` (default 0.5)	Word-token F1 between `actual` and `expected`.
`json_valid`	none	`JSON.parse(actual)` does not throw.
`json_field`	`path`, `equals`	Parse JSON, walk dot-path (supports array indices like `items.0.name`), deep-equal to `equals`.
`citations`	`expected` (string array), optional `min` (default 1.0)	Each id appears in `actual` as `[id]`, `(id)`, or bare token. Coverage = found/expected.
`expr`	`expr`	Arbitrary JS evaluated against `{actual, expected, input, case}`. Truthy = pass.

The expr check runs arbitrary JavaScript via new Function and is intentionally a power-user feature. Do not use it with untrusted cases.

Programmatic check registration

You can register custom checks at runtime:

import { evaluateSuite, registerCheck } from "@mukundakatta/ai-eval-forge";

registerCheck("len_at_least", (check, ctx) => {
  const min = check.min ?? 1;
  const ok = String(ctx.actual ?? "").length >= min;
  return { ok, score: ok ? 1 : 0, message: `len=${String(ctx.actual).length}` };
});

const result = evaluateSuite([
  {
    id: "long-enough",
    actual: "this is a long-enough answer",
    checks: [{ type: "len_at_least", min: 10 }],
  },
]);

console.log(result.summary.passRate); // 1

A check function receives (check, ctx) and returns { ok, score?, message? }. The ctx object exposes actual, expected, input, and case (the raw case object).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@mukundakatta/ai-eval-forge

Install

CLI

Library

Checks reference

Programmatic check registration

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@mukundakatta/ai-eval-forge

Install

CLI

Library

Checks reference

Programmatic check registration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages