feat(agents): add evaluation dataset creator by bjcmit · Pull Request #1279 · microsoft/hve-core

bjcmit · 2026-04-02T17:19:58Z

Description

This pull request adds a comprehensive new prompt, eval-dataset-creator.md, for generating evaluation datasets and documentation to support AI agent testing. The prompt guides users through a structured interview process to curate Q&A pairs, select evaluation metrics, and recommend tooling tailored to user skill level and agent characteristics. It also specifies the output directory structure and includes templates for all generated artifacts.

Key additions and improvements:

Evaluation Dataset Creation Workflow:

Introduces a multi-phase, interview-driven process for collecting agent context, capabilities, evaluation scenarios, and user requirements, ensuring high-quality and relevant dataset generation.
Mandates a review phase where sample Q&A pairs are validated with the user before finalizing the dataset.

Dataset and Documentation Artifacts:

Defines output structure in data/evaluation/ with separate subfolders for datasets (.json, .csv) and documentation (curation-notes.md, metric-selection.md, tool-recommendations.md).
Provides detailed JSON and CSV formats for the evaluation dataset, including metadata and balanced scenario distribution.
Supplies markdown templates for curation notes, metric selection, and tool recommendations, ensuring standardized and thorough documentation.

Tooling and Persona Guidance:

Recommends evaluation# Pull Request

Related Issue(s)

Closes #1267

Type of Change

Select all that apply:

Code & Documentation:

New feature (non-breaking change adding functionality)

Infrastructure & Configuration:

AI Artifacts:

Reviewed contribution with prompt-builder agent and addressed all feedback
Copilot agent (.github/agents/*.agent.md)

Sample Prompts (for AI Artifact Contributions)

User Request:

# Standards review only (invoke the agent directly):
@eval-dataset-creator create an evaluation dataset

Execution Flow:

Here’s a step-by-step breakdown of what happens when the Evaluation Dataset Creator agent is invoked, including tool usage and key decision points:

Structured Interview (Phases 1–4)

Purpose: Gather all necessary context before generating any artifacts.

Phase 1: Agent Context**

The agent asks six questions about the AI agent’s name, business scenario, KPIs, tasks, risks, and user adoption.
Decision Point: Wait for user responses before proceeding.

Phase 2: Agent Capabilities

Three questions about grounding sources, external tools/APIs, and response format.
Decision Point: Wait for user responses before proceeding.

Phase 3: Evaluation Scenarios

Five questions about typical, challenging, negative, and safety scenarios, plus limitations and topics to avoid.
Decision Point: Wait for user responses before proceeding.

Phase 4: Persona & Tooling

Two questions about development mode (low-code vs. pro-code) and evaluation frequency/type.
Decision Point: Wait for user responses before proceeding.

Dataset Generation (Phase 5)

After the interview, the agent generates evaluation datasets:
- JSON Format: Includes metadata, Q&A pairs, category, difficulty, tools expected, source references, and notes.
- CSV Format: Similar structure, tools listed as semicolon-delimited.
Tool Usage: Writes files to data/evaluation/datasets/.
Decision Point: Ensures minimum 30 Q&A pairs, balanced distribution across scenario types.

Dataset Review & Feedback (Phase 6)

Presents 5–8 representative Q&A pairs (covering easy, hard, grounding, negative, safety).
Asks the user for feedback on each pair:
- Is the expected response accurate?
- Should it be more/less detailed?
- Are elements missing or incorrect?
- Should the pair be modified, kept, or removed?
Decision Point: Refines dataset based on feedback. If major changes are needed, offers to regenerate portions.

Documentation & Finalization (Phase 7)

Generates three supporting documents in data/evaluation/docs/:
- Curation Notes: Business context, scope, data sources, review process, dataset balance, maintenance schedule.
- Metric Selection: Agent characteristics, selected metrics, definitions, rationale.
- Tool Recommendations: Persona profile, recommended tool, comparison, getting started, next steps.
Tool Usage: Writes files to data/evaluation/docs/.
Decision Point: Presents summary of all artifacts for user validation.

Decision Points & Tool Usage Summary

Interview: Structured Q&A, waits for user input before proceeding.
Dataset Generation: Automated file creation (JSON/CSV), ensures balance and completeness.
Review: Interactive feedback loop, offers regeneration if needed.
Documentation: Automated file creation for curation, metrics, and tooling.
Summary: Presents all artifacts for validation.

Output Artifacts:

data/evaluation/
├── datasets/
│   ├── <agent-name>-eval-dataset.json   # Full evaluation dataset (Q&A pairs + metadata)
│   └── <agent-name>-eval-dataset.csv    # Flat CSV version for Copilot Studio/manual review
└── docs/
    ├── <agent-name>-curation-notes.md        # Human-readable dataset rationale & scope
    ├── <agent-name>-metric-selection.md      # Metrics chosen + priorities + rationale
    └── <agent-name>-tool-recommendations.md  # MCS vs Azure AI Foundry guidance

data/evaluation/datasets/-eval-dataset.json

{
  "metadata": {
    "schema_version": "1",
    "agent_name": "example-agent",
    "created_date": "2026-04-02",
    "version": "1.0.0",
    "total_pairs": 30,
    "distribution": {
      "easy": 6,
      "grounding_source_checks": 3,
      "hard": 12,
      "negative": 6,
      "safety": 3
    },
    "persona": "pro-code",
    "evaluation_mode": ["manual", "batch"],
    "recommended_tool": "azure-ai-foundry"
  },
  "evaluation_pairs": [
    {

data/evaluation/docs/-curation-notes.md

# Curation Notes: Example Agent

## Business Context
Agent answers employee questions about expense and travel policy.

## Agent Scope
### In Scope
- Policy interpretation
- Step-by-step guidance
- Source citation

### Out of Scope
- Approvals
- Financial decisions

data/evaluation/docs/-metric-selection.md

# Metric Selection: Example Agent

## Selected Core Metrics
- Intent Resolution (High)
- Task Adherence (High)
- Groundedness (High)
- Response Completeness (Medium)

## Tool-Based Metrics
- Tool Call Accuracy (N/A)
- Latency (Medium)
- Token Cost (Medium)

data/evaluation/docs/-tool-recommendations.md

# Tool Recommendations: Example Agent

## Persona Profile
- Skill Level: Pro-Code Developer
- Evaluation Mode: Batch

## Recommended Tool
Azure AI Foundry

Selection Rationale:
Supports batch evaluation, groundedness metrics, and tool-call analysis.

Success Indicators:

All output artifacts exist and are non-empty
Datasets formatted correctly, contain at least 30 pairs, no empty 'expected_response' fields, and JSON = CSV.
Curation notes reflect business context and scope accurately
Metric priorities make sense for KPIs
Recommended tools matches states personas
Reality Check: Dataset imports into either Copilot Studio or Azure AI Foundry

Testing

Ran /prompt-analyze 3 times with all findings addressed
Tested agent against an Out-of-Office (OOO) Rescheduler feature
All validation commands pass:
- npm run lint:all ✅
- npm run lint:md-links ✅
- npm run validate:copyright ✅ (148/148 files, 100%)
- npm run spell-check ✅ (281 files, 0 issues)
- npm run plugin:generate ✅ (14 plugins, 0 errors)
- npm run plugin:validate ✅ (0 errors)
- npm run lint:collections-metadata ✅ (0 errors)

Checklist

Required Checks

Documentation is updated (if applicable)
Files follow existing naming conventions
Changes are backwards compatible (if applicable)
Tests added for new functionality (if applicable)

AI Artifact Contributions

Used /prompt-analyze to review contribution
Addressed all feedback from prompt-builder review
Verified contribution follows common standards and type-specific requirements

Required Automated Checks

The following validation commands must pass before merging:

Markdown linting: npm run lint:md
Spell checking: npm run spell-check
Frontmatter validation: npm run lint:frontmatter
Skill structure validation: npm run validate:skills
Link validation: npm run lint:md-links
PowerShell analysis: npm run lint:ps
Plugin freshness: npm run plugin:generate

Security Considerations

This PR does not contain any sensitive or NDA information
Any new dependencies have been reviewed for security issues
Security-related scripts follow the principle of least privilege

WilliamBerryiii

Thank you for this PR, @bjcmit. The eval-dataset-creator agent is a solid addition to the data-science collection — the structured interview flow and dual-persona support are well thought out.

After review, there are a few suggested changes in the inline comments. Please take a look and let us know if you have any questions.

codecov-commenter · 2026-04-03T20:12:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.62%. Comparing base (4ca2bca) to head (0c90a1e).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1279      +/-   ##
==========================================
- Coverage   87.63%   87.62%   -0.01%     
==========================================
  Files          65       65              
  Lines       10119    10119              
==========================================
- Hits         8868     8867       -1     
- Misses       1251     1252       +1

Flag	Coverage Δ
pester	`85.00% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…eat/1267

rezatnoMsirhC

Four non-blocking suggestions inline. The branch also needs to be rebased against main before merging.

WilliamBerryiii · 2026-04-23T00:15:12Z

Cross referencing this with #1419 - I need to evaluate the overlap ...

…eat/1267

## Pre-Release 3.3.101 ### ✨ Features - add removed maturity tier and retire owasp-docker (#1444) - add evaluation dataset creator (#1279) - align RAI planner with guide, remove scoring, improve UX (#1287) - add PSGallery staleness check and BOM cleanup (#1379) - ISA-95 network planner agent (#1177) - auto-generate collection.md with maturity filtering (#1316) - add folder-consistency check and standardize WARN outp… (#1350) - add synth-data-generate prompt to data-science collection (#1419) - add canonical deck workflow and customer-card rendering for design thinking (#1413) - add Figma MCP integration for DT artifact export (#1222) - introduce `owasp-docker` (#1245) - replace hve-core-specific references with portable discovery-based language (#1335) - introduce `owasp-cicd` (#1246) - add secure-by-design knowledge skill (#1223) - introduce `owasp-infrastructure` (#1244) - introduce `owasp-mcp` (#1207) - add OutputPath parameter to Invoke-LinkLanguageCheck.ps1 (#1229) - add -OutputPath parameter to Validate-SkillStructure.ps1 (#1225) - add maintainer-only skip-review label guard (#1293) - add extension collections overview and integrate into getting started flow (#950) - add agentic workflows for automated issue triage, implementation, PR review, dependency review, and doc-staleness detection (#1219) - consolidate package-lock.json version sync into Update-VersionFiles.ps1 (#1240) - add standards code review agent and full review orchestrator (#1174) - standardize pytest-mock as Python mocking framework (#1170) - add Jira backlog workflows and Jira/GitLab skills (#978) - add centralized version bump script and supply-chain attestation (#1183) ### 🐛 Bug Fixes - pin PowerShell-Yaml to 0.4.7 across all install sites (#1378) - close fork-PR/workflow-file-PR secret-strip gap and normalize upload-artifact version (#1421) - replace stream-based lookahead with array indexing in list-changed-files.sh (#1376) - centralize ISO 8601 timestamp regex in CIHelpers (#1343) - update stale documentation date in release-process.md (#1363) - pin basic-ftp to 5.3.0 to resolve GHSA-rp42-5vxx-qpwr (#1374) - add bot filter to dependency PR review workflow (#1362) - resolve pip-audit findings in powerpoint, gitlab, and jira skill lock files (#1360) - standardize Timestamp JSON key casing across all lint result files (#1314) - add synchronize trigger to PR Review workflow (#1323) - standardize timestamp in Validate-SkillStructure.ps1 to use Get-StandardTimestamp (#1280) - add parallel subagent dispatch and structured JSON contracts to code-review-full (#1304) - standardize timestamp in SecurityHelpers.psm1 to use Get-StandardTimestamp (#1284) - standardize timestamps in Test-DependencyPinning.ps1 and SecurityClasses.psm1 (#1282) - derive collection artifact counts from YAML at build time (#1275) - standardize timestamp in FrontmatterValidation.psm1 to use Get-StandardTimestamp (#1285) - standardize timestamp in Markdown-Link-Check.ps1 to use Get-StandardTimestamp (#1283) - escape hyphens in Mermaid diagram on Collections page (#1262) - add summary timestamp to PSScriptAnalyzer output (#1211) - fix plugin compatibility and robustness for coding-standards code review agents (#1289) - standardize timestamp in Test-CopyrightHeaders.ps1 to use Get-StandardTimestamp (#1278) - standardize timestamp in Invoke-YamlLint.ps1 to use Get-StandardTimestamp (#1270) - standardize timestamp in Invoke-LinkLanguageCheck.ps1 to use Get-StandardTimestamp (#1264) - fix dependency-review path filters and sparse-checkout cone mode (#1259) - replace invalid bare tool names with official tool identifiers (#1198) - fix broken links and remove orphaned reference in code review docs (#1257) - exclude Python env dirs from skill validation warnings (#1255) - pin happy-dom and serialize-javascript to resolve Dependabot vulnerabilities (#1253) - remove Mermaid diagram and add missing collection cards (#1247) - disable MCP servers by default to prevent token limit errors (#1144) - sync package-lock.json after pre-release version bump (#1236) - separate mermaid node declarations and add dynamic diagram generation with tests (#1215) - replace anchor links in meeting-analyst with bold text references (#1201) - remove recursive symlinks in jira and gitlab skill directories (#1233) - validate-installation scripts now check .github/skills directory (#1010) (#1206) - resolve npm audit vulnerabilities via dependency overrides (#1200) - add post-release triggers to scorecard workflow (#1186) - add missing .md extensions to relative links in agent documentation (#1180) ### 📚 Documentation - broaden Security Review description beyond OWASP (#1385) - document maintainer advisory mode and skip-review label guard (#1386) - document ExcludePaths/OutputPath for Invoke-LinkLanguageCheck (#1383) - CLI getting-started: clarify plugin install commands as alternatives (-all vs base) (#1251) ### ♻️ Refactoring - align agent and prompt folder names to collection identifier (#1210) ### 🔧 Maintenance - pin PSScriptAnalyzer to 1.25.0 and sync stale workflow version comments (#1389) - bump lxml from 6.0.2 to 6.1.0 in /.github/skills/experimental/powerpoint (#1424) - bump @vscode/vsce from 3.7.1 to 3.9.1 in the npm-dependencies group (#1390) - bump the github-actions group across 1 directory with 7 updates (#1391) - bump follow-redirects from 1.15.11 to 1.16.0 in /docs/docusaurus (#1356) - upgrade Node.js from 20 to 24 and bump cspell to v10 (#1353) - bump basic-ftp from 5.2.0 to 5.2.1 (#1324) - update github/gh-aw-actions requirement to 536ea1bad8c6715d098a9dc1afea8d403733acfe in the github-actions group across 1 directory (#1298) - update security instruction attributions and compliance (#1294) - bump the npm-dependencies group with 2 updates (#1297) - pre-release 3.3.41 (#1252) - streamline RAI Planner phase structure and documentation (#1273) - bump happy-dom from 20.8.8 to 20.8.9 in /docs/docusaurus (#1237) - pre-release 3.3.27 (#1191) - bump pygments from 2.19.2 to 2.20.0 in /.github/skills/gitlab/gitlab (#1234) - bump path-to-regexp from 0.1.12 to 0.1.13 in /docs/docusaurus (#1226) - bump the github-actions group with 4 updates (#1231) - add missing folders and alphabetize location lists (#1193) - bump brace-expansion (#1224) - bump handlebars from 4.7.8 to 4.7.9 in /docs/docusaurus (#1217) - bump brace-expansion from 5.0.3 to 5.0.5 in /docs/docusaurus (#1213) - pre-release 3.3.10 (#1187) - bump markdownlint-cli2 from 0.21.0 to 0.22.0 in the npm-dependencies group (#1175) - bump the github-actions group with 3 updates (#1176) - pre-release 3.3.1 (#1165) --- *Managed automatically by pre-release workflow.* Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

bjcmit added 3 commits April 1, 2026 21:54

Add evaluation dataset creator

b212ffb

Rename evaluation dataset creator file

a313f60

Fix issues and add and modify documentation

9be7839

bjcmit requested a review from a team as a code owner April 2, 2026 17:20

bjcmit self-assigned this Apr 2, 2026

WilliamBerryiii requested changes Apr 3, 2026

View reviewed changes

Comment thread .github/agents/data-science/eval-dataset-creator.agent.md Outdated

Comment thread .github/plugin/marketplace.json Outdated

Comment thread .github/agents/data-science/eval-dataset-creator.agent.md Outdated

bjcmit and others added 2 commits April 3, 2026 20:07

Resolve comments

ee95750

Merge branch 'main' into feat/1267

fa9d8f5

bjcmit and others added 6 commits April 3, 2026 20:16

Fix data-science version

59bf957

Merge branch 'feat/1267' of https://github.com/bjcmit/hve-core into f…

7af6225

…eat/1267

Fix table format

3236724

Fix failed auto checks

e65a176

Resolve comment

04af0ca

Merge branch 'main' into feat/1267

6981c06

bjcmit requested a review from WilliamBerryiii April 8, 2026 21:39

bjcmit and others added 5 commits April 9, 2026 17:55

Fix automated check failures

021ee14

Merge branch 'main' into feat/1267

0b7b7da

Merge branch 'main' into feat/1267

36813f5

Merge branch 'main' into feat/1267

69d2f18

Merge branch 'main' into feat/1267

5981878

WilliamBerryiii approved these changes Apr 14, 2026

View reviewed changes

Merge branch 'main' into feat/1267

dc55695

rezatnoMsirhC reviewed Apr 22, 2026

View reviewed changes

Merge branch 'main' into feat/1267

dad29a5

WilliamBerryiii changed the title ~~Add evaluation dataset creator~~ feat(agents): add evaluation dataset creator Apr 23, 2026

WilliamBerryiii and others added 3 commits April 22, 2026 23:42

Merge branch 'main' into feat/1267

234df5e

Merge branch 'main' into feat/1267

2671c4f

Resolved comments

69e78eb

bjcmit and others added 3 commits April 23, 2026 16:08

Merge branch 'main' into feat/1267

8930251

Merge branch 'feat/1267' of https://github.com/bjcmit/hve-core into f…

614bb45

…eat/1267

Merge branch 'main' into feat/1267

548004b

bjcmit requested a review from rezatnoMsirhC April 24, 2026 01:06

Generate plugins

0c90a1e

rezatnoMsirhC approved these changes Apr 24, 2026

View reviewed changes

WilliamBerryiii merged commit 7fa6e81 into microsoft:main Apr 24, 2026
51 checks passed

This was referenced Apr 24, 2026

chore(main): pre-release 3.3.101 #1274

Merged

chore(main): release hve-core 3.4.0 #1184

Open

hve-core-release-please Bot mentioned this pull request Apr 27, 2026

chore(main): pre-release 3.3.111 #1476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): add evaluation dataset creator#1279

feat(agents): add evaluation dataset creator#1279
WilliamBerryiii merged 25 commits intomicrosoft:mainfrom
bjcmit:feat/1267

bjcmit commented Apr 2, 2026 •

edited

Loading

Uh oh!

WilliamBerryiii left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 3, 2026 •

edited

Loading

Uh oh!

rezatnoMsirhC left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WilliamBerryiii commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bjcmit commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Sample Prompts (for AI Artifact Contributions)

Testing

Checklist

Required Checks

AI Artifact Contributions

Required Automated Checks

Security Considerations

Uh oh!

WilliamBerryiii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rezatnoMsirhC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WilliamBerryiii commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bjcmit commented Apr 2, 2026 •

edited

Loading

codecov-commenter commented Apr 3, 2026 •

edited

Loading