Skip to content

feat(agents): add evaluation dataset creator#1279

Merged
WilliamBerryiii merged 25 commits intomicrosoft:mainfrom
bjcmit:feat/1267
Apr 24, 2026
Merged

feat(agents): add evaluation dataset creator#1279
WilliamBerryiii merged 25 commits intomicrosoft:mainfrom
bjcmit:feat/1267

Conversation

@bjcmit
Copy link
Copy Markdown
Contributor

@bjcmit bjcmit commented Apr 2, 2026

Description

This pull request adds a comprehensive new prompt, eval-dataset-creator.md, for generating evaluation datasets and documentation to support AI agent testing. The prompt guides users through a structured interview process to curate Q&A pairs, select evaluation metrics, and recommend tooling tailored to user skill level and agent characteristics. It also specifies the output directory structure and includes templates for all generated artifacts.

Key additions and improvements:

Evaluation Dataset Creation Workflow:

  • Introduces a multi-phase, interview-driven process for collecting agent context, capabilities, evaluation scenarios, and user requirements, ensuring high-quality and relevant dataset generation.
  • Mandates a review phase where sample Q&A pairs are validated with the user before finalizing the dataset.

Dataset and Documentation Artifacts:

  • Defines output structure in data/evaluation/ with separate subfolders for datasets (.json, .csv) and documentation (curation-notes.md, metric-selection.md, tool-recommendations.md).
  • Provides detailed JSON and CSV formats for the evaluation dataset, including metadata and balanced scenario distribution.
  • Supplies markdown templates for curation notes, metric selection, and tool recommendations, ensuring standardized and thorough documentation.

Tooling and Persona Guidance:

  • Recommends evaluation# Pull Request

Related Issue(s)

Closes #1267

Type of Change

Select all that apply:

Code & Documentation:

  • New feature (non-breaking change adding functionality)

Infrastructure & Configuration:

AI Artifacts:

  • Reviewed contribution with prompt-builder agent and addressed all feedback
  • Copilot agent (.github/agents/*.agent.md)

Sample Prompts (for AI Artifact Contributions)

User Request:

# Standards review only (invoke the agent directly):
@eval-dataset-creator create an evaluation dataset 

Execution Flow:

Here’s a step-by-step breakdown of what happens when the Evaluation Dataset Creator agent is invoked, including tool usage and key decision points:


  1. Structured Interview (Phases 1–4)

Purpose: Gather all necessary context before generating any artifacts.

Phase 1: Agent Context**

  • The agent asks six questions about the AI agent’s name, business scenario, KPIs, tasks, risks, and user adoption.
  • Decision Point: Wait for user responses before proceeding.

Phase 2: Agent Capabilities

  • Three questions about grounding sources, external tools/APIs, and response format.
  • Decision Point: Wait for user responses before proceeding.

Phase 3: Evaluation Scenarios

  • Five questions about typical, challenging, negative, and safety scenarios, plus limitations and topics to avoid.
  • Decision Point: Wait for user responses before proceeding.

Phase 4: Persona & Tooling

  • Two questions about development mode (low-code vs. pro-code) and evaluation frequency/type.
  • Decision Point: Wait for user responses before proceeding.

  1. Dataset Generation (Phase 5)
  • After the interview, the agent generates evaluation datasets:
    • JSON Format: Includes metadata, Q&A pairs, category, difficulty, tools expected, source references, and notes.
    • CSV Format: Similar structure, tools listed as semicolon-delimited.
  • Tool Usage: Writes files to data/evaluation/datasets/.
  • Decision Point: Ensures minimum 30 Q&A pairs, balanced distribution across scenario types.

  1. Dataset Review & Feedback (Phase 6)
  • Presents 5–8 representative Q&A pairs (covering easy, hard, grounding, negative, safety).
  • Asks the user for feedback on each pair:
    • Is the expected response accurate?
    • Should it be more/less detailed?
    • Are elements missing or incorrect?
    • Should the pair be modified, kept, or removed?
  • Decision Point: Refines dataset based on feedback. If major changes are needed, offers to regenerate portions.

  1. Documentation & Finalization (Phase 7)
  • Generates three supporting documents in data/evaluation/docs/:
    • Curation Notes: Business context, scope, data sources, review process, dataset balance, maintenance schedule.
    • Metric Selection: Agent characteristics, selected metrics, definitions, rationale.
    • Tool Recommendations: Persona profile, recommended tool, comparison, getting started, next steps.
  • Tool Usage: Writes files to data/evaluation/docs/.
  • Decision Point: Presents summary of all artifacts for user validation.

Decision Points & Tool Usage Summary

  • Interview: Structured Q&A, waits for user input before proceeding.
  • Dataset Generation: Automated file creation (JSON/CSV), ensures balance and completeness.
  • Review: Interactive feedback loop, offers regeneration if needed.
  • Documentation: Automated file creation for curation, metrics, and tooling.
  • Summary: Presents all artifacts for validation.

Output Artifacts:

data/evaluation/
├── datasets/
│   ├── <agent-name>-eval-dataset.json   # Full evaluation dataset (Q&A pairs + metadata)
│   └── <agent-name>-eval-dataset.csv    # Flat CSV version for Copilot Studio/manual review
└── docs/
    ├── <agent-name>-curation-notes.md        # Human-readable dataset rationale & scope
    ├── <agent-name>-metric-selection.md      # Metrics chosen + priorities + rationale
    └── <agent-name>-tool-recommendations.md  # MCS vs Azure AI Foundry guidance

data/evaluation/datasets/-eval-dataset.json

{
  "metadata": {
    "schema_version": "1",
    "agent_name": "example-agent",
    "created_date": "2026-04-02",
    "version": "1.0.0",
    "total_pairs": 30,
    "distribution": {
      "easy": 6,
      "grounding_source_checks": 3,
      "hard": 12,
      "negative": 6,
      "safety": 3
    },
    "persona": "pro-code",
    "evaluation_mode": ["manual", "batch"],
    "recommended_tool": "azure-ai-foundry"
  },
  "evaluation_pairs": [
    {

data/evaluation/docs/-curation-notes.md

# Curation Notes: Example Agent

## Business Context
Agent answers employee questions about expense and travel policy.

## Agent Scope
### In Scope
- Policy interpretation
- Step-by-step guidance
- Source citation

### Out of Scope
- Approvals
- Financial decisions

data/evaluation/docs/-metric-selection.md

# Metric Selection: Example Agent

## Selected Core Metrics
- Intent Resolution (High)
- Task Adherence (High)
- Groundedness (High)
- Response Completeness (Medium)

## Tool-Based Metrics
- Tool Call Accuracy (N/A)
- Latency (Medium)
- Token Cost (Medium)

data/evaluation/docs/-tool-recommendations.md

# Tool Recommendations: Example Agent

## Persona Profile
- Skill Level: Pro-Code Developer
- Evaluation Mode: Batch

## Recommended Tool
Azure AI Foundry

Selection Rationale:
Supports batch evaluation, groundedness metrics, and tool-call analysis.

Success Indicators:

  • All output artifacts exist and are non-empty
  • Datasets formatted correctly, contain at least 30 pairs, no empty 'expected_response' fields, and JSON = CSV.
  • Curation notes reflect business context and scope accurately
  • Metric priorities make sense for KPIs
  • Recommended tools matches states personas
  • Reality Check: Dataset imports into either Copilot Studio or Azure AI Foundry

Testing

  • Ran /prompt-analyze 3 times with all findings addressed
  • Tested agent against an Out-of-Office (OOO) Rescheduler feature
  • All validation commands pass:
    • npm run lint:all
    • npm run lint:md-links
    • npm run validate:copyright ✅ (148/148 files, 100%)
    • npm run spell-check ✅ (281 files, 0 issues)
    • npm run plugin:generate ✅ (14 plugins, 0 errors)
    • npm run plugin:validate ✅ (0 errors)
    • npm run lint:collections-metadata ✅ (0 errors)

Checklist

Required Checks

  • Documentation is updated (if applicable)
  • Files follow existing naming conventions
  • Changes are backwards compatible (if applicable)
  • Tests added for new functionality (if applicable)

AI Artifact Contributions

  • Used /prompt-analyze to review contribution
  • Addressed all feedback from prompt-builder review
  • Verified contribution follows common standards and type-specific requirements

Required Automated Checks

The following validation commands must pass before merging:

  • Markdown linting: npm run lint:md
  • Spell checking: npm run spell-check
  • Frontmatter validation: npm run lint:frontmatter
  • Skill structure validation: npm run validate:skills
  • Link validation: npm run lint:md-links
  • PowerShell analysis: npm run lint:ps
  • Plugin freshness: npm run plugin:generate

Security Considerations

  • This PR does not contain any sensitive or NDA information
  • Any new dependencies have been reviewed for security issues
  • Security-related scripts follow the principle of least privilege

@bjcmit bjcmit requested a review from a team as a code owner April 2, 2026 17:20
@bjcmit bjcmit self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Member

@WilliamBerryiii WilliamBerryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, @bjcmit. The eval-dataset-creator agent is a solid addition to the data-science collection — the structured interview flow and dual-persona support are well thought out.

After review, there are a few suggested changes in the inline comments. Please take a look and let us know if you have any questions.

Comment thread .github/agents/data-science/eval-dataset-creator.agent.md Outdated
Comment thread .github/plugin/marketplace.json Outdated
Comment thread .github/agents/data-science/eval-dataset-creator.agent.md Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.62%. Comparing base (4ca2bca) to head (0c90a1e).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1279      +/-   ##
==========================================
- Coverage   87.63%   87.62%   -0.01%     
==========================================
  Files          65       65              
  Lines       10119    10119              
==========================================
- Hits         8868     8867       -1     
- Misses       1251     1252       +1     
Flag Coverage Δ
pester 85.00% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bjcmit bjcmit requested a review from WilliamBerryiii April 8, 2026 21:39
Copy link
Copy Markdown
Contributor

@rezatnoMsirhC rezatnoMsirhC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Four non-blocking suggestions inline. The branch also needs to be rebased against main before merging.

Comment thread .github/agents/data-science/eval-dataset-creator.agent.md
Comment thread .github/agents/data-science/eval-dataset-creator.agent.md
Comment thread .github/agents/data-science/eval-dataset-creator.agent.md
Comment thread docs/docusaurus/src/data/__tests__/collectionCards.test.ts Outdated
@WilliamBerryiii
Copy link
Copy Markdown
Member

Cross referencing this with #1419 - I need to evaluate the overlap ...

@WilliamBerryiii WilliamBerryiii changed the title Add evaluation dataset creator feat(agents): add evaluation dataset creator Apr 23, 2026
@bjcmit bjcmit requested a review from rezatnoMsirhC April 24, 2026 01:06
@WilliamBerryiii WilliamBerryiii merged commit 7fa6e81 into microsoft:main Apr 24, 2026
51 checks passed
WilliamBerryiii pushed a commit that referenced this pull request Apr 24, 2026
## Pre-Release 3.3.101

### ✨ Features

- add removed maturity tier and retire owasp-docker (#1444)
- add evaluation dataset creator (#1279)
- align RAI planner with guide, remove scoring, improve UX (#1287)
- add PSGallery staleness check and BOM cleanup (#1379)
- ISA-95 network planner agent (#1177)
- auto-generate collection.md with maturity filtering (#1316)
- add folder-consistency check and standardize WARN outp… (#1350)
- add synth-data-generate prompt to data-science collection (#1419)
- add canonical deck workflow and customer-card rendering for design
thinking (#1413)
- add Figma MCP integration for DT artifact export (#1222)
- introduce `owasp-docker` (#1245)
- replace hve-core-specific references with portable discovery-based
language (#1335)
- introduce `owasp-cicd` (#1246)
- add secure-by-design knowledge skill (#1223)
- introduce `owasp-infrastructure` (#1244)
- introduce `owasp-mcp` (#1207)
- add OutputPath parameter to Invoke-LinkLanguageCheck.ps1 (#1229)
- add -OutputPath parameter to Validate-SkillStructure.ps1 (#1225)
- add maintainer-only skip-review label guard (#1293)
- add extension collections overview and integrate into getting started
flow (#950)
- add agentic workflows for automated issue triage, implementation, PR
review, dependency review, and doc-staleness detection (#1219)
- consolidate package-lock.json version sync into
Update-VersionFiles.ps1 (#1240)
- add standards code review agent and full review orchestrator (#1174)
- standardize pytest-mock as Python mocking framework (#1170)
- add Jira backlog workflows and Jira/GitLab skills (#978)
- add centralized version bump script and supply-chain attestation
(#1183)

### 🐛 Bug Fixes

- pin PowerShell-Yaml to 0.4.7 across all install sites (#1378)
- close fork-PR/workflow-file-PR secret-strip gap and normalize
upload-artifact version (#1421)
- replace stream-based lookahead with array indexing in
list-changed-files.sh (#1376)
- centralize ISO 8601 timestamp regex in CIHelpers (#1343)
- update stale documentation date in release-process.md (#1363)
- pin basic-ftp to 5.3.0 to resolve GHSA-rp42-5vxx-qpwr (#1374)
- add bot filter to dependency PR review workflow (#1362)
- resolve pip-audit findings in powerpoint, gitlab, and jira skill lock
files (#1360)
- standardize Timestamp JSON key casing across all lint result files
(#1314)
- add synchronize trigger to PR Review workflow (#1323)
- standardize timestamp in Validate-SkillStructure.ps1 to use
Get-StandardTimestamp (#1280)
- add parallel subagent dispatch and structured JSON contracts to
code-review-full (#1304)
- standardize timestamp in SecurityHelpers.psm1 to use
Get-StandardTimestamp (#1284)
- standardize timestamps in Test-DependencyPinning.ps1 and
SecurityClasses.psm1 (#1282)
- derive collection artifact counts from YAML at build time (#1275)
- standardize timestamp in FrontmatterValidation.psm1 to use
Get-StandardTimestamp (#1285)
- standardize timestamp in Markdown-Link-Check.ps1 to use
Get-StandardTimestamp (#1283)
- escape hyphens in Mermaid diagram on Collections page (#1262)
- add summary timestamp to PSScriptAnalyzer output (#1211)
- fix plugin compatibility and robustness for coding-standards code
review agents (#1289)
- standardize timestamp in Test-CopyrightHeaders.ps1 to use
Get-StandardTimestamp (#1278)
- standardize timestamp in Invoke-YamlLint.ps1 to use
Get-StandardTimestamp (#1270)
- standardize timestamp in Invoke-LinkLanguageCheck.ps1 to use
Get-StandardTimestamp (#1264)
- fix dependency-review path filters and sparse-checkout cone mode
(#1259)
- replace invalid bare tool names with official tool identifiers (#1198)
- fix broken links and remove orphaned reference in code review docs
(#1257)
- exclude Python env dirs from skill validation warnings (#1255)
- pin happy-dom and serialize-javascript to resolve Dependabot
vulnerabilities (#1253)
- remove Mermaid diagram and add missing collection cards (#1247)
- disable MCP servers by default to prevent token limit errors (#1144)
- sync package-lock.json after pre-release version bump (#1236)
- separate mermaid node declarations and add dynamic diagram generation
with tests (#1215)
- replace anchor links in meeting-analyst with bold text references
(#1201)
- remove recursive symlinks in jira and gitlab skill directories (#1233)
- validate-installation scripts now check .github/skills directory
(#1010) (#1206)
- resolve npm audit vulnerabilities via dependency overrides (#1200)
- add post-release triggers to scorecard workflow (#1186)
- add missing .md extensions to relative links in agent documentation
(#1180)

### 📚 Documentation

- broaden Security Review description beyond OWASP (#1385)
- document maintainer advisory mode and skip-review label guard (#1386)
- document ExcludePaths/OutputPath for Invoke-LinkLanguageCheck (#1383)
- CLI getting-started: clarify plugin install commands as alternatives
(-all vs base) (#1251)

### ♻️ Refactoring

- align agent and prompt folder names to collection identifier (#1210)

### 🔧 Maintenance

- pin PSScriptAnalyzer to 1.25.0 and sync stale workflow version
comments (#1389)
- bump lxml from 6.0.2 to 6.1.0 in
/.github/skills/experimental/powerpoint (#1424)
- bump @vscode/vsce from 3.7.1 to 3.9.1 in the npm-dependencies group
(#1390)
- bump the github-actions group across 1 directory with 7 updates
(#1391)
- bump follow-redirects from 1.15.11 to 1.16.0 in /docs/docusaurus
(#1356)
- upgrade Node.js from 20 to 24 and bump cspell to v10 (#1353)
- bump basic-ftp from 5.2.0 to 5.2.1 (#1324)
- update github/gh-aw-actions requirement to
536ea1bad8c6715d098a9dc1afea8d403733acfe in the github-actions group
across 1 directory (#1298)
- update security instruction attributions and compliance (#1294)
- bump the npm-dependencies group with 2 updates (#1297)
- pre-release 3.3.41 (#1252)
- streamline RAI Planner phase structure and documentation (#1273)
- bump happy-dom from 20.8.8 to 20.8.9 in /docs/docusaurus (#1237)
- pre-release 3.3.27 (#1191)
- bump pygments from 2.19.2 to 2.20.0 in /.github/skills/gitlab/gitlab
(#1234)
- bump path-to-regexp from 0.1.12 to 0.1.13 in /docs/docusaurus (#1226)
- bump the github-actions group with 4 updates (#1231)
- add missing folders and alphabetize location lists (#1193)
- bump brace-expansion (#1224)
- bump handlebars from 4.7.8 to 4.7.9 in /docs/docusaurus (#1217)
- bump brace-expansion from 5.0.3 to 5.0.5 in /docs/docusaurus (#1213)
- pre-release 3.3.10 (#1187)
- bump markdownlint-cli2 from 0.21.0 to 0.22.0 in the npm-dependencies
group (#1175)
- bump the github-actions group with 3 updates (#1176)
- pre-release 3.3.1 (#1165)

---
*Managed automatically by pre-release workflow.*

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(skill): add evaluation dataset creator skill

4 participants