feat(agents): add evaluation dataset creator#1279
feat(agents): add evaluation dataset creator#1279WilliamBerryiii merged 25 commits intomicrosoft:mainfrom
Conversation
WilliamBerryiii
left a comment
There was a problem hiding this comment.
Thank you for this PR, @bjcmit. The eval-dataset-creator agent is a solid addition to the data-science collection — the structured interview flow and dual-persona support are well thought out.
After review, there are a few suggested changes in the inline comments. Please take a look and let us know if you have any questions.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1279 +/- ##
==========================================
- Coverage 87.63% 87.62% -0.01%
==========================================
Files 65 65
Lines 10119 10119
==========================================
- Hits 8868 8867 -1
- Misses 1251 1252 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
rezatnoMsirhC
left a comment
There was a problem hiding this comment.
Four non-blocking suggestions inline. The branch also needs to be rebased against main before merging.
|
Cross referencing this with #1419 - I need to evaluate the overlap ... |
## Pre-Release 3.3.101 ### ✨ Features - add removed maturity tier and retire owasp-docker (#1444) - add evaluation dataset creator (#1279) - align RAI planner with guide, remove scoring, improve UX (#1287) - add PSGallery staleness check and BOM cleanup (#1379) - ISA-95 network planner agent (#1177) - auto-generate collection.md with maturity filtering (#1316) - add folder-consistency check and standardize WARN outp… (#1350) - add synth-data-generate prompt to data-science collection (#1419) - add canonical deck workflow and customer-card rendering for design thinking (#1413) - add Figma MCP integration for DT artifact export (#1222) - introduce `owasp-docker` (#1245) - replace hve-core-specific references with portable discovery-based language (#1335) - introduce `owasp-cicd` (#1246) - add secure-by-design knowledge skill (#1223) - introduce `owasp-infrastructure` (#1244) - introduce `owasp-mcp` (#1207) - add OutputPath parameter to Invoke-LinkLanguageCheck.ps1 (#1229) - add -OutputPath parameter to Validate-SkillStructure.ps1 (#1225) - add maintainer-only skip-review label guard (#1293) - add extension collections overview and integrate into getting started flow (#950) - add agentic workflows for automated issue triage, implementation, PR review, dependency review, and doc-staleness detection (#1219) - consolidate package-lock.json version sync into Update-VersionFiles.ps1 (#1240) - add standards code review agent and full review orchestrator (#1174) - standardize pytest-mock as Python mocking framework (#1170) - add Jira backlog workflows and Jira/GitLab skills (#978) - add centralized version bump script and supply-chain attestation (#1183) ### 🐛 Bug Fixes - pin PowerShell-Yaml to 0.4.7 across all install sites (#1378) - close fork-PR/workflow-file-PR secret-strip gap and normalize upload-artifact version (#1421) - replace stream-based lookahead with array indexing in list-changed-files.sh (#1376) - centralize ISO 8601 timestamp regex in CIHelpers (#1343) - update stale documentation date in release-process.md (#1363) - pin basic-ftp to 5.3.0 to resolve GHSA-rp42-5vxx-qpwr (#1374) - add bot filter to dependency PR review workflow (#1362) - resolve pip-audit findings in powerpoint, gitlab, and jira skill lock files (#1360) - standardize Timestamp JSON key casing across all lint result files (#1314) - add synchronize trigger to PR Review workflow (#1323) - standardize timestamp in Validate-SkillStructure.ps1 to use Get-StandardTimestamp (#1280) - add parallel subagent dispatch and structured JSON contracts to code-review-full (#1304) - standardize timestamp in SecurityHelpers.psm1 to use Get-StandardTimestamp (#1284) - standardize timestamps in Test-DependencyPinning.ps1 and SecurityClasses.psm1 (#1282) - derive collection artifact counts from YAML at build time (#1275) - standardize timestamp in FrontmatterValidation.psm1 to use Get-StandardTimestamp (#1285) - standardize timestamp in Markdown-Link-Check.ps1 to use Get-StandardTimestamp (#1283) - escape hyphens in Mermaid diagram on Collections page (#1262) - add summary timestamp to PSScriptAnalyzer output (#1211) - fix plugin compatibility and robustness for coding-standards code review agents (#1289) - standardize timestamp in Test-CopyrightHeaders.ps1 to use Get-StandardTimestamp (#1278) - standardize timestamp in Invoke-YamlLint.ps1 to use Get-StandardTimestamp (#1270) - standardize timestamp in Invoke-LinkLanguageCheck.ps1 to use Get-StandardTimestamp (#1264) - fix dependency-review path filters and sparse-checkout cone mode (#1259) - replace invalid bare tool names with official tool identifiers (#1198) - fix broken links and remove orphaned reference in code review docs (#1257) - exclude Python env dirs from skill validation warnings (#1255) - pin happy-dom and serialize-javascript to resolve Dependabot vulnerabilities (#1253) - remove Mermaid diagram and add missing collection cards (#1247) - disable MCP servers by default to prevent token limit errors (#1144) - sync package-lock.json after pre-release version bump (#1236) - separate mermaid node declarations and add dynamic diagram generation with tests (#1215) - replace anchor links in meeting-analyst with bold text references (#1201) - remove recursive symlinks in jira and gitlab skill directories (#1233) - validate-installation scripts now check .github/skills directory (#1010) (#1206) - resolve npm audit vulnerabilities via dependency overrides (#1200) - add post-release triggers to scorecard workflow (#1186) - add missing .md extensions to relative links in agent documentation (#1180) ### 📚 Documentation - broaden Security Review description beyond OWASP (#1385) - document maintainer advisory mode and skip-review label guard (#1386) - document ExcludePaths/OutputPath for Invoke-LinkLanguageCheck (#1383) - CLI getting-started: clarify plugin install commands as alternatives (-all vs base) (#1251) ### ♻️ Refactoring - align agent and prompt folder names to collection identifier (#1210) ### 🔧 Maintenance - pin PSScriptAnalyzer to 1.25.0 and sync stale workflow version comments (#1389) - bump lxml from 6.0.2 to 6.1.0 in /.github/skills/experimental/powerpoint (#1424) - bump @vscode/vsce from 3.7.1 to 3.9.1 in the npm-dependencies group (#1390) - bump the github-actions group across 1 directory with 7 updates (#1391) - bump follow-redirects from 1.15.11 to 1.16.0 in /docs/docusaurus (#1356) - upgrade Node.js from 20 to 24 and bump cspell to v10 (#1353) - bump basic-ftp from 5.2.0 to 5.2.1 (#1324) - update github/gh-aw-actions requirement to 536ea1bad8c6715d098a9dc1afea8d403733acfe in the github-actions group across 1 directory (#1298) - update security instruction attributions and compliance (#1294) - bump the npm-dependencies group with 2 updates (#1297) - pre-release 3.3.41 (#1252) - streamline RAI Planner phase structure and documentation (#1273) - bump happy-dom from 20.8.8 to 20.8.9 in /docs/docusaurus (#1237) - pre-release 3.3.27 (#1191) - bump pygments from 2.19.2 to 2.20.0 in /.github/skills/gitlab/gitlab (#1234) - bump path-to-regexp from 0.1.12 to 0.1.13 in /docs/docusaurus (#1226) - bump the github-actions group with 4 updates (#1231) - add missing folders and alphabetize location lists (#1193) - bump brace-expansion (#1224) - bump handlebars from 4.7.8 to 4.7.9 in /docs/docusaurus (#1217) - bump brace-expansion from 5.0.3 to 5.0.5 in /docs/docusaurus (#1213) - pre-release 3.3.10 (#1187) - bump markdownlint-cli2 from 0.21.0 to 0.22.0 in the npm-dependencies group (#1175) - bump the github-actions group with 3 updates (#1176) - pre-release 3.3.1 (#1165) --- *Managed automatically by pre-release workflow.* Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
This pull request adds a comprehensive new prompt,
eval-dataset-creator.md, for generating evaluation datasets and documentation to support AI agent testing. The prompt guides users through a structured interview process to curate Q&A pairs, select evaluation metrics, and recommend tooling tailored to user skill level and agent characteristics. It also specifies the output directory structure and includes templates for all generated artifacts.Key additions and improvements:
Evaluation Dataset Creation Workflow:
Dataset and Documentation Artifacts:
data/evaluation/with separate subfolders for datasets (.json,.csv) and documentation (curation-notes.md,metric-selection.md,tool-recommendations.md).Tooling and Persona Guidance:
Related Issue(s)
Closes #1267
Type of Change
Select all that apply:
Code & Documentation:
Infrastructure & Configuration:
AI Artifacts:
prompt-builderagent and addressed all feedback.github/agents/*.agent.md)Sample Prompts (for AI Artifact Contributions)
User Request:
Execution Flow:
Here’s a step-by-step breakdown of what happens when the Evaluation Dataset Creator agent is invoked, including tool usage and key decision points:
Purpose: Gather all necessary context before generating any artifacts.
Phase 1: Agent Context**
Phase 2: Agent Capabilities
Phase 3: Evaluation Scenarios
Phase 4: Persona & Tooling
data/evaluation/datasets/.data/evaluation/docs/:data/evaluation/docs/.Decision Points & Tool Usage Summary
Output Artifacts:
data/evaluation/datasets/-eval-dataset.json
{ "metadata": { "schema_version": "1", "agent_name": "example-agent", "created_date": "2026-04-02", "version": "1.0.0", "total_pairs": 30, "distribution": { "easy": 6, "grounding_source_checks": 3, "hard": 12, "negative": 6, "safety": 3 }, "persona": "pro-code", "evaluation_mode": ["manual", "batch"], "recommended_tool": "azure-ai-foundry" }, "evaluation_pairs": [ {data/evaluation/docs/-curation-notes.md
data/evaluation/docs/-metric-selection.md
data/evaluation/docs/-tool-recommendations.md
Success Indicators:
Testing
/prompt-analyze3 times with all findings addressednpm run lint:all✅npm run lint:md-links✅npm run validate:copyright✅ (148/148 files, 100%)npm run spell-check✅ (281 files, 0 issues)npm run plugin:generate✅ (14 plugins, 0 errors)npm run plugin:validate✅ (0 errors)npm run lint:collections-metadata✅ (0 errors)Checklist
Required Checks
AI Artifact Contributions
/prompt-analyzeto review contributionprompt-builderreviewRequired Automated Checks
The following validation commands must pass before merging:
npm run lint:mdnpm run spell-checknpm run lint:frontmatternpm run validate:skillsnpm run lint:md-linksnpm run lint:psnpm run plugin:generateSecurity Considerations