Skip to content

scthornton/securecode-aiml

Repository files navigation

SecureCode AI/ML

Security training dataset for AI coding assistants. 750 examples of real-world AI/ML vulnerabilities with vulnerable code, secure implementations, and defense-in-depth guidance across the OWASP LLM Top 10 2025 categories.

Built by perfecXion.ai.

SecureCode Dataset Family

Dataset Examples Focus Link
SecureCode 2,185 Unified dataset (web + AI/ML) HuggingFace
SecureCode Web 1,435 Traditional web security (OWASP Top 10 2021) HuggingFace
SecureCode AI/ML 750 AI/ML security (OWASP LLM Top 10 2025) HuggingFace

Dataset

750 examples across all 10 OWASP LLM Top 10 2025 categories (75 per category):

Category Description Examples
LLM01 Prompt Injection 75
LLM02 Sensitive Information Disclosure 75
LLM03 Supply Chain Vulnerabilities 75
LLM04 Data and Model Poisoning 75
LLM05 Improper Output Handling 75
LLM06 Excessive Agency 75
LLM07 System Prompt Leakage 75
LLM08 Vector and Embedding Weaknesses 75
LLM09 Misinformation 75
LLM10 Unbounded Consumption 75

Languages and Frameworks

  • Python (680): LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, FastAPI, Flask, Django, ChromaDB, Pinecone, Qdrant, Weaviate, Milvus, FAISS, vLLM, CrewAI, AutoGen, Dify, Gradio, Streamlit, Chainlit, BentoML, Ray Serve, MLflow, W&B
  • TypeScript (35): Vercel AI SDK, Next.js, Express
  • JavaScript (30): React, Node.js, browser-based AI apps
  • Other (5): Kotlin, Dart, Swift, Java, PHP

Quality

Metric Value
Valid JSON 750/750
Average quality score 93.8/100
Score range 92-99
4-turn conversations 750/750
Grounding tier T2 (all)
Security assertions 5+ per example

Schema

Each file is a single JSON object:

{
  "id": "llm01-descriptive-slug",
  "metadata": {
    "category": "OWASP LLM Top 10 2025 - LLM01: Prompt Injection",
    "subcategory": "Indirect Injection",
    "technique": "RAG Document Injection",
    "severity": "CRITICAL",
    "cwe": "CWE-74",
    "lang": "python",
    "owasp_llm_2025": "LLM01"
  },
  "context": {
    "description": "Vulnerability description",
    "impact": "Business and technical impact",
    "real_world_example": "Reference to documented incident"
  },
  "conversations": [
    {"role": "human", "content": "Developer question about building an AI system"},
    {"role": "assistant", "content": "Vulnerable code + secure implementation + defense-in-depth"},
    {"role": "human", "content": "Follow-up about testing and edge cases"},
    {"role": "assistant", "content": "Testing guidance, common mistakes, monitoring"}
  ],
  "validation": {
    "syntax_check": true,
    "security_logic_sound": true,
    "grounding_tier": "T2"
  },
  "security_assertions": ["5+ security property assertions"],
  "quality_score": 93,
  "references": [
    {"type": "cve", "id_or_url": "CVE-2024-XXXXX", "publisher": "NVD/MITRE"}
  ]
}

Conversation Format

Each example follows a 4-turn conversation:

  1. Human: Developer asks how to build an AI feature
  2. Assistant: Shows vulnerable implementation, explains the risk, provides secure implementation with 5+ defense layers
  3. Human: Asks about testing, detection, or edge cases
  4. Assistant: Testing code, common mistakes to avoid, SAST/DAST guidance, monitoring recommendations

Usage

Load with Python

import json
from pathlib import Path

examples = []
for path in Path("data").glob("*.jsonl"):
    with open(path) as f:
        examples.append(json.load(f))

print(f"Loaded {len(examples)} examples")

Load with HuggingFace Datasets

from datasets import load_dataset

dataset = load_dataset("scthornton/securecode-aiml")

Filter by Category

llm01 = [e for e in examples if e["metadata"]["owasp_llm_2025"] == "LLM01"]
print(f"Prompt injection examples: {len(llm01)}")

Extract Training Pairs

for example in examples:
    for turn in example["conversations"]:
        if turn["role"] == "human":
            prompt = turn["content"]
        else:
            response = turn["content"]

Version History

Dataset Examples Coverage Status
SecureCode Web 1,435 Traditional web security (12 languages, 9 frameworks) HuggingFace
SecureCode AI/ML 750 AI/ML security (OWASP LLM Top 10 2025, 30+ frameworks) HuggingFace
SecureCode (Unified) 2,185 Everything combined with HF configs HuggingFace

Ethics and Intended Use

This dataset is defensive security research. Every vulnerability example includes a corresponding secure implementation. The dataset teaches developers to identify and fix AI/ML security vulnerabilities.

Intended uses: Training AI coding assistants, security education, vulnerability research, red team preparation.

Out of scope: Offensive exploitation, automated attack generation, circumventing security controls.

See SECURITY.md for responsible disclosure guidelines.

Citation

@dataset{thornton2026securecodeaiml,
  title={SecureCode AI/ML: AI/ML Security Training Dataset for the OWASP LLM Top 10 2025},
  author={Thornton, Scott},
  year={2026},
  publisher={perfecXion.ai},
  url={https://huggingface.co/datasets/scthornton/securecode-aiml}
}

License

MIT License. See LICENSE.


Contact

Scott Thornton — AI Security Researcher

Security Issues: Please report via SECURITY.md

About

750-example security training dataset for AI coding assistants covering OWASP LLM Top 10 2025 with secure code patterns

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors