SecureCode AI/ML

Security training dataset for AI coding assistants. 750 examples of real-world AI/ML vulnerabilities with vulnerable code, secure implementations, and defense-in-depth guidance across the OWASP LLM Top 10 2025 categories.

Built by perfecXion.ai.

SecureCode Dataset Family

Dataset	Examples	Focus	Link
SecureCode	2,185	Unified dataset (web + AI/ML)	HuggingFace
SecureCode Web	1,435	Traditional web security (OWASP Top 10 2021)	HuggingFace
SecureCode AI/ML	750	AI/ML security (OWASP LLM Top 10 2025)	HuggingFace

Dataset

750 examples across all 10 OWASP LLM Top 10 2025 categories (75 per category):

Category	Description	Examples
LLM01	Prompt Injection	75
LLM02	Sensitive Information Disclosure	75
LLM03	Supply Chain Vulnerabilities	75
LLM04	Data and Model Poisoning	75
LLM05	Improper Output Handling	75
LLM06	Excessive Agency	75
LLM07	System Prompt Leakage	75
LLM08	Vector and Embedding Weaknesses	75
LLM09	Misinformation	75
LLM10	Unbounded Consumption	75

Languages and Frameworks

Python (680): LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, FastAPI, Flask, Django, ChromaDB, Pinecone, Qdrant, Weaviate, Milvus, FAISS, vLLM, CrewAI, AutoGen, Dify, Gradio, Streamlit, Chainlit, BentoML, Ray Serve, MLflow, W&B
TypeScript (35): Vercel AI SDK, Next.js, Express
JavaScript (30): React, Node.js, browser-based AI apps
Other (5): Kotlin, Dart, Swift, Java, PHP

Quality

Metric	Value
Valid JSON	750/750
Average quality score	93.8/100
Score range	92-99
4-turn conversations	750/750
Grounding tier	T2 (all)
Security assertions	5+ per example

Schema

Each file is a single JSON object:

{
  "id": "llm01-descriptive-slug",
  "metadata": {
    "category": "OWASP LLM Top 10 2025 - LLM01: Prompt Injection",
    "subcategory": "Indirect Injection",
    "technique": "RAG Document Injection",
    "severity": "CRITICAL",
    "cwe": "CWE-74",
    "lang": "python",
    "owasp_llm_2025": "LLM01"
  },
  "context": {
    "description": "Vulnerability description",
    "impact": "Business and technical impact",
    "real_world_example": "Reference to documented incident"
  },
  "conversations": [
    {"role": "human", "content": "Developer question about building an AI system"},
    {"role": "assistant", "content": "Vulnerable code + secure implementation + defense-in-depth"},
    {"role": "human", "content": "Follow-up about testing and edge cases"},
    {"role": "assistant", "content": "Testing guidance, common mistakes, monitoring"}
  ],
  "validation": {
    "syntax_check": true,
    "security_logic_sound": true,
    "grounding_tier": "T2"
  },
  "security_assertions": ["5+ security property assertions"],
  "quality_score": 93,
  "references": [
    {"type": "cve", "id_or_url": "CVE-2024-XXXXX", "publisher": "NVD/MITRE"}
  ]
}

Conversation Format

Each example follows a 4-turn conversation:

Human: Developer asks how to build an AI feature
Assistant: Shows vulnerable implementation, explains the risk, provides secure implementation with 5+ defense layers
Human: Asks about testing, detection, or edge cases
Assistant: Testing code, common mistakes to avoid, SAST/DAST guidance, monitoring recommendations

Usage

Load with Python

import json
from pathlib import Path

examples = []
for path in Path("data").glob("*.jsonl"):
    with open(path) as f:
        examples.append(json.load(f))

print(f"Loaded {len(examples)} examples")

Load with HuggingFace Datasets

from datasets import load_dataset

dataset = load_dataset("scthornton/securecode-aiml")

Filter by Category

llm01 = [e for e in examples if e["metadata"]["owasp_llm_2025"] == "LLM01"]
print(f"Prompt injection examples: {len(llm01)}")

Extract Training Pairs

for example in examples:
    for turn in example["conversations"]:
        if turn["role"] == "human":
            prompt = turn["content"]
        else:
            response = turn["content"]

Version History

Dataset	Examples	Coverage	Status
SecureCode Web	1,435	Traditional web security (12 languages, 9 frameworks)	HuggingFace
SecureCode AI/ML	750	AI/ML security (OWASP LLM Top 10 2025, 30+ frameworks)	HuggingFace
SecureCode (Unified)	2,185	Everything combined with HF configs	HuggingFace

Ethics and Intended Use

This dataset is defensive security research. Every vulnerability example includes a corresponding secure implementation. The dataset teaches developers to identify and fix AI/ML security vulnerabilities.

Intended uses: Training AI coding assistants, security education, vulnerability research, red team preparation.

Out of scope: Offensive exploitation, automated attack generation, circumventing security controls.

See SECURITY.md for responsible disclosure guidelines.

Citation

@dataset{thornton2026securecodeaiml,
  title={SecureCode AI/ML: AI/ML Security Training Dataset for the OWASP LLM Top 10 2025},
  author={Thornton, Scott},
  year={2026},
  publisher={perfecXion.ai},
  url={https://huggingface.co/datasets/scthornton/securecode-aiml}
}

License

MIT License. See LICENSE.

Contact

Scott Thornton — AI Security Researcher

Security Issues: Please report via SECURITY.md

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
data		data
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DATASET_CARD.md		DATASET_CARD.md
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES_v3.0.1.md		RELEASE_NOTES_v3.0.1.md
SECURITY.md		SECURITY.md
manifest-after.json		manifest-after.json
manifest-before.json		manifest-before.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SecureCode AI/ML

SecureCode Dataset Family

Dataset

Languages and Frameworks

Quality

Schema

Conversation Format

Usage

Load with Python

Load with HuggingFace Datasets

Filter by Category

Extract Training Pairs

Version History

Ethics and Intended Use

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SecureCode AI/ML

SecureCode Dataset Family

Dataset

Languages and Frameworks

Quality

Schema

Conversation Format

Usage

Load with Python

Load with HuggingFace Datasets

Filter by Category

Extract Training Pairs

Version History

Ethics and Intended Use

Citation

License

Contact

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages