The rapid development of large AI models has triggered a wave of regulatory activity worldwide. The EU AI Act (fully enforced by August 2026) introduces risk-based classification of AI systems, mandatory conformity assessments for high-risk AI, and transparency obligations. The Cyber Resilience Act (CRA, fully applied by December 2027) adds cybersecurity requirements across the full product lifecycle, with compliance triggered by commercial use of open-source components.
Despite this regulatory urgency, no existing open-source evaluation framework is simultaneously aligned with EU conformity assessments, covers both AI safety and cybersecurity, and remains vendor-neutral and community-governed. Model developers and enterprises face fragmented tools, inconsistent standards, and high compliance costs.
BAAI (Beijing Academy of Artificial Intelligence) has built a proven AI evaluation ecosystem including the Open Chinese LLM Leaderboard (343+ models), FlagEval-Debate (world's first multilingual debate evaluation), FlagEval-Arena (21 language + 33 multimodal models), FlagEvalMM (open-source multimodal framework), and has actively participated in the IEEE P3419 international standard for large model evaluation.
Eclipse PanEval is proposed to fill this gap — bringing a unified, regulation-aligned, community-governed evaluation framework to the Eclipse Foundation.
Eclipse PanEval provides a unified, vendor-neutral framework to evaluate AI models for capability, safety, and cybersecurity in line with EU regulations.
In-scope:
- A three-dimensional evaluation framework based on "Capacity – Task – Metrics"
- Coverage of 4 major model categories: language, multimodal, and speech models
- Support for several evaluation tasks including task solving, coding, multi-turn QA, factuality, image-text QA, depth estimation, speech perception, and more
- Safety & robustness evaluation as a cross-cutting dimension across all model types
- Alignment with EU AI Act and CRA compliance requirements
- AI-assisted subjective evaluation to improve efficiency and objectivity
- Open leaderboard and evaluation platform (https://flageval.baai.ac.cn)
Out-of-scope:
- Model training or fine-tuning
- Deployment infrastructure for production AI systems
- Legal compliance certification (Eclipse PanEval provides evaluation tooling, not legal advice)
Eclipse PanEval is an open-source large model evaluation platform and framework, designed to establish scientific, impartial, and open evaluation benchmarks, methodologies, and toolsets. It comprehensively assesses foundation model performance across language, multimodal, vision, and speech domains.
Core framework: A three-dimensional evaluation system based on "Capacity – Task – Metrics":
- Capacity: defines the scope of model capabilities ("What to evaluate?")
- Task: the form used to assess model capabilities ("How to evaluate?")
- Metrics: quantitative assessment from multiple perspectives ("How to measure?")
Eclipse PanEval covers 4 major model categories and 40+ evaluation tasks, with Safety & Robustness as a cross-cutting evaluation dimension for all categories.
Eclipse PanEval is licensed under Apache 2.0, which is fully compatible with the Eclipse Foundation's licensing policies. There are no known trademark conflicts with the name "PanEval." All code contributed to the project is original work developed by BAAI or has been cleared for open-source release. No third-party code with incompatible licenses is included. The project does not incorporate any proprietary or commercially restricted components.
The Eclipse Foundation is one of the world's leading open-source foundations, with a strong focus on vendor-neutral governance, enterprise adoption, and international community building. Eclipse PanEval aligns closely with Eclipse's mission for the following reasons:
1. Regulatory alignment: The Eclipse Foundation is Europe-based and well-positioned to support AI projects that align with EU AI Act and CRA compliance requirements — a core goal of Eclipse PanEval.
2. Vendor-neutral governance: Eclipse's transparent, community-driven governance model ensures that Eclipse PanEval remains independent, impartial, and trustworthy — essential qualities for an evaluation framework.
3. Enterprise reach: Eclipse's established relationships with enterprises and industry partners will accelerate Eclipse PanEval's adoption as a standard evaluation framework for organizations entering the European AI market.
4. Global developer community: Hosting Eclipse PanEval at Eclipse enables collaboration with a diverse, international developer community, supporting the project's goal of becoming a globally recognized, open evaluation standard.
Eclipse PanEval's long-term technical development focuses on the following areas:
Evaluation Capability Expansion:
- Progressively expand from large language model evaluation to comprehensive coverage of multimodal, vision, and speech models
- Introduce an embodied intelligence evaluation framework to support emerging AI application scenarios such as robotics and autonomous driving
- Build an agent evaluation system to address the trend of AI evolving from single models to complex systems
Compliance & Safety:
- Deeply integrate compliance evaluation tooling aligned with the EU AI Act and the Cyber Resilience Act (CRA)
- Continuously enhance safety and robustness evaluation dimensions to drive the adoption of "trustworthy AI" standards
Community & Ecosystem:
- Expand multilingual evaluation support to serve a global developer community
- Continue collaboration with international standards organizations including IEEE and AI Verify Foundation
- Establish Eclipse PanEval as the globally recognized open standard for large model evaluation
Initial code contribution is expected to be completed within the first month after project approval, migrating the existing FlagEval codebase to the Eclipse Foundation-hosted repository.
Month 3–6: Release the first official version (v1.0), including a complete large language model evaluation pipeline and leaderboard functionality, available for community testing.
Month 7–9: Release v2.0, adding Large Model Arena evaluation, an embodied intelligence evaluation framework, and multimodal (VLM) evaluation modules.
Month 10–15: Release v3.0, introducing agent evaluation and AI Application System Framework Evaluation. Begin integration work with EU regulatory compliance tooling.
Overall progress will follow the pace of community collaboration, with no hard deadlines imposed.
BAAI will contribute the following existing code and assets to the project:
Core Evaluation Framework — FlagEval: Already open-sourced on GitHub (GitHub - FlagOpen/FlagEval) under the Apache 2.0 license. Copyright is held by the Beijing Academy of Artificial Intelligence (BAAI).
Evaluation Benchmark Datasets: A diverse collection of benchmark datasets covering three major model categories (language, multimodal, and speech)
FlagEvalMM: An open-source multimodal model evaluation framework (GitHub - flageval-baai/FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluatio) under the Apache 2.0 license, empowering developers with flexible multimodal evaluation capabilities.
Key third-party dependencies and their licenses:
- PyTorch (BSD License)
- Transformers / HuggingFace (Apache 2.0)
- NumPy / SciPy (BSD License)
-Some datasets are licensed under the CC BY-NC-SA 4.0 license
All third-party dependencies are compatible with Apache 2.0 and present no license conflicts.
- Log in to post comments