Eclipse PanEval

Tuesday, March 24, 2026 - 04:47 by qigang zhu
This proposal is in the Project Proposal Phase (as defined in the Eclipse Development Process) and is written to declare its intent and scope. We solicit additional participation and input from the community. Please login and add your feedback in the comments section.
Parent Project
Proposal State
Community Review
Background

The rapid development of large AI models has triggered a wave of regulatory activity worldwide. The EU AI Act (fully enforced by August 2026) introduces risk-based classification of AI systems, mandatory conformity assessments for high-risk AI, and transparency obligations. The Cyber Resilience Act (CRA, fully applied by December 2027) adds cybersecurity requirements across the full product lifecycle, with compliance triggered by commercial use of open-source components.

Despite this regulatory urgency, no existing open-source evaluation framework is simultaneously aligned with EU conformity assessments, covers both AI safety and cybersecurity, and remains vendor-neutral and community-governed. Model developers and enterprises face fragmented tools, inconsistent standards, and high compliance costs.

BAAI (Beijing Academy of Artificial Intelligence) has built a proven AI evaluation ecosystem including the Open Chinese LLM Leaderboard (343+ models), FlagEval-Debate (world's first multilingual debate evaluation), FlagEval-Arena (21 language + 33 multimodal models), FlagEvalMM (open-source multimodal framework), and has actively participated in the IEEE P3419 international standard for large model evaluation.

Eclipse PanEval is proposed to fill this gap — bringing a unified, regulation-aligned, community-governed evaluation framework to the Eclipse Foundation.

Scope

Eclipse PanEval provides a unified, vendor-neutral framework to evaluate AI models for capability, safety, and cybersecurity in line with EU regulations.

In-scope:
- A three-dimensional evaluation framework based on "Capacity – Task – Metrics"
- Coverage of 4 major model categories: language, multimodal, and speech models
- Support for several evaluation tasks including task solving, coding, multi-turn QA, factuality, image-text QA, depth estimation, speech perception, and more
- Safety & robustness evaluation as a cross-cutting dimension across all model types
- Alignment with EU AI Act and CRA compliance requirements
- AI-assisted subjective evaluation to improve efficiency and objectivity
- Open leaderboard and evaluation platform (https://flageval.baai.ac.cn)

Out-of-scope:
- Model training or fine-tuning
- Deployment infrastructure for production AI systems
- Legal compliance certification (Eclipse PanEval provides evaluation tooling, not legal advice)

Description

Eclipse PanEval is an open-source large model evaluation platform and framework, designed to establish scientific, impartial, and open evaluation benchmarks, methodologies, and toolsets. It comprehensively assesses foundation model performance across language, multimodal, vision, and speech domains.

Core framework: A three-dimensional evaluation system based on "Capacity – Task – Metrics":
- Capacity: defines the scope of model capabilities ("What to evaluate?")
- Task: the form used to assess model capabilities ("How to evaluate?")
- Metrics: quantitative assessment from multiple perspectives ("How to measure?")

Eclipse PanEval covers 4 major model categories and 40+ evaluation tasks, with Safety & Robustness as a cross-cutting evaluation dimension for all categories.

Why Here?

The Eclipse Foundation is one of the world's leading open-source foundations, with a strong focus on vendor-neutral governance, enterprise adoption, and international community building. Eclipse PanEval aligns closely with Eclipse's mission for the following reasons:

1. Regulatory alignment: The Eclipse Foundation is Europe-based and well-positioned to support AI projects that align with EU AI Act and CRA compliance requirements — a core goal of Eclipse PanEval.

2. Vendor-neutral governance: Eclipse's transparent, community-driven governance model ensures that Eclipse PanEval remains independent, impartial, and trustworthy — essential qualities for an evaluation framework.

3. Enterprise reach: Eclipse's established relationships with enterprises and industry partners will accelerate Eclipse PanEval's adoption as a standard evaluation framework for organizations entering the European AI market.

4. Global developer community: Hosting Eclipse PanEval at Eclipse enables collaboration with a diverse, international developer community, supporting the project's goal of becoming a globally recognized, open evaluation standard.

Future Work

Eclipse PanEval's long-term technical development focuses on the following areas:

Evaluation Capability Expansion:
- Progressively expand from large language model evaluation to comprehensive coverage of multimodal, vision, and speech models
- Introduce an embodied intelligence evaluation framework to support emerging AI application scenarios such as robotics and autonomous driving
- Build an agent evaluation system to address the trend of AI evolving from single models to complex systems

Compliance & Safety:
- Deeply integrate compliance evaluation tooling aligned with the EU AI Act and the Cyber Resilience Act (CRA)
- Continuously enhance safety and robustness evaluation dimensions to drive the adoption of "trustworthy AI" standards

Community & Ecosystem:
- Expand multilingual evaluation support to serve a global developer community
- Continue collaboration with international standards organizations including IEEE and AI Verify Foundation
- Establish Eclipse PanEval as the globally recognized open standard for large model evaluation

Project Scheduling

Initial code contribution is expected to be completed within the first month after project approval, migrating the existing FlagEval codebase to the Eclipse Foundation-hosted repository.

Month 3–6: Release the first official version (v1.0), including a complete large language model evaluation pipeline and leaderboard functionality, available for community testing.

Month 7–9: Release v2.0, adding Large Model Arena evaluation, an embodied intelligence evaluation framework, and multimodal (VLM) evaluation modules.

Month 10–15: Release v3.0, introducing agent evaluation and AI Application System Framework Evaluation. Begin integration work with EU regulatory compliance tooling.

Overall progress will follow the pace of community collaboration, with no hard deadlines imposed.

Project Leads
Committers
Zheqi He (This committer does not have an Eclipse Account)
Bowen Qin (This committer does not have an Eclipse Account)
Xuejing Li (This committer does not have an Eclipse Account)
Hui Wang (This committer does not have an Eclipse Account)
Jingshu Zheng (This committer does not have an Eclipse Account)
Tongshuai Ren (This committer does not have an Eclipse Account)
Initial Contribution

BAAI will contribute the following existing code and assets to the project:

Core Evaluation Framework — FlagEval: Already open-sourced on GitHub (GitHub - FlagOpen/FlagEval) under the Apache 2.0 license. Copyright is held by the Beijing Academy of Artificial Intelligence (BAAI).

Evaluation Benchmark Datasets: A diverse collection of benchmark datasets covering three major model categories (language, multimodal, and speech)

FlagEvalMM: An open-source multimodal model evaluation framework (GitHub - flageval-baai/FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluatio) under the Apache 2.0 license, empowering developers with flexible multimodal evaluation capabilities.

Key third-party dependencies and their licenses:

- PyTorch (BSD License)

- Transformers / HuggingFace (Apache 2.0)

- NumPy / SciPy (BSD License)

-Some datasets are licensed under the CC BY-NC-SA 4.0 license

All third-party dependencies are compatible with Apache 2.0 and present no license conflicts.

Source Repository Type