Jump to content

Artificial intelligence/Guidelines

From Meta, a Wikimedia project coordination wiki

These guidelines describe recommended practices for developing and deploying tools that use Artificial Intelligence (AI) within Wikimedia projects such as Wikipedia, Wikidata, Wikisource, and Wikiversity.

These guidelines apply primarily to AI-assisted tools built by Wikimedians, but in principle can be used for to guide any AI-related work that impacts Wikipedia projects

The goal is to ensure that AI-assisted tools used within the Wikimedia ecosystem are transparent, accountable, reproducible, and aligned with community governance.

These guidelines apply to:

  • researchers
  • developers
  • tool maintainers
  • bot operators
  • Wikimedia volunteers deploying machine learning or generative AI tools

Responsible use of AI in Wikimedia projects requires:

  1. clearly defined tasks
  2. transparent tool documentation through a tool card
  3. hosting choices appropriate to the deployment
  4. open and reproducible methods where feasible
  5. community bot approval where applicable
  6. human oversight
  7. data transparency
  8. evaluation
  9. monitoring after deployment.

Following these guidelines helps ensure that AI tools support Wikimedia's mission of providing reliable, transparent, and collaborative knowledge.

Terminology

[edit]

These guidelines distinguish between three levels or layers of AI-assisted tools: models on which they ar based, the tool itself and task that the tool performs:

  • Model: the underlying machine learning artifact (for example, a vandalism classifier, a transliteration model, or a large language model). Models may be traditional or generative, large or small, general-purpose or specialized.
  • Tool or AI-Assisted Tool: a deployed system that performs a specific task on a Wikimedia project, typically by combining one or more models with prompts, workflow logic, retrieval steps, post-processing, and guardrails. Bots, userscripts, gadgets, and external services that interact with Wikimedia projects are tools in this sense.
  • Task: the specific editorial or analytical action a tool performs (for example, suggesting a category, flagging a possibly vandalistic edit, or proposing a citation).

Documentation and evaluation will typically be most useful if applied to tools. That's because most Wikimedia contributors will not train their own models, while many Wikimedians write tools. In addition, because the same model may be used in many tools, and a single tool may use several models, documentation and evaluation is often most useful per tool, rather than per model.

At the same time, the choice of model, and how it is hosted, is a key decision that will have impact on most issues covered by these guidelines.

A note on the terms Artificial Intelligence and Machine Learning

[edit]

Machine learning (ML) is the broadest category, describing computer programs that are statistical models which learn from data, rather than relying on explicitly programmed instructions.

Deep learning is a branch of ML that uses neural networks to detect patterns in raw data. Generative AI refers to the application of deep learning techniques to build models that can generate novel outputs.

This means that machine learning tools do not necessarily depend on generative AI models - they can also describe other types of models.

And, in turn, while most generative AI models used by tools will be so-called large models, there are also small and specialized models being built.

For AI-assisted tools, the choice of the right model is a key decision that will determine factors like transparency, environmental sustainability, efficiency or capacity.

1. Define a single task per tool

[edit]

Each AI-assisted tool should focus on a single, clearly defined task.

Examples of well-defined tasks include:

  • detecting vandalism in edits
  • suggesting article categories
  • transliterating names between languages
  • identifying duplicate entities in Wikidata
  • suggesting references for statements

Tools should avoid combining multiple unrelated editorial actions without clear boundaries. Where a single tool performs more than one task, each task should be enumerated, documented, and reviewed separately, in the manner of multi-task bots such as AnomieBOT.

Defining specific tasks helps ensure:

  • easier evaluation
  • predictable behavior
  • easier community review
  • lower risk of unintended changes

For tools built on generative AI, the task is typically defined through prompt design, non-model guardrails, or workflow configuration. These definitions, prompts, and evaluation procedures should be documented as part of the tool's description (see §2).

2. Document the tool with a tool card

[edit]

Every AI-assisted tool should include a publicly accessible tool card. The tool card aggregates in one place, for purposes of transparency and education, the documentation required throughout these guidelines.

A detailed structure for tool cards is set out in the tool card template. At a high level, a tool card should cover:

  • Identity: kind of artifact (a wrapper around third-party model(s), a fine-tune, or a from-scratch model) and the task(s) performed (see section 1)
  • Models: the model(s) used, with a link to each model's card (on what model cards should contain, see Mitchell et al., Model Cards for Model Reporting (2019))
  • System: invocation pipeline (prompts, workflow, sampling settings, guardrails), hosting tier (see section 3), and data flow (see section 7)
  • Evaluation: ethical considerations (see section 5), per-task results (see section 8), known limitations and biases distinguishing tool-level from model-inherited, and reproducibility (see section 6)
  • Stewardship: monitoring (see section 9), governance, and licensing

For multi-task tools, the per-task items should be enumerated per task, as described in section 1.

3. Choose hosting appropriate to the tool

[edit]

AI-assisted tools should use the most transparent and community-aligned model hosting that is compatible with their use case and deployment models. In rough order of preference:

  1. Locally hosted — the model runs on the contributor's own machine. This offers the strongest privacy and reproducibility properties, and is well suited to analysis tools, offline workflows, and editor-side assistants.
  2. Community-hosted — the model runs on infrastructure operated by the Wikimedia Foundation, a Wikimedia affiliate, or another transparently governed nonprofit host. This is typically the appropriate tier for bots and shared tools where local hosting is not feasible.
  3. External or commercial hosting — the model runs on a third-party service. This is acceptable where the above options are not feasible. Open-weights models running on commercial cloud infrastructure are preferable, at this tier, to models under proprietary licenses.

When external or commercial services are used, the tool card should document:

  • the provider and the specific model used
  • what data is sent to the external service
  • any data retention or training-on-input behavior of the provider
  • privacy and licensing considerations

The use of external services for bots and other automated tools is subject to community approval (see §4).

Model hosting

[edit]

When contributors create new or fine-tuned models, those models should be openly shared and published on a transparent host such as Hugging Face together with their model cards, to support reproducibility.

4. Prefer open and reproducible methods

[edit]

Tools should – wherever feasible – be built with open-source models and reproducible pipelines. At the same time, it should be recognized that some tasks may not achieve acceptable levels of performance with current open models.

If possible, tools should also be designed with swappable AI architectures that allow various models to be used by the tool.

What constitutes an open source AI model

[edit]

There is currently no single settled definition of "open source" as applied to model weights, and the legal status of model licensing remains unsettled in several jurisdictions. Furthermore, there is no consensus on what data sharing standard should apply to open-source models.

A common sense understanding of open-source models assumes that, at a minimum, they share openly model weights, provide transparent information on data used to train the model, and share other components, such as code used to train and run the system.

Several existing frameworks may be useful reference points for what counts as "open" at the model level:

Openness at the tool level

[edit]

While the standard for open-source AI models is not yet defined today, it is clear how tools themselves can be open-sourced.

Tools deployed on Wikimedia projects should be open-sourced, which means openly sharing source code, documented prompts and workflows, and reproducible pipelines.

To ensure tools are open and reproducible, developers should provide:

  • source code
  • training or fine-tuning scripts, where applicable
  • dataset descriptions
  • instructions to reproduce results
  • performance testing to allow comparisons between open-weight and closed models

When a closed model is used as part of a tool, the tool card should:

  • document the open models that were evaluated and the evaluation process
  • describe the performance differences observed
  • link to the best known open alternative and its known performance gaps, to help guide contributors building open substitutes

5. Require bot approval

[edit]

If an AI-assisted tool performs automated edits on Wikimedia projects, it must follow the standard bot approval process for the relevant project.

A bot request should include:

  • a description of the tool, including its task, model(s), prompts, and workflow
  • sample edits
  • testing results
  • rate limits for editing
  • a link to the tool card

Like all bots, AI-assisted automated tools should start with limited testing before full deployment.

The bot approval process focuses primarily on the behavior of the tool rather than on the specific model used. Bot operators should describe the workflow in which the model is used, demonstrate how outputs are reviewed or validated, and provide examples of expected edits.

Approval is granted by the relevant community according to existing bot policies.

6. Keep humans in the loop

[edit]

AI-assisted tools should not replace human editorial oversight.

Instead, tools should operate in human-in-the-loop workflows, where humans review or validate model outputs before changes are made. Examples include:

  • a tool suggesting edits, with humans approving and submitting the edits themselves
  • a tool flagging possible vandalism, with moderators taking the final decision to label something as vandalism

Human review helps prevent:

  • incorrect edits
  • bias propagation
  • large-scale automated errors

7. Ensure data transparency

[edit]

Tools should clearly document the data they use, including:

  • data sent to any external service at inference time
  • data retained or logged by the tool
  • data used for retrieval, grounding, or fine-tuning
  • any preprocessing or filtering applied
  • known biases or gaps in the data

Datasets used by Wikimedia AI tools should preferably originate from:

  • Wikimedia projects
  • open datasets
  • freely licensed sources

Where contributors do train or fine-tune models, training-data transparency proposals developed in connection with the European AI Act may serve as a useful template.

8. Evaluate tools per task

[edit]

Before deployment, each task performed by a tool should be evaluated using appropriate metrics. The same model may behave very differently across prompts, configurations, and surrounding workflow, so evaluation should be conducted per task and per tool, not only per underlying model.

Possible evaluation measures include:

  • accuracy
  • precision and recall
  • edit acceptance rate
  • human reviewer feedback

Performance should be evaluated on representative datasets, and tools should also be tested for known failure cases such as:

  • languages with fewer speakers
  • unusual article formats
  • rare entity types

9. Monitor tools after deployment

[edit]

Tools should be regularly monitored to detect issues such as:

  • declining accuracy
  • unintended editing behavior
  • bias in predictions

Monitoring may include:

  • periodic audits
  • community feedback
  • automated logging of edits

If problems are detected, the tool should be paused until the issue is resolved.

10. Respect community governance

[edit]

Tools should follow the policies and norms of each Wikimedia project.

Developers should engage with the relevant communities before deploying automated tools. Community consultation is especially important for tools that:

  • make automated edits
  • affect large numbers of pages
  • influence editorial workflows for editors who don't opt in
  • affect sensitive pages, such as BLPs or controversial topics
  • make complex changes that are hard to review by looking at diffs

About

[edit]

These guidelines were created by Houcemeddine Turki, Samuel J. Klein, and Athul R T, following a discussion on the Future of Wikimedia at the Age of Artificial Intelligence in Wikimania 2025, and extensively revised by Luis Villa and Alek Tarkowski based on talk page feedback. The outcomes of this discussion have been presented at the Deoband Community Wikimedia's Wikipedia 25th Birthday. This initiative is funded by the Wikimedia CH's Innovation Programme.

The guidelines are mainly based on the following references: