Page MenuHomePhabricator

[A/B Test] Run an A/B test to evaluate impact of showing multiple Checks within a single edit
Closed, ResolvedPublic

Description

In T342930, we ran an A/B test of Reference Check that demonstrated it was effective at causing:

  • Newcomers to publish new content edits that include references while lowering the likelihood those edits would be reverted
  • Newcomers to be more likely to return to edit again

...all the while NOT causing degradations in other metrics like, block rate, edit completion rate, etc.

This task involves the work of running another A/B test (or potentially an A/B/C test) of the Reference Check with one key change: removing the constraint on how many Reference Checks people have the potential to see within a single edit.

Decision to be made

This A/B test will help us make the following decision: What – if any – changes in the Reference Check UX will we make to ensure people seeing multiple Checks within a single edit continue to experience the benefits the first iteration of the feature caused ?

Hypotheses

IDHypothesisMetric(s) for evaluation
KPIThe quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will include a reference or an explicit acknowledgement as to why these edits lack references.1) Proportion of published edits that add new content and include a reference or explicit acknowledgement of why a citation was not added, 2) Proportion of published edits that add new content (T333714) and are reverted within 48 hours (or have a high revision risk score) if we use revision risk model (T317700, T343938))
Curiosity #1New account holders will be more likely to publish an unreverted edit to the main namespace within 24 hours of creating an account because they will be made aware of the need to accompany new text they're attempting to publish with a reference, when they don't first think/know to do so themselvesConstructive activation
Curiosity #2Newcomers and Junior Contributors will be more aware of the need to add a reference when contributing new content because the visual editor will prompt them to do so in cases where they have not done so themselves.Increase in the proportion of newcomers and Junior Contributors that publish at least one new content edit that includes a reference.
Curiosity #3Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that includes a reference because Edit Check will have caused them to realize references are required when contributing new content to Wikipedia.1) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and successfully and return to make an unreverted edit to a main namespace within 7 and 30 days., 2) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and return to make a new content edit with a reference to a main namespace within 7 and 30 days.

Leading indicators

See T388731.

Guardrails

This section describes the metrics we will use to make sure other important parts/dimensions of the "editing ecosystem" are not being negatively impacted by people being able to see Multiple Reference Checks in a single edit. The scenarios named in the chart below emerged through T325851.

IDNameMetric(s) for Evaluation
1)Edit quality decrease (T317700)Proportion of published edits that add new content and are still reverted within 48hours (or have a low revision risk score if we use the revision risk model (T317700)). Will include a breakdown of revert rate of published edits with and without a reference added.
2)Edit completion rate drastically decreasesProportion of edits that are started (event.action = init) and are successfully published (event.action = saveSuccess)
3)Edit abandonment rate drastically increasesProportion of contributors that are presented Edit Check feedback and abandon their edits (indicated by event.action = abort and event.abort_type = abandon).
4)People shown Edit Check are blocked at higher ratesProportion of contributors blocked after publishing an edit where Edit Check was shown
5)High false positive rateProportion of contributors that dismiss adding a citation and select "I didn't add new information" or other indicator that their edit doesn't require a citation

A/B Test: Decision Matrix

IDScenarioIndicator(s)Plan of Action
1)Reference Check is disrupting, discouraging, or otherwise getting in the way of volunteers. Read: people are less likely to publish the edits they start.Significant drop in edit completion and spike in edit abandonment in edit sessions where Reference Check is activated.Pause scaling plans; investigate changes to UX
2)Reference Check is increasing the likelihood that people will publish destructive editsIncrease in proportion of contributors blocked after publishing an edit where Reference Check is activated, Increase in proportion of published edits where Reference Check was activated and are reverted within 48 hours relative to new content edits Reference Check was NOT activated within.Pause scaling plans, review edits to try to identify pattern in abuse and propose changes to UX to mitigate them
3)Reference Check is causing people to publish edits that align with project policiesIncrease in the proportion of edits Reference Check was activated within that include a reference and are not reverted within 48 hours relative to new content edits without a reference Reference Check was NOT activated withinMove forward with scaling plans
4)Reference Check is effective at causing people to accompany new content edits that include a reference, but those references are unreliableIncrease in the proportion of published edits Reference Check was activated within that include a reference and increase or no change in the proportion of these edits that are reverted within 48 hoursBlock scaling plans on reference reliability work (T276857)
5)Reference Check is not effective at causing people to accompany new content edits that include a reference but is not disrupting to volunteers.No change or decrease in the proportion of published edits Reference Check was activated within that include reference and A) no significant drop in edit completion or abandonment rate or B) no significant spike in block or revert rateMove forward with scaling plans
WARNING: For each metric named above, we need to be able to filter them by the number of Reference Checks shown within a given edit.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
ResolvedMNeisler
ResolvedRyasmeen
Resolvedppelberg
Resolvedppelberg
ResolvedJFernandez-WMF
ResolvedTrizek-WMF
Resolvedppelberg
Openppelberg
OpenNone
DuplicateNone
DeclinedTrizek-WMF
ResolvedTrizek-WMF
DuplicateMNeisler
OpenNone
OpenNone
DuplicateNone
Resolvednayoub
DuplicateNone
Resolvedppelberg
Resolvedppelberg
Resolvedppelberg
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenMNeisler
OpenNone
Resolvedppelberg

Event Timeline

Update: 17 January 2024
Per what @MNeisler and I discussed offline, we are going to revise what we are trying to learn through this A/B test...

Rather than looking to this A/B test to help us decide on the maximum amount of Reference Checks that can be shown within a single edit by default, we're instead going to use this test to learn: What – if any – changes in edit quality and completion do we observe when people have the potential to see multiple Reference Checks within a single edit?

If/when we come to learn that people seeing multiple Reference Checks within a single edit does cause significant degradations in edit quality and/or completion, we'll decide from there how to respond. Maybe we'll come to think that setting an upper bound on the number of Reference Checks shown within a single edit will be most effective and addressing these regressions. Maybe we'll come to think changes to the Reference Check (UX) need to be made. Maybe we'll come to think the logic that determines when Reference Checks are shown need to be made.

The point is, we're committing to remaining open to how/if we adjust the Reference Check UX until we first learn how changing the number of Reference Checks that people have the potential to see in an edit impacts them.

Note: in proposing to move forward with this approach, we recognize that there are likely to be outliers in the test data that we will need to exclude from the analysis. E.g. Vandalism and potentially, new articles which could might benefit from a bespoke/different implementation of Edit Check.

ppelberg updated the task description. (Show Details)

I've updated the task description to reflect the changes @MNeisler and I converged on in T379131#10472407.

Note: Megan, I'm assigning this task over to you to ensure the task description reads as you expected it to.

A) Proportion of new content edits published without a reference and without being shown edit check (indicator of false negative)

I think this case shouldn't exist, because the logging that'd track this is triggered by the same thing that would show the edit check. So unless we're manually reviewing revisions to decide whether the "new content" tag is being correctly applied, they shouldn't differ.

Two questions regarding a couple of task description items:

  • Guardrail 4: Do we plan to have a quality assessment regarding the purpose of the edit compared to the number of checks declined? It could be interesting to check if the "I decline everything Edit Check asks me to do to keep my text" pattern is a possible characteristic of a certain edit behavior.
  • Decision matrix 4: References reliability is not measured at the moment (as it is qualitative information). How do we cover it?

A) Proportion of new content edits published without a reference and without being shown edit check (indicator of false negative)

I think this case shouldn't exist, because the logging that'd track this is triggered by the same thing that would show the edit check. So unless we're manually reviewing revisions to decide whether the "new content" tag is being correctly applied, they shouldn't differ.

Good spot, @DLynch and I agree with what you're proposing here: remove metric A) from guardrail 5).

I state the above both agreeing with the rationale you shared while also thinking about how we've been intentional about prioritizing optimizing false positives above optimizing false negatives, as evidenced by decisions like T381020.

MNeisler triaged this task as Medium priority.Mar 6 2025, 4:45 PM
MNeisler added a project: Product-Analytics.

Update
DECIDED: per today's offline discussion with @MNeisler, we are going to look at retention through the following time frames:

  • 7 days
  • 30 days

The Multi-Check (References) AB test analysis report is ready for sharing and review. The report includes a summary of insights, KPI and secondary metrics results by various breakdowns, guardrail analysis, and an overview of the test design and methodology.

Please let me know if you have any questions.

cc @ppelberg