B test to evaluate impact of showing multiple Checks within a single edit
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	Nov 6 2024, 12:50 AM

Description

In T342930, we ran an A/B test of Reference Check that demonstrated it was effective at causing:

Newcomers to publish new content edits that include references while lowering the likelihood those edits would be reverted
Newcomers to be more likely to return to edit again

...all the while NOT causing degradations in other metrics like, block rate, edit completion rate, etc.

This task involves the work of running another A/B test (or potentially an A/B/C test) of the Reference Check with one key change: removing the constraint on how many Reference Checks people have the potential to see within a single edit.

Decision to be made

This A/B test will help us make the following decision: What – if any – changes in the Reference Check UX will we make to ensure people seeing multiple Checks within a single edit continue to experience the benefits the first iteration of the feature caused ?

Hypotheses

ID	Hypothesis	Metric(s) for evaluation
KPI	The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will include a reference or an explicit acknowledgement as to why these edits lack references.	1) Proportion of published edits that add new content and include a reference or explicit acknowledgement of why a citation was not added, 2) Proportion of published edits that add new content (T333714) and are reverted within 48 hours (or have a high revision risk score) if we use revision risk model (T317700, T343938))
Curiosity #1	New account holders will be more likely to publish an unreverted edit to the main namespace within 24 hours of creating an account because they will be made aware of the need to accompany new text they're attempting to publish with a reference, when they don't first think/know to do so themselves	Constructive activation
Curiosity #2	Newcomers and Junior Contributors will be more aware of the need to add a reference when contributing new content because the visual editor will prompt them to do so in cases where they have not done so themselves.	Increase in the proportion of newcomers and Junior Contributors that publish at least one new content edit that includes a reference.
Curiosity #3	Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that includes a reference because Edit Check will have caused them to realize references are required when contributing new content to Wikipedia.	1) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and successfully and return to make an unreverted edit to a main namespace within 7 and 30 days., 2) Proportion of newcomers and Junior Contributors that publish an edit Edit Check was activated within and return to make a new content edit with a reference to a main namespace within 7 and 30 days.

Leading indicators

See T388731.

Guardrails

This section describes the metrics we will use to make sure other important parts/dimensions of the "editing ecosystem" are not being negatively impacted by people being able to see Multiple Reference Checks in a single edit. The scenarios named in the chart below emerged through T325851.

ID	Name	Metric(s) for Evaluation
1)	Edit quality decrease (T317700)	Proportion of published edits that add new content and are still reverted within 48hours (or have a low revision risk score if we use the revision risk model (T317700)). Will include a breakdown of revert rate of published edits with and without a reference added.
2)	Edit completion rate drastically decreases	Proportion of edits that are started (event.action = init) and are successfully published (`event.action = saveSuccess`)
3)	Edit abandonment rate drastically increases	Proportion of contributors that are presented Edit Check feedback and abandon their edits (indicated by `event.action = abort` and `event.abort_type = abandon`).
4)	People shown Edit Check are blocked at higher rates	Proportion of contributors blocked after publishing an edit where Edit Check was shown
5)	High false positive rate	Proportion of contributors that dismiss adding a citation and select "I didn't add new information" or other indicator that their edit doesn't require a citation

A/B Test: Decision Matrix

ID	Scenario	Indicator(s)	Plan of Action
1)	Reference Check is disrupting, discouraging, or otherwise getting in the way of volunteers. Read: people are less likely to publish the edits they start.	Significant drop in edit completion and spike in edit abandonment in edit sessions where Reference Check is activated.	Pause scaling plans; investigate changes to UX
2)	Reference Check is increasing the likelihood that people will publish destructive edits	Increase in proportion of contributors blocked after publishing an edit where Reference Check is activated, Increase in proportion of published edits where Reference Check was activated and are reverted within 48 hours relative to new content edits Reference Check was NOT activated within.	Pause scaling plans, review edits to try to identify pattern in abuse and propose changes to UX to mitigate them
3)	Reference Check is causing people to publish edits that align with project policies	Increase in the proportion of edits Reference Check was activated within that include a reference and are not reverted within 48 hours relative to new content edits without a reference Reference Check was NOT activated within	Move forward with scaling plans
4)	Reference Check is effective at causing people to accompany new content edits that include a reference, but those references are unreliable	Increase in the proportion of published edits Reference Check was activated within that include a reference and increase or no change in the proportion of these edits that are reverted within 48 hours	Block scaling plans on reference reliability work (T276857)
5)	Reference Check is not effective at causing people to accompany new content edits that include a reference but is not disrupting to volunteers.	No change or decrease in the proportion of published edits Reference Check was activated within that include reference and A) no significant drop in edit completion or abandonment rate or B) no significant spike in block or revert rate	Move forward with scaling plans

WARNING: For each metric named above, we need to be able to filter them by the number of Reference Checks shown within a given edit.

Related Objects
Search...

Status	Assigned	Task
Open	None	T265163 Create a system to encode best practices into editing experiences
Open	None	T345472 Offer Edit Check(s) within new article creation
Open	None	T366743 [Epic] Enable multiple Edit Checks to be presented within a single edit
Resolved	MNeisler	T379131 [A/B Test] Run an A/B test to evaluate impact of showing multiple Checks within a single edit
Resolved	Ryasmeen	T384372 Deploy config change to start the Multi-Reference Check A/B Test
Resolved	ppelberg	T352120 Implement methodology for identifying the Edit Check(s) shown within an edit and action(s) people take in response
Resolved	ppelberg	T384658 Conduct pre-deployment QA of showing multiple Reference Checks in a given edit
Resolved	JFernandez-WMF	T384954 Conduct pre-deployment usability testing of Multi-Check Phase 2 (References)
Resolved	Trizek-WMF	T386956 Inform concerned communities of the A/B test re: multiple Checks within a single edit
Resolved	ppelberg	T352092 Instrument the multi-check experience
Open	ppelberg	T352095 Publish multi-check measurement plan
Open	None	T351777 [MILESTONE] Offer Multi-Check (References) at partner wikis
Duplicate	None	T349027 Conduct usability testing of multi-check user experience (desktop + mobile)
Declined	Trizek-WMF	T351782 [MILESTONE] Host Edit Check Community Conversations (Q3)
Resolved	Trizek-WMF	T351783 Host November (2023) Edit Check Community Conversation
Duplicate	MNeisler	T350191 Analyze how people engage with page tools while editing
Open	None	T352115 [SPIKE] Investigate how/if multi-check experience will relate to the help panel
Open	None	T329596 Introduce the ability to assign Checks/Suggestions a priority level
Duplicate	None	T328594 Implementing logging to track how many checks people are being presented with on a per-edit-basis
Resolved	nayoub	T347530 Design user experience for presenting people with multiple checks
Duplicate	None	T349028 Analyze how people engage with page tools while editing (desktop, Vector (2022))
Resolved	ppelberg	T369235 [SPIKE] Determine how related Checks will be combined/sequenced in initial multi-Check experience
Resolved	ppelberg	T348575 [SPIKE] What open technical questions will need to be addressed before multiple checks can be supported
Resolved	ppelberg	T351839 [MILESTONE] Build multi-check proof of concept (desktop + mobile)
Open	None	T352117 Invite Senior Contributors to share input about the multi-check experiences
Open	None	T352118 Determine what facet(s) of the multi-check experience will be configurable on-wiki
Open	None	T352121 Conduct Pre-mortem for multi-check experience
Open	None	T360486 Decide if/how the Help Panel and Edit Check experiences will relate
Open	None	T367130 Dashboard: VE global health metrics (aka editing funnel)
Open	MNeisler	T399134 [SPIKE] Investigate data engineering needs for dashboarding VE editing funnel
Open	None	T367931 [SPIKE] In what cases might VE expand beyond the editing container?
Resolved	ppelberg	T388731 [A/B Test] Report on Multi-Check (References) leading indicators

Event Timeline

ppelberg created this task.Nov 6 2024, 12:50 AM

ppelberg mentioned this in T378755: Conduct impact analysis of enabling Multi-Check UX (desktop) .

ppelberg updated the task description. (Show Details)Nov 6 2024, 12:57 AM

Update: 17 January 2024
Per what @MNeisler and I discussed offline, we are going to revise what we are trying to learn through this A/B test...

Rather than looking to this A/B test to help us decide on the maximum amount of Reference Checks that can be shown within a single edit by default, we're instead going to use this test to learn: What – if any – changes in edit quality and completion do we observe when people have the potential to see multiple Reference Checks within a single edit?

If/when we come to learn that people seeing multiple Reference Checks within a single edit does cause significant degradations in edit quality and/or completion, we'll decide from there how to respond. Maybe we'll come to think that setting an upper bound on the number of Reference Checks shown within a single edit will be most effective and addressing these regressions. Maybe we'll come to think changes to the Reference Check (UX) need to be made. Maybe we'll come to think the logic that determines when Reference Checks are shown need to be made.

The point is, we're committing to remaining open to how/if we adjust the Reference Check UX until we first learn how changing the number of Reference Checks that people have the potential to see in an edit impacts them.

Note: in proposing to move forward with this approach, we recognize that there are likely to be outliers in the test data that we will need to exclude from the analysis. E.g. Vandalism and potentially, new articles which could might benefit from a bespoke/different implementation of Edit Check.

I've updated the task description to reflect the changes @MNeisler and I converged on in T379131#10472407.

Note: Megan, I'm assigning this task over to you to ensure the task description reads as you expected it to.

ppelberg mentioned this in T352120: Implement methodology for identifying the Edit Check(s) shown within an edit and action(s) people take in response.Jan 22 2025, 6:16 AM

ppelberg added a subtask: T352120: Implement methodology for identifying the Edit Check(s) shown within an edit and action(s) people take in response.

A) Proportion of new content edits published without a reference and without being shown edit check (indicator of false negative)

I think this case shouldn't exist, because the logging that'd track this is triggered by the same thing that would show the edit check. So unless we're manually reviewing revisions to decide whether the "new content" tag is being correctly applied, they shouldn't differ.

Two questions regarding a couple of task description items:

Guardrail 4: Do we plan to have a quality assessment regarding the purpose of the edit compared to the number of checks declined? It could be interesting to check if the "I decline everything Edit Check asks me to do to keep my text" pattern is a possible characteristic of a certain edit behavior.
Decision matrix 4: References reliability is not measured at the moment (as it is qualitative information). How do we cover it?

ppelberg mentioned this in T384658: Conduct pre-deployment QA of showing multiple Reference Checks in a given edit.Jan 27 2025, 10:41 PM

ppelberg mentioned this in T384954: Conduct pre-deployment usability testing of Multi-Check Phase 2 (References).Jan 28 2025, 7:14 PM

ppelberg updated the task description. (Show Details)Feb 7 2025, 11:18 PM

MNeisler mentioned this in T352092: Instrument the multi-check experience.Feb 12 2025, 4:38 PM

In T379131#10485757, @DLynch wrote:

A) Proportion of new content edits published without a reference and without being shown edit check (indicator of false negative)

I think this case shouldn't exist, because the logging that'd track this is triggered by the same thing that would show the edit check. So unless we're manually reviewing revisions to decide whether the "new content" tag is being correctly applied, they shouldn't differ.

Good spot, @DLynch and I agree with what you're proposing here: remove metric A) from guardrail 5).

I state the above both agreeing with the rationale you shared while also thinking about how we've been intentional about prioritizing optimizing false positives above optimizing false negatives, as evidenced by decisions like T381020.

ppelberg updated the task description. (Show Details)Feb 21 2025, 6:00 PM

ppelberg added a subtask: T352092: Instrument the multi-check experience.Feb 24 2025, 7:32 PM

ppelberg added a subtask: T351777: [MILESTONE] Offer Multi-Check (References) at partner wikis.

ppelberg closed subtask T384658: Conduct pre-deployment QA of showing multiple Reference Checks in a given edit as Resolved.Mar 3 2025, 5:21 PM

MNeisler triaged this task as Medium priority.Mar 6 2025, 4:45 PM

MNeisler added a project: Product-Analytics.

MNeisler moved this task from Triage to Current Quarter on the Product-Analytics board.Mar 7 2025, 4:26 PM

ppelberg mentioned this in T388731: [A/B Test] Report on Multi-Check (References) leading indicators.Mar 12 2025, 9:46 PM

ppelberg mentioned this in T388920: Make Reference Check decline reason visible on-wiki.Mar 14 2025, 7:23 PM

ppelberg updated the task description. (Show Details)Mar 24 2025, 11:43 PM

ppelberg closed subtask T352092: Instrument the multi-check experience as Resolved.Mar 24 2025, 11:55 PM

ppelberg mentioned this in T386956: Inform concerned communities of the A/B test re: multiple Checks within a single edit.Mar 25 2025, 4:24 AM

ppelberg closed subtask T386956: Inform concerned communities of the A/B test re: multiple Checks within a single edit as Resolved.

ppelberg closed subtask T352120: Implement methodology for identifying the Edit Check(s) shown within an edit and action(s) people take in response as Resolved.Mar 25 2025, 6:15 PM

JFernandez-WMF closed subtask T384954: Conduct pre-deployment usability testing of Multi-Check Phase 2 (References) as Resolved.Mar 26 2025, 7:33 PM

MNeisler edited projects, added Product-Analytics (Kanban); removed Product-Analytics.Apr 28 2025, 3:13 PM

ppelberg closed subtask T384372: Deploy config change to start the Multi-Reference Check A/B Test as Resolved.Apr 29 2025, 7:28 PM

Ryasmeen reopened subtask T384372: Deploy config change to start the Multi-Reference Check A/B Test as Open.Apr 30 2025, 8:46 PM

Ryasmeen closed subtask T384372: Deploy config change to start the Multi-Reference Check A/B Test as Resolved.Apr 30 2025, 9:11 PM

ppelberg closed subtask T388731: [A/B Test] Report on Multi-Check (References) leading indicators as Resolved.May 1 2025, 7:51 PM

MNeisler moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.May 5 2025, 7:23 PM

Update
DECIDED: per today's offline discussion with @MNeisler, we are going to look at retention through the following time frames:

7 days
30 days

ppelberg updated the task description. (Show Details)May 7 2025, 8:48 PM

KCVelaga_WMF subscribed.May 12 2025, 1:01 PM

MNeisler moved this task from Doing to Needs Review on the Product-Analytics (Kanban) board.May 23 2025, 5:24 PM

The Multi-Check (References) AB test analysis report is ready for sharing and review. The report includes a summary of insights, KPI and secondary metrics results by various breakdowns, guardrail analysis, and an overview of the test design and methodology.

Please let me know if you have any questions.

cc @ppelberg

ppelberg mentioned this in T395519: [Multi-Check] Deploy Multi-Check (References) to all Wikipedias.Jun 18 2025, 4:15 PM

Wonderfully done, Megan. Results published: https://www.mediawiki.org/w/index.php?title=Edit_check&diff=7748074&oldid=7734801

[A/B Test] Run an A/B test to evaluate impact of showing multiple Checks within a single editClosed, ResolvedPublicActions