Jump to content

User talk:Alaexis/AI Source Verification

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

UI is not aggressive enough ;-)

[edit]

When I click on Verify Claim it changes from a blue check to a grey check, then nothing. "What's going on? IDK, should I wait? Move on?" Some sense that the tool is waiting for replies and has not simply fallen over would be very helpful. In general I think any status change to the tool should be reflected in the UI. "Starting verification. ... Opening https://glossary.ametsoc.org/wiki/Particle ... "

(Also a link to this page in the tool, say wrapping "Source Verifier" would be very handy). Johnjbarton (talk) 16:02, 9 March 2026 (UTC)[reply]

That's also a good idea. For now the only indication is that the button becomes disabled. Alaexis¿question? 19:50, 9 March 2026 (UTC)[reply]

Running across an entire article

[edit]

I suspect this is probably going to require a local LLM like Llama or Mistral, but would there be a way to run something like this across an entire article and let it generate a report for the whole article with the same kind of output that this script is giving per citation? That would also allow the URL to be given just once for something like shortened footnotes. Rjjiii (talk) 17:38, 14 March 2026 (UTC)[reply]

FWIW we've been having a bit of a chat that touches on this at User talk:ClaudineChionh#Source verification in AFC. And I sometimes ask Opus (with these custom instructions) to do a sanity check on a draft/NPP article, especially if it's suffering from WP:REFBOMB. I think if we're considering a full-page scan or a bot that scans AfC submissions then there needs to be some consideration of API costs, maybe clear warnings for users who want to check all references on a single page, for example. ClaudineChionh (she/her · talk · email · global) 05:16, 15 March 2026 (UTC)[reply]
Hi @Rjjiii, yeah, we've been thinking about it. I think that API costs are not that high, even if we run it for all 2k articles in the AFC backlog. Even the current credits granted by Publicai may be sufficient. Keep in mind that inference costs go down all the time. WMF plans to host open-source models on Liftwing though I don't know if and when they plan to make them accessible to the community.
No matter what LLM is used, it will take minutes rather than seconds to generate such reports. The question is whether it's better to generate them automatically, perhaps triggered by AFC submission, or to have the reviewing editor trigger it manually, and then maybe get a notification when a report is generated. Probably it would be easier to start with the latter and in the meantime have the bot approved.
Ideally the check results would be persisted, so it would be possible to generate an updated report without re-running all the checks. But that wouldn't be in the first stage's scope. Alaexis¿question? 17:25, 15 March 2026 (UTC)[reply]
"No matter what LLM is used, it will take minutes rather than seconds to generate such reports." I think part of the benefit of LLMs responding in seconds is that it creates the illusion of conversation, and that's not too important to me personally, at least in this regard. This past week, I've had Claude vibe code something in for Mixtral/Mistral, Ollama, Python,and pdfplumber and have been tinkering. Mixtral and Mistral take hours to get through even a small article locally. My thought process was more based around Good Article, Did You Know, and Featured Article reviews. Sometimes those sit for weeks or months, then the actual review can go on for weeks. In that context, it doesn't really matter if a request takes a few minutes, hours, or even days. Rjjiii (talk) 17:56, 15 March 2026 (UTC)[reply]
These are also good use cases - I didn't think about them, probably because I've never been involved in GA and FA nominations. Anyway, the code is the same, I'll try to build something and will let you know once it's ready for testing. Alaexis¿question? 22:25, 15 March 2026 (UTC)[reply]
@Rjjiii, @ClaudineChionh, I've added it to the dev version of the script. See the results here
The biggest problem - and the reason I didn't post the third report at the talk page - is that the script's logic looks at each citation separately. In the third draft, there were two sources supporting different parts of the same claim, and we got "Not supported" verdict which is technically true but unhelpful. Alaexis¿question? 13:28, 16 March 2026 (UTC)[reply]
Some time soon, I'm planning to do some GA/DYK/FAC reviews. If I use this, is there any specific input that would be helpful for you? Also, thanks for putting this together, Rjjiii (talk) 02:35, 18 March 2026 (UTC)[reply]
Just jot down all your thoughts about the UX and the outcome, that would help a lot. Alaexis¿question? 08:37, 18 March 2026 (UTC)[reply]
Sounds good. I waited a bit to see how the discussions at Wikipedia:Village pump (proposals)/RfC LLMCOMM guideline and Wikipedia:Writing articles with large language models/RfC turned out. I created a thread on my talk page to track reviews at User talk:Rjjiii § LLM disclosure. My first two observations are that this does not feel slow to me at all, and that it is checking by cited source rather than by citation. So in the first review, the sources are reused so heavily that the vast majority of citations are not checked. Rjjiii (talk) 21:20, 29 March 2026 (UTC)[reply]
@Rjjiii, thanks for the feedback.
This is a bug in the "Verify All Citations" functionality. I hope I'll fix it soon. If you check citations individually it will work fine. Alaexis¿question? 19:59, 1 April 2026 (UTC)[reply]
I somehow had the "entire article" version and applied it to SARS. Strong potential if one focus on screening to find NOT SUPPORTED. My run did not get too far before I hit my limit. Johnjbarton (talk) 01:09, 30 March 2026 (UTC)[reply]
@Johnjbarton, if you run out of your own credits, have you tried using the publicai option? (though it's not available now, looks like we've run out of credits)
Btw I was thinking about showing only NOT SUPPORTED and PARTIALLY SUPPORTED by default to make it more actionable and reduce scrolling. The problem is that in spite of my best enchanting prompt-engineering efforts it still fairly often classifies not supported as UNAVAILABLE so it's usually worth looking through them too. Alaexis¿question? 19:51, 1 April 2026 (UTC)[reply]
I'm going to try a paid tier, just out of curiosity.
I'm unclear on what you are describing here, but I will just add that the AI summary of the case for/against support is often very helpful in further analysis. In fact the AI could be wrong about the case but the summary hints on how to think about the issues. Johnjbarton (talk) 20:41, 1 April 2026 (UTC)[reply]
Paid tier of your Gemini subscription? PublicAI is free. Alaexis¿question? 10:46, 3 April 2026 (UTC)[reply]
I'm just using Gemini, yes, to keep things simple enough. Johnjbarton (talk) 16:34, 3 April 2026 (UTC)[reply]
Using Villa's script: User_talk:Rjjiii#Citation_verification_report_(Alien_vs_Predator_(Atari_Jaguar_video_game)) Rjjiii (talk) 06:44, 26 April 2026 (UTC)[reply]
@Rjjiiioh hey, I missed this earlier! Report looks accurate - is it what you expected? —Luis (talk) 21:40, 1 May 2026 (UTC)[reply]
This is the same one that I'm discussing below. Also, I'm trying to leave a column of feedback to the right of the output to note how useful the LLM output is for each citation. Rjjiii (talk) 12:11, 2 May 2026 (UTC)[reply]

Echo citation

[edit]

With the tool open it is very hard to get to the citation one seeks to verify. Click on the [1] goes to the Verify. The popup is not always easy to copy content from. Maybe the citation could be render in the Verify panel? Johnjbarton (talk) 03:17, 24 March 2026 (UTC)[reply]

Do you mean that when you've run "Verify all" and then you click on [1] the results disappear? What do you mean by "citation could be render in the Verify panel"? Alaexis¿question? 20:04, 1 April 2026 (UTC)[reply]
No this was prior to "Verify all".
For simple web citations, the link in the Verify Source box is enough to take independent action on the citation. But for a long journal cite, the tool may not pick the "right" URL. I was imagining rendering the whole citation in the box that shows the URL now, maybe with the URL being used for verify as color highlighted. I'll try to post some examples to motivate this request. Johnjbarton (talk) 20:33, 1 April 2026 (UTC)[reply]

Works better than I expected.

[edit]

I applied the tool to an article that I knew needed work on sources and recorded the results here: User:Johnjbarton/AISandbox. My grade is 7/9. The tool identified two sources that were no longer available or non-existing. It failed two sources correctly. It passed three sources correctly. It failed to find one URL, probably because of a Cloudflare. It verified one claim that I would flunk: the source was a primary for the "first exoplanet" but a primary can't verify "first". Johnjbarton (talk) 04:16, 24 March 2026 (UTC)[reply]

Happy to hear that! :) Alaexis¿question? 20:05, 1 April 2026 (UTC)[reply]

Selected claim highlighting is confusing

[edit]

When a paragraph has multiple cites, eg Friction#cite_ref-66, the verifier picks the correct content to place in the "Selected Claim" box, but on the page it highlights the entire paragraph. Johnjbarton (talk) 19:59, 24 March 2026 (UTC)[reply]

Failed to parse AI response.

[edit]

Same call, multiple times (repeatable). Does this bit from the Inspector console help? As far as I can tell the "result" from the Gemini call stops before the closing brace. index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1381 Verifier Status: Verification complete! index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1755 [Verifier] displayResult called with type: string value: ```json {"confidence": 55, "verdict": "PARTIALLY SUPPORTED", "comments": "The source supports the claim that young and small (M-type) stars produce extreme stellar flares and coronal mass ejections (C index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1779 JSON parsing failed: SyntaxError: Unexpected token '`', "```json {""... is not valid JSON

   at JSON.parse (<anonymous>)
   at WikipediaSourceVerifier.displayResult (index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1777:28)
   at WikipediaSourceVerifier.verifyClaim (index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1571:22)

displayResult @ index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1779 index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1780 Attempted to parse: ```json {"confidence": 55, "verdict": "PARTIALLY SUPPORTED", "comments": "The source supports the claim that young and small (M-type) stars produce extreme stellar flares and coronal mass ejections (CMEs) which can lead to the stripping of a planet's atmosphere, impacting habitability. However, the source does not mention 'old' displayResult @ index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1780 index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1781 Original response: ```json {"confidence": 55, "verdict": "PARTIALLY SUPPORTED", "comments": "The source supports the claim that young and small (M-type) stars produce extreme stellar flares and coronal mass ejections (CMEs) which can lead to the stripping of a planet's atmosphere, impacting habitability. However, the source does not mention 'old' displayResult @ index.php?title=User:Alaexis/AI_Source_Verification.js&action=raw&ctype=text/javascript:1781 Johnjbarton (talk) 23:42, 24 March 2026 (UTC)[reply]

Happened to me as well. Don't know if it can be fixed easily - now the json format enforcement is part of the prompt, perhaps another approach would be more reliable. Alaexis¿question? 20:07, 1 April 2026 (UTC)[reply]
This is a duplicate of #ERROR results in Gemini, which has more info. Johnjbarton (talk) 02:01, 2 May 2026 (UTC)[reply]

No URL found

[edit]

For this source

  • Hirota, Tomoya; King, Bryan H. (10 January 2023). "Autism Spectrum Disorder: A Review". JAMA. 329 (2): 157–168. doi:10.1001/jama.2022.23661. PMID 36625807.

the tool says "No URL found" and yet the user sees two URLs found. The DOI url leads to an abstract. Johnjbarton (talk) 23:42, 26 March 2026 (UTC)[reply]

The reason you see this message is that there is no url tag. The DOI is displayed as a link but there is no link in the source code. Of course it's easy enough to convert it into a url, but I think that most of the time it would lead to an abstract and it's insufficient for a verification. Alaexis¿question? 20:21, 1 April 2026 (UTC)[reply]
Because of the many issues faced by this tool, I think a "Not supported by abstract" or "Supported by abstract alone" might be useful. Because we should be most of the time summarizing sources and abstracts summarizes sources, a lot of content does verify with the abstract. An "abstract" mode would be a screening tool and might be fast enough. Johnjbarton (talk) 20:38, 1 April 2026 (UTC)[reply]
Yeah, this is a good point. In essence, an abstract can serve as positive evidence but not as negative evidence. Now the tool doesn't support it - the assumption is that either a source is available (then the outcome is SUPPORTED, PARTIALLY or NOT SUPPORTED) or not available (UNAVAILABLE).
Implementing your suggestion requires distinguishing between sources that are fully and partially available, e.g., an abstract and open access article. On one hand it doesn't seem too complicated, especially if done only for cite journal citations. On the other hand it still adds complexity (determining the completeness of a source using heuristics or LLMs, adding new verdicts), so I'm a bit hesitant about doing it now.
It would be even better if we could get full access to TWL but that's even more complicated. Alaexis¿question? 10:45, 3 April 2026 (UTC)[reply]

Verifying source is on-topic

[edit]

One problem in weaker articles is sourced claims that are not directly about the article topic. For example, Habitable zone for complex life uses [[1]] to verify

  • Since then the list of exoplanets has grown to the thousands.

which the Source Verifier says is SUPPORTED, but the source actually says there are 361 habitable zone exoplanets. That one might be tough but many of these WP:SYNTH claims are easier to toss if we insist that the source be related to the article topic. Johnjbarton (talk) 00:02, 27 March 2026 (UTC)[reply]

For now this is out of scope. Providing additional context like the article title and preceding sentences is the key to solve this issue and similar problems (I've seen a claim with just a pronoun classified as NOT SUPPORTED because the model didn't know who the subject was). It's tricky since adding more context can impact the accuracy unpredictably. Wikimedia research folks work on a similar project and they have much more resources, hopefully they'll come up with the solution. Alaexis¿question? 20:18, 1 April 2026 (UTC)[reply]

Clicking another citation during verification

[edit]

I got impatient and clicked on another citation after entering a verification request. The UI went into a new state, with "Verifying..." text and clock but not grey. I think the tool should indicate it has aborted the previous effort and is ready to try a new one. Johnjbarton (talk) 22:00, 27 March 2026 (UTC)[reply]

This should be easy to fix. Alaexis¿question? 20:31, 1 April 2026 (UTC)[reply]

Interesting fail, "%" vs percent?

[edit]

WP claim:

  • By the end of the mission the data suggested that at least 20% and perhaps as many as 50% of the stars visible at night have Earth-sized planets in their habitable zone.

Source

AI analysis

  • However, the source does not mention the specific statistic that 20% to 50% of stars have Earth-sized planets in their habitable zones..

Johnjbarton (talk) 02:44, 4 April 2026 (UTC)[reply]

Interesting. Please add links to the article (ideally the revision link) and the number of citation when reporting issues so that I can reproduce them. Alaexis¿question? 07:24, 4 April 2026 (UTC)[reply]
It is source 39 in the report for source-verifier the article Habitable zone. In this version it is source 31.
As I mentioned elsewhere, marking the version number or even version link on the Report would be great. The Report also has a link to source however, for testing against one source. Johnjbarton (talk) 02:38, 5 April 2026 (UTC)[reply]
@Johnjbarton, turned out this was due to the source length limit. Usually 12k characters is sufficient and it would've been sufficient in this case too if not for lots of invisible text on that page.
I've added a warning when a source hasn't been fetched completely and also I've added revision links to reports as you requested.
You've been using Citation Verifier for AfD, right? Alaexis¿question? 15:54, 8 April 2026 (UTC)[reply]
Ok thanks.
Re: AfD. No, I usually ignore sources in AfD. Most of those cases are whether any good sources exist to show notability. The ones we have are typically not important or we wouldn't have gone to AfD.
I think the sweet spot for me will be popular science articles were editors are enthusiastic for their point of view but lack experience or perspective. These articles have a higher proportion of web sources so the tool finds source content. The NOT SUPPORTED items also lead me to look at sections of articles I might not focus on. Johnjbarton (talk) 18:07, 8 April 2026 (UTC)[reply]

Copy report

[edit]

Awesome feature! If possible it would be great to add the page version info. I have been editing in one window while the Verifying is continuing in another. Johnjbarton (talk) 03:34, 4 April 2026 (UTC)[reply]

 Done Alaexis¿question? 21:20, 10 April 2026 (UTC)[reply]
Would it be possible to have the Copy Report button only copy those items that are not hidden? So, in the run I just did, I have 58 supported, 8 partial, 1 not supported, and 44 unavailable. I've got the supported and unavailable hidden, but they still show up in the copied report. It would reduce a lot of noise if they were not included. RoySmith (talk) 21:04, 12 April 2026 (UTC)[reply]
FWIW, in a run with 393 citations I had 38 ERROR entries, most of which where NOT SUPPORTED or PARTIAL. I didn't see any ERRORs that were SUPPORTED. I'm unsure which category the ERROR responses are placed in. Johnjbarton (talk) 23:48, 12 April 2026 (UTC)[reply]

Oh no, lost my work.

[edit]

I ran the "Verify All Citations" and was working down the list of not supported. Then I clicked on a citation link in the article: poof the verification results are gone.

I think once we enter "Verify All" then the UI should owned by "Verify All" until the user indicates they are done. Johnjbarton (talk) 20:29, 4 April 2026 (UTC)[reply]

@Johnjbarton, that must have been super frustrating! I've just released a fix, now it doesn't disappear when you click on a citation. Alaexis¿question? 22:06, 4 April 2026 (UTC)[reply]
Thanks! I overreacted, since all I had to do press the button and wait again ;-) Johnjbarton (talk) 02:31, 5 April 2026 (UTC)[reply]
 Done Thanks Johnjbarton (talk) 01:58, 2 May 2026 (UTC)[reply]

Partial: Split?

[edit]

I have noticed "Partial support" can mean that the source supports the last sentence, but not content before that. For example from the Report on Neutrino

  • The neutrino is so named because it is electrically neutral and because its rest mass is so small (-ino) that it was long thought to be zero. The rest mass of the neutrino is much smaller than that of the other known elementary particles (excluding massless particles).

Sourced to

The analysis is spot on and says "The source supports the second part of the claim...", Perhaps when the wikitext has two sentences and the verifier finds "partial support", we should offer "Split and re-verify" or "Verify sentences" rather than "Add failed verification". Split at the final sentence and verify both halves. I guess the results for the example above would be red/green. The red would be a candidate for a "{{citation needed}}" button.

The advantage of this kind of drill down feature is that we know this is a case where we have the sources. Johnjbarton (talk) 16:39, 9 April 2026 (UTC)[reply]

Yeah that happens a lot. I don't think I'll have time to work on this in the near future considering that the attached note usually clarifies what is supported and what isn't, as you noted yourself.
You're right that "Add failed verification" is not always the right CTA. Sometimes I'm adding "citation needed" tags to unsourced parts, sometimes I remove unsourced info and sometimes I add sources. Probably the CTA copy can be made more generic. Alaexis¿question? 21:30, 10 April 2026 (UTC)[reply]
I've changed it to "Edit section". Less action-y but more accurate. Alaexis¿question? 18:51, 19 April 2026 (UTC)[reply]

⚠ Source is long, only partially checked.

[edit]

If the result is "Supported" I think the warning is confusing. We evidently have enough.

I guess you are downloading the full source and uploading only 12k to AI? We would probably get more valuable results if we included the conclusion section for sources with conclusions. I understand the cost benefit may be an issue. Johnjbarton (talk) 17:07, 9 April 2026 (UTC)[reply]

@Johnjbarton, I've disabled the warning for supported claims. Alaexis¿question? 13:57, 19 April 2026 (UTC)[reply]
thanks! Johnjbarton (talk) 16:36, 19 April 2026 (UTC)[reply]

Lead Verifier

[edit]

One task where I think a tool rather similar to Source Verifier could work is to verify MOS:Lead. The two sentences:

  • Apart from basic facts, significant information should not appear in the lead if it is not covered in the remainder of the article. A lead section should be carefully sourced as appropriate, although it is common for citations to appear only in the body and not the lead.

are a type of source verification task, running sentences of the lead against the body. This task has the source at hand and almost always below 12k. Could even be a feature of Source Verifier. Unsourced sentences could be verified against the body. Sourced sentences in the lead could be processed by "Verify Claim". If the AI can pinpoint the body content for a lead sentence, that content could be subject to "Verify Claim" if sourced or else marked as {{cn}}. (The "Apart from basic facts," bit is not something I understand as a useful exception given WP:Verify.) Johnjbarton (talk) 22:52, 9 April 2026 (UTC)[reply]

@Johnjbarton, that's a great feature idea! Indeed the citation policy for the lead is different and ideally the tool should take it into account. Alaexis¿question? 20:03, 18 April 2026 (UTC)[reply]

% confidence?

[edit]

I guess this is "how confident the AI is that the claim is true" and not "how confident we are that the AI assessment is accurate"? Johnjbarton (talk) 19:12, 10 April 2026 (UTC)[reply]

Yes. To be honest it's not really needed in the UI. Alaexis¿question? 21:46, 10 April 2026 (UTC)[reply]
My understanding of the latest research is that basically the LLMs have no real way to measure this anyway so it is only barely better than a guess. I would vote for removing it.
The question then is how to replace it... for my wikidata experimentation, I was doing this with (1) a slightly broader spread of confidence options for the model to indicate (see levels here) and (2) multi-model voting; in other words ask three small/cheap models "what do you think" and if they all agree, great; if they disagree, then that's a real flag for the editor. —Luis (talk) 18:50, 24 April 2026 (UTC)[reply]
I've removed it from the UI. if you were wondering why it was there in the first place, for some reason I got better responses from the models when I asked them to produce a %. Here's what WMF Research found "We stuck with requesting a confidence assessment from the model though made it more qualitative (low, medium, high). It seemed to help perhaps a little even though we don't use it." [2] Alaexis¿question? 17:00, 26 April 2026 (UTC)[reply]

Enhancement request: support sfn/harvnb style cites

[edit]

HI. Seems like this tool may be super useful. I ran it on my FAC-nominated article Nile, clicked "Verify all citations" and the output was 100% "unavailable" (475 sources). Most were like this:

 [3] SOURCE UNAVAILABLE (0%)
 Claim: 2,539 meters (8,330 ft)
 Comments: No URL found in reference

Probably operator error :-) I'm confused about "No URL found in reference" because all of my sources include URLs. I see that the tool instructions say "The tool can only verify sources available on the web. For paywalled or offline sources" ... and that may explain why the tool could not process many of the cites; but some of the cites _are_ fully available, free, online. Also: When I started it, it estimated 50 minute running time, but it completed in under 1 minute. Am I running the tool properly? Thanks in advance for any help! Noleander (talk) 19:34, 10 April 2026 (UTC)[reply]

While I have been successful on other articles, I got the same result on Nile. My guess is that the tool does not understand {{tl:harvnb}} citations. (I don't blame it ;-) Johnjbarton (talk) 20:01, 10 April 2026 (UTC)[reply]
Oh, so the tool only works when the Template:cite book orTemplate:cite journal is directly inside the "ref" bracket within the body text? As in:
... body text ...<ref>{{cite book |title =... | url = ... ..}}</ref>
but the tool does not (yet) process sfn/harvnb cites like:
... body text ...{{sfn|Smith|2000}}
References section: {{cite book |title =... | url = ... ..}}
Well, in that case, the tool author can consider this an enhancement request :-) Since all of my articles (and lots of FA articles) use the latter format. Noleander (talk) 20:10, 10 April 2026 (UTC)[reply]
Hi @Noleander, thanks for the feedback. Indeed, sfn citations are not supported yet. I've added it to the list of limitations for now.
Parsing citations is quite tricky, I'll try to see if I can add this capability. If I do it, would you be willing to test it? This would involve installing the dev version of the script, checking a few articles with such sources and letting me know if everything works correctly. Alaexis¿question? 21:38, 10 April 2026 (UTC)[reply]
I've stumbled across this a couple of times and would be willing to test it too. ClaudineChionh (she/her · talk · email · global) 22:24, 10 April 2026 (UTC)[reply]
Yes, I would be happy to test it. Just ping me when it is ready. Noleander (talk) 22:45, 10 April 2026 (UTC)[reply]
@ClaudineChionh, @Noleander, I've added it and it seems to be working - I did some spot checks in the Nile article. Please use the dev version User:Alaexis/AI_Source_Verification_test.js for testing. Alaexis¿question? 19:19, 11 April 2026 (UTC)[reply]
@Alaexis I ran the test version of your tool (just now) on the Nile article, and it did better than when I tried it a day or two ago. The summary of my latest test was:
Summary: 44 supported, 6 partially supported, 0 not supported, 426 source unavailable out of 476 citations. Generated by Citation Verifier using a PublicAI-hosted open-source LLM on 18:59, 11 April 2026 (UTC). Tokens used: 705,521 input, 15,409 output.
The tool appears to be working properly. Unfortunately, a lot of the sources for Nile are not available to the LLMs. So, for the Nile article, it was not able to provide a yes/no judgement on the vast majority of the sources. On the other hand, for the sources that _are_ readable, the tool seems to be working. I'm 99.999% confident that all the material in the Nile article is consistent with the sources, and the results bear that out: 44 "Supported" and 0 "not supported". So, in that sense, the tool is performing correctly. I see you made a small change to the article during your testing, thanks ... much appreciated. If I can do any more testing, or otherwise help, let me know. Noleander (talk) 21:20, 11 April 2026 (UTC)[reply]
I'm not very surprised, sfn templates are used a lot for books and journals articles which are usually not available online. Alaexis¿question? 05:42, 12 April 2026 (UTC)[reply]
@Alaexis I see you implemented parts of this. Have you looked at parsoid at all for the more general case? Not sure it really buys much, haven't looked closely enough, but any time someone else can do parsing for you... —Luis (talk) 19:35, 24 April 2026 (UTC)[reply]
@LuisVilla, no, I haven't. Worth exploring, but I have to say that I haven't seen many parsing errors - it seems like the current logic works fairly well. Supporting combined citations is a large change that would require touching this part of the code. Maybe using parsoid would make it easier. Alaexis¿question? 17:12, 26 April 2026 (UTC)[reply]

Highlighting bug

[edit]
  1. Open [3]
  2. Open Verify
  3. Find [24], click on it.
  • Expect: image caption as claim, highlighted
  • Actual: image caption as claim, everything else highlighted.

Johnjbarton (talk) 20:35, 10 April 2026 (UTC)[reply]

[edit]

I find I have to rely on the Report: it's too easy to lose track of the original page. However as soon as I start fixing problems identified by the Verifier, the citation numbers in the Report are off. So I use two pages, the Report and the analysed-version of the page. It would be very handy if the citation numbers in the Report were linked to the analyzed version, eg 4 Johnjbarton (talk) 03:08, 11 April 2026 (UTC)[reply]

Makes sense and it shouldn't be too hard. Alaexis¿question? 19:27, 11 April 2026 (UTC)[reply]
 Done Thanks! This works, but (of course!) lead to a new problem #Positioning or timing problem for working with the Report. Johnjbarton (talk) 02:00, 2 May 2026 (UTC)[reply]

Issue with the "Verify" tab depending on installation method

[edit]

I installed this script earlier by pasting {{subst:iusc|User:Alaexis/Scripts/AI Source Verification.js}} into my common.js, and the "Verify" tab was not visible anywhere; not next to the "Article" and "Talk" tabs, nor any of the dropdown menus (I typically use Vector 2022, but I did switch to Vector 2010 to check, and it didn't work there either). I removed the code from my common.js, tried again using the script installer, and everything worked as expected. TeoTB (talk) 11:53, 17 April 2026 (UTC)[reply]

Hi, it looks like a caching issue. If this ever happens again (unlikely) I'd be grateful if you can share the contents of your browser console (Developer Tools -> Console). Alaexis¿question? 19:59, 18 April 2026 (UTC)[reply]
Will do. In any case, great work on the tool, and I hope to see more editors using it! I've tried it on a few articles and it works quite nicely, even though I'm using the free model. TeoTB (talk) 19:39, 20 April 2026 (UTC)[reply]
That's great to hear, thank you! Alaexis¿question? 20:11, 20 April 2026 (UTC)[reply]

Experience on an article in featured article review

[edit]

I applied the Source Verifier to Black hole as part of its FAR review. This article had already been manually reviewed by several editors and in my opinion would have passed FA. I generated and saved a Report then another editor and I worked through 99 citations (out of 393 in the article) that had issues, and either fixed the article or checked off that the Verifier was too aggressive or off base.

We found a couple of claims that were plain wrong and numerous ones that were stated in a way that did not match the source. I found new sources for some of the claims. Quite a number of cases had a paragraph of text with two or three citations which together verified the paragraph, but the Verifier marked each individual citation off for different parts of the paragraph. Some of these were fixed by moving the citations, others were just checked off. Only about half of the citations in the paper were accessible to the Verifier, but the process of working the other half also verified most of the total.

This process was a lot of work but the Comments from the Verifier and the ability to step through the Report systematically made a huge improvement over manually verifying citations in this long technical article. Johnjbarton (talk) 02:30, 23 April 2026 (UTC)[reply]

Good to hear that. Probably WT:FACR is a good place to get feedback on source verification, I'm going to leave a note there. Alaexis¿question? 17:14, 25 April 2026 (UTC)[reply]

Faster pace of change? (aka a call for alpha testers)

[edit]

Hi, all! I have started throwing some changes at @Alaexis. I have two definite goals and one maybe goal:

  1. make the script more robust and testable. For those who aren't software developers, the basic idea here is that if you have very good tests, you can make changes more quickly because you know the tests will catch mistakes before you ship them to the public. For example, if we make tweaks to make the tool faster or cheaper to run - do those break things? if we have automated tests, we can catch that before shipping it to you.
  2. make the tool better at the core job of cite checking. Once there are more tests we can potentially do a lot of interesting things to make the tool better. Just off the top of my head: feed the model relevant sections from the cited page, instead of just the first 12k characters (already built, but not integrated yet); integrate Cite Unseen for a check on cite quality; use multi-model voting for higher reliability (possibly at lower compute cost - ie faster and better for the planet); integrate with Wikipedia Library to peek behind paywalls; pull information from Citoid to supplement the cite check; etc. etc.

The maybe goal is: make the tool better at growing the commons. Once a user has worked with the script and the article to validate a cite, it is possible to start doing creative things with this information. For example, it probably isn't very hard to build an "add this fact to Wikidata" button that would extract any factual knowledge within the claim and add it to Wikidata. I can imagine a lot of other things like that (WikiCite, Archive, and Zotero all come to mind.) I'm not sure this is the right tool for that, but I'd like to experiment.

If you're interested in helping move this forward, please help me test the changes! For right now, I've put a new version of the script into

User:LuisVilla/AI Source Verification-alpha-test.js

That tests GitHub change #118, which is the first step in "make more robust and testable".

Because the changes so far are small, this should work exactly the same as the old version. Please give it a shot and let me know if anything has changed. (or also, let me know if you've tried it and nothing changed - that's also good to know!) If it works reliably for everyone, then we can really pick up speed and make it even more accurate and useful.

Thanks- —Luis (talk) 17:46, 24 April 2026 (UTC)[reply]

I applied the Source Verifier to Habitable zone a second time and saved the Report in User:Johnjbarton/sandbox/Habitable_zone. Except for the fact that I ran out of money the results seem comparable to my previous run. Johnjbarton (talk) 20:47, 24 April 2026 (UTC)[reply]
@Johnjbarton thanks so much! I have some ideas to make it more efficient so you’ll be less likely to run out of money in the future ;) —Luis (talk) 04:33, 25 April 2026 (UTC)[reply]
Just in case anyone is curious, I was using the lowest paid tier of Google Gemini with a $5/month spend limit. I ran out on 23rd of the month. Johnjbarton (talk) 16:06, 25 April 2026 (UTC)[reply]
@LuisVilla Neat! I hadn't tried this tool before, and in just a few minutes of testing on articles that I've already identified as needing better citations (microblading and Erion Veliaj), I got several tips about potential partially-supported claims, which should help me fix them up more efficiently. I like the idea of integrating metadata from Cite Unseen. Related tools I'm thinking about: Meta:Research:Micro-task Generator for Organizers on Wikipedia, MW:VisualEditor/Suggestion Mode.
Tested so far in both the main version and yours, no differences observed:
  • Check a handful of different types of individual well-formatted citations, such as: Google Books, NYTimes, book citation with no URL, Internet Archive book, YouTube, website page.
  • Check citations to web pages in a couple of non-English languages.
  • Check an entire article and copy the plain text of the report.
  • Paste in source text to check a citation.
  • Try using default model and Gemini.
Difference observed, although I suspect this is noise - first row is your version, second row is original (using default model):
# Verdict Confidence Source Comments
[16] ERROR source Failed to parse AI response: { "confidence": 90, "verdict": "SUPPORTED", "comments": ""It was 2003, when Erion Veliaj founded the “MJAFT” movement and led it for 4 years, until November 2007." - The source text confirms Vel
[16] checkY Supported 90% source "It was 2003, when Erion Veliaj founded the ‘MJAFT’ movement and led it for 4 years, until November 2007." - The source text confirms Veliaj founded MJAFT in 2003, supporting the claim that he was one of its earliest activists.
Dreamyshade (talk) 00:32, 25 April 2026 (UTC)[reply]
Thanks for testing it! I’ll take a poke at that difference. Suspect it’s a transient failure of the model to emit reliable json but not sure.
This is definitely the sort of thing that the VE team could put into suggestion mode eventually, though doing it at scale will require a lot of compute no matter how much we can optimize it :) —Luis (talk) 04:32, 25 April 2026 (UTC)[reply]
@LuisVilla, thanks you for making it much more robust and testable! Testing changes has been taking a lot of my time, hope we'll be able to ship new things much faster. I've merged the changes to the main branch.
@Johnjbarton, @Dreamyshade - thanks for testing. Alaexis¿question? 17:04, 25 April 2026 (UTC)[reply]
I did my first review using this script and the free LLM: User_talk:Rjjiii#Citation_verification_report_(Alien_vs_Predator_(Atari_Jaguar_video_game))
A few thoughts:
  • The LLM is often too willing to say that a source supports the Wikipedia article.
  • Confidence seems tied to support/partial support/not support. Is it? It should be possible to get 100% confident it doesn't support.
  • The LLM treats direct and unattributed quotation as directly supported which is kind of technically true?
  • The LLM seems not to know when an article claim has multiple citations.
Rjjiii (talk) 06:47, 26 April 2026 (UTC)[reply]
@Rjjiii, thanks for the feedback. #4 is a known limitation and supporting multiple citations is in the roadmap. The confidence label indeed was misleading, I'm going to remove it for now.
Can you share permalinks to the citations where you experienced #1 and #3? I want to test it with a different model. Alaexis¿question? 09:02, 26 April 2026 (UTC)[reply]
This version: https://en.wikipedia.org/w/index.php?title=Alien_vs_Predator_(Atari_Jaguar_video_game)&oldid=1350630398
Gave these results: User_talk:Rjjiii#Citation_verification_report_(Alien_vs_Predator_(Atari_Jaguar_video_game))
In the feedback column I noted issues.
For point #1 examples, check out citations: 19 regarding cartridge capacity and 32 regarding staff and development kits.
For point #3 examples, check out citations: 38 for "taking a bite out of an apple" and 38 again for closing paraphrasing about "playable" levels.
Hope the specific examples help, Rjjiii (talk) 17:44, 26 April 2026 (UTC)[reply]
@Rjjiii. Your first example is a classic LLM hallucination. Better models hallucinate much less. I've re-checked the first example using Claude Sonnet 4.6 and the analysis was spot on
There's not much that can be done about it short of paying for better models or waiting for better open-source models. Alaexis¿question? 08:36, 27 April 2026 (UTC)[reply]
However, the claim about 'increasing the cartridge capacity' as part of this proposal is not mentioned in the available source text. Ah, Claude's spot on. It's just the model then. Rjjiii (talk) 13:57, 27 April 2026 (UTC)[reply]
I do think there are some things that we can do to make this more reliable for smaller models, but it will require literal experimentation so we have to improve the test and regression framework first. —Luis (talk) 13:34, 29 April 2026 (UTC)[reply]
As to the the "apple bite" example, I think that "if a design team member said that this happened in an interview (setting aside the source reliability concerns), then we can state it in wikivoice" is a defensible position. The prompt does try to detect more obvious inconsistencies (Distinguish between definitive statements and uncertain/hedged language. Claims stated as facts require sources that make definitive statements, not speculation or tentative assertions.) It would've been simpler if English had had explicit evidentiality markers, I suppose. If you have more examples like this please share them. Alaexis¿question? 09:20, 27 April 2026 (UTC)[reply]

The Wikipedia Library workflow

[edit]

Here is a way that we might combine TWL and Source Verifier.

  1. While view the Report, eg User:Johnjbarton/sandbox/Habitable_zone, citation [13] is an abstract that partly confirms. Does the article have more?
  2. click on the citation number, [13] to see (oops we have to click the Verifier closed) a DOI link and the places cited.
  3. Open the DOI link an a new page. (I use the [Wikipedia:Citing_sources_with_Zotero#Zotero_as_a_proxy_for_The_Wikipedia_Library|proxy in Zotero] to go to the full source).
  4. Copy some of the source article, eg Introduction
  5. Return to the tab with the article, click on the citation or a, b,c links.
  6. (oops we have to open Verify)
  7. (oops now the content is out of view, search for [13])
  8. Click on [13] in the article.
  9. (oops, since the Verifier finds the abstract, it won't let us paste content)

If we had a "Override content" button for the last oops we could take some trial runs on copy/paste verification. Johnjbarton (talk) 23:52, 24 April 2026 (UTC)[reply]

@Johnjbarton, override content makes a lot of sense. Re #2, why do you need to close Verifier? For me (I'm on Vector 2022) clicking or hovering on the citation number simply displays the citation and I can open-in-a-new-tab any links within it, it doesn't affect the sidebar. Alaexis¿question? 17:11, 25 April 2026 (UTC)[reply]
In #2 I am still on the Report from #1, and the [13] in the Report is linked to https://en.wikipedia.org/wiki/Special:PermanentLink/1350388531#cite_note-Lingam-2021-13. I was just describing my workflow based on the Report, which is how I envision TWL feature being most useful.
The "reopening" of the Verifier when revisiting the page is something that I find annoying, but I don't know if you can control that. Johnjbarton (talk) 17:17, 25 April 2026 (UTC)[reply]
Yeah, ideally there would be a seamless transition between the report mode and individual mode. In the latter you'd be able to paste content manually, control which url is used, have an external link to TWL, etc. Probably makes sense to start with the override. Alaexis¿question? 16:40, 26 April 2026 (UTC)[reply]
@Johnjbarton just added the override.  Done Alaexis¿question? 16:21, 30 April 2026 (UTC)[reply]
Thanks! I tried it on the example above and it worked perfectly! Johnjbarton (talk) 17:00, 30 April 2026 (UTC)[reply]

ERROR results in Gemini

[edit]

I decided to try Gemini debugging via AI chat on the truncated JSON failures I see. For example User:Johnjbarton/sandbox/Habitable zone, citations [26], [27], and [28]. These fails are 100% reproducible at least when running against the same version of the article.

The fails are invalid JSON chopped off at 79 characters. The AI suggests:

  • The Problem: If the answer is complex, the model often runs out of "reasoning" room, gets confused, and just stops (often around that 70-80 token mark).

My interpretation of the fix it suggests is to replace the plain text instructions for the response in the System Prompt ("Respond in JSON format:" and "Confidence guide:") with structured data instructions via response_schema={ "type": "object", "properties": { "confidence": {"type": "number"}, "verdict": {"type": "string"}, "comments": {"type": "string"} } }

The AI says this approach will work in OpenAI and others and claims that the reduced system prompt token count will improve performance, and be generally wonderful. HTH Johnjbarton (talk) 20:51, 26 April 2026 (UTC)[reply]

Wow, I thought that these are just random errors but it looks like you are right regarding the root cause. The problem with using response schemas is that that each provider works differently and not all of them support them. So the tradeoff here is that we'd need to maintain different prompts for each provider. I'm thinking of a different solution: limiting Gemini's thinking budget to, let's say, 1000 tokens. This may cause some degradation of performance but probably thinking isn't that important for a classification task. The overall budget can be kept at 2k or even increased. Alaexis¿question? 11:14, 28 April 2026 (UTC)[reply]
I had hoped that maybe we could switch to openrouter, which has some infrastructure to handle at least some of this. But I don't see how we can switch to openrouter and still let people bring their own keys. Something to think about :(
@Alaexis should one of us open a GH issue to track this? —Luis (talk) 13:40, 29 April 2026 (UTC)[reply]
Yes, please open one.
I think all power users use BYOK. Alaexis¿question? 08:06, 30 April 2026 (UTC)[reply]
Yes, but if you BYOK for openrouter you can use a wider variety of models with one key/budget. I might go ahead and add support for that tonight. —Luis (talk) 22:09, 1 May 2026 (UTC)[reply]

"AI Guidelines" compliance?

[edit]

Meta:AI Guidelines has been put together by a few folks and I thought it would be interesting to walk through what it would take for this tool to comply.

The big thing the model requires that this tool doesn't have is the "tool card"; if this is accurate, providing links and notices so that users can find the tool card would be pretty trivial.

AI Source Verification tool card [DRAFT]

[edit]

Last updated: 2026-05-01 based on revisions to the Guidelines This page documents the AI Source Verification user script, in line with the voluntary Wikimedia AI Guidelines. Its structure follows the five clusters set out in section 2 of those guidelines.

Identity

[edit]

Scope of this card

[edit]

AI Source Verification is a tool that wraps third-party language models, not a model itself. The sections below cover what is first-party to this tool (prompts, claim extraction, evaluation, data flow, governance) and link out to upstream model cards for what is not (training data, training procedure, model architecture).

Purpose of the tool

[edit]

The purpose of the tool is to help logged-in editors (currently primarily on English Wikipedia) review citations, either one-by-one or through a report mode that can process an entire article. Automated editing is out of scope. The tool is not a bot and does not require the standard bot approval process. There is no batch-edit or auto-edit mode. However, because the code is open source, it could be reused for that purpose.

Models

[edit]

Models used

[edit]

The tool does not train or host any model. It dispatches to one of the following providers per request: - PublicAI (default; no API key required) — routes to Qwen-SEA-LION v4 32B-IT, an open-weights (Apache-2.0) Southeast-Asian-language fine-tune of Qwen3 published by AI Singapore. PublicAI is a public-benefit, non-profit-funded inference provider. - Anthropic Claude Sonnet 4.6 (claude-sonnet-4-6) — proprietary; see Anthropic's models overview and Anthropic's Transparency Hub for evaluations and system cards. - Google Gemini Flash (gemini-flash-latest) — proprietary; see Google's Gemini models documentation. The -latest alias is a moving pointer that Google may re-route to a newer Flash variant; the tool follows whichever version Google currently serves under that alias. - OpenAI GPT-4o (gpt-4o) — proprietary; see OpenAI's model documentation and the GPT-4o System Card. The gpt-4o name is a moving pointer that OpenAI re-routes to its current dated snapshot.

Benchmark-only models

[edit]

The benchmark suite also evaluates the tool against additional open-weights models that are not currently dispatched to in production: - Apertus — open-weights model included for evaluation purposes only. - OLMo (Allen Institute for AI) — open-weights model included for evaluation purposes only. These models are tested via the benchmark in benchmark/ (in the repo) but are not exposed to editors through the user script.

Future model opportunities

[edit]

One possible goal of this project is to improve citation assessment and prompt quality so that the tool can reach sufficient levels of performance when using smaller, open-weight models. This will require continued development of test cases and test frameworks. Locally hosted models, while preferred by the guidelines (see section 3), are difficult to integrate into a browser-based user script at this time.

System

[edit]

System architecture

[edit]

The current workflow is that, when given a Wikipedia citation (claim text + cited source URL), the tool fetches the source, compares it against the claim, and returns a verdict of Supported / Partially Supported / Not Supported / Source Unavailable to assist a human editor's review. The source code of the tool has two main components: - User script: Browser-side user script (source) stored in typical Mediawiki userscript mechanisms, and executed by the editor's browser. - Proxy: A Cloudflare Worker at publicai-proxy.alaexis.workers.dev (source) accepts requests from the user script, fetches the cited source, and forwards the claim and source text to the chosen LLM provider.

Prompts

[edit]

The verification prompt and claim-extraction logic live in core/prompts.js and core/claim.js (source). Both are checked into public source control for transparency and change management.

Data flow

[edit]

For each verification, the following is sent from the editor's browser to the proxy: - The article URL and citation number (to the proxy, for source fetching and logging). - Up to ~12k characters of source text fetched from the cited URL (to the proxy, and then to the selected model provider). - The extracted claim text (by the proxy, to the selected model provider).

No editor identity, IP, or session cookies are forwarded to model providers. The proxy stores per-verification metadata (article URL, citation number, source URL, provider, verdict, confidence) in a Postgres log used for evaluation. No article-text or source-text bodies are stored.

Evaluation

[edit]

Ethical considerations

[edit]

- Human in the loop: verdicts are advisory; no edit is made without an editor's explicit action. - Editor friction: the tool's user interface has a load-bearing tension between ease of use and accuracy of verification: changes will have to be made with that in mind, as improvements to speed of use may reduce a user's desire to double-check the tool's work. - Provider choice: editors can choose an open-weights provider (PublicAI) by default and avoid forwarding text to proprietary providers if they prefer, but this may cause a reduction in accuracy.

Benchmark and results

[edit]

Accuracy is measured against a public benchmark in benchmark/dataset.json (currently 77 claim/source pairs upstream; expansion incorporating a Wikimedia Research source-verification dataset and user-contributed data is in flight). Headline numbers, current models: - Detection of unsupported citations: ~70%. - Misfire rate on supported citations: <15%. Per-provider metrics (exact accuracy, lenient accuracy, binary accuracy, latency, calibration) are produced by npm run analyze and reported in the repo.

Future evaluation opportunities

[edit]

- The current benchmarking dataset is overweight on verified citations and needs more rejected/unverified citations for robustness. - The current benchmarking report is not available on-wiki or in other human-readable form. - The data source has some multilingual cites, but not enough. - We could crowdsource the benchmarking data, for reuse in other projects.

Limitations and biases

[edit]

Tool-level: - Online sources only; PDF extraction is best-effort. Integration of Wikimedia Library and Archive copies is wishlisted. - Mostly fails on paywalled or AI-blocking sources, which may privilege certain classes and sources of information. - Sentences with multiple interleaved citations can confuse claim extraction. - Source content is truncated at ~12k characters. - Model accuracy varies by provider; see benchmark.

Inherited from underlying models: - English-language bias in the proprietary models is well documented; performance on non-English Wikipedias has not been systematically evaluated. - Each provider's training data carries its own biases (over-representation of widely indexed web text, under-representation of paywalled scholarly sources, etc.) which can affect verdicts on certain source types. By design, the tool's prompts try to direct the model not to rely on on pre-trained knowledge, but this cannot be entirely prevented, especially with smaller models.

Reproducibility

[edit]

- Source code: linked above for both the user script and the proxy. - Prompts: core/prompts.js in the user-script repo. - Benchmark dataset and runner: benchmark/ in the user-script repo; npm run benchmark reproduces results given the appropriate provider API keys. - Required environment: Node.js for the benchmark; provider API keys for whichever providers are being evaluated.

Stewardship

[edit]

Monitoring

[edit]

There is currently no scheduled re-evaluation or automated monitoring of accuracy in production. The proxy log captures per-verification metadata (article URL, citation number, source URL, provider, verdict, confidence) that could support such monitoring in the future.

However, the benchmark suite is run manually before significant changes.

Governance

[edit]

- Bug reports and feature requests: GitHub issues. - Discussion: talk page. Major behavior changes will be announced on the talk page before deployment.

Licensing

[edit]

User script and benchmark suite: see the repository LICENSE. Proxy worker: see the repository LICENSE. Benchmark dataset: distributed with the repository under the same license. Underlying models: each upstream provider's terms apply to model outputs. Open-weights models (Apertus, Qwen, OLMo) carry their respective open licenses; proprietary providers (Claude, GPT, Gemini) carry their vendors' terms of service. —Luis (talk) 04:16, 30 April 2026 (UTC)[reply]

I would move "In specific,..." in the Purpose section to Data flow section. The details of the workflow could change and are not intrinsic to the purpose.
I would remove the Editor Friction section since I would oppose it, were it to be "announced on the talk page before deployment".
The biggest bias in the tool is against paywalled sources. They are not official second class sources in Wikipedia but cannot be verified by the tool. Johnjbarton (talk) 17:12, 30 April 2026 (UTC)[reply]
Say more about the editor friction concern @Johnjbarton? Definitely not thinking anything like pre-approval (among other problems, would go against WP:BOLD). But you're right, maybe this is not the right place for that.
Good point about paywalled sources being a source of a type of bias, I'll fix that momentarily. —Luis (talk) 17:25, 30 April 2026 (UTC)[reply]
I'm an editor: friction==bad. Source review is a world of hurt, more can't be better. Johnjbarton (talk) 17:35, 30 April 2026 (UTC)[reply]
OK, fair ;) I've rewritten to this, which I hope accurately captures that there is a tension here, even as, of course, if the tool makes your work harder then you won't use it :)
Ethical considerations: Editor friction: the tool's user interface has a load-bearing tension between ease of use and accuracy of verification: changes will have to be made with that in mind, as improvements to speed of use may reduce a user's desire to double-check the tool's work. —Luis (talk) 18:23, 1 May 2026 (UTC)[reply]
@LuisVilla perhaps the tool should be renamed to Verification Assistant. Limited access to sources means full verification is out of reach. If you recalibrate your goal, ease of use will rise on the list. Plus what significant consequence of failing to double check can we imagine? Johnjbarton (talk) 22:09, 1 May 2026 (UTC)[reply]
Yeah, there's a hard question about how "in the loop" humans need to be--especially since, as you noted in another comment, currently it is way too eager to approve citations. So yeah, maybe assistant would be a better name. Give me a few weeks to help improve the quality though ;) —Luis (talk) 03:36, 2 May 2026 (UTC)[reply]
New draft published. —Luis (talk) 19:33, 1 May 2026 (UTC)[reply]

Positioning or timing problem for working with the Report.

[edit]
  1. Comet in a new tab and click on Verify to open the verifier.
  2. In a new tab, Open User:Johnjbarton/sandbox/Habitable_zone
  3. scroll down and click on [13] in the # column.

A workaround is to go to the address bar, click in, an hit enter.

I guess that the Verifier inserts in the DOM after the page has rendered, triggering a reflow. So maybe there is a way to re-visit the URL fragment after the tool opens. Johnjbarton (talk) 18:03, 30 April 2026 (UTC)[reply]

@Johnjbarton, I couldn't reproduce it, I get exactly the expected view. If it still happens, can you explain what goes wrong? Alaexis¿question? 13:46, 4 May 2026 (UTC)[reply]
Oh sorry, I didn't realize there was an additional requirement. Source Verifier needs to be opened in some other tab. I changed the steps to match. Johnjbarton (talk) 14:55, 4 May 2026 (UTC)[reply]
Wanted to acknowledge that I could reproduce it. Will keep you posted. Alaexis¿question? 19:21, 5 May 2026 (UTC)[reply]

Experience applying to an AI-cleanup

[edit]

I applied the Source Verifier to Gareth Thomas (materials scientist) when it was tagged for suspected {{AI-generated}}. The tag says the article

  • ...may include hallucinated information, copyright violations, claims not verified in cited sources, original research, or fictitious references.

It turned out that, in my opinion, the sourcing was generally fine and the tool helped by pointing out some areas of partial support that I cleaned up.

However, I learned of a limitation in the tool that should be mentioned in the documentation: copyright violations in the article content will generally result in green "SUPPORTED" markers. We have independent copyvio tools, but a less rigorous check during the source verification could be a possible enhancement. Johnjbarton (talk) 02:10, 2 May 2026 (UTC)[reply]

Which copyvio tools are you thinking of @Johnjbarton? Happy to poke and see if we can incorporate or reference them. —Luis (talk) 04:12, 2 May 2026 (UTC)[reply]
Wikipedia:Copyright_problems/Instructions#Useful_scripts_and_tools, though I have no experience with these. I assume these are all algorithmic tools. But it seems to me that an AI tool could do some simple copyvio check against a source cheaply as a side effect of a source verification.
From an older version of a page, I opened the Verifier, clicked on a citation, then copied the Selected Claim field. Then I clicked the link in the Source Content box. In that new tab in Chrome I typed and pasted the Claim:
  • Check the content to see if the following text could be a close paraphrase: This won favor with the Department of Energy, becoming a line item in the congressional budget of 1980. This led to the establishment of the National Center for Electron Microscopy at Lawrence Berkeley National Laboratory in 1983.
This claim was previous deleted as copyvio by an editor.
The AI response was very encouraging:
  • The text you provided is a close paraphrase, as it mirrors specific wording and factual details from official historical accounts of the National Center for Electron Microscopy (NCEM).
  • Specifically, the text matches information from a biographical tribute to Gareth Thomas published by the National Academy of Engineering (NAE), which states that a proposal for the center "won favor with the Department of Energy, becoming a line item in the congressional budget of 1980". [1]
Johnjbarton (talk) 16:08, 2 May 2026 (UTC)[reply]
Indeed there are many sourcing-related checks that we do as human editors and that AI could facilitate: verification per se, copyvio, source reliability, due weight, etc. Given limited resources, I'd rather have the tool do one thing well rather than trying to catch all problems. I understand that from your point of view it would be convenient to have one report with all sourcing-related issues. Alaexis¿question? 17:09, 2 May 2026 (UTC)[reply]

Dark mode

[edit]

I have issues reading some text in dark mode. Specifically noticed it with ⚠ The source is long and can only be checked partially. Thank you, Wracking talk! 03:37, 2 May 2026 (UTC)[reply]

Hi @Wracking, just fixed it  Done. Alaexis¿question? 21:23, 3 May 2026 (UTC)[reply]

False positives rate - actual data

[edit]

@Rjjiiiyou mentioned that you think the tool is too willing to accept bad cites as good cites. I think that's right but we needed data to prove that and begin to find the problems. So we have about 100 more samples now, and can start putting together some real data.

Provider Not-supported

("ground truth")

Misclassified by LLM

as Supported-of-any-kind

Rate
Gemini 2.5 Flash 46 8 17.4%
Claude Sonnet 4.5 46 9 19.6%
Qwen-SEA-LION 22* 7 31.8%
Apertus 70B 22* 12 54.5%

So in other words, depending on the model, anywhere from 17% to 55% of wrong citations are being reported as at least partially supported.

That's... not great! I have a lot of ideas on how we can fix that, as well as ideas for how we can get your help to test and improve that. But also it is 9pm on a Friday so probably that's... tomorrow. For now, just wanted to share this data for transparency's sake. —Luis (talk) 04:17, 2 May 2026 (UTC)[reply]

Thanks for posting the FPR. Just one quibble, I think it makes more sense to group Partially Supported with Unsupported. In both cases the user gets the alert and hopefully checks the source. Thus, misclassifying Not Supported as Partially Supported is less damaging since the error would most likely be fixed. At least this is how I work with the tool. Alaexis¿question? 17:12, 2 May 2026 (UTC)[reply]
I also group Partial with Unsupported (still like to see both). I think the data would be more helpful if stated the way the tool classifies them. So if the tool says SUPPORTED what is the probability that a human would say SUPPORTED. This number is important because I think editors will work the way I do: ignore the SUPPORTED entries and focus on the ones AI suggests are incorrect. Johnjbarton (talk) 19:25, 2 May 2026 (UTC)[reply]

Pop under bug

[edit]
  1. Open the Verify panel
  2. Adjust the panel width to cover 1/2 of the page
  3. Click on Change Key button

Expect: UI able to change Key

Actual: UI only able to cancel dialog because it is under the Verify panel. Johnjbarton (talk) 17:05, 4 May 2026 (UTC)[reply]

Fixed. Alaexis¿question? 18:21, 9 May 2026 (UTC)[reply]

Timing bug

[edit]
  1. Open this version of the page on Robin Nievera.
  2. Open Verifier. Find claim Nievera had recently signed with Viva Artists Agency, which he said "helped with rehearsals" for the show.
  3. Click on [27]. UI says "Content fetched successfully". Click Verify Claim.

Expected SUPPORTED. Actual SOURCE UNAVAILABLE The provided source text contains the article headline and website navigation/metadata, but the actual article body is missing (indicated by 'Loading content...'). There is no content available to verify the claim regarding Viva Artists Agency or rehearsals. If I copy the content into the "Paste source text manually" feature, SUPPORTED is returned.

If you manually load the source you will see an intermediate loading stage before the content. Johnjbarton (talk) 02:42, 12 May 2026 (UTC)[reply]

@Johnjbarton, exactly, the problem is with the intermediate loading stage. Everything is solvable but every change to the fetching logic can affect something else and it needs to be extensively tested. I wish someone built a fetching tool on toolforge I could use rather than inventing the bicycle here. @LuisVilla - if you think we can tackle it cheaply and reliably I'd be happy to chat. Alaexis¿question? 11:45, 14 May 2026 (UTC)[reply]
I have some ideas to help (was testing some of them last night, actually) but there is only going to be so much we can do.
@Alaexis Maybe a different way to think about it: what UI experience could we offer in this situation? if we reliably detected "this page can't be seen by the bot", then... ? ask the user to copy/paste into a field? automatically try to fetch from Archive? (I think we can do that last one soon)? ...? —Luis (talk) 13:29, 14 May 2026 (UTC)[reply]
In this case the tool tried to fetch it from the IA. Whenever there is a IA link we use it.
Good question regarding the UI. Essentially, we want to distinguish between "fetch failed but the page is available" and "fetch failed but the page can be accessed by the user". I think it would be tricky to do it reliably. Alaexis¿question? 14:40, 14 May 2026 (UTC)[reply]
I attempted to reproduced this but now the Verifier says "No URL found. Please paste the source text below:" Johnjbarton (talk) 02:14, 15 May 2026 (UTC)[reply]
Alas, fetching website data is not exact science. Alaexis¿question? 10:08, 15 May 2026 (UTC)[reply]

"Filter chip"

[edit]

What is that? And where is it located on the script? RedShellMomentum 21:06, 13 May 2026 (UTC)[reply]

(NB: I haven't used the script for a couple of days and don't think I've seen this message.) I found the line in the code on GitHub:
emptyEl.textContent = 'All citations are hidden by the current filters. Click a filter chip above to show them.';
I don't think I've seen the term "filter chip" before. ClaudineChionh (she/her · talk · email · global) 21:59, 13 May 2026 (UTC)[reply]
Oh, of course, "filter chip" isn't a special term, it's just a Chip (GUI). (I clearly haven't had enough coffee.) ClaudineChionh (she/her · talk · email · global) 22:39, 13 May 2026 (UTC)[reply]
@RedShellMomentum, this only means that the filters you applied hide all the results. It's probably a bit too jargon-y and can be made clearer. Alaexis¿question? 10:46, 14 May 2026 (UTC)[reply]

URL not found?

[edit]

In this version of Dark energy, citation 99 is

but the Verifier claims "No URL found. Please paste the source text below:" Why? The link is fine. Johnjbarton (talk) 18:22, 24 May 2026 (UTC)[reply]

You can access the link but the proxy gets 403 error when it tries to fetch data from the website, possibly due to some kind of anti-scraping protection. I guess I should change the error message to make it clear. Alaexis¿question? 06:22, 25 May 2026 (UTC)[reply]
Another case of URL not found Schimeczek, Christoph; Nitsch, Felix; Kochems, Johannes; Nienhaus, Kristina (2026-02-28). "Avoiding avalanches: Effective dispatch planning for competing storage units in day-ahead electricity market simulations". Journal of Energy Storage. 148 120054. doi:10.1016/j.est.2025.120054. ISSN 2352-152X. Johnjbarton (talk) 20:48, 28 May 2026 (UTC)[reply]
And
  • Löffler, Konstantin; Hainsch, Karlo; Burandt, Thorsten; Kemfert, Claudia; von Hirschhausen, Christian (2017). "Designing a model for the global energy system — GENeSYS-MOD: an application of the Open-Source Energy Modeling System (OSeMOSYS)". Energies. 10 (1468): 1468. doi:10.3390/en10101468. hdl:10419/200750. ISSN 1996-1073. S2CID 6944356.
Johnjbarton (talk) 20:55, 28 May 2026 (UTC)[reply]
I think something is broken, most sources fail with No URL found. Johnjbarton (talk) 21:02, 28 May 2026 (UTC)[reply]
@Johnjbarton - sorry to hear that. Just checked a random article and got the usual (low) % of errors.
When I open the first link (Schimeczek et al) myself I get a captcha challenge, so apparently science direct has stricter than usual access policy. Though I'm accessing it from my own IP which is not the same IP the proxy uses.
I don't think this can be solved comprehensively unless we get a proper access to journal articles. @LuisVilla, any thoughts about these specific articles or about the general problem? Alaexis¿question? 21:01, 29 May 2026 (UTC)[reply]
Just an FYI: I started using the Zotero client-side proxy recently. So if I attempt to "reproduce" a URL fetch there is an additional layer of difference I guess. Does the Verifier pull sources into the browser? Does the proxy affect how it works? Johnjbarton (talk) 21:11, 29 May 2026 (UTC)[reply]
No, your use of Zotero shouldn't have any impact. The Verifier uses a proxy that is hosted on Cloudflare. Alaexis¿question? 07:08, 30 May 2026 (UTC)[reply]