Page MenuHomePhabricator

Investigate and evaluate hCaptcha to replace Wikimedia's Fancy Captcha
Open, In Progress, HighPublic

Description

This is not complete and as such, should be considered a WIP. Comments/questions and such below are welcome

Following on from T249854: Add support for hCaptcha, and as a potential solution to T241921: Fix Wikimedia captchas (and the various older incantations).

hCaptcha is an alternative to reCaptcha, without the usual privacy concerns that come with it.

cloudflare did move to hCaptcha, but now have got their own turnstile.

It may still require a change to the WIkimedia's Privacy Policy, as it requires loading JS from an external website, and submitting data back to them, but hCaptchas Privacy Policy is seemingly more in line with what we'd want (IANAL, and would need WMF-Legal review obviously). They're more interested in the aggregate data rather than individual data, and try to discard other data as soon as they can.

hCaptcha are offering donation of websites "earnings" from captchas being solved to the Wikimedia Foundation rather than keeping it for themselves. While I imagine this won't solve all of Wikimedia's funding problems, it's nice that we're considered a good solution for the problem. Obviously, there's the potential of this resulting in captcha solves on Wikimedia sites also helping generate income

EVR9uTuXsAATveo.jpg (302×1 px, 21 KB)

The implementation is similar to reCaptcha, selecting images of a certain type etc.

Localisation is done to ~150 languages, and they're planning on open sourcing UI translations onto github, so a chance to expand that further and to help support more languages (which is one goal of the Captcha replacement project, T7309: Localize captcha images, though removing the text strings to be identified and typed out does make that task kinda redundant)

There's also a labelling service we could potentially use with MachineVision instead of the Google services. It would be potentially possible to use our own captchas to help label our own images from commons, somewhat a mix of T87598: Create a CAPTCHA that is also a useful micro edit and T34695: Implement, Review and Deploy Wikicaptcha

Questions:

  • Does this image matching captcha solution help our Accessibility issues?

Known caveats/issues:


Useful links:

Details

Other Assignee
EMill-WMF

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

But we do not rely on them to edit and we have plenty of alternatives. Also the requests go through proxies.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

I once suggested Wikimedia to develop one - T174874: Create a standalone Wikimedia CAPTCHA service

So in the same way we don't use coreboot on our servers, and we use propreitary software switches and routers (I could continue), because of lack of appropriate alternatives. And of course, we don't use FOSS hardware; again, for the same reasons. It's just not practical.

But we do have control on our servers. We do not have control on hCaptcha ones. At least it should be something that can be installed in Wikimedia servers; even if they may contact hCaptcha servers, the Captcha should work without them.

And do bare in mind many community members don't feel as strongly (or in many cases, even care) as you do. How many use Windows? And therefore IE or Edge? Mac? Safari? iPhone? Non free drivers and binaries on Linux systems? In some cases they're forced to (work machines etc), but many by choice. Granted, it's consuming resoucrces using non FOSS, but it's a vein of a similar argument.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

In the whole Wikimedia there's very few places that external scripts are loaded - content of MachineVision and Google Translate are already filtered so that they may not do anything bad. Here hCaptcha may theoretically inject arbitrary script to Wikimedia pages.

We have a current effort to replace any external resources, for privacy concerns. See also T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki

We have a current effort to replace any external resources, for privacy concerns. See also T135963: Add support for Content-Security-Policy (CSP) headers in MediaWiki

Yes, I'm aware of this. But CSP has a whitelisting system for this particular kind of issue. CSP is to stop unwanted and not specifically allowed things from being loaded; not stopping the wanted things that make things work

In the whole Wikimedia there's very few places that external scripts are loaded - content of MachineVision and Google Translate are already filtered so that they may not do anything bad. Here hCaptcha may theoretically inject arbitrary script to Wikimedia pages.

And their functionality and data requirements are different.

Yes, hCaptcha could (hell, we've seent it happen on Wikis enough times too. Sure it doesn't always last long, but it happens) inject arbitary scripts. Either purposefully, or accidentally due to some breach. But that's what contracts are for; so then if they are breached, there's legal ramifications.

Neither are Google services we use for MachineVision, and for Google Translate in Content Translation, along with other services from other companies for translation. I think there's probably more too, without digging too deeply. I obviously understand interaction with those is more optional, where a captcha as part of the login flow (and other flows) is not so optional. And also in those cases, the users aren't directly interacting with Google Services, they're doing it via a "proxy" app/API. But in those cases, information like IP address serves no benefit. A translation between two languages is the same wherever you are in the world.

But we do not rely on them to edit and we have plenty of alternatives. Also the requests go through proxies.

Again, what they do and how they work are different. Solving the captcha (ie the action/work) is only part of the process. Removing information the backend work with, such as IP, makes the service mostly useless. Please read my original responses.

Also, in most cases, most users will not see a Captcha. Certainly, I imagine long registered users won't have seen one on Wikimedia (unless creating an additional account for example) in a long time.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

I once suggested Wikimedia to develop one - T174874: Create a standalone Wikimedia CAPTCHA service

Great. But I've already answered this question. We only have limited time and resources. Your task was also explcitily declined. Same as many other ideas where people suggest we should branch out and do X.

As it stands, no one has come forward with an appropriate Free/Open Source solution to the Captcha problem (unless I an everyone else involved has missed someone posting something that is enlightening and solves the problem). And as is clear by the mostly lack of progress on our own Captcha in over a decade, it's clear, that even with the best will in the world, the Foundation and my colleagues don't have all of the large quantities of required knowledge/experience of how to improve our captcha, whilst getting the benefits of l10n/i18n (which generally is something we do do quite well) and accessibility, and even more importantly don't have the time and resources to work on the projects in a capacity to make significant headway *at the same time as all the other work we have to do*.

But we do have control on our servers. We do not have control on hCaptcha ones. At least it should be something that can be installed in Wikimedia servers; even if they may contact hCaptcha servers, the Captcha should work without them.

Again, read my answer about how the captcha works. Passing things through our servers removes that useful information, so we might aswell not bother.

How much control do we necessarily have with propriety firmware etc on them? How many Intel Management Engine type exploits are there out there? Sure, we can limit that by controlling egress, but that doesn't necessarily remove it completely.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

And in the same way you think that not using an FOSS solution is a big problem, other people do not. I suspect a decent amount of people that use Wikipedia don't know what this means, nor do they care. They'll happily use it on other sites they use, which are doing whatever with their data. Doesn't mean you're wrong, but certainly doesn't mean you're right either.

Again, nobody requires users to use Windows. But Wikimedia may be going to require (at least new) users to use a non-free third-party service.

This is explicitly not true; a huge number of businesses (I would argue "almost all", though obviously I don't have any hard statistics to back that up) force their employees to use Windows, for a variety of reasons (it's what the tech support on-hand is familiar with; apps the company relies on were written for Windows and it'd be expensive to update or replace them; the company values paid technical support; etc). You can argue that any or all of these should be non-concerns for any business, but you're screaming into an empty amphitheater in that case. Even ignoring this, pretty much any public computer is going to be Windows just because it has the broadest software support and the general public is by far most likely to already be familiar with it.

Many companies have a volume license of Windows, but it is not the case of WMF.

Many companies have a volume license of Windows, but it is not the case of WMF.

@Bugreporter: It is entirely irrelevant what WMF folks use on their machines. Please move off-topic Windows license discussions somewhere else. Thanks!

I don't think they would need the IP address. If all they want are statistics on the number of requests/solves from an IP address, they could be given a HMAC of the IP address with a secret salt. Plus probably the AS and country of the IP, since I'm sure that's also part of their risk analysis. They couldn't combine requests from wmf users with those from third parties, wikimedia sites would be on its own island, but that's the goal. We have a big enough user base, that I doubt it combining it would really be needed. That, plus proxying the actual image loads (and not letting them insert arbitrary javascript, but using a known-good copy), I think would work wrt privacy. Still not ideal from a FOSS philosophical POV, though.

From an operational perspective, a concern I have is the dependency that is created if using a single vendor for a service like this. If in 5-10 years time, after several mergers and acquisitions, the captcha provider decided to stop providing the service under acceptable terms for us (e.g. they change their terms and are no longer wishing to respect user privacy at all, in order to monetize them), what would we do? It's not like we could stop requiring captchas without an impact. At the very least, the current implementation current would have to be kept at an appropriate level, so it can easily fall back there in such case (or, simply, if the vendor had an outage).

I don't think they would need the IP address. If all they want are statistics on the number of requests/solves from an IP address, they could be given a HMAC of the IP address with a secret salt.

hCaptcha does indeed support such a paradigm by allowing clients to pass blinded end-user IPs to their backend, where they are isolated from the rest of the statistical reputation-scoring hCaptcha performs within the context of their large pool of client data. I cannot find any public-facing documentation for this feature, but I can confirm that it exists and would be a requirement for any proposed Wikimedia implementation.

They couldn't combine requests from wmf users with those from third parties, wikimedia sites would be on its own island, but that's the goal. We have a big enough user base, that I doubt it combining it would really be needed.

There would be a potential downgrade of the performance of hCaptcha's reputation-scoring relative to their standard implementation, but this would still be a vast improvement over FancyCaptcha, which essentially has none.

That, plus proxying the actual image loads (and not letting them insert arbitrary javascript, but using a known-good copy), I think would work wrt privacy. Still not ideal from a FOSS philosophical POV, though.

hCaptcha provides both first-party hosting and full-proxy options for their primary javascript widget and related resources, the latter of which should alleviate all user privacy issues within the context of Wikimedia's current privacy policy. In discussions with hCaptcha, they are also extremely comfortable with Wikimeda/WMF having as much access to relevant source code as possible for audit purposes. As you mentioned, this isn't fully in alignment with certain FOSS philosophies, but is likely the best outcome possible for such a vendor relationship. By contrast, Google currently does not and would likely be unwilling to satisfy any of these requirements with reCaptcha.

From an operational perspective, a concern I have is the dependency that is created if using a single vendor for a service like this. If in 5-10 years time, after several mergers and acquisitions, the captcha provider decided to stop providing the service under acceptable terms for us (e.g. they change their terms and are no longer wishing to respect user privacy at all, in order to monetize them), what would we do? It's not like we could stop requiring captchas without an impact. At the very least, the current implementation current would have to be kept at an appropriate level, so it can easily fall back there in such case (or, simply, if the vendor had an outage).

This is indeed a concern, and one that the Security-Team addressed within a recent WMF-internal risk assessment. FancyCaptcha (or similar) would need to be maintained to some extent as either a fallback captcha system (in the case of service outages) or as a temporary replacement if hCaptcha's terms and/or ethos ever departed significantly from current expectations. This would all likely be codified via contractual agreements between the WMF and hCaptcha, if this option were to move forward.

sbassett moved this task from Back Orders to Watching on the Security-Team board.

From an operational perspective, a concern I have is the dependency that is created if using a single vendor for a service like this. If in 5-10 years time, after several mergers and acquisitions, the captcha provider decided to stop providing the service under acceptable terms for us (e.g. they change their terms and are no longer wishing to respect user privacy at all, in order to monetize them), what would we do? It's not like we could stop requiring captchas without an impact. At the very least, the current implementation current would have to be kept at an appropriate level, so it can easily fall back there in such case (or, simply, if the vendor had an outage).

This is indeed a concern, and one that the Security-Team addressed within a recent WMF-internal risk assessment. FancyCaptcha (or similar) would need to be maintained to some extent as either a fallback captcha system (in the case of service outages) or as a temporary replacement if hCaptcha's terms and/or ethos ever departed significantly from current expectations. This would all likely be codified via contractual agreements between the WMF and hCaptcha, if this option were to move forward.

Can hCaptcha allow us to create a custom version of service that may be hosted in WMF server? this would significantly reduce the risk of outage and suspension. A non-revocable legal agreement of running the service may also be needed. Note even with it, this may still be much more controversial than T272111.

Can hCaptcha allow us to create a custom version of service that may be hosted in WMF server? this would significantly reduce the risk of outage and suspension. A non-revocable legal agreement of running the service may also be needed. Note even with it, this may still be much more controversial than T272111.

If hCaptcha were to be implemented within Wikimedia production, part of that process would involve creating a custom service that managed the proxied transmission of fully-anonymized data to hCaptcha for evaluation. And ideally said service would provide us more flexibility in migrating to separate or fallback captcha systems, such as FancyCaptcha, if the need arose. I do not believe there would be a way to avoid sending any data to hCaptcha, as that is not possible with their current architecture. But as previously discussed, there are a number of ways (technical, legal, etc) which should make such transactions as secure and private as possible and fully-compliant with the current Wikimedia privacy policy.

Update: this is a fairly interesting blog post from Cloudflare discussing their migration from reCaptcha to hCaptcha. They had many similar concerns over user privacy.

Noting also the description is rather out of date

kostajh changed the task status from Open to In Progress.Nov 28 2024, 8:36 AM

As some of you may have seen on the annual plan Meta page, we have been integrating our infrastructure with hCaptcha; it's currently live on test2wiki, and we will post more announcements on wikis as well as on Diff in the coming days.

For more details, particularly the privacy safeguards and risks, see the project page: mw:hCaptcha.

I'd also like invite you to subscribe to the Product Safety and Integrity team newsletter where we'll be keeping you updated on the highlights of our projects as well as the cross-project strategic thinking.

kostajh updated Other Assignee, added: EMill-WMF.