Fix PuSH to instances where WEB_DOMAIN != LOCAL_DOMAIN#2739
Fix PuSH to instances where WEB_DOMAIN != LOCAL_DOMAIN#2739ClearlyClaire wants to merge 4 commits intomastodon:masterfrom
Conversation
8e5b5db to
e6bef1a
Compare
|
Is modifying Also it would be nice to have a test case to make sure this kind of regression won't happen anymore. |
|
I'm open about using a new Account#followers_subscription_domains or similar, but you're not sure those will actually be used for subscription (except for Mastodon instances). Moreover, I can't really think of any use for Account#followers_domains as the domain part of the acct: URI doesn't mean much anyway (since actor URI, the feeds and API endpoints are what matters and can be hosted on a different domain entirely, as its the case with the issue at end). Either way, this should be documented.
As for the tests, I agree, but I haven't had the time to write them, and I'm on my way to my dayjob now.
|
Yes, but what matters is that Mastodon code use them for checking subscription validity. It is for making the code easier to understand.
True. I can't think of any either. So we can rename
I agree. It should be noted that an assumption is made here. |
There was a problem hiding this comment.
This becomes hard to read because of "unless not" so maybe "next if hidden and not included"
There was a problem hiding this comment.
If WEB_DOMAIN and LOCAL_DOMAIN must be related by a root domain (i.e. one of them is a subdomain) then maybe that can be the check instead?
There was a problem hiding this comment.
Yes, it's a bit unreadable, I will change that. As for WEB_DOMAIN being a subdomain of LOCAL_DOMAIN, I'm uneasy about that, that is not an assumption we have made so far, and not one GNU Social has made either. I think multiple GNU Social users have acct: URIs pointing to different domains entirely (I can think of MMN-o, but some of those can be considered aliases). It would make the check easier, though.
app/models/account.rb
Outdated
There was a problem hiding this comment.
I'm a bit concerned because this will be a huge array, every remote follower has a unique remote_url, distinct is kinda useless here
There was a problem hiding this comment.
Ah, yeah, I just copied it, but the “distinct” is useless indeed. I'm not sure how to optimize that. We could just .uniq the array, but I'm not sure it will be very useful. We could also add a “remote_url_domain” field to Account, to parse the URL only when refreshing the remote account details, and not at each private/direct message, but I'm not sure how much better that is.
One part of the change was also to compute this array only when pushing private/direct messages, and not public/unlisted ones.
There was a problem hiding this comment.
I'll push a slightly optimized version of this, which should probably be ok for about a thousand followers across a few hundreds remote domains, but not much more.
450b822 to
8ddab9d
Compare
|
I have updated my PR with a few optimizations and fixes, but it will probably not scale well to users with thousands of followers. The best way to handle this would probably be to store the parsed-and-normalized domain alongside the URL in a field of the “accounts” table, but that would probably make for a fairly long-running database transition script… |
|
The last commit introduces a web_domain field to the accounts table as a cache for the normalized domain part of remote_url. This comes with a migration script that I fear will be quite long-running. |
17a04a3 to
e6f416c
Compare
There was a problem hiding this comment.
You are absolutely right that this migration script is impossible to run. It will take hours if not days, and migrations place a lock on the entire schema meaning it would be hours of downtime. I am almost veering towards your initial approach of just iterating over everything.
There was a problem hiding this comment.
Yeah, sorry about making you do the migration bit. See my comment on the other file.
There was a problem hiding this comment.
I don't think that would ever take days, but hours is entirely possible, and that's not acceptable for a migration script. For the record, on my really small instance (only ~2k remote accounts known, but also low-end ARM hardware), the migration script took about 2 minutes to run.
There was a problem hiding this comment.
There is a second approach. Instead of checking whether the destination is allowed during the loop here, do it in Pubsubhubbub::DeliveryWorker instead.
There was a problem hiding this comment.
It should be made a hashtable, or at least .uniq as @ThibG commented on the last review (though he said he is 'not sure it will be very useful'). An O(n^2) operation is still O(n^2) on threads: a few threads are nothing compared to 10k^2 if the person has a lot of followers.
There was a problem hiding this comment.
You're distributing to the same number of subscriptions either way, the difference is between checking a huge array for a value that needs to be transformed first (high memory use), or checking just one thing in the responsible job (low memory use)
Also, lots of small jobs are overall better for the system than a few long-running ones, because small jobs return consumed resources like connections faster, so jobs of different varities can run more or less at the same time, while one long-running job may block a sidekiq worker for a long time.
There was a problem hiding this comment.
I see what you mean. Not passing the table to the worker, but leveraging the DB index, thus having both low complexity and benefits of small jobs. Much more clever than mine!
EDIT: Realized DB index might not be constant time.
There was a problem hiding this comment.
In my last proposal (well, before the last commit which switches to a database field), I am already using “.uniq” to avoid .normalize()-ing URLs from the same domain, as it's the most expensive part of the computation. This does however require the use of a large array at some point, indeed.
Checking it in Pubsubhubbub::DeliveryWorker would involve doing the whole thing for each PuSH subscriber (so, roughly multiply the total running time by the number of remote instances), and avoiding the big array means you will do multiple a .normalize() call for each follower, even if multiple ones happen to use the same domain…
There was a problem hiding this comment.
@ThibG is right. Forgot that PuSH subscriptions are anonymous (again) for the first time. The DB index cannot be relied on. I believe filling the database gradually is a good idea.
If adding a column is expensive what about creating a new table? Is it a concievable situation for a user to have multiple 'authorized' domains?
There was a problem hiding this comment.
Yes, it is conceivable as it is only a guess. For instance, “authorized” domains include the domain part of the acct: URI too, and we might also allow the domain part of the PuSH or salmon endpoints…
There was a problem hiding this comment.
Sorry, but I insist on the approach I described. Please go back to the approach before migrations/db changes, and move the check to the delivery worker.
There was a problem hiding this comment.
I'm not sure what you want me to do here. We would not be able to share the domains array anymore, and would need to re-compute it instead. Or iterate over all the followers without deduplicating as soon as possible, which will ease the RAM requirements, but would make each job potentially run much longer than the “big” one.
EDIT: To clarify, so far, I have a .uniq(&:host) call between parsing and normalizing, as normalizing is much more expensive than parsing, and deduplicating in this order provides a huge performance boost over iterating over each of them and normalizing them.
There was a problem hiding this comment.
Do you want me to use something like:
def is_domain_subscribed?(domain)
return true if followers.where(domain: domain).count.positive?
return true if followers.where(web_domain: domain).count.positive?
followers.reorder(nil).where('accounts.domain IS NOT NULL AND accounts.web_domain IS NULL')
.select('accounts.remote_url').find_each do |url|
return true if domain == Addressable::URI.parse(url).normalized_host
end
false
endAs I said earlier, I fear this is going to be much slower than what I came up with earlier (at least without the web_domain field in the database). But you're right that it should consume less memory.
|
Updated with my last suggestion (using both the web_domain value from the database as well as computing a set of domains on-the-fly for remote followers that don't have the value set). I haven't tested it yet. |
|
Fixed a mistake in an earlier commit (which used |
|
I'm investigating a few ways to significantly reduce the time we spend parsing and normalizing URIs:
EDIT: With those changes, parsing 600000 URLs on my i5 takes ~35s (instead of the 3 minutes it took prior to those changes). The slightly modified migration script takes about 1m30s on my low-end ARM server. Maybe this is a viable solution again? |
|
#2954 changes the distribution to work for everybody for public statuses again. I don't have a good solution to tracking "web_domain" since it's not something the protocol announces in any way. For all I know, the Atom feed, Salmon endpoint and PuSH hub of a user could all be on different domains. :( |
|
Le 9 mai 2017 23:53:25 GMT+02:00, Eugen Rochko <[email protected]> a écrit :
#2954 changes the distribution to work for everybody for public
statuses again.
Good! That will already be a much better situation, although it would be good if follower-only toots worked too.
I don't have a good solution to tracking "web_domain"
since it's not something the protocol announces in any way. For all I
know, the Atom feed, Salmon endpoint and PuSH hub of a user could all
be on different domains. :(
Yes, indeed. This notion of private toots is a departure from OStatus anyway, which does not impose any kind of relation on the different URLs. My proposal was for it to work with *Mastodon* instances where LOCAL_DOMAIN ≠ WEB_DOMAIN.
|
|
I have rebased my branch on latest master and completely revamped it: first, there is a bunch of fixes and performance improvements, and the last commit is the most invasive one and introduces the “web_domain” field to the database. Unlike in my previous approaches, this is meant to be set only when the domain part of the Moreover, the accounts are not updated by a migration script, and the update mechanism relies on the normal remote account mechanism instead, thus having almost no impact on performances, at the cost of undelivered toots when all the following conditions are met:
EDIT: I haven't thoroughly tested it yet. |
2cce098 to
1d03265
Compare
0aff090 to
98aaa62
Compare
|
Let me bump this PR again now that 1.4.1 is released! As earlier, I have been running with this changes (regularly rebasing them on master) without any issue I am aware of. |
|
@ThibG Yeah, sorry about the delays on this one - I'm always hesitant to change db schema. Especially when in this case it's so specific to WEB_DOMAIN which is a Mastodon only thing and not even recommended. However, this led to me thinking - why don't we make more use of Webfinger aliases? Webfinger provides a canonical acct, and an array of aliases. When using WEB_DOMAIN, we could list that among the aliases. On the other end, perhaps accounts should have an aliases string array column, which would keep track of those. We still have the "that's a lot of select + parsing out domain out of acct or URL" problem but at least it's not hardcoded to web_domain |
|
Yes, I can understand why you would be hesitant to change db schema for this. Webfinger aliases would contain entire URIs that would probably have to be parsed each time, which will probably be way too slow (my current solution is to save the domain itself in web_domain, at remote account creation/update, so that the lookup is fast). As for including it in the aliases, the profile URL (thus including the WEB_DOMAIN) is already part of the advertised aliases. |
efaedcc to
b909cf9
Compare
26340e5 to
7c4ed60
Compare
171a38e to
c58f4da
Compare
When we are only interested in the normalized host, calling normalized_host avoids normalizing the other components of the URI as well as creating a new object
…L_DOMAIN While PuSH subscriptions are anonymous, and instance-wide in practice, Mastodon attempts to deliver private (follower-only) toots to instances known to host followers. To do so, it checks that the PuSH callback has the same host name as one of the followers. This assumption is insufficiant for Mastodon instances for which WEB_DOMAIN ≠ LOCAL_DOMAIN. This fix adds a “web_domain” field to Account, which will be set for remote accounts whose remote_url is hosted on a different domain as the domain part of their “acct:” URI. while this makes assumptions not present in OStatus, this is consistent with Mastodon's behaviour and does not limit federation in any way.
|
I have a new idea. I'm going to implement http signatures (there is a draft rfc for it). PuSH subscription requests will be signed on behalf of the user who triggered the subscription. We don't really need a specific user but once we have a user we can store the domain in the subscriptions table, and then filter by it. |
|
That seems sensible to me as long as it is optional (since that's not part of OStatus). |
…ance domain (mastodon#4205) * Fix mastodon#2672 - Connect signed PuSH subscription requests to instance domain Resolves mastodon#2739 * Fix return of locate_subscription * Fix tests
This is an attempt to fix #2672 and hasn't been tested yet.
This is done in two ways:
Rspec tests should be added, and the way the followers_domains is built should probably be optimized.