Skip to content

Fix PuSH to instances where WEB_DOMAIN != LOCAL_DOMAIN#2739

Closed
ClearlyClaire wants to merge 4 commits intomastodon:masterfrom
ClearlyClaire:fix-push
Closed

Fix PuSH to instances where WEB_DOMAIN != LOCAL_DOMAIN#2739
ClearlyClaire wants to merge 4 commits intomastodon:masterfrom
ClearlyClaire:fix-push

Conversation

@ClearlyClaire
Copy link
Copy Markdown
Contributor

This is an attempt to fix #2672 and hasn't been tested yet.
This is done in two ways:

  • Push public toots without attempting to verify the callback URL against the followers
  • Add host part of followers' remote_url to the list of allowed domains. This makes assumptions not in OStatus, but should work with other Mastodon instances
    Rspec tests should be added, and the way the followers_domains is built should probably be optimized.

@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 3 times, most recently from 8e5b5db to e6bef1a Compare May 3, 2017 05:37
@ghost
Copy link
Copy Markdown

ghost commented May 3, 2017

Is modifying Account#followers_domains a good idea? I don't find it used elsewhere, but people would likely assume it to mean the Account#domains of Account#followers. Maybe make something separate like Account#followers_subscription_domains?

Also it would be nice to have a test case to make sure this kind of regression won't happen anymore.

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

ClearlyClaire commented May 3, 2017 via email

@ghost
Copy link
Copy Markdown

ghost commented May 3, 2017

but you're not sure those will actually be used for subscription

Yes, but what matters is that Mastodon code use them for checking subscription validity. It is for making the code easier to understand.

I can't really think of any use for Account#followers_domains as the domain part of the acct: URI doesn't mean much anyway

True. I can't think of any either. So we can rename followers_domains, instead of keeping it and making something new. No reason for continued usage of an underspecific name, after all. Maybe there is going to be some feature down the line that use another set of domains, who knows?

Either way, this should be documented.

I agree. It should be noted that an assumption is made here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes hard to read because of "unless not" so maybe "next if hidden and not included"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WEB_DOMAIN and LOCAL_DOMAIN must be related by a root domain (i.e. one of them is a subdomain) then maybe that can be the check instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a bit unreadable, I will change that. As for WEB_DOMAIN being a subdomain of LOCAL_DOMAIN, I'm uneasy about that, that is not an assumption we have made so far, and not one GNU Social has made either. I think multiple GNU Social users have acct: URIs pointing to different domains entirely (I can think of MMN-o, but some of those can be considered aliases). It would make the check easier, though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned because this will be a huge array, every remote follower has a unique remote_url, distinct is kinda useless here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, I just copied it, but the “distinct” is useless indeed. I'm not sure how to optimize that. We could just .uniq the array, but I'm not sure it will be very useful. We could also add a “remote_url_domain” field to Account, to parse the URL only when refreshing the remote account details, and not at each private/direct message, but I'm not sure how much better that is.
One part of the change was also to compute this array only when pushing private/direct messages, and not public/unlisted ones.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push a slightly optimized version of this, which should probably be ok for about a thousand followers across a few hundreds remote domains, but not much more.

@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 2 times, most recently from 450b822 to 8ddab9d Compare May 3, 2017 11:33
@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

I have updated my PR with a few optimizations and fixes, but it will probably not scale well to users with thousands of followers. The best way to handle this would probably be to store the parsed-and-normalized domain alongside the URL in a field of the “accounts” table, but that would probably make for a fairly long-running database transition script…

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

The last commit introduces a web_domain field to the accounts table as a cache for the normalized domain part of remote_url. This comes with a migration script that I fear will be quite long-running.

@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 4 times, most recently from 17a04a3 to e6f416c Compare May 3, 2017 16:51
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right that this migration script is impossible to run. It will take hours if not days, and migrations place a lock on the entire schema meaning it would be hours of downtime. I am almost veering towards your initial approach of just iterating over everything.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry about making you do the migration bit. See my comment on the other file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would ever take days, but hours is entirely possible, and that's not acceptable for a migration script. For the record, on my really small instance (only ~2k remote accounts known, but also low-end ARM hardware), the migration script took about 2 minutes to run.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a second approach. Instead of checking whether the destination is allowed during the loop here, do it in Pubsubhubbub::DeliveryWorker instead.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be made a hashtable, or at least .uniq as @ThibG commented on the last review (though he said he is 'not sure it will be very useful'). An O(n^2) operation is still O(n^2) on threads: a few threads are nothing compared to 10k^2 if the person has a lot of followers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're distributing to the same number of subscriptions either way, the difference is between checking a huge array for a value that needs to be transformed first (high memory use), or checking just one thing in the responsible job (low memory use)

Also, lots of small jobs are overall better for the system than a few long-running ones, because small jobs return consumed resources like connections faster, so jobs of different varities can run more or less at the same time, while one long-running job may block a sidekiq worker for a long time.

Copy link
Copy Markdown

@ghost ghost May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Not passing the table to the worker, but leveraging the DB index, thus having both low complexity and benefits of small jobs. Much more clever than mine!

EDIT: Realized DB index might not be constant time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my last proposal (well, before the last commit which switches to a database field), I am already using “.uniq” to avoid .normalize()-ing URLs from the same domain, as it's the most expensive part of the computation. This does however require the use of a large array at some point, indeed.

Checking it in Pubsubhubbub::DeliveryWorker would involve doing the whole thing for each PuSH subscriber (so, roughly multiply the total running time by the number of remote instances), and avoiding the big array means you will do multiple a .normalize() call for each follower, even if multiple ones happen to use the same domain…

Copy link
Copy Markdown

@ghost ghost May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThibG is right. Forgot that PuSH subscriptions are anonymous (again) for the first time. The DB index cannot be relied on. I believe filling the database gradually is a good idea.

If adding a column is expensive what about creating a new table? Is it a concievable situation for a user to have multiple 'authorized' domains?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is conceivable as it is only a guess. For instance, “authorized” domains include the domain part of the acct: URI too, and we might also allow the domain part of the PuSH or salmon endpoints…

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I insist on the approach I described. Please go back to the approach before migrations/db changes, and move the check to the delivery worker.

Copy link
Copy Markdown
Contributor Author

@ClearlyClaire ClearlyClaire May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you want me to do here. We would not be able to share the domains array anymore, and would need to re-compute it instead. Or iterate over all the followers without deduplicating as soon as possible, which will ease the RAM requirements, but would make each job potentially run much longer than the “big” one.

EDIT: To clarify, so far, I have a .uniq(&:host) call between parsing and normalizing, as normalizing is much more expensive than parsing, and deduplicating in this order provides a huge performance boost over iterating over each of them and normalizing them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to use something like:

  def is_domain_subscribed?(domain)
    return true if followers.where(domain: domain).count.positive?
    return true if followers.where(web_domain: domain).count.positive?
    followers.reorder(nil).where('accounts.domain IS NOT NULL AND accounts.web_domain IS NULL')
             .select('accounts.remote_url').find_each do |url|
      return true if domain == Addressable::URI.parse(url).normalized_host
    end
    false
  end

As I said earlier, I fear this is going to be much slower than what I came up with earlier (at least without the web_domain field in the database). But you're right that it should consume less memory.

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

Updated with my last suggestion (using both the web_domain value from the database as well as computing a set of domains on-the-fly for remote followers that don't have the value set). I haven't tested it yet.

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

ClearlyClaire commented May 6, 2017

Fixed a mistake in an earlier commit (which used Addressable::URI's domain method instead of host). Now running on my instance, haven't seen any issue so far, but obviously haven't seen any improvement either, since the remote instances would have to run the fix.
In the meantime, I'm monitoring how fast the new web_domain field gets updated for known remote users.

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

ClearlyClaire commented May 6, 2017

I'm investigating a few ways to significantly reduce the time we spend parsing and normalizing URIs:

  • Using uri.normalized_host instead of uri.normalize.host, avoiding to normalize the other components and creating a new object
  • Using the “idn-ruby” gem which implements IDNA natively instead of using the default pure Ruby implementation

EDIT: With those changes, parsing 600000 URLs on my i5 takes ~35s (instead of the 3 minutes it took prior to those changes). The slightly modified migration script takes about 1m30s on my low-end ARM server. Maybe this is a viable solution again?

@Gargron
Copy link
Copy Markdown
Member

Gargron commented May 9, 2017

#2954 changes the distribution to work for everybody for public statuses again. I don't have a good solution to tracking "web_domain" since it's not something the protocol announces in any way. For all I know, the Atom feed, Salmon endpoint and PuSH hub of a user could all be on different domains. :(

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

ClearlyClaire commented May 10, 2017 via email

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

ClearlyClaire commented May 10, 2017

I have rebased my branch on latest master and completely revamped it: first, there is a bunch of fixes and performance improvements, and the last commit is the most invasive one and introduces the “web_domain” field to the database.

Unlike in my previous approaches, this is meant to be set only when the domain part of the remote_url and the domain of an Account differ, to reduce the need for deduplication.

Moreover, the accounts are not updated by a migration script, and the update mechanism relies on the normal remote account mechanism instead, thus having almost no impact on performances, at the cost of undelivered toots when all the following conditions are met:

  • The toot is private (followers-only)
  • The remote follower is older than the migration and did not get updated
  • The remote follower is on an instance where WEB_DOMAINLOCAL_DOMAIN

EDIT: I haven't thoroughly tested it yet.

@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 3 times, most recently from 2cce098 to 1d03265 Compare May 12, 2017 17:01
@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 3 times, most recently from 0aff090 to 98aaa62 Compare May 29, 2017 10:14
@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

Let me bump this PR again now that 1.4.1 is released! As earlier, I have been running with this changes (regularly rebasing them on master) without any issue I am aware of.

@Gargron
Copy link
Copy Markdown
Member

Gargron commented Jun 2, 2017

@ThibG Yeah, sorry about the delays on this one - I'm always hesitant to change db schema. Especially when in this case it's so specific to WEB_DOMAIN which is a Mastodon only thing and not even recommended. However, this led to me thinking - why don't we make more use of Webfinger aliases? Webfinger provides a canonical acct, and an array of aliases. When using WEB_DOMAIN, we could list that among the aliases. On the other end, perhaps accounts should have an aliases string array column, which would keep track of those. We still have the "that's a lot of select + parsing out domain out of acct or URL" problem but at least it's not hardcoded to web_domain

@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

Yes, I can understand why you would be hesitant to change db schema for this. Webfinger aliases would contain entire URIs that would probably have to be parsed each time, which will probably be way too slow (my current solution is to save the domain itself in web_domain, at remote account creation/update, so that the lookup is fast). As for including it in the aliases, the profile URL (thus including the WEB_DOMAIN) is already part of the advertised aliases.

@akihikodaki akihikodaki added the bug Something isn't working label Jun 3, 2017
@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 4 times, most recently from efaedcc to b909cf9 Compare June 10, 2017 16:45
@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 2 times, most recently from 26340e5 to 7c4ed60 Compare June 15, 2017 07:21
@ClearlyClaire ClearlyClaire force-pushed the fix-push branch 2 times, most recently from 171a38e to c58f4da Compare June 25, 2017 09:07
When we are only interested in the normalized host, calling normalized_host
avoids normalizing the other components of the URI as well as creating a
new object
…L_DOMAIN

While PuSH subscriptions are anonymous, and instance-wide in practice, Mastodon
attempts to deliver private (follower-only) toots to instances known to host
followers. To do so, it checks that the PuSH callback has the same host name as
one of the followers.

This assumption is insufficiant for Mastodon instances for which
WEB_DOMAIN ≠ LOCAL_DOMAIN. This fix adds a “web_domain” field to Account, which
will be set for remote accounts whose remote_url is hosted on a different
domain as the domain part of their “acct:” URI.

while this makes assumptions not present in OStatus, this is consistent with
Mastodon's behaviour and does not limit federation in any way.
@Gargron
Copy link
Copy Markdown
Member

Gargron commented Jul 10, 2017

I have a new idea. I'm going to implement http signatures (there is a draft rfc for it). PuSH subscription requests will be signed on behalf of the user who triggered the subscription. We don't really need a specific user but once we have a user we can store the domain in the subscriptions table, and then filter by it.

@Gargron Gargron mentioned this pull request Jul 11, 2017
5 tasks
@ClearlyClaire
Copy link
Copy Markdown
Contributor Author

That seems sensible to me as long as it is optional (since that's not part of OStatus).

Gargron added a commit that referenced this pull request Jul 14, 2017
…ain (#4205)

* Fix #2672 - Connect signed PuSH subscription requests to instance domain

Resolves #2739

* Fix return of locate_subscription

* Fix tests
@ClearlyClaire ClearlyClaire deleted the fix-push branch July 15, 2017 14:44
abcang pushed a commit to pixiv/mastodon that referenced this pull request Aug 22, 2017
…ance domain (mastodon#4205)

* Fix mastodon#2672 - Connect signed PuSH subscription requests to instance domain

Resolves mastodon#2739

* Fix return of locate_subscription

* Fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v1.3+ does not send toots to instances using a separate WEB_DOMAIN

3 participants