-
-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Integrate Snowflake IDs for statuses #1059
Description
Snowflake IDs are an idea that is used by Twitter, Discord, Instagram and many more. Here is the problem layout:
- Timelines are ordered by unique IDs, this allows keyset pagination (since_id, max_id)
- Timelines are meant to be chronological
- This is not an issue for local toots, because they get an assigned ID when they are created
- However, federated toots may arrive with various delays
- There are processes that "pull" remote toots into the system as well (thread resolving, the search for toot URL thing)
- In those cases, they also get an ID assigned when they are created locally, which can differ wildly from what their
created_attimestamp is
A snowflake ID is a big integer that consists of a precise timestamp, a sequence mask and a couple more optional parts to ensure uniqueness, depending on how many different generators run at the same time (i.e. different workers/shards/regions etc)
Because the outermost part of a snowflake ID is a timestamp, chronological sorting is built in. And while the timestamp is most safely-unique when it contains precise milliseconds, I think those can be padded for remote toots (whose timestamp does not carry ms precision)
- All status ID columns in the database schema are bigint
- A snowflake generator must be integrated for status IDs instead of psql's autoincrement
- The snowflake generator must accept a timestamp parameter from the past
This is a somewhat complicated area, and I welcome any help & expertise on the issue. Here are a couple of resources for reference:
- http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/
- https://github.com/twitter/snowflake/tree/b3f6a3c6ca8e1b6847baa6ff42bf72201e2c2231
Another note is that Twitter, Instagram &co can write separate high-performance daemons for ID generation. A separate process just for that would ensure that all the different web and sidekiq workers would be calling the same ID generator, thus ensuring that the sequence part in the ID is unique. That would be a simple thing to do, however there is a limit to how many different processes I want to impose on small instance admins (there is already: web, sidekiq, streaming)
Another approach is having the ID generator in Ruby code. However, because there is a variable amount of web and sidekiq workers that could call this code in each of their separate processes, they wouldn't share the sequence part. As I mentioned earlier, this could be mitigated by a "worker index" part in the snowflake ID. However, the problem is that the various web and sidekiq workers do not know information like "I am worker number 2 out of 5".
The first linked article describes the generator at the Postgres schema level. This is the most promising angle, because Postgres is definitely the one source of truth in any deployment. However, I have no experience writing custom Postgres functions.
Thanks for reading and hope we can figure something out!