Skip to content

Comments

added :binary_data_format option#95

Merged
jcoglan merged 1 commit intofaye:mainfrom
mackuba:binary_format
May 24, 2025
Merged

added :binary_data_format option#95
jcoglan merged 1 commit intofaye:mainfrom
mackuba:binary_format

Conversation

@mackuba
Copy link
Contributor

@mackuba mackuba commented May 8, 2025

TLDR: in a binary stream, Hybi#emit_message converts the data from an ascii string to a byte array. This adds up to quite a lot of work when processing a very high throughput stream, and is especially wasteful if the API user actually wants to receive a string, and needs to convert the array back to the string using pack('C*') (not sure if this is the best way). This also creates unnecessary temporary data which adds work for the GC.

This PR adds an option :binary_data_format => :array / :string (defaulting to :array); if you set it to :string, then Hybi#emit_message just passes through the string it has as is to the message. There will be a second PR to faye-websocket related to this that I'll add in a moment.

In my case, this change can result in as much as 200-300% speedup (!) (measured when doing literally nothing more than reading and discarding data from the websocket, in the actual app it will be obviously less, with database calls etc., since the reading of streamed data is a smaller % of total work).


Backstory: I have some number of projects using the Bluesky API / AT Protocol, including the skyfall Ruby gem for streaming from their streaming APIs, built on Faye. The Bluesky "firehose" API emits all "events" happening anywhere in the network, i.e. any new/deleted post, like, follow and so on. It also includes quite a lot of overhead, because there's some low-level data there that might be needed by some clients, but in practice usually isn't.

The current traffic (which varies during the day) is around 500 events/second, or around 20 mbit/s. But we already had a sustained traffic of almost 2000 evt/s for a little bit back in November, and both Bluesky's and third party devs' code was often struggling then. If the network keeps growing, the traffic is expected to grow again, so I've been trying to optimize my code to prepare for that (incl. migrating from SQLite to Postgres).

I was recently profiling the processing code and checking where I can make some more speed gains. At the moment I can push it to around 4000 events/second, potentially to 5-6k, so I have some space to grow.

But in the end I ran a profiler, and I've noticed that most of the time was spent inside the websocket library, specifically on converting strings to arrays and back. And I dug into the code, and I've noticed that… the websocket driver has the data as a binary string, converts it to a byte array, which is passed to me, which I then need to convert back to a binary string in order to pass it to the decoder (CBOR). How much could I save if the string is just passed to me without conversion? Turns out, a lot!

In this PR, I've added an option :binary_data_format => :array / :string (defaulting to :array); if you set it to :string, then Hybi#emit_message just passes through the string it has as is to the message. There will be a second PR to faye-websocket related to this that I'll add in a moment, which just passes through this option from the API user down to the driver.

I made some simple test code using my library that you can run to test this: https://github.com/mackuba/websocket-fix/blob/master/websocket_test.rb. Try it with e.g. 100,000 or 200,000 iterations (first arg) and with and without -f (second arg).

Note: to see the difference in processing speed, you need to have at least ~200 mbit/s connection, so ideally run this on some kind of server. If you max out your network's connection speed, then you will see a difference in CPU use instead.

Results of the test for me, first on my home connection:


% time ./websocket_test.rb 400000
Running in normal mode:
Connected (0.834173s to connect)

Estimated processing speed: 4394.1 evt/s

GC stats: {count: 681, time: 1145, marking_time: 80, sweeping_time: 1065, heap_allocated_pages: 93, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 90335, heap_live_slots: 64919, heap_free_slots: 25416, heap_final_slots: 0, heap_marked_slots: 64877, heap_eden_pages: 93, total_allocated_pages: 93, total_freed_pages: 0, total_allocated_objects: 14701400, total_freed_objects: 14636481, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 32225676, minor_gc_count: 672, major_gc_count: 9, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 647, old_objects: 64761, old_objects_limit: 129522, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 40928508}

./websocket_test.rb 400000 50.64s user 1.89s system 57% cpu 1:32.03 total

% time ./websocket_test.rb 400000 -f
Running in FAST mode:
Connected (0.839828s to connect)

Estimated processing speed: 4502.4 evt/s

GC stats: {count: 517, time: 2025, marking_time: 222, sweeping_time: 1803, heap_allocated_pages: 99, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 95660, heap_live_slots: 64920, heap_free_slots: 30740, heap_final_slots: 0, heap_marked_slots: 64878, heap_eden_pages: 99, total_allocated_pages: 99, total_freed_pages: 0, total_allocated_objects: 13500383, total_freed_objects: 13435463, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 16777216, minor_gc_count: 512, major_gc_count: 5, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 647, old_objects: 64762, old_objects_limit: 129524, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 19737900}

./websocket_test.rb 400000 -f 21.76s user 4.79s system 29% cpu 1:29.86 total


Results when running on a server:

$ time ./websocket_test.rb 400000
Running in normal mode:
Connected (0.685806801s to connect)

Estimated processing speed: 6559.5 evt/s

GC stats: {count: 674, time: 959, marking_time: 164, sweeping_time: 795, heap_allocated_pages: 84, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 77650, heap_live_slots: 52010, heap_free_slots: 25640, heap_final_slots: 0, heap_marked_slots: 51968, heap_eden_pages: 84, total_allocated_pages: 84, total_freed_pages: 0, total_allocated_objects: 14677645, total_freed_objects: 14625635, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 29129457, minor_gc_count: 666, major_gc_count: 8, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 518, old_objects: 51852, old_objects_limit: 103704, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 34107091}

real 1m1.718s
user 0m57.342s
sys 0m2.746s

$ time ./websocket_test.rb 400000 -f
Running in FAST mode:
Connected (0.679570876s to connect)

Estimated processing speed: 26389.5 evt/s

GC stats: {count: 570, time: 618, marking_time: 121, sweeping_time: 496, heap_allocated_pages: 86, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 79696, heap_live_slots: 52011, heap_free_slots: 27685, heap_final_slots: 0, heap_marked_slots: 51969, heap_eden_pages: 86, total_allocated_pages: 86, total_freed_pages: 0, total_allocated_objects: 13481274, total_freed_objects: 13429263, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 16777216, minor_gc_count: 565, major_gc_count: 5, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 518, old_objects: 51853, old_objects_limit: 103706, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 19737900}

real 0m15.885s
user 0m7.368s
sys 0m1.860s

(~25k events/s is comparable to what I got when testing a Rust library doing that!)

@jcoglan
Copy link
Collaborator

jcoglan commented May 24, 2025

Thanks a lot for finding and fixing this. On reflection, is was an ill-advised decision of mine to represent binary data as an array. I thought this would make it easier to distinguish and handle, but as you point out, most things in the Ruby ecosystem that parse binary formats expect the input to be a string, and this is the data type you get when reading from a file or socket. There's really no point allocating an array that you're going to have to convert back again.

With that in mind, I actually think that emitting a string should be the new default. This may a breaking change, but if we don't do this I think it will be difficult for most users to get the performance benefit. websocket-driver is normally not invoked directly by applications but is buried deep inside other libraries and frameworks, which would need to update their code to take advantage of this. Therefore I think we should emit strings by default, but I've kept your option for reverting to the old behaviour of emitting arrays.

I've also updated the frame() method to treat Encoding::BINARY strings as binary messages, so that code for echo servers or other pass-through logic works correctly.

This change will ship as version 0.8.0 is it involves a change to the API.

@mackuba
Copy link
Contributor Author

mackuba commented May 24, 2025

Awesome, thank you! 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants