Skip to content

Optimize oj_dump_cstr using SSE4.2 and SSSE3.#973

Merged
ohler55 merged 19 commits intoohler55:developfrom
samyron:optimize-oj_dump_cstr-sse4
Feb 4, 2026
Merged

Optimize oj_dump_cstr using SSE4.2 and SSSE3.#973
ohler55 merged 19 commits intoohler55:developfrom
samyron:optimize-oj_dump_cstr-sse4

Conversation

@samyron
Copy link
Copy Markdown
Contributor

@samyron samyron commented Jun 11, 2025

Hello! Me again.. this time optimizing oj_dump_cstr on x86-64 platforms using SSE4.2 and SSSE3.

Apologies for taking so long to finish this.

Benchmarks

CPU: Intel(R) Core(TM) i7-8850H

Real world benchmarks

develop commit: 735514652c7b112fe8971a8962ab9aaf0bc95f67
optimize-oj_dump_cstr-sse4 commit: a3956438b154be15695d99d1e34c98b4f4481f86

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after     1.286k i/100ms
Calculating -------------------------------------
               after     12.567k (± 3.4%) i/s   (79.58 μs/i) -     63.014k in   5.020525s

Comparison:
              before:     8394.4 i/s
               after:    12566.7 i/s - 1.50x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after    70.000 i/100ms
Calculating -------------------------------------
               after    714.000 (± 5.2%) i/s    (1.40 ms/i) -      3.570k in   5.014227s

Comparison:
              before:      722.5 i/s
               after:      714.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after   136.000 i/100ms
Calculating -------------------------------------
               after      1.385k (± 7.2%) i/s  (721.77 μs/i) -      6.936k in   5.042296s

Comparison:
              before:     1262.5 i/s
               after:     1385.5 i/s - same-ish: difference falls within error


== Encoding ohai.json (20145 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after     1.647k i/100ms
Calculating -------------------------------------
               after     19.819k (± 7.0%) i/s   (50.46 μs/i) -     98.820k in   5.016088s

Comparison:
              before:    21454.0 i/s
               after:    19818.6 i/s - same-ish: difference falls within error

@ohler55
Copy link
Copy Markdown
Owner

ohler55 commented Jun 17, 2025

Is this still a WIP?

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Jun 18, 2025

Is this still a WIP?

Yes.. I at least need to clean up the warnings. Two questions for you though:

  1. Are you okay you with runtime CPU detection to determine which method to use (on x86-64)? Or would you prefer using a compiler flag to enable/disable the feature?
  2. If using runtime CPU detection, would you like me to use that for the SSE4.2 code in the parser? If you would prefer the compile-time flag, would you like me to use the same --with-sse42 flag for this code too?

@ohler55
Copy link
Copy Markdown
Owner

ohler55 commented Jun 18, 2025

If the overhead of runtime is trivial then that would be fine but compile time is best if possible.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Jun 22, 2025

Apologies for the delays... it's been a busy week. I should be able to wrap this up within the next few days.

@ohler55
Copy link
Copy Markdown
Owner

ohler55 commented Jun 22, 2025

No worries. Same has happens to me on more than one occasion.

@samyron samyron marked this pull request as ready for review June 26, 2025 02:59
@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Jun 26, 2025

This should be ready for review. There is still a bit of duplication between the NEON and SSE4.2 code which can probably be made a bit more generic. I'm not entirely sure it's worth it but I'm happy to do so if you'd like.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Jan 22, 2026

Apologies for all of the commits / noise on this PR. Also... major apologies for this going dark for so long... I didn't expect to let this sit for months. I'm currently merging in the develop branch and I'm going to retest and benchmark. Will finish it up shortly.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Jan 24, 2026

Just updated the comment with fresh benchmarks. I think this is ready to go.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Jan 24, 2026

If #992 is approved this PR should be modified to use the runtime instruction set detection.

@ohler55
Copy link
Copy Markdown
Owner

ohler55 commented Feb 3, 2026

With the recent SSE PR merge is this still valid. Lots of churn in that area.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Feb 3, 2026

Yes.. though I now need to circle back and ensure this works with the runtime CPU detection. This is currently relying on the previous compile time #ifdef. Additionally, the other PR was focused on the parser, this one is for encoding.

@ohler55
Copy link
Copy Markdown
Owner

ohler55 commented Feb 3, 2026

I was about ready to make a release. If your PR can be updated quickly I'll hold off for a bit.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Feb 3, 2026

I can try to work on this tonight... I can post an update one way or another if you can hold off for a day... if you prefer not to wait I completely understand.

@ohler55
Copy link
Copy Markdown
Owner

ohler55 commented Feb 3, 2026

I've stall long enough another day will not be a problem.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Feb 4, 2026

This now uses the runtime SSE4.2 detection. This seems to pass all tests and still have the same performance mentioned above.

@samyron
Copy link
Copy Markdown
Contributor Author

samyron commented Feb 4, 2026

There is probably a clean-up ask to refactor the parser to use the SIMD_Impl initialized in oj.c (defined in simd.h).

@ohler55 ohler55 merged commit 88972f9 into ohler55:develop Feb 4, 2026
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants