Discussion about this post

User's avatar
Nicolas Duminil's avatar

"Python builds the prototype, Java builds the system that survive scale."

I can't agree more. This statement should be repeated more often.

Bartek's avatar

Hi Markus,

Thanks for the insightful article.

One question that came up while reading: isn’t one of the core limitations for AI inference on the JVM the lack of native GPU / accelerator support, including efficient GPU–CPU switching and device-level memory management? JVM multithreading and FFM help a lot on the CPU side, but they don’t fundamentally solve GPU offload or scheduling, which is still critical for high-throughput inference.

Related to that, I’m also wondering about loading very large model weights directly into the JVM:

Wouldn’t holding multi-GB weights inside the JVM heap put significant pressure on the Garbage Collector, especially for long-running inference services?

Even with off-heap memory or Panama bindings, do you see GC behavior, memory fragmentation, or pause times becoming a practical bottleneck at scale?

4 more comments...

No posts

Ready for more?