We made state-of-the-art speech synthesis scalable, and achieved a truly remarkable improvement both for the latency and throughput.

Customers

70%

Faster Time to first audio

60%

Lower Cost

200ms

Time to First audio

"Our collaboration with Modular is a glimpse into the future of accessible AI infrastructure. Our API now returns the first 2 seconds of synthesized audio on average ~70% faster compared to vanilla vLLM based implementation, at just 200ms for 2 second chunks. This allowed us to serve more QPS with lower latency and eventually offer the API at a ~60% lower price than would have been possible without using Modular’s stack."

Igor Poletaev

Chief Science Officer - Inworld

Problem

Inworld helps teams build AI products for consumer applications, with services that organically evolve throughout the product experience and scale to meet user demands.

As a team of former DeepMind and Google engineers, they’re already pushing the limits of what they thought possible with existing AI infrastructure, and they came to Modular because they needed an even faster, more capable speech synthesis service. The computational intensity of generating realistic, low-latency speech creates significant technical challenges. For example specialized APIs were essential to enable scalable and economically viable voice AI applications.

Solving this required more than just optimizing their text-to-speech (TTS) model; it demanded a fundamental redesign of the entire inference stack. This is where the collaboration with Modular began.

Solution

Our partnership with Inworld represents a co-engineered approach, where both companies’ engineering teams worked together to integrate Modular’s MAX Framework and Inworld’s proprietary text-to-speech model. In less than 8 weeks, we went from start-of-engagement to the worlds most advanced state-of-the-art speech pipeline on NVIDIA Blackwell GPU Architecture.

There were many technical hurdles to overcome to make the architecture scalable and achieve the fastest results possible. Using Modular's MAX and Mojo together was essential. MAX’s streaming-aware scheduler - designed to minimize time-to-first-token (TTFT) - coupled with its highly optimized kernel library, delivered ~1.6X faster performance for the Speech-Language Model (SpeechLM) component. Mojo offers the ability to define high-efficiency custom kernels, allowing Inworld to create things like a tailored silence-detection kernel that runs directly on the GPU.

The SpeechLM architecture itself was a breakthrough - achieved by adapting and scaling a cutting-edge, open-source-inspired tech stack. The model architecture is a Speech-Language Model (SpeechLM) built upon an in-house neural audio codec and an LLM backbone.

Now, anyone who builds with Inworld, gets the direct benefits of Modular:

Deliver truly instant interactions. Thanks to MAX's streaming-aware scheduler, your application gets the first chunk of audio in as little as 200ms, eliminating awkward pauses and keeping users immersed.

Scale your application without fear of cost. By optimizing the entire stack for high throughput, we cut the price by ~60%. You can now serve more users and deploy rich voice experiences at a cost that is ~22x lower than alternatives.

Ensure seamless performance under load. Our architecture is built for high throughput, ensuring your application can serve any QPS you need. The user experience remains seamless and responsive, even during traffic spikes.

Results

Deploying Inworld's model with the Modular Platform has achieved a truly remarkable improvement both for the latency and throughput. In the streaming mode, the API now returns the first 2 seconds of synthesized audio on average ~70% faster if compared to vanilla vLLM-based implementation. This allowed us to serve more QPS with lower latency and eventually offer the API at a ~60% lower price than would have been possible without Modular's stack.

Lower is better for latency (ms)

About Inworld

Inworld develops AI products for builders of consumer applications, enabling scaled applications that grow into user needs and organically evolve through experience. They are fundamentally redefining AI experiences with a return to the user.

Read more of the technical details of the engagement on Inworld's blog.

Request a demo of this use case

If you're deploying text-to-speech inference, request a demo today. Excited to chat!

Case Studies

Unleashing AI performance on AMD GPUs with Modular's Platform

Modular partners with AMD to bring the AI ecosystem more choice with state-of-the-art performance on AMD Instinct GPUs.

AI batch processing is now cheaper than anyone thought possible

When selling GPUs as a commodity meets the fastest inference engine - cost savings can skyrocket.

Revolutionizing your own research to production

Modular allows Qwerky AI to do advanced AI research, to write optimized code and deploy across NVIDIA, AMD, and other types of silicon.

Modular partners with NVIDIA to accelerate AI compute everywhere

Modular’s Platform provides state-of-the-art support for NVIDIA Blackwell, Hopper, Ampere, Ada Lovelace and NVIDIA Grace Superchips.

Unlocking fast AMD compute for all

AI inference has a cost problem. Hardware alone isn't enough - customers need software that can extract every ounce of performance from these chips. TensorWave and Modular team up to shatter the cost-performance ceiling for AI inference.

Modular partners with AWS to democratize AI Infrastructure

Modular partnered with AWS to bring MAX to AWS Marketplace, offering SOTA performance for GenAI workloads across GPUs types.

Scales for enterprises

  • Dedicated enterprise support

    We are a team of the world's best AI infrastructure leaders who are reinventing and rebuilding accelerated compute for everyone.

  • Infinitely scalable to reduce your TCO

    Optimize costs and performance with multi-node inference at massive scale across cloud or on-prem environments.

  • Enterprise grade SLA

    Our performance is backed with an enterprise grade SLA, ensuring reliability, accountability, and peace of mind.

Build the future of AI with Modular

View Editions
  • Get started guide

    Install MAX with a few commands and deploy a GenAI model locally.

    Read Guide
  • Browse open models

    500+ models, many optimized for lightning-fast performance

    Browse models