InftyAI’s llmaz is an advanced inference platform designed to streamline the deployment and management of large language models (LLMs) on Kubernetes. By integrating state-of-the-art inference backends, llmaz brings cutting-edge research to the cloud, offering a production-ready solution for LLMs.
Key Features of llmaz:
- Kubernetes Integration for easy to use: deploy and manage LLMs within Kubernetes clusters, leveraging Kubernetes’ robust orchestration capabilities.
- Advanced Inference Backends: Utilize state-of-the-art inference backends to ensure efficient and scalable model serving.
- Production-Ready: Designed for production environments, llmaz offers reliability and performance for enterprise applications.
The deployment of a model is quite simple in llmaz.
Here’s a toy example for deploying deepseek-ai/DeepSeek-R1, all you need to do is to apply a Model and a Playground.
| apiVersion: llmaz.io/v1alpha1 kind: OpenModel metadata: name: opt-125m spec: familyName: opt source: modelHub: modelID: deepseek-ai/DeepSeek-R1 inferenceConfig: flavors: – name: default # Configure GPU type requests: nvidia.com/gpu: 1 — apiVersion: inference.llmaz.io/v1alpha1 kind: Playground metadata: name: opt-125m spec: replicas: 1 modelClaim: modelName: opt-125m |
Latest Release: v0.1.3
The latest release, v0.1.3, was released on April 23th, 2025. The release v0.1 includes several enhancements and bug fixes to improve the platform’s stability and performance. For detailed information on the changes introduced in this release, please refer to the release notes.
Integrations
Broad Backends Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.
llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores.
AI Gateway Support: Offering capabilities like token-based rate limiting, model routing with the integration of Envoy AI Gateway.
Build-in ChatUI: Out-of-the-box chatbot support with the integration of Open WebUI, offering capacities like function call, RAG, web search and more, see configurations here.
llmaz, serving as an easy to use and advanced inference platform, uses LeaderWorkerSet as the underlying workload to support both single-host and multi-host inference scenarios.
llmaz supports horizontal scaling with HPA by default and will integrate with autoscaling components like Cluster-Autoscaler or Karpenter for smart scaling across different clouds.
About the Founder: Kante Yin
Kante Yin is a prominent figure in the Kubernetes community, serving as a SIG Scheduling Approver and a top committer of LWS and Kueue. His contributions to Kubernetes scheduling and workload management have been instrumental in advancing cloud-native technologies. Kante’s expertise and leadership continue to drive innovation in the Kubernetes ecosystem.
Compared to other inference platforms, llmaz stands out with its extensionable cloud-native design, making it incredibly lightweight and efficient. Its architecture is optimized for scalability and resource efficiency, enabling seamless integration into modern cloud environments while maintaining high performance.
OSPP 2025 (Open Source Software Supply)
The Open Source Promotion Plan is a summer program organized by the Open Source Software Supply Chain Promotion Plan of the Institute of Software Chinese Academy of Sciences in 2020. It aims to encourage university students to actively participate in the development and maintenance of open source software, cultivate and discover more outstanding developers, promote the vigorous development of excellent open source software communities, and assist in the construction of open source software supply chains.
llmaz has 2 projects in OSPP 2025. Student Registration and Application: May 9 – June 9. Welcome to our community.
- KEDA-based Serverless Elastic Scaling for llmaz
- Enabling Efficient Model and Container Image Distribution in LLMaz with Dragonfly
For more information about llmaz and its features, visit the GitHub repository.

















