Welcome to production-stack!#
K8S-native cluster-wide deployment for vLLM.
vLLM Production Stack project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:
🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code
💻 Monitor the metrics through a web dashboard
😄 Enjoy the performance benefits brought by request routing and KV cache offloading
📈 Easily deploy the stack on AWS, GCP, or any other cloud provider
Documentation#
Getting Started
Deployment
Use Cases
Developer Guide
Community