Created this issue to discuss Semantic Caching requirement in high level. Would like to hear what community think of this. Supporting caching at gateway is not something everyone agrees. Hence lets have a discussions here and see what is the best approach to follow.
Why Implement Semantic Caching for LLM/AI Gateways?
LLM calls are expensive and time-consuming, making a caching layer highly valuable for consumers of API gateways. Users expect caching capabilities at the gateway level to reduce backend load and improve response times. While most API gateways support request-based caching (based on request attributes), semantic caching extends this concept to the AI/LLM domain by focusing on the meaning of requests rather than their syntax.
Some LLMs provide built-in caching, but these implementations often consume additional resources on the LLM backend. By introducing semantic caching at the gateway level, we can provide vendor-agnostic caching that works independently of the LLM provider. This approach guarantees better resource utilization but also supports multi-LLM backends where a single API may route requests to different LLMs. Gateway-level caching enhances performance, offering faster responses and reducing backend processing costs.
Technical Considerations for Semantic Caching
In semantic caching, cache keys are determined by the meaning of a request rather than its exact syntax. This involves using embeddings, metadata, and context to capture differences in AI behavior and ensure accurate matches for similar requests.
To implement semantic caching(storages):
- Vector Databases: Store and manage embeddings for semantic similarity searches.
- In-Memory Stores (e.g., Redis): Handle fast lookups for associated metadata and cache entries.
- Scalable Object Storage: Manage large payloads, such as LLM responses, efficiently.
Created this issue to discuss Semantic Caching requirement in high level. Would like to hear what community think of this. Supporting caching at gateway is not something everyone agrees. Hence lets have a discussions here and see what is the best approach to follow.
Why Implement Semantic Caching for LLM/AI Gateways?
LLM calls are expensive and time-consuming, making a caching layer highly valuable for consumers of API gateways. Users expect caching capabilities at the gateway level to reduce backend load and improve response times. While most API gateways support request-based caching (based on request attributes), semantic caching extends this concept to the AI/LLM domain by focusing on the meaning of requests rather than their syntax.
Some LLMs provide built-in caching, but these implementations often consume additional resources on the LLM backend. By introducing semantic caching at the gateway level, we can provide vendor-agnostic caching that works independently of the LLM provider. This approach guarantees better resource utilization but also supports multi-LLM backends where a single API may route requests to different LLMs. Gateway-level caching enhances performance, offering faster responses and reducing backend processing costs.
Technical Considerations for Semantic Caching
In semantic caching, cache keys are determined by the meaning of a request rather than its exact syntax. This involves using embeddings, metadata, and context to capture differences in AI behavior and ensure accurate matches for similar requests.
To implement semantic caching(storages):