LLM Prompt Caching: The Hidden Lever for Speed, Cost, and Reliability
You’ve probably heard that large language models (LLMs) are expensive and slow. But here’s what many teams overlook: the easiest way to dramatically cut costs and latency is not fine-tuning, not switching providers, and not prompt engineering tricks.
It’s prompt caching a simple, underutilized strategy that separates the scrappy prototypes from the scalable AI products.
In this article, I’ll break down:
- What prompt caching is (and why it matters)
- The main caching strategies (with code examples)
- Pros and cons of each approach
- Best practices from the top 1% of AI engineers
What is Prompt Caching?
At its core, prompt caching means storing the results of a model call so that repeated prompts don’t re-hit the LLM. Instead of re-paying for the same answer, you reuse it instantly.
This matters because:
- LLM calls are expensive (fractions of a cent add up fast at scale).
- Responses can take seconds to generate.
- Many prompts are repetitive, especially in chatbots, RAG pipelines, or batch jobs.
Types of Prompt Caching (with Code Examples)
Let’s walk through the main strategies, their trade-offs, and implementation patterns.
1. Exact-Match Cache
The simplest form. If the prompt + parameters are identical, you serve a cached result.
Pros
- Easy to implement
- Reliable and deterministic
Cons
- Only works if input is exactly the same
- Doesn’t help with “near-duplicate” prompts
Example (Python + Redis):
import hashlib
import redis
from openai import OpenAI
client = OpenAI()
r = redis.Redis()
def cache_key(prompt, params):
return hashlib.sha256((prompt + str(params)).encode()).hexdigest()
def cached_completion(prompt, params):
key = cache_key(prompt, params)
if (cached := r.get(key)):
return cached.decode("utf-8")
response = client.chat.completions.create(**params, messages=[{"role": "user", "content": prompt}])
r.set(key, response.choices[0].message.content)
return response.choices[0].message.content
2. Semantic Caching
Instead of exact text matching, you use embeddings to cache semantically similar prompts.
Pros
- Handles paraphrased or slightly different prompts
- Useful in RAG or FAQ-like systems
Cons
- More complex (requires embeddings + similarity search)
- Risk of serving “close but wrong” answers
Example (with FAISS):
from sentence_transformers import SentenceTransformer
import faiss
model = SentenceTransformer('all-MiniLM-L6-v2')
index = faiss.IndexFlatL2(384)
cache = {}
def semantic_cache(prompt, generate_fn):
emb = model.encode([prompt])
if index.ntotal > 0:
D, I = index.search(emb, 1)
if D[0][0] < 0.2: # similarity threshold
return cache[I[0][0]]
response = generate_fn(prompt)
index.add(emb)
cache[len(cache)] = response
return response
3. Partial / Template Caching
Cache parts of prompts or structured templates. Example: FAQ answers or system prompts.
Pros
- Saves cost for static or semi-static sections
- Great in pipelines with repeated context
Cons
- Requires engineering discipline (define reusable prompt parts)
- Doesn’t cover dynamic queries
Example (templated prompt caching):
TEMPLATE = "You are a helpful assistant. Answer concisely.\nQuestion: {q}"
def cached_template(q):
prompt = TEMPLATE.format(q=q)
return cached_completion(prompt, {"model": "gpt-4o", "max_tokens": 200})
4. Hybrid Caching
Combine exact-match, semantic, and template caching. Many production systems layer them:
- Check exact match
- Fallback to semantic match
- Generate new and cache it
This balances cost, correctness, and coverage.
Best Practices for Prompt Caching
1. Set expiration times
- Cache “forever” for static prompts
- Use TTL (time-to-live) for dynamic ones
2. Version your cache keys
- Include model name + version
- Prevent serving stale responses after a model upgrade
3. Log cache hit/miss metrics
- Measure effectiveness (e.g., 30%+ hit rate saves thousands monthly)
4. Don’t over-cache dynamic queries
- For real-time Q&A (like customer chats), balance freshness vs. cost
5. Layer caches
- In-memory (fastest)
- Redis / database (shared across servers)
- Semantic (last fallback before hitting LLM)
When You Should Not Cache
- Compliance-sensitive applications (where every answer must be regenerated fresh)
- Rapidly changing data (e.g., stock prices, breaking news)
- Low-volume prototypes (you’ll spend more time coding the cache than you’ll save in compute)
Key Takeaways
- Prompt caching is the lowest-hanging fruit for speeding up and scaling LLM apps.
- Use exact-match caching as your baseline, then layer on semantic caching for smarter reuse.
- Always version your cache and measure hit rates; otherwise, you’re flying blind.
Photo by