Back to Blog

LLM Prompt Caching: The Hidden Lever for Speed, Cost, and Reliability

Stephen
engineering-leadershipai-architectureproduct-strategyteam-building

Photo by Vimal S on Unsplash

You’ve probably heard that large language models (LLMs) are expensive and slow. But here’s what many teams overlook: the easiest way to dramatically cut costs and latency is not fine-tuning, not switching providers, and not prompt engineering tricks.

It’s prompt caching a simple, underutilized strategy that separates the scrappy prototypes from the scalable AI products.

In this article, I’ll break down:

  • What prompt caching is (and why it matters)
  • The main caching strategies (with code examples)
  • Pros and cons of each approach
  • Best practices from the top 1% of AI engineers

What is Prompt Caching?

At its core, prompt caching means storing the results of a model call so that repeated prompts don’t re-hit the LLM. Instead of re-paying for the same answer, you reuse it instantly.

This matters because:

  • LLM calls are expensive (fractions of a cent add up fast at scale).
  • Responses can take seconds to generate.
  • Many prompts are repetitive, especially in chatbots, RAG pipelines, or batch jobs.

Types of Prompt Caching (with Code Examples)

Let’s walk through the main strategies, their trade-offs, and implementation patterns.

1. Exact-Match Cache

The simplest form. If the prompt + parameters are identical, you serve a cached result.

Pros

  • Easy to implement
  • Reliable and deterministic

Cons

  • Only works if input is exactly the same
  • Doesn’t help with “near-duplicate” prompts

Example (Python + Redis):

 import hashlib
import redis
from openai import OpenAI

client = OpenAI()
r = redis.Redis()

def cache_key(prompt, params):
    return hashlib.sha256((prompt + str(params)).encode()).hexdigest()

def cached_completion(prompt, params):
    key = cache_key(prompt, params)
    if (cached := r.get(key)):
        return cached.decode("utf-8")
    response = client.chat.completions.create(**params, messages=[{"role": "user", "content": prompt}])
    r.set(key, response.choices[0].message.content)
    return response.choices[0].message.content

2. Semantic Caching

Instead of exact text matching, you use embeddings to cache semantically similar prompts.

Pros

  • Handles paraphrased or slightly different prompts
  • Useful in RAG or FAQ-like systems

Cons

  • More complex (requires embeddings + similarity search)
  • Risk of serving “close but wrong” answers

Example (with FAISS):

 from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer('all-MiniLM-L6-v2')
index = faiss.IndexFlatL2(384)

cache = {}

def semantic_cache(prompt, generate_fn):
    emb = model.encode([prompt])
    if index.ntotal > 0:
        D, I = index.search(emb, 1)
        if D[0][0] < 0.2:  # similarity threshold
            return cache[I[0][0]]
    response = generate_fn(prompt)
    index.add(emb)
    cache[len(cache)] = response
    return response

3. Partial / Template Caching

Cache parts of prompts or structured templates. Example: FAQ answers or system prompts.

Pros

  • Saves cost for static or semi-static sections
  • Great in pipelines with repeated context

Cons

  • Requires engineering discipline (define reusable prompt parts)
  • Doesn’t cover dynamic queries

Example (templated prompt caching):

 TEMPLATE = "You are a helpful assistant. Answer concisely.\nQuestion: {q}"

def cached_template(q):
    prompt = TEMPLATE.format(q=q)
    return cached_completion(prompt, {"model": "gpt-4o", "max_tokens": 200})

4. Hybrid Caching

Combine exact-match, semantic, and template caching. Many production systems layer them:

  1. Check exact match
  2. Fallback to semantic match
  3. Generate new and cache it

This balances cost, correctness, and coverage.

Best Practices for Prompt Caching

1. Set expiration times

  • Cache “forever” for static prompts
  • Use TTL (time-to-live) for dynamic ones

2. Version your cache keys

  • Include model name + version
  • Prevent serving stale responses after a model upgrade

3. Log cache hit/miss metrics

  • Measure effectiveness (e.g., 30%+ hit rate saves thousands monthly)

4. Don’t over-cache dynamic queries

  • For real-time Q&A (like customer chats), balance freshness vs. cost

5. Layer caches

  • In-memory (fastest)
  • Redis / database (shared across servers)
  • Semantic (last fallback before hitting LLM)

When You Should Not Cache

  • Compliance-sensitive applications (where every answer must be regenerated fresh)
  • Rapidly changing data (e.g., stock prices, breaking news)
  • Low-volume prototypes (you’ll spend more time coding the cache than you’ll save in compute)

Key Takeaways

  • Prompt caching is the lowest-hanging fruit for speeding up and scaling LLM apps.
  • Use exact-match caching as your baseline, then layer on semantic caching for smarter reuse.
  • Always version your cache and measure hit rates; otherwise, you’re flying blind.