How to Burn Money & Latency: A Guide to Breaking Prompt Caching

A hands-on experimental harness and tutorial exploring how prompting techniques, shared resources, and CLI tools impact input token caching.

Core Components

Component	Description
`tutorial-plan.md`	Deep dive into CLI capabilities, provider specifics, academic grounding, and 17 detailed experiments.
`setup.sh`	Environmental validation and OTel smoke test.
`analyze.py`	Extracts cache hit/miss analytics from OTel JSONL traces.
`compare.py`	Generates a multi-experiment comparison report with token cost estimates and best-practice gain/loss analysis.
`run-all.sh`	Executes the standard experiment suite (skipping high-latency tests).
`run-multi-model.sh`	Runs all experiments across multiple models; produces per-model and combined sky-view reports.
`discover-models.sh`	Probes Copilot CLI to discover which models are currently available.
`ccusage-session.sh`	Wrapper for `ccusage` to get rich session-level cost/token reports.
`presentation-outline.md`	Skeleton for a 20–25 min presentation (caching + routing).
`cache-best-practices.md`	Quickcard with side-by-side best/worst practice examples.

Experiment Suite & Patterns

Category	Pattern	Command	Key Lesson
Basics	Baseline	`bash exp1-baseline.sh`	First call always creates the cache.
	Cache Hit	`bash exp2-cache-hit.sh`	Identical prompts = ~50% input savings.
	TTL Expiry	`bash exp11-ttl-expiry.sh`	Cache clears after ~5m of inactivity.
Invalidation	Timestamp	`bash exp3-timestamp-invalidation.sh`	Dynamic content at start kills the prefix.
	Model Switch	`bash exp7-model-switch.sh`	Different model = different cache namespace.
	Effort Change	`bash exp8-reasoning-effort.sh`	Reasoning changes can reset prefix compute.
	Hook Context	`bash exp13-hook-impact.sh`	Dynamic hook context (v1.0.65+) breaks prefix.
	Dynamic Tail	`bash exp14-dynamic-tail-mitigation.sh`	Moving dynamic data to the end preserves the stable prefix.
	RAG Ordering	`bash exp15-rag-ordering.sh`	Relevance-order churn breaks reuse; stable IDs preserve it.
	Schema Canonicalization	`bash exp16-schema-canonicalization.sh`	Semantically identical JSON can miss cache if byte order changes.
Growth	Instructions	`bash exp4-custom-instructions.sh`	`AGENTS.md` adds stable, cached tokens.
	MCP / Tools	`bash exp5-mcp-impact.sh`	Tool definitions are stable and cached.
	Skills	`exp9-skills-impact.sh`	Large skill files increase cached density.
Session	Multi-Turn	`bash exp6-multi-turn.sh`	Appending preserves the prior prefix.
	Tool Usage	`exp10-tool-execution.sh`	Tool results become part of cached history.
Benchmarks	Cross-Model	`exp12-cross-model.sh`	Compare Claude vs GPT vs Gemini behavior.
	Semantic Thresholds	`python3 exp17-semantic-threshold-simulation.py`	Static semantic thresholds trade hit rate against false positives.

Quick Start

Install and authenticate the GitHub Copilot CLI:

# https://docs.github.com/copilot/how-tos/copilot-cli
copilot login

Run the setup and smoke test:
```
source ./setup.sh
```

Experiment Isolation (Recommended)

By default, the harness enables Experiment Isolation. This ensures that your personal Copilot skills, custom instructions, and MCP configurations (stored in ~/.copilot) do not bias the token usage results.

Enabled (Default): Sets a temporary COPILOT_HOME, unsets COPILOT_SKILLS_DIRS, etc.
Toggle: export COPILOT_ISOLATION=false before sourcing setup.sh to use your system config.

⚠️ Cost Disclaimer (Please Read Before Running)

This repository exists to help you reduce LLM costs over time — but running the experiments does consume paid tokens.

Costs are approximate and depend on:

selected model(s)
prompt/output lengths
cache hit rate
how many experiments and repeats you run

Rough cost expectations (order-of-magnitude)

A single short experiment run is often from cents to low tens of cents.
run-all.sh can add up to several dollars on premium models.
run-multi-model.sh can multiply total spend significantly (same suite × number of models).

Default reference pricing used by `compare.py` (USD per 1M tokens)

claude-sonnet-4.6: input 3.00, cached-read 0.30, output 15.00
gpt-5.4: input 2.50, cached-read 1.25, output 10.00
gemini-3.5-flash: input 0.075, cached-read 0.01875, output 0.30

These are estimator defaults, not billing guarantees. Always confirm with your provider's current pricing.

How to avoid surprise spend

Start with exp1 and exp2 before running the full suite.
Prefer cheaper models first (or pass your own --pricing file to compare.py).
Avoid run-multi-model.sh until single-model results are stable.
Monitor usage after each batch (analyze.py, compare.py, optional ccusage).
Run a single experiment:
```
bash exp2-cache-hit.sh
```
Run the included experiments and compare:
```
bash run-all.sh
```

Analyze a specific OTel JSONL file:

python3 analyze.py "$HOME/cache-experiments/otel/exp2-cache-hit-*.jsonl"

Run all experiments with cost estimates and gain/loss analysis:

bash run-all.sh
# compare.py now outputs per-experiment token costs, per-model summary,
# and best-practice gain/loss analysis grounded in actual benchmark data

Discover available Copilot models:

bash discover-models.sh
bash discover-models.sh --json   # JSON array output
bash discover-models.sh --save models.txt

Run benchmarks across multiple models with sky-view comparison:

bash run-multi-model.sh claude-sonnet-4.6 gpt-5.4 gemini-3.5-flash
# or: bash run-multi-model.sh $(bash discover-models.sh)
# Output: $HOME/cache-experiments/results/multi-model/sky-view-comparison.md

(Optional) Cross-check with ccusage via npx:
```
npx ccusage@latest copilot session
npx ccusage@latest copilot session --json
```
ccusage reads the file from ~/.copilot/otel/*.jsonl or from the COPILOT_OTEL_FILE_EXPORTER_PATH set during the experiment.

How to Read the Numbers

input_tokens — total input tokens for the call (includes cached ones)
cache_read_input_tokens / cache_read — tokens served from cache (cache HIT)
cache_creation_input_tokens / cache_creation — tokens written to cache (cache WRITE)
overall_cache_hit_rate — fraction of input tokens that were cache reads across all calls
Est. Cost (USD) — approximate cost based on per-token pricing (override with --pricing)
No-Cache Cost — what the same tokens would cost with zero caching
Savings — delta between actual cost and no-cache cost (the value of caching)

Cost Estimation & Best-Practice Gain/Loss

compare.py now provides three analysis layers:

Per-experiment token cost — estimates USD cost per experiment using default or custom pricing, with a no-cache baseline for comparison.
Per-model summary — aggregates token usage and cost by model across all experiments.
Best-practice gain/loss — pairs anti-pattern experiments with their best-practice counterparts to show the estimated cost impact of each practice, grounded in actual benchmark data.

Custom pricing file format (per 1M tokens, USD):

{"claude-sonnet-4.6": {"input": 3.0, "cached_read": 0.30, "cached_write": 3.75, "output": 15.0}}

python3 compare.py $HOME/cache-experiments/otel --pricing my-pricing.json

Multi-Model Benchmarking

run-multi-model.sh runs the full experiment suite across multiple models sequentially and produces:

Per-model reports — each model gets its own report.md with token usage, cost estimates, and gain/loss analysis.
Sky-view comparison — a combined cross-model report showing hit rates, token counts, and cost estimates side by side, so you can verify whether caching assumptions hold across providers.

bash run-multi-model.sh claude-sonnet-4.6 gpt-5.4 gemini-3.5-flash

Output location: $HOME/cache-experiments/results/multi-model/

Key Takeaways

Caching is prefix-based and practically byte-for-byte exact for shared prefixes.
Place stable content at the beginning of the prompt; dynamic content at the end.
Tools, system instructions, MCP servers, retrieved context, and schemas can all become part of the cached prefix.
Canonicalize ordering for tools, JSON schemas, and RAG chunks.
Pin model versions; model switches invalidate cache.
Keep calls within the cache TTL (5 min default for Anthropic; 5–10 min default for OpenAI).
Append, don't modify, in multi-turn conversations.
Treat semantic caching as an approximate optimization unless it has verification/error guarantees.

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening an issue or pull request.

Maintainers

Governance & Release Process

Code ownership: .github/CODEOWNERS
Release process: RELEASING.md
Changelog: CHANGELOG.md

Security

This project uses the organization-level security policy:

https://github.com/AmadeusITGroup/.github/blob/main/SECURITY.md

License

Licensed under the Apache License 2.0.

Contact

For general repository topics, open a GitHub issue. For security vulnerabilities, follow the organization security policy linked above.

Sources

All sources are documented in tutorial-plan.md under Part 2.6 Sources and the academic research corpus under research/. Provider-specific TTL/pricing details are separated from academic claims in the tutorial plan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How to Burn Money & Latency: A Guide to Breaking Prompt Caching

Core Components

Experiment Suite & Patterns

Quick Start

Experiment Isolation (Recommended)

⚠️ Cost Disclaimer (Please Read Before Running)

Rough cost expectations (order-of-magnitude)

Default reference pricing used by `compare.py` (USD per 1M tokens)

How to avoid surprise spend

How to Read the Numbers

Cost Estimation & Best-Practice Gain/Loss

Multi-Model Benchmarking

Key Takeaways

Contributing

Maintainers

Governance & Release Process

Security

License

Contact

Sources

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
docs/ai		docs/ai
research		research
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
analyze.py		analyze.py
cache-best-practices.md		cache-best-practices.md
ccusage-session.sh		ccusage-session.sh
compare.py		compare.py
discover-models.sh		discover-models.sh
exp1-baseline.sh		exp1-baseline.sh
exp10-tool-execution.sh		exp10-tool-execution.sh
exp11-ttl-expiry.sh		exp11-ttl-expiry.sh
exp12-cross-model.sh		exp12-cross-model.sh
exp13-hook-impact.sh		exp13-hook-impact.sh
exp14-dynamic-tail-mitigation.sh		exp14-dynamic-tail-mitigation.sh
exp15-rag-ordering.sh		exp15-rag-ordering.sh
exp16-schema-canonicalization.sh		exp16-schema-canonicalization.sh
exp17-semantic-threshold-simulation.py		exp17-semantic-threshold-simulation.py
exp2-cache-hit.sh		exp2-cache-hit.sh
exp3-timestamp-invalidation.sh		exp3-timestamp-invalidation.sh
exp4-custom-instructions.sh		exp4-custom-instructions.sh
exp5-mcp-impact.sh		exp5-mcp-impact.sh
exp6-multi-turn.sh		exp6-multi-turn.sh
exp7-model-switch.sh		exp7-model-switch.sh
exp8-reasoning-effort.sh		exp8-reasoning-effort.sh
exp9-skills-impact.sh		exp9-skills-impact.sh
presentation-outline.md		presentation-outline.md
run-all.sh		run-all.sh
run-multi-model.sh		run-multi-model.sh
setup.sh		setup.sh
tutorial-plan.md		tutorial-plan.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

How to Burn Money & Latency: A Guide to Breaking Prompt Caching

Core Components

Experiment Suite & Patterns

Quick Start

Experiment Isolation (Recommended)

⚠️ Cost Disclaimer (Please Read Before Running)

Rough cost expectations (order-of-magnitude)

Default reference pricing used by compare.py (USD per 1M tokens)

How to avoid surprise spend

How to Read the Numbers

Cost Estimation & Best-Practice Gain/Loss

Multi-Model Benchmarking

Key Takeaways

Contributing

Maintainers

Governance & Release Process

Security

License

Contact

Sources

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Default reference pricing used by `compare.py` (USD per 1M tokens)

Packages