A hands-on experimental harness and tutorial exploring how prompting techniques, shared resources, and CLI tools impact input token caching.
| Component | Description |
|---|---|
tutorial-plan.md |
Deep dive into CLI capabilities, provider specifics, academic grounding, and 17 detailed experiments. |
setup.sh |
Environmental validation and OTel smoke test. |
analyze.py |
Extracts cache hit/miss analytics from OTel JSONL traces. |
compare.py |
Generates a multi-experiment comparison report with token cost estimates and best-practice gain/loss analysis. |
run-all.sh |
Executes the standard experiment suite (skipping high-latency tests). |
run-multi-model.sh |
Runs all experiments across multiple models; produces per-model and combined sky-view reports. |
discover-models.sh |
Probes Copilot CLI to discover which models are currently available. |
ccusage-session.sh |
Wrapper for ccusage to get rich session-level cost/token reports. |
presentation-outline.md |
Skeleton for a 20–25 min presentation (caching + routing). |
cache-best-practices.md |
Quickcard with side-by-side best/worst practice examples. |
| Category | Pattern | Command | Key Lesson |
|---|---|---|---|
| Basics | Baseline | bash exp1-baseline.sh |
First call always creates the cache. |
| Cache Hit | bash exp2-cache-hit.sh |
Identical prompts = ~50% input savings. | |
| TTL Expiry | bash exp11-ttl-expiry.sh |
Cache clears after ~5m of inactivity. | |
| Invalidation | Timestamp | bash exp3-timestamp-invalidation.sh |
Dynamic content at start kills the prefix. |
| Model Switch | bash exp7-model-switch.sh |
Different model = different cache namespace. | |
| Effort Change | bash exp8-reasoning-effort.sh |
Reasoning changes can reset prefix compute. | |
| Hook Context | bash exp13-hook-impact.sh |
Dynamic hook context (v1.0.65+) breaks prefix. | |
| Dynamic Tail | bash exp14-dynamic-tail-mitigation.sh |
Moving dynamic data to the end preserves the stable prefix. | |
| RAG Ordering | bash exp15-rag-ordering.sh |
Relevance-order churn breaks reuse; stable IDs preserve it. | |
| Schema Canonicalization | bash exp16-schema-canonicalization.sh |
Semantically identical JSON can miss cache if byte order changes. | |
| Growth | Instructions | bash exp4-custom-instructions.sh |
AGENTS.md adds stable, cached tokens. |
| MCP / Tools | bash exp5-mcp-impact.sh |
Tool definitions are stable and cached. | |
| Skills | exp9-skills-impact.sh |
Large skill files increase cached density. | |
| Session | Multi-Turn | bash exp6-multi-turn.sh |
Appending preserves the prior prefix. |
| Tool Usage | exp10-tool-execution.sh |
Tool results become part of cached history. | |
| Benchmarks | Cross-Model | exp12-cross-model.sh |
Compare Claude vs GPT vs Gemini behavior. |
| Semantic Thresholds | python3 exp17-semantic-threshold-simulation.py |
Static semantic thresholds trade hit rate against false positives. |
-
Install and authenticate the GitHub Copilot CLI:
# https://docs.github.com/copilot/how-tos/copilot-cli copilot login -
Run the setup and smoke test:
source ./setup.sh
By default, the harness enables Experiment Isolation. This ensures that your personal Copilot skills, custom instructions, and MCP configurations (stored in ~/.copilot) do not bias the token usage results.
- Enabled (Default): Sets a temporary
COPILOT_HOME, unsetsCOPILOT_SKILLS_DIRS, etc. - Toggle:
export COPILOT_ISOLATION=falsebefore sourcingsetup.shto use your system config.
This repository exists to help you reduce LLM costs over time — but running the experiments does consume paid tokens.
Costs are approximate and depend on:
- selected model(s)
- prompt/output lengths
- cache hit rate
- how many experiments and repeats you run
- A single short experiment run is often from cents to low tens of cents.
run-all.shcan add up to several dollars on premium models.run-multi-model.shcan multiply total spend significantly (same suite × number of models).
claude-sonnet-4.6: input3.00, cached-read0.30, output15.00gpt-5.4: input2.50, cached-read1.25, output10.00gemini-3.5-flash: input0.075, cached-read0.01875, output0.30
These are estimator defaults, not billing guarantees. Always confirm with your provider's current pricing.
-
Start with
exp1andexp2before running the full suite. -
Prefer cheaper models first (or pass your own
--pricingfile tocompare.py). -
Avoid
run-multi-model.shuntil single-model results are stable. -
Monitor usage after each batch (
analyze.py,compare.py, optionalccusage). -
Run a single experiment:
bash exp2-cache-hit.sh
-
Run the included experiments and compare:
bash run-all.sh
-
Analyze a specific OTel JSONL file:
python3 analyze.py "$HOME/cache-experiments/otel/exp2-cache-hit-*.jsonl" -
Run all experiments with cost estimates and gain/loss analysis:
bash run-all.sh # compare.py now outputs per-experiment token costs, per-model summary, # and best-practice gain/loss analysis grounded in actual benchmark data
-
Discover available Copilot models:
bash discover-models.sh bash discover-models.sh --json # JSON array output bash discover-models.sh --save models.txt -
Run benchmarks across multiple models with sky-view comparison:
bash run-multi-model.sh claude-sonnet-4.6 gpt-5.4 gemini-3.5-flash # or: bash run-multi-model.sh $(bash discover-models.sh) # Output: $HOME/cache-experiments/results/multi-model/sky-view-comparison.md
-
(Optional) Cross-check with
ccusagevianpx:npx ccusage@latest copilot session npx ccusage@latest copilot session --json
ccusagereads the file from~/.copilot/otel/*.jsonlor from theCOPILOT_OTEL_FILE_EXPORTER_PATHset during the experiment.
input_tokens— total input tokens for the call (includes cached ones)cache_read_input_tokens/cache_read— tokens served from cache (cache HIT)cache_creation_input_tokens/cache_creation— tokens written to cache (cache WRITE)overall_cache_hit_rate— fraction of input tokens that were cache reads across all callsEst. Cost (USD)— approximate cost based on per-token pricing (override with--pricing)No-Cache Cost— what the same tokens would cost with zero cachingSavings— delta between actual cost and no-cache cost (the value of caching)
compare.py now provides three analysis layers:
- Per-experiment token cost — estimates USD cost per experiment using default or custom pricing, with a no-cache baseline for comparison.
- Per-model summary — aggregates token usage and cost by model across all experiments.
- Best-practice gain/loss — pairs anti-pattern experiments with their best-practice counterparts to show the estimated cost impact of each practice, grounded in actual benchmark data.
Custom pricing file format (per 1M tokens, USD):
{"claude-sonnet-4.6": {"input": 3.0, "cached_read": 0.30, "cached_write": 3.75, "output": 15.0}}python3 compare.py $HOME/cache-experiments/otel --pricing my-pricing.jsonrun-multi-model.sh runs the full experiment suite across multiple models sequentially and produces:
- Per-model reports — each model gets its own
report.mdwith token usage, cost estimates, and gain/loss analysis. - Sky-view comparison — a combined cross-model report showing hit rates, token counts, and cost estimates side by side, so you can verify whether caching assumptions hold across providers.
bash run-multi-model.sh claude-sonnet-4.6 gpt-5.4 gemini-3.5-flashOutput location: $HOME/cache-experiments/results/multi-model/
- Caching is prefix-based and practically byte-for-byte exact for shared prefixes.
- Place stable content at the beginning of the prompt; dynamic content at the end.
- Tools, system instructions, MCP servers, retrieved context, and schemas can all become part of the cached prefix.
- Canonicalize ordering for tools, JSON schemas, and RAG chunks.
- Pin model versions; model switches invalidate cache.
- Keep calls within the cache TTL (5 min default for Anthropic; 5–10 min default for OpenAI).
- Append, don't modify, in multi-turn conversations.
- Treat semantic caching as an approximate optimization unless it has verification/error guarantees.
Contributions are welcome. Please read CONTRIBUTING.md before opening an issue or pull request.
- Code ownership:
.github/CODEOWNERS - Release process:
RELEASING.md - Changelog:
CHANGELOG.md
This project uses the organization-level security policy:
Licensed under the Apache License 2.0.
For general repository topics, open a GitHub issue. For security vulnerabilities, follow the organization security policy linked above.
All sources are documented in tutorial-plan.md under Part 2.6 Sources and the academic research corpus under research/. Provider-specific TTL/pricing details are separated from academic claims in the tutorial plan.