Skip to content

AmadeusITGroup/token-usage-optimisation-tutorial-cache-aspects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

How to Burn Money & Latency: A Guide to Breaking Prompt Caching

A hands-on experimental harness and tutorial exploring how prompting techniques, shared resources, and CLI tools impact input token caching.

Core Components

Component Description
tutorial-plan.md Deep dive into CLI capabilities, provider specifics, academic grounding, and 17 detailed experiments.
setup.sh Environmental validation and OTel smoke test.
analyze.py Extracts cache hit/miss analytics from OTel JSONL traces.
compare.py Generates a multi-experiment comparison report with token cost estimates and best-practice gain/loss analysis.
run-all.sh Executes the standard experiment suite (skipping high-latency tests).
run-multi-model.sh Runs all experiments across multiple models; produces per-model and combined sky-view reports.
discover-models.sh Probes Copilot CLI to discover which models are currently available.
ccusage-session.sh Wrapper for ccusage to get rich session-level cost/token reports.
presentation-outline.md Skeleton for a 20–25 min presentation (caching + routing).
cache-best-practices.md Quickcard with side-by-side best/worst practice examples.

Experiment Suite & Patterns

Category Pattern Command Key Lesson
Basics Baseline bash exp1-baseline.sh First call always creates the cache.
Cache Hit bash exp2-cache-hit.sh Identical prompts = ~50% input savings.
TTL Expiry bash exp11-ttl-expiry.sh Cache clears after ~5m of inactivity.
Invalidation Timestamp bash exp3-timestamp-invalidation.sh Dynamic content at start kills the prefix.
Model Switch bash exp7-model-switch.sh Different model = different cache namespace.
Effort Change bash exp8-reasoning-effort.sh Reasoning changes can reset prefix compute.
Hook Context bash exp13-hook-impact.sh Dynamic hook context (v1.0.65+) breaks prefix.
Dynamic Tail bash exp14-dynamic-tail-mitigation.sh Moving dynamic data to the end preserves the stable prefix.
RAG Ordering bash exp15-rag-ordering.sh Relevance-order churn breaks reuse; stable IDs preserve it.
Schema Canonicalization bash exp16-schema-canonicalization.sh Semantically identical JSON can miss cache if byte order changes.
Growth Instructions bash exp4-custom-instructions.sh AGENTS.md adds stable, cached tokens.
MCP / Tools bash exp5-mcp-impact.sh Tool definitions are stable and cached.
Skills exp9-skills-impact.sh Large skill files increase cached density.
Session Multi-Turn bash exp6-multi-turn.sh Appending preserves the prior prefix.
Tool Usage exp10-tool-execution.sh Tool results become part of cached history.
Benchmarks Cross-Model exp12-cross-model.sh Compare Claude vs GPT vs Gemini behavior.
Semantic Thresholds python3 exp17-semantic-threshold-simulation.py Static semantic thresholds trade hit rate against false positives.

Quick Start

  1. Install and authenticate the GitHub Copilot CLI:

    # https://docs.github.com/copilot/how-tos/copilot-cli
    copilot login
  2. Run the setup and smoke test:

    source ./setup.sh

Experiment Isolation (Recommended)

By default, the harness enables Experiment Isolation. This ensures that your personal Copilot skills, custom instructions, and MCP configurations (stored in ~/.copilot) do not bias the token usage results.

  • Enabled (Default): Sets a temporary COPILOT_HOME, unsets COPILOT_SKILLS_DIRS, etc.
  • Toggle: export COPILOT_ISOLATION=false before sourcing setup.sh to use your system config.

⚠️ Cost Disclaimer (Please Read Before Running)

This repository exists to help you reduce LLM costs over time — but running the experiments does consume paid tokens.

Costs are approximate and depend on:

  • selected model(s)
  • prompt/output lengths
  • cache hit rate
  • how many experiments and repeats you run

Rough cost expectations (order-of-magnitude)

  • A single short experiment run is often from cents to low tens of cents.
  • run-all.sh can add up to several dollars on premium models.
  • run-multi-model.sh can multiply total spend significantly (same suite × number of models).

Default reference pricing used by compare.py (USD per 1M tokens)

  • claude-sonnet-4.6: input 3.00, cached-read 0.30, output 15.00
  • gpt-5.4: input 2.50, cached-read 1.25, output 10.00
  • gemini-3.5-flash: input 0.075, cached-read 0.01875, output 0.30

These are estimator defaults, not billing guarantees. Always confirm with your provider's current pricing.

How to avoid surprise spend

  1. Start with exp1 and exp2 before running the full suite.

  2. Prefer cheaper models first (or pass your own --pricing file to compare.py).

  3. Avoid run-multi-model.sh until single-model results are stable.

  4. Monitor usage after each batch (analyze.py, compare.py, optional ccusage).

  5. Run a single experiment:

    bash exp2-cache-hit.sh
  6. Run the included experiments and compare:

    bash run-all.sh
  7. Analyze a specific OTel JSONL file:

    python3 analyze.py "$HOME/cache-experiments/otel/exp2-cache-hit-*.jsonl"
  8. Run all experiments with cost estimates and gain/loss analysis:

    bash run-all.sh
    # compare.py now outputs per-experiment token costs, per-model summary,
    # and best-practice gain/loss analysis grounded in actual benchmark data
  9. Discover available Copilot models:

    bash discover-models.sh
    bash discover-models.sh --json   # JSON array output
    bash discover-models.sh --save models.txt
  10. Run benchmarks across multiple models with sky-view comparison:

    bash run-multi-model.sh claude-sonnet-4.6 gpt-5.4 gemini-3.5-flash
    # or: bash run-multi-model.sh $(bash discover-models.sh)
    # Output: $HOME/cache-experiments/results/multi-model/sky-view-comparison.md
  11. (Optional) Cross-check with ccusage via npx:

    npx ccusage@latest copilot session
    npx ccusage@latest copilot session --json

    ccusage reads the file from ~/.copilot/otel/*.jsonl or from the COPILOT_OTEL_FILE_EXPORTER_PATH set during the experiment.

How to Read the Numbers

  • input_tokens — total input tokens for the call (includes cached ones)
  • cache_read_input_tokens / cache_read — tokens served from cache (cache HIT)
  • cache_creation_input_tokens / cache_creation — tokens written to cache (cache WRITE)
  • overall_cache_hit_rate — fraction of input tokens that were cache reads across all calls
  • Est. Cost (USD) — approximate cost based on per-token pricing (override with --pricing)
  • No-Cache Cost — what the same tokens would cost with zero caching
  • Savings — delta between actual cost and no-cache cost (the value of caching)

Cost Estimation & Best-Practice Gain/Loss

compare.py now provides three analysis layers:

  1. Per-experiment token cost — estimates USD cost per experiment using default or custom pricing, with a no-cache baseline for comparison.
  2. Per-model summary — aggregates token usage and cost by model across all experiments.
  3. Best-practice gain/loss — pairs anti-pattern experiments with their best-practice counterparts to show the estimated cost impact of each practice, grounded in actual benchmark data.

Custom pricing file format (per 1M tokens, USD):

{"claude-sonnet-4.6": {"input": 3.0, "cached_read": 0.30, "cached_write": 3.75, "output": 15.0}}
python3 compare.py $HOME/cache-experiments/otel --pricing my-pricing.json

Multi-Model Benchmarking

run-multi-model.sh runs the full experiment suite across multiple models sequentially and produces:

  • Per-model reports — each model gets its own report.md with token usage, cost estimates, and gain/loss analysis.
  • Sky-view comparison — a combined cross-model report showing hit rates, token counts, and cost estimates side by side, so you can verify whether caching assumptions hold across providers.
bash run-multi-model.sh claude-sonnet-4.6 gpt-5.4 gemini-3.5-flash

Output location: $HOME/cache-experiments/results/multi-model/

Key Takeaways

  • Caching is prefix-based and practically byte-for-byte exact for shared prefixes.
  • Place stable content at the beginning of the prompt; dynamic content at the end.
  • Tools, system instructions, MCP servers, retrieved context, and schemas can all become part of the cached prefix.
  • Canonicalize ordering for tools, JSON schemas, and RAG chunks.
  • Pin model versions; model switches invalidate cache.
  • Keep calls within the cache TTL (5 min default for Anthropic; 5–10 min default for OpenAI).
  • Append, don't modify, in multi-turn conversations.
  • Treat semantic caching as an approximate optimization unless it has verification/error guarantees.

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening an issue or pull request.

Maintainers

Governance & Release Process

Security

This project uses the organization-level security policy:

License

Licensed under the Apache License 2.0.

Contact

For general repository topics, open a GitHub issue. For security vulnerabilities, follow the organization security policy linked above.

Sources

All sources are documented in tutorial-plan.md under Part 2.6 Sources and the academic research corpus under research/. Provider-specific TTL/pricing details are separated from academic claims in the tutorial plan.

About

Hands-on Copilot CLI experiment harness to measure prompt caching, token usage, and cost optimization trade-offs.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors