experiments: add Experiments MCP Tools page#37302
Conversation
Adds a new page documenting the `experiments` toolset of the Datadog MCP Server, following the pattern established by security/mcp_server.md. Covers use cases (pre-launch flag audit, mid-run diagnostics, result exploration, code cleanup at conclusion, program-wide sweeps), setup instructions linking to central MCP docs, and a full reference of the 14 public tools in the toolset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Preview links (active after the
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Simplify redundant overview paragraph to a single value statement - Add alert callout that experiments toolset is not enabled by default - Link check-flag-implementation to FF MCP page and note its toolset - Defang jargony "specific mechanism in the diff" phrasing - Soften time-series claim from "plot" to "examine time-bucketed results" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
🐑 PR Shepherd is maintaining this PRI watch your PR and automatically fix CI failures, rebase your branch, handle flaky tests, and push it to the merge queue when it's ready. More about what I do → Guide To pause me on this PR, add the |
|
Created a DOCS card for an editorial review. |
rtrieu
left a comment
There was a problem hiding this comment.
hi @trestletech-dd, thanks for this PR! i left some comments, and also have an ask:
the "Available tools" section on this page duplicates what should live on /mcp_server/tools. can you add the experiments tools there and remove that section from this page? the use cases, setup, and intro content should stay here, just not the tool reference.
once that's resolved, i think this PR will be close to being merge ready. let me know if you have any questions or want to chat!
| text: "Set Up the Datadog MCP Server" | ||
| - link: "mcp_server" | ||
| tag: "Documentation" | ||
| text: "Datadog MCP Server Overview" |
There was a problem hiding this comment.
| text: "Datadog MCP Server Overview" | |
| text: "Datadog MCP Server Overview" | |
| - link: "mcp_server/tools" | |
| tag: "Documentation" | |
| text: "Datadog MCP Server Tools" |
|
|
||
| **Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation: | ||
|
|
||
| - Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today? |
There was a problem hiding this comment.
| - Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today? | |
| - Whether the flag is read with the correct value type and context attributes for its targeting rules, and whether the default value matches what production serves |
| **Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation: | ||
|
|
||
| - Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today? | ||
| - Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice? |
There was a problem hiding this comment.
| - Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice? | |
| - Whether the code emits metric events correctly in all variants, or whether there is a path where a metric fires in one variant but not another, or fires twice |
|
|
||
| - Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today? | ||
| - Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice? | ||
| - Are there nearby events or user behaviors in the code that aren't captured by any metric, or segments worth adding because the code path diverges by platform or context? |
There was a problem hiding this comment.
| - Are there nearby events or user behaviors in the code that aren't captured by any metric, or segments worth adding because the code path diverges by platform or context? | |
| - Whether nearby events or behaviors in the code aren't captured by any metric, and whether segments are worth adding because the code path diverges by platform or context |
|
|
||
| ### Use cases | ||
|
|
||
| **Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation: |
There was a problem hiding this comment.
| **Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation: | |
| **Before launching an experiment**, point an agent at [`check_datadog_flag_implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation: |
|
|
||
| **Before concluding**, use `explore-experiment-results` to build confidence in the interpretation. An agent can slice the primary metric by device type, country, plan tier, or any other assignment property to check whether the result holds across subgroups or is being carried by one cohort. It can also examine time-bucketed results to check whether the lift held steady over time or faded after the first few days. This segmentation work — which would otherwise require navigating multiple dashboard views — happens in a single conversational thread alongside the diagnostic and results data already in context. | ||
|
|
||
| **At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback. |
There was a problem hiding this comment.
| **At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback. | |
| **At conclusion**, an agent can record the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback. |
|
|
||
| **At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback. | ||
|
|
||
| **For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface stuck drafts with no allocation, and generate a standup-ready status summary. |
There was a problem hiding this comment.
| **For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface stuck drafts with no allocation, and generate a standup-ready status summary. | |
| **For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface draft experiments with no allocation, and generate a status summary. |
| : *Permissions required: `Product Analytics Experiments Read`* | ||
|
|
||
| `get-experiment-results` | ||
| : Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not re-derive significance from raw p-values or confidence intervals. |
There was a problem hiding this comment.
| : Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not re-derive significance from raw p-values or confidence intervals. | |
| : Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not recalculate significance from raw p-values or confidence intervals. |
| ### Metric investigation | ||
|
|
||
| `get-metric-definition` | ||
| : Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool to call next to investigate why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation pieces needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those. |
There was a problem hiding this comment.
| : Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool to call next to investigate why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation pieces needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those. | |
| : Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool for investigating why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation parameters needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those. |
| The `experiments` toolset is not enabled by default. To enable it, add `experiments` to the `toolsets` parameter when connecting to the Datadog MCP Server. For example: | ||
|
|
||
| ```text | ||
| https://mcp.{{< region-param key="dd_site" >}}/api/unstable/mcp-server/mcp?toolsets=all,experiments |
There was a problem hiding this comment.
is this the correct URL? is it meant to contain "unstable"?
There was a problem hiding this comment.
is this the correct URL? is it meant to contain "unstable"?
Good callout; I need to get the experiments toolset onto the v1 mcp API, then I'll update this.
There was a problem hiding this comment.
Actually, we won't be pushing the experiments MCP toolset to GA until Q3 and I'd like to get these docs out first to help customers. So, yes this is the correct URL.
Apply the straightforward review suggestions on content/en/experiments/mcp_tools.md: - Add Datadog MCP Server Tools further_reading link - Rename check-flag-implementation to check_datadog_flag_implementation - Reword the pre-launch audit bullets - Tighten the running, metric-movement, pre-conclusion, conclusion, and program-wide use case paragraphs - Refine get-experiment-results and get-metric-definition tool descriptions Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What does this PR do? What is the motivation?
Adds
content/en/experiments/mcp_tools.md, a new top-level page in the Experiments section documenting theexperimentstoolset of the Datadog MCP Server.The page follows the pattern established by
security/mcp_server.mdand covers:experimentstoolset, grouped by function (listing, lifecycle, diagnostics/results, metric investigation, troubleshooting)Prior art for toolset-specific MCP pages in their own domain:
Merge instructions
Merge readiness:
AI assistance
Initial draft and structure written with Claude Code; content reviewed and refined iteratively.
Additional notes
experimentstoolset is separate from thefeature-flagstoolset; the page links tofeature_flags/feature_flag_mcp_serverfor flag management tools used alongside experiments.check-flag-implementationtool referenced in the pre-launch use case lives in thefeature-flagstoolset — worth a docs team eye on whether that cross-toolset reference needs a clarifying note.Product Analytics Experiments Read) should be verified against whatever format the docs team standardizes on.