experiments: add Experiments MCP Tools page by trestletech-dd · Pull Request #37302 · DataDog/documentation

trestletech-dd · 2026-06-05T17:39:16Z

What does this PR do? What is the motivation?

Adds content/en/experiments/mcp_tools.md, a new top-level page in the Experiments section documenting the experiments toolset of the Datadog MCP Server.

The page follows the pattern established by security/mcp_server.md and covers:

Overview explaining what the toolset enables, with emphasis on the combination of experiment state + source code access
Use cases across the experiment lifecycle: pre-launch flag audit, mid-run diagnostics, result exploration/segmentation, code cleanup at conclusion, and program-wide health sweeps
Setup linking to central MCP setup docs (no duplicated connection instructions)
Available tools — full reference of all 14 public tools in the experiments toolset, grouped by function (listing, lifecycle, diagnostics/results, metric investigation, troubleshooting)

Prior art for toolset-specific MCP pages in their own domain:

Merge instructions

Merge readiness:

Ready for merge

AI assistance

Initial draft and structure written with Claude Code; content reviewed and refined iteratively.

Additional notes

The experiments toolset is separate from the feature-flags toolset; the page links to feature_flags/feature_flag_mcp_server for flag management tools used alongside experiments.
The check-flag-implementation tool referenced in the pre-launch use case lives in the feature-flags toolset — worth a docs team eye on whether that cross-toolset reference needs a clarifying note.
Permissions display strings (e.g. Product Analytics Experiments Read) should be verified against whatever format the docs team standardizes on.

Adds a new page documenting the `experiments` toolset of the Datadog MCP Server, following the pattern established by security/mcp_server.md. Covers use cases (pre-launch flag audit, mid-run diagnostics, result exploration, code cleanup at conclusion, program-wide sweeps), setup instructions linking to central MCP docs, and a full reference of the 14 public tools in the toolset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-05T17:59:20Z

Preview links (active after the `build_preview` check completes)

New or renamed files

https://docs-staging.datadoghq.com/jeff.allen/experiments-mcp-tools/experiments/mcp_tools

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Simplify redundant overview paragraph to a single value statement - Add alert callout that experiments toolset is not enabled by default - Link check-flag-implementation to FF MCP page and note its toolset - Defang jargony "specific mechanism in the diff" phrasing - Soften time-series claim from "plot" to "examine time-bucketed results" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dd-octo-sts-98cdbc · 2026-06-09T13:11:25Z

🐑 PR Shepherd is maintaining this PR

I watch your PR and automatically fix CI failures, rebase your branch, handle flaky tests, and push it to the merge queue when it's ready.

More about what I do → Guide

To pause me on this PR, add the flow-skip label.

rtrieu · 2026-06-09T17:43:18Z

Created a DOCS card for an editorial review.

rtrieu

hi @trestletech-dd, thanks for this PR! i left some comments, and also have an ask:

the "Available tools" section on this page duplicates what should live on /mcp_server/tools. can you add the experiments tools there and remove that section from this page? the use cases, setup, and intro content should stay here, just not the tool reference.

once that's resolved, i think this PR will be close to being merge ready. let me know if you have any questions or want to chat!

rtrieu · 2026-06-17T16:54:55Z

+  text: "Set Up the Datadog MCP Server"
+- link: "mcp_server"
+  tag: "Documentation"
+  text: "Datadog MCP Server Overview"


Suggested change

text: "Datadog MCP Server Overview"

text: "Datadog MCP Server Overview"

- link: "mcp_server/tools"

tag: "Documentation"

text: "Datadog MCP Server Tools"

rtrieu · 2026-06-17T16:58:23Z

+
+**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:
+
+- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?


Suggested change

- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?

- Whether the flag is read with the correct value type and context attributes for its targeting rules, and whether the default value matches what production serves

rtrieu · 2026-06-17T16:58:37Z

+**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:
+
+- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?
+- Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice?


Suggested change

- Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice?

- Whether the code emits metric events correctly in all variants, or whether there is a path where a metric fires in one variant but not another, or fires twice

rtrieu · 2026-06-17T16:58:50Z

+
+- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?
+- Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice?
+- Are there nearby events or user behaviors in the code that aren't captured by any metric, or segments worth adding because the code path diverges by platform or context?


Suggested change

- Are there nearby events or user behaviors in the code that aren't captured by any metric, or segments worth adding because the code path diverges by platform or context?

- Whether nearby events or behaviors in the code aren't captured by any metric, and whether segments are worth adding because the code path diverges by platform or context

rtrieu · 2026-06-17T17:06:26Z

+
+### Use cases
+
+**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:


Suggested change

**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:

**Before launching an experiment**, point an agent at [`check_datadog_flag_implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:

rtrieu · 2026-06-17T17:09:08Z

+
+**Before concluding**, use `explore-experiment-results` to build confidence in the interpretation. An agent can slice the primary metric by device type, country, plan tier, or any other assignment property to check whether the result holds across subgroups or is being carried by one cohort. It can also examine time-bucketed results to check whether the lift held steady over time or faded after the first few days. This segmentation work — which would otherwise require navigating multiple dashboard views — happens in a single conversational thread alongside the diagnostic and results data already in context.
+
+**At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.


Suggested change

**At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.

**At conclusion**, an agent can record the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.

rtrieu · 2026-06-17T17:09:50Z

+
+**At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.
+
+**For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface stuck drafts with no allocation, and generate a standup-ready status summary.


Suggested change

**For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface stuck drafts with no allocation, and generate a standup-ready status summary.

**For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface draft experiments with no allocation, and generate a status summary.

rtrieu · 2026-06-17T17:10:32Z

+: *Permissions required: `Product Analytics Experiments Read`*
+
+`get-experiment-results`
+: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not re-derive significance from raw p-values or confidence intervals.


Suggested change

: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not re-derive significance from raw p-values or confidence intervals.

: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not recalculate significance from raw p-values or confidence intervals.

rtrieu · 2026-06-17T17:10:52Z

+### Metric investigation
+
+`get-metric-definition`
+: Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool to call next to investigate why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation pieces needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those.


Suggested change

: Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool to call next to investigate why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation pieces needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those.

: Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool for investigating why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation parameters needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those.

rtrieu · 2026-06-17T17:12:01Z

+The `experiments` toolset is not enabled by default. To enable it, add `experiments` to the `toolsets` parameter when connecting to the Datadog MCP Server. For example:
+
+```text
+https://mcp.{{< region-param key="dd_site" >}}/api/unstable/mcp-server/mcp?toolsets=all,experiments


is this the correct URL? is it meant to contain "unstable"?

is this the correct URL? is it meant to contain "unstable"?

Good callout; I need to get the experiments toolset onto the v1 mcp API, then I'll update this.

Actually, we won't be pushing the experiments MCP toolset to GA until Q3 and I'd like to get these docs out first to help customers. So, yes this is the correct URL.

Apply the straightforward review suggestions on content/en/experiments/mcp_tools.md: - Add Datadog MCP Server Tools further_reading link - Rename check-flag-implementation to check_datadog_flag_implementation - Reword the pre-launch audit bullets - Tighten the running, metric-movement, pre-conclusion, conclusion, and program-wide use case paragraphs - Refine get-experiment-results and get-metric-definition tool descriptions Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

experiments: add mcp_tools to nav menu

93f6d48

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added the Architecture Everything related to the Doc backend label Jun 5, 2026

trestletech-dd and others added 3 commits June 8, 2026 11:01

Move note to header

8ea31eb

experiments/mcp_tools: add Bits AI link and permissions reference

61693d1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

trestletech-dd marked this pull request as ready for review June 9, 2026 13:11

trestletech-dd requested a review from a team as a code owner June 9, 2026 13:11

rtrieu added the editorial review Waiting on a more in-depth review label Jun 9, 2026

rtrieu requested changes Jun 17, 2026

View reviewed changes


		Before launching an experiment, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:

		- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?

	- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?
	- Whether the flag is read with the correct value type and context attributes for its targeting rules, and whether the default value matches what production serves

	- Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice?
	- Whether the code emits metric events correctly in all variants, or whether there is a path where a metric fires in one variant but not another, or fires twice

	- Are there nearby events or user behaviors in the code that aren't captured by any metric, or segments worth adding because the code path diverges by platform or context?
	- Whether nearby events or behaviors in the code aren't captured by any metric, and whether segments are worth adding because the code path diverges by platform or context


		### Use cases

		Before launching an experiment, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:


		Before concluding, use `explore-experiment-results` to build confidence in the interpretation. An agent can slice the primary metric by device type, country, plan tier, or any other assignment property to check whether the result holds across subgroups or is being carried by one cohort. It can also examine time-bucketed results to check whether the lift held steady over time or faded after the first few days. This segmentation work — which would otherwise require navigating multiple dashboard views — happens in a single conversational thread alongside the diagnostic and results data already in context.

		At conclusion, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.

	At conclusion, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.
	At conclusion, an agent can record the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.


		At conclusion, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.

		For program-wide operations, an agent can sweep all running experiments for diagnostic warnings, surface stuck drafts with no allocation, and generate a standup-ready status summary.

	: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not re-derive significance from raw p-values or confidence intervals.
	: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not recalculate significance from raw p-values or confidence intervals.

Conversation

trestletech-dd commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do? What is the motivation?

Merge instructions

AI assistance

Additional notes

Uh oh!

github-actions Bot commented Jun 5, 2026

Preview links (active after the build_preview check completes)

New or renamed files

Uh oh!

dd-octo-sts-98cdbc Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐑 PR Shepherd is maintaining this PR

Uh oh!

rtrieu commented Jun 9, 2026

Uh oh!

rtrieu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trestletech-dd commented Jun 5, 2026 •

edited

Loading

Preview links (active after the `build_preview` check completes)

dd-octo-sts-98cdbc Bot commented Jun 9, 2026 •

edited

Loading