Skip to content

experiments: add Experiments MCP Tools page#37302

Open
trestletech-dd wants to merge 6 commits into
masterfrom
jeff.allen/experiments-mcp-tools
Open

experiments: add Experiments MCP Tools page#37302
trestletech-dd wants to merge 6 commits into
masterfrom
jeff.allen/experiments-mcp-tools

Conversation

@trestletech-dd

@trestletech-dd trestletech-dd commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What does this PR do? What is the motivation?

Adds content/en/experiments/mcp_tools.md, a new top-level page in the Experiments section documenting the experiments toolset of the Datadog MCP Server.

The page follows the pattern established by security/mcp_server.md and covers:

  • Overview explaining what the toolset enables, with emphasis on the combination of experiment state + source code access
  • Use cases across the experiment lifecycle: pre-launch flag audit, mid-run diagnostics, result exploration/segmentation, code cleanup at conclusion, and program-wide health sweeps
  • Setup linking to central MCP setup docs (no duplicated connection instructions)
  • Available tools — full reference of all 14 public tools in the experiments toolset, grouped by function (listing, lifecycle, diagnostics/results, metric investigation, troubleshooting)

Prior art for toolset-specific MCP pages in their own domain:

Merge instructions

Merge readiness:

  • Ready for merge

AI assistance

Initial draft and structure written with Claude Code; content reviewed and refined iteratively.

Additional notes

  • The experiments toolset is separate from the feature-flags toolset; the page links to feature_flags/feature_flag_mcp_server for flag management tools used alongside experiments.
  • The check-flag-implementation tool referenced in the pre-launch use case lives in the feature-flags toolset — worth a docs team eye on whether that cross-toolset reference needs a clarifying note.
  • Permissions display strings (e.g. Product Analytics Experiments Read) should be verified against whatever format the docs team standardizes on.

Adds a new page documenting the `experiments` toolset of the Datadog MCP
Server, following the pattern established by security/mcp_server.md.

Covers use cases (pre-launch flag audit, mid-run diagnostics, result
exploration, code cleanup at conclusion, program-wide sweeps), setup
instructions linking to central MCP docs, and a full reference of the
14 public tools in the toolset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Preview links (active after the build_preview check completes)

New or renamed files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the Architecture Everything related to the Doc backend label Jun 5, 2026
trestletech-dd and others added 3 commits June 8, 2026 11:01
- Simplify redundant overview paragraph to a single value statement
- Add alert callout that experiments toolset is not enabled by default
- Link check-flag-implementation to FF MCP page and note its toolset
- Defang jargony "specific mechanism in the diff" phrasing
- Soften time-series claim from "plot" to "examine time-bucketed results"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@trestletech-dd trestletech-dd marked this pull request as ready for review June 9, 2026 13:11
@trestletech-dd trestletech-dd requested a review from a team as a code owner June 9, 2026 13:11
@dd-octo-sts-98cdbc

dd-octo-sts-98cdbc Bot commented Jun 9, 2026

Copy link
Copy Markdown

🐑 PR Shepherd is maintaining this PR

I watch your PR and automatically fix CI failures, rebase your branch, handle flaky tests, and push it to the merge queue when it's ready.

More about what I do → Guide

To pause me on this PR, add the flow-skip label.

@rtrieu rtrieu added the editorial review Waiting on a more in-depth review label Jun 9, 2026
@rtrieu

rtrieu commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Created a DOCS card for an editorial review.

@rtrieu rtrieu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @trestletech-dd, thanks for this PR! i left some comments, and also have an ask:

the "Available tools" section on this page duplicates what should live on /mcp_server/tools. can you add the experiments tools there and remove that section from this page? the use cases, setup, and intro content should stay here, just not the tool reference.

once that's resolved, i think this PR will be close to being merge ready. let me know if you have any questions or want to chat!

text: "Set Up the Datadog MCP Server"
- link: "mcp_server"
tag: "Documentation"
text: "Datadog MCP Server Overview"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
text: "Datadog MCP Server Overview"
text: "Datadog MCP Server Overview"
- link: "mcp_server/tools"
tag: "Documentation"
text: "Datadog MCP Server Tools"

Comment thread content/en/experiments/mcp_tools.md Outdated

**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:

- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?
- Whether the flag is read with the correct value type and context attributes for its targeting rules, and whether the default value matches what production serves

Comment thread content/en/experiments/mcp_tools.md Outdated
**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:

- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?
- Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice?
- Whether the code emits metric events correctly in all variants, or whether there is a path where a metric fires in one variant but not another, or fires twice

Comment thread content/en/experiments/mcp_tools.md Outdated

- Is the flag read with the right value type and context attributes for its targeting rules? Is the default value consistent with what production serves today?
- Does the code correctly emit the metric events the experiment depends on — or is there a path where a metric fires in one variant but not another, or fires twice?
- Are there nearby events or user behaviors in the code that aren't captured by any metric, or segments worth adding because the code path diverges by platform or context?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Are there nearby events or user behaviors in the code that aren't captured by any metric, or segments worth adding because the code path diverges by platform or context?
- Whether nearby events or behaviors in the code aren't captured by any metric, and whether segments are worth adding because the code path diverges by platform or context

Comment thread content/en/experiments/mcp_tools.md Outdated

### Use cases

**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Before launching an experiment**, point an agent at [`check-flag-implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:
**Before launching an experiment**, point an agent at [`check_datadog_flag_implementation`][5] (part of the `feature-flags` toolset) alongside your source code to audit the flag installation:

Comment thread content/en/experiments/mcp_tools.md Outdated

**Before concluding**, use `explore-experiment-results` to build confidence in the interpretation. An agent can slice the primary metric by device type, country, plan tier, or any other assignment property to check whether the result holds across subgroups or is being carried by one cohort. It can also examine time-bucketed results to check whether the lift held steady over time or faded after the first few days. This segmentation work — which would otherwise require navigating multiple dashboard views — happens in a single conversational thread alongside the diagnostic and results data already in context.

**At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.
**At conclusion**, an agent can record the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.

Comment thread content/en/experiments/mcp_tools.md Outdated

**At conclusion**, an agent can take the winning variant decision, find the flag in the source, and draft the code change: inline the winning branch, remove the losing branch, delete the SDK call default that no longer needs a fallback.

**For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface stuck drafts with no allocation, and generate a standup-ready status summary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface stuck drafts with no allocation, and generate a standup-ready status summary.
**For program-wide operations**, an agent can sweep all running experiments for diagnostic warnings, surface draft experiments with no allocation, and generate a status summary.

Comment thread content/en/experiments/mcp_tools.md Outdated
: *Permissions required: `Product Analytics Experiments Read`*

`get-experiment-results`
: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not re-derive significance from raw p-values or confidence intervals.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not re-derive significance from raw p-values or confidence intervals.
: Returns computed per-variant, per-metric results. The `verdict` field (`better`, `worse`, `inconclusive`, or `unreliable`) is authoritative — do not recalculate significance from raw p-values or confidence intervals.

Comment thread content/en/experiments/mcp_tools.md Outdated
### Metric investigation

`get-metric-definition`
: Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool to call next to investigate why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation pieces needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool to call next to investigate why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation pieces needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those.
: Returns the definition of an experiment metric — the underlying event query, data source, and the recommended Datadog MCP tool for investigating why the metric moved. For `datadog`-sourced metrics, the response includes a `recommended_tool_call` field pointing to `aggregate_rum_events` or `run_analytics_query` along with the structured filter and aggregation parameters needed to assemble the call. Not for Datadog infrastructure or APM metrics; use `get_datadog_metric` for those.

The `experiments` toolset is not enabled by default. To enable it, add `experiments` to the `toolsets` parameter when connecting to the Datadog MCP Server. For example:

```text
https://mcp.{{< region-param key="dd_site" >}}/api/unstable/mcp-server/mcp?toolsets=all,experiments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the correct URL? is it meant to contain "unstable"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the correct URL? is it meant to contain "unstable"?

Good callout; I need to get the experiments toolset onto the v1 mcp API, then I'll update this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we won't be pushing the experiments MCP toolset to GA until Q3 and I'd like to get these docs out first to help customers. So, yes this is the correct URL.

Apply the straightforward review suggestions on content/en/experiments/mcp_tools.md:
- Add Datadog MCP Server Tools further_reading link
- Rename check-flag-implementation to check_datadog_flag_implementation
- Reword the pre-launch audit bullets
- Tighten the running, metric-movement, pre-conclusion, conclusion, and program-wide use case paragraphs
- Refine get-experiment-results and get-metric-definition tool descriptions

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Architecture Everything related to the Doc backend editorial review Waiting on a more in-depth review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants