Skip to content

logical,txnmode: add throughput and latency metrics#171191

Open
DarrylWong wants to merge 4 commits into
cockroachdb:masterfrom
DarrylWong:ldr-txn-wire-up-metrics
Open

logical,txnmode: add throughput and latency metrics#171191
DarrylWong wants to merge 4 commits into
cockroachdb:masterfrom
DarrylWong:ldr-txn-wire-up-metrics

Conversation

@DarrylWong
Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong commented May 29, 2026

The first two commits are part of #169672. We only require the first commit which could go in this PR as well i.e. the other PR is not blocking.


This change wires up the following metrics for txn mode:

  1. logical_replication.events_ingested
  2. logical_replication.events_ingested_by_label
  3. logical_replication.logical_bytes
  4. logical_replication.commit_latency
  5. logical_replication.batch_hist_nanos

DLQ metrics are skipped for now as the DLQ path does not exist yet.

Release note: none
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-61283
Informs: #169872

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 29, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@DarrylWong
Copy link
Copy Markdown
Contributor Author

image

We want to start adding metrics to ldr txn mode, including
the existing metrics. Txn mode is implemented as individual
sub systems which will cause a dependency cycle if we
attempt to import the original logical package. This change
extracts the existing metrics struct to its own package.

Release note: None
@DarrylWong DarrylWong force-pushed the ldr-txn-wire-up-metrics branch 2 times, most recently from c8eac63 to 8e45b11 Compare May 29, 2026 19:43
@DarrylWong DarrylWong changed the title Ldr txn wire up metrics logical,txnmode: add throughput and latency metrics May 29, 2026
@DarrylWong DarrylWong force-pushed the ldr-txn-wire-up-metrics branch from 8e45b11 to ed4cca5 Compare May 29, 2026 21:03
@DarrylWong DarrylWong requested a review from jeffswenson May 29, 2026 21:04
@DarrylWong DarrylWong marked this pull request as ready for review May 29, 2026 21:04
@DarrylWong DarrylWong requested review from a team as code owners May 29, 2026 21:04
@DarrylWong DarrylWong requested review from DrewKimball and removed request for a team May 29, 2026 21:04
if waitingTxn.remainingDeps == 0 {
if waitingTxn.EventHorizon.LessEq(a.getGlobalFrontierLocked()) {
readyBuffer.AddLast(waitingTxn.Transaction)
a.metrics.TxnApplierBlockedTxns.Dec(1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using labeled metrics and model the different states of a transaction in the applier as different labels. I think the states are broadly:

  • ready := can be applied, is waiting for an applier to pick it up
  • applying := was dispatched to an applier
  • txn-wait := is waiting on a transaction
  • horizion-wait := is waiting on the horizion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, I didn't know about metric states.

How would you feel if we had just one metric for ready+applying txns? Ready+applying is easy to calculate, but splitting them up would require some non trivial refactoring; the readyBuffer is a locally scoped var on the aggregator that we'd have to extract to the applier and then wrap around a mutex. My current thinking is that this isn't worth the complexity because I don't think the split up metrics are that useful of a signal.

It would let you observe the case where we have ready txns, but for some reason we never hand it off to the applier. Given the handoff is just a select statement over a channel this seems unlikely? I'm imagining the more realistic stuck boundary is that we can't apply a txn for some reason and that blocks everything else. i.e. if ready+applying is growing/stalled, then it seems very likely it's applying thats the one blocked.

Let me know what you think though, happy to take a stab at lifting it out.

if waitingTxn.EventHorizon.LessEq(a.getGlobalFrontierLocked()) {
readyBuffer.AddLast(waitingTxn.Transaction)
a.metrics.TxnApplierBlockedTxns.Dec(1)
a.metrics.TxnApplierReadyTxns.Inc(1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider: instead of trying to have all of these explicit increments and decrements, we could define a set of child metrics and have a goroutine periodically lock the applier stats and calculate the metrics at that time. I think that might be simpler than trying to keep everything perfectly in sync and it keeps the metric logic out of the critical code path.

LabeledScanningRanges: metric.NewExportedGaugeVec(metaLabeledScanningRanges, []string{"label"}),
LabeledCatchupRanges: metric.NewExportedGaugeVec(metaLabeledCatchupRanges, []string{"label"}),

TxnApplierBlockedTxns: metric.NewGauge(metaTxnApplierBlockedTxns),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new metrics should also accept the scope label.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}
a.metrics.AppliedRowUpdates.Inc(int64(txn.applyResult.AppliedRows))
if a.metricsLabel != "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the txnwriter track the commit latency and throughput metrics? The applier is probably the most complex part of txnldr, so I would like to keep as much complexity out of it as possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I'll call out that we add one extra loop by putting it on this layer, since we now need to process the results of our apply whereas before we just returned them directly. I think this should be negligible overhead though.

@DarrylWong DarrylWong force-pushed the ldr-txn-wire-up-metrics branch 2 times, most recently from 770597a to 8c8c113 Compare June 1, 2026 21:45
This change adds metrics to see in flight transactions in the
applier pipeline. The number is broken down into txn blocked,
horizon blocked and ready txns.

Txn blocked means the transaction is not yet able to be committed
it is waiting on a peer transaction to resolve first. Horizon blocked
means that the transaction's dependencies are resolved but we are
waiting on the event horizon to pass. Ready txns is the number of txns
that are ready to be commited but haven't yet, i.e. on the ready buffer.

Release note: None
This change wires up the following metrics for txn mode:
  1. logical_replication.events_ingested
  2. logical_replication.events_ingested_by_label
  3. logical_replication.logical_bytes
  4. logical_replication.commit_latency
  5. logical_replication.batch_hist_nanos

DLQ metrics are skipped for now as the DLQ path does not
exist yet.
@DarrylWong DarrylWong force-pushed the ldr-txn-wire-up-metrics branch from 8c8c113 to 7ff99c2 Compare June 1, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants