logical,txnmode: add throughput and latency metrics#171191
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
We want to start adding metrics to ldr txn mode, including the existing metrics. Txn mode is implemented as individual sub systems which will cause a dependency cycle if we attempt to import the original logical package. This change extracts the existing metrics struct to its own package. Release note: None
c8eac63 to
8e45b11
Compare
8e45b11 to
ed4cca5
Compare
| if waitingTxn.remainingDeps == 0 { | ||
| if waitingTxn.EventHorizon.LessEq(a.getGlobalFrontierLocked()) { | ||
| readyBuffer.AddLast(waitingTxn.Transaction) | ||
| a.metrics.TxnApplierBlockedTxns.Dec(1) |
There was a problem hiding this comment.
What do you think of using labeled metrics and model the different states of a transaction in the applier as different labels. I think the states are broadly:
- ready := can be applied, is waiting for an applier to pick it up
- applying := was dispatched to an applier
- txn-wait := is waiting on a transaction
- horizion-wait := is waiting on the horizion
There was a problem hiding this comment.
Neat, I didn't know about metric states.
How would you feel if we had just one metric for ready+applying txns? Ready+applying is easy to calculate, but splitting them up would require some non trivial refactoring; the readyBuffer is a locally scoped var on the aggregator that we'd have to extract to the applier and then wrap around a mutex. My current thinking is that this isn't worth the complexity because I don't think the split up metrics are that useful of a signal.
It would let you observe the case where we have ready txns, but for some reason we never hand it off to the applier. Given the handoff is just a select statement over a channel this seems unlikely? I'm imagining the more realistic stuck boundary is that we can't apply a txn for some reason and that blocks everything else. i.e. if ready+applying is growing/stalled, then it seems very likely it's applying thats the one blocked.
Let me know what you think though, happy to take a stab at lifting it out.
| if waitingTxn.EventHorizon.LessEq(a.getGlobalFrontierLocked()) { | ||
| readyBuffer.AddLast(waitingTxn.Transaction) | ||
| a.metrics.TxnApplierBlockedTxns.Dec(1) | ||
| a.metrics.TxnApplierReadyTxns.Inc(1) |
There was a problem hiding this comment.
consider: instead of trying to have all of these explicit increments and decrements, we could define a set of child metrics and have a goroutine periodically lock the applier stats and calculate the metrics at that time. I think that might be simpler than trying to keep everything perfectly in sync and it keeps the metric logic out of the critical code path.
| LabeledScanningRanges: metric.NewExportedGaugeVec(metaLabeledScanningRanges, []string{"label"}), | ||
| LabeledCatchupRanges: metric.NewExportedGaugeVec(metaLabeledCatchupRanges, []string{"label"}), | ||
|
|
||
| TxnApplierBlockedTxns: metric.NewGauge(metaTxnApplierBlockedTxns), |
There was a problem hiding this comment.
The new metrics should also accept the scope label.
| } | ||
| } | ||
| a.metrics.AppliedRowUpdates.Inc(int64(txn.applyResult.AppliedRows)) | ||
| if a.metricsLabel != "" { |
There was a problem hiding this comment.
Can we have the txnwriter track the commit latency and throughput metrics? The applier is probably the most complex part of txnldr, so I would like to keep as much complexity out of it as possible.
There was a problem hiding this comment.
Done, but I'll call out that we add one extra loop by putting it on this layer, since we now need to process the results of our apply whereas before we just returned them directly. I think this should be negligible overhead though.
770597a to
8c8c113
Compare
This change adds metrics to see in flight transactions in the applier pipeline. The number is broken down into txn blocked, horizon blocked and ready txns. Txn blocked means the transaction is not yet able to be committed it is waiting on a peer transaction to resolve first. Horizon blocked means that the transaction's dependencies are resolved but we are waiting on the event horizon to pass. Ready txns is the number of txns that are ready to be commited but haven't yet, i.e. on the ready buffer. Release note: None
This change wires up the following metrics for txn mode: 1. logical_replication.events_ingested 2. logical_replication.events_ingested_by_label 3. logical_replication.logical_bytes 4. logical_replication.commit_latency 5. logical_replication.batch_hist_nanos DLQ metrics are skipped for now as the DLQ path does not exist yet.
8c8c113 to
7ff99c2
Compare

The first two commits are part of #169672. We only require the first commit which could go in this PR as well i.e. the other PR is not blocking.
This change wires up the following metrics for txn mode:
DLQ metrics are skipped for now as the DLQ path does not exist yet.
Release note: none
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-61283
Informs: #169872