Skip to content

KMS Retry support#1999

Draft
rozza wants to merge 3 commits into
mongodb:mainfrom
rozza:JAVA-5391
Draft

KMS Retry support#1999
rozza wants to merge 3 commits into
mongodb:mainfrom
rozza:JAVA-5391

Conversation

@rozza

@rozza rozza commented Jun 11, 2026

Copy link
Copy Markdown
Member

Added KMS Retry support.

3 Commits:

  • Update CAPI javadoc and added rules to: ‎mongodb-crypt/AGENTS.md‎
  • Implemented KMS Retries
  • TlsChannelImpl fixes

JAVA-5391 / JAVA-5772

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end support for libmongocrypt’s KMS retry behavior across the sync and reactive-streams drivers, and updates the vendored TLS channel to correctly surface close_notify when it arrives in the same network segment as application data (addressing the KMS HTTP response parsing edge case referenced by JAVA-5391 / JAVA-5411).

Changes:

  • Enable and surface libmongocrypt KMS retry primitives (retryable HTTP errors, network failures, backoff sleep) via new JNA bindings and MongoKeyDecryptor APIs.
  • Update sync/reactive KMS I/O loops to use retry-aware feeding (feedWithRetry), budget exhaustion (fail), and pre-request sleep (sleepMicroseconds), including CSOT-aware timeout behavior.
  • Add prose/integration tests and Evergreen wiring for KMS retry scenarios, plus a unit test suite for the TLS close_notify + data coalescing behavior.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mongodb-crypt/src/test/resources/kms-reply-429.txt Adds a canned HTTP 429 KMS response used to validate retryable HTTP error handling.
mongodb-crypt/src/test/java/com/mongodb/crypt/capi/MongoCryptTest.java Adds unit tests validating retry-on-HTTP-error, retry-on-network-error, and retry budget exhaustion.
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptorImpl.java Implements new retry/backoff APIs by calling newly bound libmongocrypt KMS functions.
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptor.java Extends the decryptor interface with sleep/backoff and retry signaling methods.
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoCryptImpl.java Enables libmongocrypt KMS retry behavior during initialization.
mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/CAPI.java Adds new JNA native method declarations for KMS retry and expands/updates binding documentation.
mongodb-crypt/AGENTS.md Documents contribution rules and mapping guidelines for CAPI.java JNA bindings/javadoc.
driver-sync/src/test/functional/com/mongodb/client/ClientSideEncryptionKmsRetryProseTest.java Adds sync-driver entrypoint for the KMS retry prose tests.
driver-sync/src/test/functional/com/mongodb/client/AbstractClientSideEncryptionKmsRetryProseTest.java Implements KMS retry prose tests (network retry, HTTP retry, budget exhaustion, CSOT timing).
driver-sync/src/main/com/mongodb/client/internal/Crypt.java Integrates retry-aware KMS decryption into the sync encryption state machine, including backoff and timeout mapping.
driver-reactive-streams/src/test/functional/com/mongodb/reactivestreams/client/ClientSideEncryptionKmsRetryProseTest.java Adds reactive-streams entrypoint for the KMS retry prose tests.
driver-reactive-streams/src/main/com/mongodb/reactivestreams/client/internal/crypt/KeyManagementService.java Adds reactive retry/backoff handling and retry-aware KMS response feeding.
driver-core/src/test/unit/com/mongodb/internal/connection/tlschannel/TlsChannelCloseNotifyTest.java Adds regression tests for TLS channel read semantics when close_notify and data coalesce.
driver-core/src/main/com/mongodb/internal/connection/tlschannel/impl/TlsChannelImpl.java Updates vendored TLS-channel read/unwrap behavior to return bytes produced alongside close_notify and eagerly unwrap buffered records.
.evergreen/run-kms-retry-tests.sh Adds a CI script to run the new sync + reactive KMS retry prose tests.
.evergreen/.evg.yml Wires a new Evergreen function/task/buildvariant to execute the KMS retry test script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/CAPI.java
Comment thread mongodb-crypt/AGENTS.md Outdated
@rozza rozza force-pushed the JAVA-5391 branch 3 times, most recently from d4e6fbf to 7cad542 Compare June 11, 2026 14:06
rozza added 2 commits June 11, 2026 15:08
Enable libmongocrypt's KMS retry behaviour and wire it through the sync
and reactive drivers. Transient KMS failures are retried with backoff
delays supplied by libmongocrypt; retry is enabled unconditionally.

- Add CAPI bindings: mongocrypt_setopt_retry_kms,
  mongocrypt_kms_ctx_usleep, mongocrypt_kms_ctx_feed_with_retry and
  mongocrypt_kms_ctx_fail, exposed via sleepMicroseconds(),
  feedWithRetry() and fail() on MongoKeyDecryptor
- HTTP 429 responses are signalled by feedWithRetry and network errors
  by fail(); in both cases the attempt ends early and libmongocrypt
  re-presents the KMS context for retry with the next backoff delay.
  When the retry budget is exhausted the original error is surfaced
- The sync Crypt sleeps the backoff before each attempt, surfaces
  MongoOperationTimeoutException only once the CSOT deadline has
  actually expired (a fixed-timeout connect or handshake stall with
  budget remaining stays retryable) and treats unexpected EOF as a
  retryable network error
- The reactive KeyManagementService applies the backoff via
  Mono.delay, checks the CSOT deadline before consuming retry budget,
  and releases buffers and closes streams on all completion paths;
  the shared backoff budget check lives in MongoCryptHelper
- Add the spec Section 24 KMS retry prose tests (sync and reactive)
  plus a CSOT-during-retry test; each test resets the failpoint
  server and control requests send Connection: close so the
  single-threaded server cannot stall KMS connections or leak armed
  failpoints between tests and suites
- Add mongodb-crypt unit tests for the new bindings, a sync unit test
  covering the CSOT classification of KMS network failures, and an
  Evergreen task running both suites

JAVA-5391
@rozza rozza requested a review from Copilot June 11, 2026 14:11
@rozza rozza changed the title Java 5391 KMS Retry support Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 20 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

mongodb-crypt/src/test/resources/kms-reply-429.txt:7

  • Content-Length: 32 implies the HTTP body is exactly the JSON string, but this resource currently ends with an extra blank line. getHttpResourceAsByteBuffer joins lines with \r\n, so the trailing empty line appends an extra CRLF after the body, making the response longer than the declared Content-Length and potentially breaking/parsing the retry logic in a non-representative way.

Remove the trailing blank line so the file ends immediately after the JSON body (matching how kms-reply.txt is formatted).

Content-Length: 32

{"__type":"ThrottlingException"}

When a KMS server responds and immediately closes the connection, the
response data and the close_notify alert arrive in the same network
segment as separate TLS records. The vendored tls-channel discarded
data produced in the same unwrap as a close (fixed upstream in 1.0.0)
and only surfaced a buffered close_notify on the next read. KMS HTTP
responses without Content-Length need data + EOF in the same exchange
for libmongocrypt to parse the response and signal a retry (JAVA-5391),
so reads now also eagerly drain TLS records already buffered alongside
produced data, folding every produced byte into the returned count. If
a buffered record turns out to be corrupt, the data decrypted before it
is still returned and the failure surfaces on the next read.

Covered by TlsChannelCloseNotifyTest using deterministic in-memory
channels over TLSv1.2 and TLSv1.3: data + close_notify in one segment,
close in a separate segment, responses spanning multiple TLS records,
destination buffers smaller than a record or filling up mid-response,
and a corrupt record following the data.

The eager drain has been proposed upstream in
marianobarrios/tls-channel#351, which this
implementation and test coverage mirror; JAVA-5411 tracks refreshing
the vendored tls-channel copy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants