KMS Retry support#1999
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end support for libmongocrypt’s KMS retry behavior across the sync and reactive-streams drivers, and updates the vendored TLS channel to correctly surface close_notify when it arrives in the same network segment as application data (addressing the KMS HTTP response parsing edge case referenced by JAVA-5391 / JAVA-5411).
Changes:
- Enable and surface libmongocrypt KMS retry primitives (retryable HTTP errors, network failures, backoff sleep) via new JNA bindings and
MongoKeyDecryptorAPIs. - Update sync/reactive KMS I/O loops to use retry-aware feeding (
feedWithRetry), budget exhaustion (fail), and pre-request sleep (sleepMicroseconds), including CSOT-aware timeout behavior. - Add prose/integration tests and Evergreen wiring for KMS retry scenarios, plus a unit test suite for the TLS
close_notify+ data coalescing behavior.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mongodb-crypt/src/test/resources/kms-reply-429.txt | Adds a canned HTTP 429 KMS response used to validate retryable HTTP error handling. |
| mongodb-crypt/src/test/java/com/mongodb/crypt/capi/MongoCryptTest.java | Adds unit tests validating retry-on-HTTP-error, retry-on-network-error, and retry budget exhaustion. |
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptorImpl.java | Implements new retry/backoff APIs by calling newly bound libmongocrypt KMS functions. |
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptor.java | Extends the decryptor interface with sleep/backoff and retry signaling methods. |
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoCryptImpl.java | Enables libmongocrypt KMS retry behavior during initialization. |
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/CAPI.java | Adds new JNA native method declarations for KMS retry and expands/updates binding documentation. |
| mongodb-crypt/AGENTS.md | Documents contribution rules and mapping guidelines for CAPI.java JNA bindings/javadoc. |
| driver-sync/src/test/functional/com/mongodb/client/ClientSideEncryptionKmsRetryProseTest.java | Adds sync-driver entrypoint for the KMS retry prose tests. |
| driver-sync/src/test/functional/com/mongodb/client/AbstractClientSideEncryptionKmsRetryProseTest.java | Implements KMS retry prose tests (network retry, HTTP retry, budget exhaustion, CSOT timing). |
| driver-sync/src/main/com/mongodb/client/internal/Crypt.java | Integrates retry-aware KMS decryption into the sync encryption state machine, including backoff and timeout mapping. |
| driver-reactive-streams/src/test/functional/com/mongodb/reactivestreams/client/ClientSideEncryptionKmsRetryProseTest.java | Adds reactive-streams entrypoint for the KMS retry prose tests. |
| driver-reactive-streams/src/main/com/mongodb/reactivestreams/client/internal/crypt/KeyManagementService.java | Adds reactive retry/backoff handling and retry-aware KMS response feeding. |
| driver-core/src/test/unit/com/mongodb/internal/connection/tlschannel/TlsChannelCloseNotifyTest.java | Adds regression tests for TLS channel read semantics when close_notify and data coalesce. |
| driver-core/src/main/com/mongodb/internal/connection/tlschannel/impl/TlsChannelImpl.java | Updates vendored TLS-channel read/unwrap behavior to return bytes produced alongside close_notify and eagerly unwrap buffered records. |
| .evergreen/run-kms-retry-tests.sh | Adds a CI script to run the new sync + reactive KMS retry prose tests. |
| .evergreen/.evg.yml | Wires a new Evergreen function/task/buildvariant to execute the KMS retry test script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d4e6fbf to
7cad542
Compare
…nd documented the rules in AGENTS.md
Enable libmongocrypt's KMS retry behaviour and wire it through the sync and reactive drivers. Transient KMS failures are retried with backoff delays supplied by libmongocrypt; retry is enabled unconditionally. - Add CAPI bindings: mongocrypt_setopt_retry_kms, mongocrypt_kms_ctx_usleep, mongocrypt_kms_ctx_feed_with_retry and mongocrypt_kms_ctx_fail, exposed via sleepMicroseconds(), feedWithRetry() and fail() on MongoKeyDecryptor - HTTP 429 responses are signalled by feedWithRetry and network errors by fail(); in both cases the attempt ends early and libmongocrypt re-presents the KMS context for retry with the next backoff delay. When the retry budget is exhausted the original error is surfaced - The sync Crypt sleeps the backoff before each attempt, surfaces MongoOperationTimeoutException only once the CSOT deadline has actually expired (a fixed-timeout connect or handshake stall with budget remaining stays retryable) and treats unexpected EOF as a retryable network error - The reactive KeyManagementService applies the backoff via Mono.delay, checks the CSOT deadline before consuming retry budget, and releases buffers and closes streams on all completion paths; the shared backoff budget check lives in MongoCryptHelper - Add the spec Section 24 KMS retry prose tests (sync and reactive) plus a CSOT-during-retry test; each test resets the failpoint server and control requests send Connection: close so the single-threaded server cannot stall KMS connections or leak armed failpoints between tests and suites - Add mongodb-crypt unit tests for the new bindings, a sync unit test covering the CSOT classification of KMS network failures, and an Evergreen task running both suites JAVA-5391
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
mongodb-crypt/src/test/resources/kms-reply-429.txt:7
Content-Length: 32implies the HTTP body is exactly the JSON string, but this resource currently ends with an extra blank line.getHttpResourceAsByteBufferjoins lines with\r\n, so the trailing empty line appends an extra CRLF after the body, making the response longer than the declared Content-Length and potentially breaking/parsing the retry logic in a non-representative way.
Remove the trailing blank line so the file ends immediately after the JSON body (matching how kms-reply.txt is formatted).
Content-Length: 32
{"__type":"ThrottlingException"}
When a KMS server responds and immediately closes the connection, the response data and the close_notify alert arrive in the same network segment as separate TLS records. The vendored tls-channel discarded data produced in the same unwrap as a close (fixed upstream in 1.0.0) and only surfaced a buffered close_notify on the next read. KMS HTTP responses without Content-Length need data + EOF in the same exchange for libmongocrypt to parse the response and signal a retry (JAVA-5391), so reads now also eagerly drain TLS records already buffered alongside produced data, folding every produced byte into the returned count. If a buffered record turns out to be corrupt, the data decrypted before it is still returned and the failure surfaces on the next read. Covered by TlsChannelCloseNotifyTest using deterministic in-memory channels over TLSv1.2 and TLSv1.3: data + close_notify in one segment, close in a separate segment, responses spanning multiple TLS records, destination buffers smaller than a record or filling up mid-response, and a corrupt record following the data. The eager drain has been proposed upstream in marianobarrios/tls-channel#351, which this implementation and test coverage mirror; JAVA-5411 tracks refreshing the vendored tls-channel copy.
Added KMS Retry support.
3 Commits:
JAVA-5391 / JAVA-5772