utf8 bug fix #47008
Open
dibahlfi wants to merge 2 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves Cosmos Python SDK reliability around request framing and response decoding by (1) correcting Content-Length calculation for UTF-8 payloads and (2) introducing an opt-in, environment-variable-driven fallback when decoding malformed UTF-8 response bodies.
Changes:
- Compute
Content-Lengthusing UTF-8 byte length (len(body.encode("utf-8"))) in both sync and async request paths. - Add
_response_decoding.decode_response_body()to keep strict UTF-8 decoding by default while enablingREPLACE/IGNOREfallback viaCOSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT. - Add unit tests covering decoding behavior (strict + opt-in fallback) and regression tests for Content-Length wiring in sync/async paths.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py | Uses shared decode helper for response bodies; fixes sync Content-Length UTF-8 byte accounting. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_asynchronous_request.py | Uses shared decode helper for response bodies; fixes async Content-Length UTF-8 byte accounting. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py | New helper module implementing strict decode + opt-in permissive fallback with logging. |
| sdk/cosmos/azure-cosmos/tests/test_response_decoding.py | New tests validating strict behavior, actionable error hinting, and env-var-driven fallback modes. |
| sdk/cosmos/azure-cosmos/tests/test_content_length_encoding.py | New regression tests validating UTF-8 byte-length Content-Length (including sync/async wiring). |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the Content-Length fix and the opt-in malformed UTF-8 decode fallback. |
Member
Author
|
@sdkReviewAgent-2 |
xinlian12
reviewed
May 20, 2026
xinlian12
reviewed
May 20, 2026
Member
|
✅ Review complete (44:43) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
Member
Author
|
/azp run python - cosmos - tests |
Member
Author
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces two independent fixes in Cosmos Python request/response handling.
Fix 1: Content-Length calculation for request bodies-
The Content-Length header is updated to use the UTF-8 encoded byte length instead of the Unicode character count.
Previously, len(body) was used, which returns character length. For non-ASCII payloads, this can under-report the actual size. For example, the string { "name": "café" } has a character length of 15 but a UTF-8 byte length of 16 because "é" is multi-byte.
This mismatch can lead to incorrect wire-level payload sizing and potential request failures. The fix ensures accurate byte-level accounting across both sync and async request paths.
Fix 2: Optional fallback for malformed UTF-8 in responses-
Adds an opt-in mechanism to handle response bodies that contain invalid UTF-8 sequences.
Default behavior remains unchanged: strict decoding raises UnicodeDecodeError when malformed UTF-8 is encountered.
An environment variable controls the fallback behavior:
COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT
REPLACE: invalid bytes are replaced with the Unicode replacement character
IGNORE: invalid bytes are skipped during decoding
This enables workloads to continue when encountering malformed data without changing default strict behavior.
Compatibility and risk-
No change to default decoding behavior (still strict).
Fallback behavior is explicitly opt-in via environment variable.
Content-Length change only corrects byte calculation and does not alter request semantics.