Skip to content

utf8 bug fix #47008

Open
dibahlfi wants to merge 2 commits into
mainfrom
users/dibahl/utf8-decoding-fix
Open

utf8 bug fix #47008
dibahlfi wants to merge 2 commits into
mainfrom
users/dibahl/utf8-decoding-fix

Conversation

@dibahlfi
Copy link
Copy Markdown
Member

@dibahlfi dibahlfi commented May 19, 2026

This PR introduces two independent fixes in Cosmos Python request/response handling.

Fix 1: Content-Length calculation for request bodies-
The Content-Length header is updated to use the UTF-8 encoded byte length instead of the Unicode character count.
Previously, len(body) was used, which returns character length. For non-ASCII payloads, this can under-report the actual size. For example, the string { "name": "café" } has a character length of 15 but a UTF-8 byte length of 16 because "é" is multi-byte.
This mismatch can lead to incorrect wire-level payload sizing and potential request failures. The fix ensures accurate byte-level accounting across both sync and async request paths.

Fix 2: Optional fallback for malformed UTF-8 in responses-
Adds an opt-in mechanism to handle response bodies that contain invalid UTF-8 sequences.
Default behavior remains unchanged: strict decoding raises UnicodeDecodeError when malformed UTF-8 is encountered.
An environment variable controls the fallback behavior:
COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT
REPLACE: invalid bytes are replaced with the Unicode replacement character
IGNORE: invalid bytes are skipped during decoding
This enables workloads to continue when encountering malformed data without changing default strict behavior.

Compatibility and risk-
No change to default decoding behavior (still strict).
Fallback behavior is explicitly opt-in via environment variable.
Content-Length change only corrects byte calculation and does not alter request semantics.

Copilot AI review requested due to automatic review settings May 19, 2026 23:56
@dibahlfi dibahlfi requested a review from a team as a code owner May 19, 2026 23:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Cosmos Python SDK reliability around request framing and response decoding by (1) correcting Content-Length calculation for UTF-8 payloads and (2) introducing an opt-in, environment-variable-driven fallback when decoding malformed UTF-8 response bodies.

Changes:

  • Compute Content-Length using UTF-8 byte length (len(body.encode("utf-8"))) in both sync and async request paths.
  • Add _response_decoding.decode_response_body() to keep strict UTF-8 decoding by default while enabling REPLACE/IGNORE fallback via COSMOS.CHARSET_DECODER_ERROR_ACTION_ON_MALFORMED_INPUT.
  • Add unit tests covering decoding behavior (strict + opt-in fallback) and regression tests for Content-Length wiring in sync/async paths.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py Uses shared decode helper for response bodies; fixes sync Content-Length UTF-8 byte accounting.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_asynchronous_request.py Uses shared decode helper for response bodies; fixes async Content-Length UTF-8 byte accounting.
sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py New helper module implementing strict decode + opt-in permissive fallback with logging.
sdk/cosmos/azure-cosmos/tests/test_response_decoding.py New tests validating strict behavior, actionable error hinting, and env-var-driven fallback modes.
sdk/cosmos/azure-cosmos/tests/test_content_length_encoding.py New regression tests validating UTF-8 byte-length Content-Length (including sync/async wiring).
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the Content-Length fix and the opt-in malformed UTF-8 decode fallback.

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/aio/_asynchronous_request.py
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py
Comment thread sdk/cosmos/azure-cosmos/tests/test_response_decoding.py Outdated
@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi dibahlfi changed the title utf8 bug fix - initial commit utf8 bug fix May 20, 2026
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_response_decoding.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py
@xinlian12
Copy link
Copy Markdown
Member

Review complete (44:43)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi
Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi
Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants