Skip to content

Conversation

@dheeraj-vanamala
Copy link

@dheeraj-vanamala dheeraj-vanamala commented Nov 30, 2025

Description

This PR fixes issue #4517 where the OTLP gRPC exporter fails to reconnect to the collector after a restart (returning UNAVAILABLE).

Changes:

  • Detected StatusCode.UNAVAILABLE in the export loop.
  • Added logic to close the existing channel and re-initialize it before retrying.
  • Added a regression test test_unavailable_reconnects to verify the reconnection behavior.

Fixes #4517
Fixes #4529

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I added a new regression test case test_unavailable_reconnects in exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py.

  • test_unavailable_reconnects: Verifies that the exporter closes and re-initializes the gRPC channel when the server returns StatusCode.UNAVAILABLE.

Does This PR Require a Contrib Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@dheeraj-vanamala dheeraj-vanamala requested a review from a team as a code owner November 30, 2025 15:26
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 30, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: xrmx / name: Riccardo Magliocchetti (9edfeea)

@dheeraj-vanamala dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from c670f77 to b7620d0 Compare November 30, 2025 16:00
@dheeraj-vanamala dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from b7620d0 to 436ecc9 Compare November 30, 2025 16:13
@dheeraj-vanamala
Copy link
Author

dheeraj-vanamala commented Nov 30, 2025

I understand this issue is related to the upstream gRPC bug (grpc/grpc#38290).

I've analyzed that issue in depth, and the root cause appears to be a regression in the gRPC 'backup poller' (introduced in grpcio>=1.68.0) which fails to recover connections when the primary EventEngine is disabled (common in Python for fork safety).

While upstream fixes are being explored (e.g., grpc/grpc#38480), the issue has persisted for months, leaving exporters stuck in an UNAVAILABLE state indefinitely after collector restarts.

This PR implements a robust mitigation: detecting the persistent UNAVAILABLE state and forcing a channel re-initialization. This effectively resets the underlying poller state, allowing the exporter to recover immediately without requiring a full application restart. This approach provides stability for users while the complex upstream fix is finalized.

…mments

- Remove aggressive gRPC keepalive and retry settings to rely on defaults.
- Fix compression precedence logic to correctly handle NoCompression (0).
- Refactor channel initialization to be stateless (remove _channel_reconnection_enabled).- Update documentation to refer to 'OTLP-compatible receiver'
@dheeraj-vanamala dheeraj-vanamala changed the title Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) (Fixes #4529) Dec 9, 2025
@xrmx xrmx moved this to Ready for review in @xrmx's Python PR digest Dec 17, 2025
@dheeraj-vanamala dheeraj-vanamala requested a review from xrmx January 7, 2026 17:07
@xrmx
Copy link
Contributor

xrmx commented Jan 22, 2026

@dheeraj-vanamala you have to fix tox -e typecheck and also please add a changelog entry

The warnings seems all to be not checking for None.

@xrmx xrmx changed the title Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) (Fixes #4529) Fix: Reinitialize gRPC channel on UNAVAILABLE error Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready for review

Development

Successfully merging this pull request may close these issues.

otel-collector pause breaks the grpc OTLPSpanExporter permanently. Transient error StatusCode.UNAVAILABLE

3 participants