Skip to content

Retry backoff in ClientApiClient.refresh_ignored_field_lists is stuck at 0 — unthrottled retry loop during server outages #9

Description

@TrApY

Summary

In lib/clients/metaapi/client_api_client.py (PyPI metaapi-cloud-sdk==29.1.1, latest at the time of writing), the error-retry backoff of refresh_ignored_field_lists never grows from zero, producing an unthrottled retry loop for the whole duration of a server-side outage.

The bug

except Exception as err:
    self._logger.error(f'Failed to update hashing ignored field list {format_error(err)}')
    self._ignored_field_lists_caches[region]['retryIntervalInSeconds'] = min(
        self._ignored_field_lists_caches[region].get('retryIntervalInSeconds', 0) * 2, 300
    )
    await asyncio.sleep(self._ignored_field_lists_caches[region]['retryIntervalInSeconds'])

retryIntervalInSeconds is only written with its base value (self._retry_interval_in_seconds) after a successful refresh. When the cache entry is fresh (created in the same call) and the request fails, get('retryIntervalInSeconds', 0) returns 0, and 0 * 2 = 0 — forever. The except branch then does asyncio.sleep(0) and retries immediately, in a tight loop, until the endpoint recovers.

Each loop iteration also goes through HttpClient.request's own 5 internal retries, so the net effect during an outage is a continuous stream of requests plus one ERROR log per ~30s cycle, with zero pause between cycles.

Observed impact

During the mt-client-api-v1.london.agiliumtrade.ai 503 outage on 2026-06-10 (~22:00–23:00 UTC), a bot using the streaming API logged ~5,800 error lines/hour from this loop. The constant churn saturated the asyncio event loop enough that APScheduler jobs were skipped (Run time of job ... was missed by 0:01:02) and unrelated outbound HTTP calls timed out. It also flooded our error tracker.

Suggested fix

Seed the backoff with the base interval instead of 0:

previous = self._ignored_field_lists_caches[region].get('retryIntervalInSeconds', 0)
self._ignored_field_lists_caches[region]['retryIntervalInSeconds'] = min(
    max(previous * 2, self._retry_interval_in_seconds), 300
)

This yields the intended 1 → 2 → 4 → … → 300s progression on consecutive failures (and keeps the existing reset-to-base on success).

Happy to provide more logs/details if useful. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions