Skip to content

Python: flaky integration tests blocking merge queue (Foundry Hosting + Foundry embeddings) #5553

@moonbox3

Description

@moonbox3

Two integration tests are flaking in the merge queue and blocking merges into main. Both are already marked @pytest.mark.flaky and have a 3-attempt retry policy, but all three attempts fail back-to-back. They have been temporarily skipped to unblock the queue (see follow-up PR) and need a real fix.

Observed failing on at least PRs #5301 and #5531 — these flakes are systemic, not PR-specific. Example failing run: https://github.com/microsoft/agent-framework/actions/runs/25083550113

Test 1 — test_temperature_and_max_tokens

  • Path: packages/foundry_hosting/tests/test_responses_int.py::TestOptions::test_temperature_and_max_tokens
  • Job: Python Tests - Foundry Hosting Integration

The test posts a /responses request with max_output_tokens=50 and asserts that the response contains exactly one message-type output:

output_messages = [o for o in body["output"] if o["type"] == "message"]
assert len(output_messages) == 1

It fails with assert 0 == 1 — i.e. the response status is completed (HTTP 200), but the output array contains no type == "message" entries. The captured server log shows output_count=1, so an output item exists; it's just not a message (likely a reasoning item only, with the message truncated by the 50-token cap). Failed all 3 attempts.

Suggested fix: raise max_output_tokens so the model has room to emit a user-visible message after any reasoning output, or relax the assertion to allow any output type while still verifying status/limits were honored.

Test 2 — test_text_embedding_live

  • Path: packages/foundry/tests/foundry/test_foundry_embedding_client.py::TestFoundryEmbeddingIntegration::test_text_embedding_live
  • Job: Python Integration Tests - Foundry

Calls FoundryEmbeddingClient().get_embeddings(["Hello, world!"]) against the live endpoint. Fails with:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The Azure SDK got an empty response body from azure.ai.inference and tried to loads(''). Failed all 3 attempts. This looks like a transient endpoint issue or a missing-credential / 204 response that the client isn't surfacing as a clearer error.

Suggested fix: investigate why the embeddings endpoint returns an empty body in CI (auth? region? service-side flake?); also consider catching the empty-response case in the client and raising a typed error rather than letting json.loads blow up.

Why this also keeps biting unrelated PRs

Both tests are gated by the core path filter in .github/workflows/python-merge-tests.yml:

core:
  - 'python/packages/core/agent_framework/_*.py'
  - 'python/packages/core/agent_framework/_workflows/**'
  - 'python/packages/core/agent_framework/exceptions.py'
  - 'python/packages/core/agent_framework/observability.py'

Any change under _workflows/** triggers every provider's live integration suite via coreChanged == 'true', so a workflows-only PR ends up gated on Foundry / Foundry Hosting / OpenAI / etc. flakes. Worth a follow-up to narrow this filter (e.g. split workflows out of core) once the flakes themselves are fixed.

Action items

  • Fix test_temperature_and_max_tokens (foundry hosting)
  • Fix test_text_embedding_live (foundry embeddings)
  • Re-enable both tests (remove @pytest.mark.skip)
  • (Optional follow-up) Narrow the core path filter so workflows-only PRs don't fan out to every provider

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingpython

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions