Python: Enable Ollama integration tests in CI and rename report by giles17 · Pull Request #5454 · microsoft/agent-framework

giles17 · 2026-04-23T21:26:41Z

Motivation and Context

The Misc integration CI job already runs packages/ollama/tests but all 5 Ollama tests were always skipped because no Ollama server was available and the OLLAMA_MODEL/OLLAMA_EMBEDDING_MODEL env vars were empty. This PR enables those tests by installing Ollama in CI, and also renames the report job from "Flaky Test Report" to "Integration Test Report" to better reflect its scope.

Description

Ollama CI Enablement

Adds Ollama setup steps to the Misc integration job in both workflow files:

Install Ollama — downloads and installs the Ollama binary via the official install script
Cache models — uses actions/cache@v4 to cache ~/.ollama/models, so model pulls only happen on the first run (~500MB). Subsequent runs restore from cache in ~10-15 seconds.
Start server and pull models — starts ollama serve in the background, waits for it to be ready, then pulls qwen2.5:0.5b (chat, ~400MB) and nomic-embed-text (embedding, ~50MB). Small models were chosen to minimize CI time while still exercising the integration path.
Set env vars — OLLAMA_MODEL=qwen2.5:0.5b and OLLAMA_EMBEDDING_MODEL=nomic-embed-text so the skip conditions in the test files are satisfied.

This enables 5 previously-skipped tests:

test_cmc_integration_with_chat_completion
test_cmc_integration_with_tool_call
test_cmc_streaming_integration_with_tool_call
test_cmc_streaming_integration_with_chat_completion
test_ollama_embedding_integration

Report Rename

Renames "Flaky Test Report" to "Integration Test Report" throughout:

CI job names and IDs
Artifact names
Cache keys
Output file names
Script titles and docstrings

Files Changed

.github/workflows/python-merge-tests.yml — Ollama setup steps + report rename
.github/workflows/python-integration-tests.yml — Same changes (kept in sync)
python/scripts/flaky_report/__init__.py — Docstring update
python/scripts/flaky_report/__main__.py — Docstring and example update
python/scripts/flaky_report/aggregate.py — Report title update

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

…n Test Report - Install Ollama, cache models (qwen2.5:0.5b + nomic-embed-text), and start server in the Misc integration job for both workflow files - Set OLLAMA_MODEL and OLLAMA_EMBEDDING_MODEL env vars so the 5 Ollama tests are no longer skipped - Rename Flaky Test Report to Integration Test Report throughout (job names, artifact names, cache keys, file names, script titles/docstrings) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions

Automated Code Review

Reviewers: 4 | Confidence: 91%

✓ Correctness

This PR makes two changes: (1) adds Ollama setup (install, cache, start, pull models) to the integration and merge test CI workflows, and (2) renames the 'Flaky Test Report' job/artifacts/caches to 'Integration Test Report' across both workflows and the Python reporting scripts. Both changes are straightforward and consistent. The Ollama startup loop correctly waits up to 30 seconds before proceding, and if it fails the subsequent ollama pull will surface a clear CI error. The rename is applied consistently across job IDs, cache keys, file paths, artifact names, and code strings. No other jobs reference the renamed job ID in their needs: blocks, so no dependency breakage. The scripts/flaky_report/ module directory was intentionally not renamed, and module-path references remain correct.

✓ Security Reliability

This PR adds Ollama model setup to integration and merge test CI workflows and renames the 'flaky test report' to 'integration test report'. The changes are low-risk. The Ollama install uses the standard curl | sh CI pattern over HTTPS, and the startup polling loop has an implicit failure mode (if Ollama never starts, ollama pull fails under set -e). The cache key rename from flaky-report-history-* to integration-report-history-* will orphan existing cache entries, meaning the first run loses historical trend data — this is expected but worth noting. No security or reliability bugs found.

✓ Test Coverage

This PR makes two changes: (1) adds Ollama setup to CI workflows so existing Ollama integration tests run against a real instance, and (2) renames 'Flaky Test Report' to 'Integration Test Report' in workflows and the flaky_report scripts. For (1), existing tests in test_ollama_chat_client.py and test_ollama_embedding_client.py already gate on OLLAMA_MODEL/OLLAMA_EMBEDDING_MODEL env vars, so enabling those in CI is correctly covered by pre-existing tests. For (2), the python/scripts/flaky_report/ module has no unit tests at all — this is a pre-existing gap, not introduced by this PR. The string changes are purely cosmetic (report titles, filenames, docstrings) and low-risk. No new behavior is introduced that lacks test coverage.

✓ Design Approach

I did not find a blocking design flaw in the report renaming; it stays within the existing aggregate-report contract. The main design concern is in the new Ollama CI bootstrap: hard-coding specific model tags directly into the workflows couples the test infrastructure to one chosen local model configuration instead of keeping model selection in environment-level configuration like the other providers.

Suggestions

The python/scripts/flaky_report/ module (aggregate.py, main.py) has zero unit tests. While this is a pre-existing gap and the current PR only changes display strings, consider adding basic tests for generate_trend_report(), load_current_run(), and the CLI entry point to prevent regressions in future changes.

Automated review by giles17's agents

Copilot

Pull request overview

Enables Ollama-backed Python integration tests in CI by provisioning an Ollama server + models during the “Misc integration” jobs, and renames the aggregated trend report from “Flaky Test Report” to “Integration Test Report” to reflect its broader scope.

Changes:

Add Ollama install/start/model-pull steps and set OLLAMA_MODEL/OLLAMA_EMBEDDING_MODEL in CI workflows.
Rename the trend report job/artifacts/history/output filenames from “flaky” to “integration”.
Update report tool docstrings/title text to match the new report naming.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`.github/workflows/python-merge-tests.yml`	Installs/starts Ollama for misc integration job; renames report job/artifacts/cache/history/output.
`.github/workflows/python-integration-tests.yml`	Mirrors the same Ollama setup + report rename in the dedicated integration workflow.
`python/scripts/flaky_report/aggregate.py`	Updates the generated markdown report title string.
`python/scripts/flaky_report/__main__.py`	Updates CLI docstring/example to the new “integration” naming.
`python/scripts/flaky_report/__init__.py`	Updates module docstring to “integration test report”.

Copilot · 2026-04-23T21:31:54Z

          python-version: ${{ env.UV_PYTHON }}
          os: ${{ runner.os }}
+      - name: Install Ollama
+        run: curl -fsSL https://ollama.com/install.sh | sh


The Ollama installer is executed via curl ... | sh, which is an unpinned remote script and introduces a supply-chain risk in CI. Consider pinning to a specific Ollama release (download a versioned artifact) and verifying its checksum/signature before installation, or use a package-manager-based install when available.

Suggested change

run: curl -fsSL https://ollama.com/install.sh | sh

run: |

set -euo pipefail

OLLAMA_VERSION="v0.5.7"

OLLAMA_ARCHIVE="ollama-linux-amd64.tgz"

OLLAMA_BASE_URL="https://github.com/ollama/ollama/releases/download/${OLLAMA_VERSION}"

curl -fsSLo "${OLLAMA_ARCHIVE}" "${OLLAMA_BASE_URL}/${OLLAMA_ARCHIVE}"

curl -fsSLo sha256sums.txt "${OLLAMA_BASE_URL}/sha256sums.txt"

grep " ${OLLAMA_ARCHIVE}$" sha256sums.txt | sha256sum -c -

sudo tar -C /usr/local -xzf "${OLLAMA_ARCHIVE}"

rm -f "${OLLAMA_ARCHIVE}" sha256sums.txt

Copilot · 2026-04-23T21:31:55Z

+        run: |
+          ollama serve &
+          for i in $(seq 1 30); do
+            if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then
+              break
+            fi
+            sleep 1
+          done
+          ollama pull qwen2.5:0.5b
+          ollama pull nomic-embed-text


The readiness loop will fall through after 30 seconds even if the Ollama server never becomes reachable, and then proceeds to ollama pull, which can create flaky failures if startup is slow. After the loop, explicitly validate the server is up (and fail with a clear message if not), or exit early when the timeout is reached (similar to the Cosmos emulator readiness check used elsewhere in this workflow).

Copilot · 2026-04-23T21:31:55Z

          python-version: ${{ env.UV_PYTHON }}
          os: ${{ runner.os }}
+      - name: Install Ollama
+        run: curl -fsSL https://ollama.com/install.sh | sh


The Ollama installer is executed via curl ... | sh, which is an unpinned remote script and introduces a supply-chain risk in CI. Consider pinning to a specific Ollama release (download a versioned artifact) and verifying its checksum/signature before installation, or use a package-manager-based install when available.

Suggested change

run: curl -fsSL https://ollama.com/install.sh | sh

shell: bash

run: |

set -euo pipefail

OLLAMA_VERSION="0.5.7"

OLLAMA_ASSET="ollama-linux-amd64.tgz"

OLLAMA_BASE_URL="https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}"

curl -fsSLo "/tmp/${OLLAMA_ASSET}" "${OLLAMA_BASE_URL}/${OLLAMA_ASSET}"

curl -fsSLo /tmp/ollama-checksums.txt "${OLLAMA_BASE_URL}/sha256sum.txt"

grep " ${OLLAMA_ASSET}\$" /tmp/ollama-checksums.txt | sed "s# ${OLLAMA_ASSET}\$# /tmp/${OLLAMA_ASSET}#" | sha256sum --check --

sudo tar -C /usr -xzf "/tmp/${OLLAMA_ASSET}"

Copilot · 2026-04-23T21:31:55Z

+          ollama serve &
+          for i in $(seq 1 30); do
+            if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then
+              break
+            fi
+            sleep 1
+          done
+          ollama pull qwen2.5:0.5b
+          ollama pull nomic-embed-text


The readiness loop will fall through after 30 seconds even if the Ollama server never becomes reachable, and then proceeds to ollama pull, which can create flaky failures if startup is slow. After the loop, explicitly validate the server is up (and fail with a clear message if not), or exit early when the timeout is reached (similar to the Cosmos emulator readiness check used elsewhere in this workflow).

Suggested change

ollama serve &

for i in $(seq 1 30); do

if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then

break

fi

sleep 1

done

ollama pull qwen2.5:0.5b

ollama pull nomic-embed-text

set -euo pipefail

ollama serve &

for i in $(seq 1 30); do

if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then

ollama pull qwen2.5:0.5b

ollama pull nomic-embed-text

exit 0

fi

sleep 1

done

echo "Ollama server did not become ready within 30 seconds." >&2

exit 1

Copilot · 2026-04-23T21:31:56Z

+"""CLI entry point for the integration test report tool.

 Usage:
    uv run python -m scripts.flaky_report <reports-dir> <history-file> <output-file>

 Example (from python/ directory):
    uv run python -m scripts.flaky_report \\
-        ../flaky-reports/ \\
-        flaky-report-history.json \\
-        flaky-test-report.md
+        ../test-results/ \\
+        integration-report-history.json \\
+        integration-test-report.md


The docstring now calls this an “integration test report tool”, but the module entry point remains scripts.flaky_report. To avoid confusion for users, consider adding a brief note that the module/package name is kept for backward compatibility even though the report scope has expanded.

The 0.5b model was too small to reliably follow simple prompts like 'Say Hello World', causing test assertion failures. The 1.5b model follows instructions more reliably while still being small enough for fast CI pulls (~1GB). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove hard skips from 4 tests in test_11_workflow_parallel.py - Remove hard skip from test_conditional_branching in test_06_dt_multi_agent_orchestration_conditionals.py - Increase pytest --timeout from 360 to 480 for Functions+DurableTask CI job - Updated in both python-merge-tests.yml and python-integration-tests.yml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- test_11_workflow_parallel (4 tests): xdist worker crashes during execution - test_conditional_branching: orchestration fails with RuntimeError, not a timeout - Keep 480s timeout bump for remaining Functions tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…enAI Both samples passed a bearer token provider via api_key= which caused the client to route to api.openai.com instead of Azure OpenAI, resulting in 401 Unauthorized. Changed to credential= which correctly triggers Azure routing and picks up AZURE_OPENAI_ENDPOINT from the environment. - samples/azure_functions/11_workflow_parallel/function_app.py: 1 fix - samples/durabletask/06_multi_agent_orchestration_conditionals/worker.py: 2 fixes - Re-enable 4 parallel workflow tests and 1 conditional branching test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The 4 parallel workflow tests crash because xdist worksteal distributes them across separate workers, each spawning its own func process against shared emulators. Auth fix (api_key->credential) was valid and stays. test_conditional_branching now passes with the auth fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Wrap skip reason strings to stay within 120 char line limit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Kill any auto-started Ollama before launching serve (fixes port conflict: 'address already in use') - Retry ollama pull up to 3 times with 15s backoff (fixes 429 rate limit failures) - Applied to both python-merge-tests.yml and python-integration-tests.yml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Foundry agent: add allow_preview=True to custom client test - Foundry hosting: raise max_output_tokens 50->200, add temperature, relax assertion in test_temperature_and_max_tokens - Foundry embedding: update skip reason with root cause (endpoint mismatch) - OpenAI file search: fix vector store indexing race condition by polling file_counts before querying; fix get_streaming_response -> get_response(stream=True) - Azure OpenAI file search: remove skip (transient 500 resolved) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 23, 2026 21:26

moonbox3 added the python label Apr 23, 2026

Copilot started reviewing on behalf of giles17 April 23, 2026 21:27 View session

github-actions Bot reviewed Apr 23, 2026

View reviewed changes

Copilot AI reviewed Apr 23, 2026

View reviewed changes

giles17 temporarily deployed to integration April 23, 2026 21:37 — with GitHub Actions Inactive

giles17 had a problem deploying to integration April 23, 2026 21:37 — with GitHub Actions Failure

giles17 temporarily deployed to integration April 23, 2026 21:37 — with GitHub Actions Inactive

giles17 temporarily deployed to integration April 23, 2026 22:02 — with GitHub Actions Inactive

giles17 temporarily deployed to integration April 24, 2026 03:42 — with GitHub Actions Inactive

giles17 had a problem deploying to integration April 24, 2026 03:42 — with GitHub Actions Failure

giles17 temporarily deployed to integration April 24, 2026 03:42 — with GitHub Actions Inactive

giles17 temporarily deployed to integration April 24, 2026 04:30 — with GitHub Actions Inactive

giles17 temporarily deployed to integration April 24, 2026 17:39 — with GitHub Actions Inactive

giles17 temporarily deployed to integration April 24, 2026 18:03 — with GitHub Actions Inactive

giles17 had a problem deploying to integration April 24, 2026 18:03 — with GitHub Actions Failure

giles17 temporarily deployed to integration April 24, 2026 18:03 — with GitHub Actions Inactive

giles17 had a problem deploying to integration April 24, 2026 18:03 — with GitHub Actions Failure

giles17 temporarily deployed to integration April 24, 2026 18:31 — with GitHub Actions Inactive

giles17 had a problem deploying to integration April 24, 2026 18:31 — with GitHub Actions Failure

giles17 temporarily deployed to integration April 25, 2026 01:43 — with GitHub Actions Inactive

giles17 and others added 8 commits April 27, 2026 05:37

Fix E501 line-too-long in azurefunctions parallel test skip reasons

386e08e

Wrap skip reason strings to stay within 120 char line limit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' into flaky-test-report

2e6b999

Remove temperature from foundry hosting test (unsupported by CI model)

3ab3370

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Enable Ollama integration tests in CI and rename report#5454

Python: Enable Ollama integration tests in CI and rename report#5454
giles17 wants to merge 12 commits intomainfrom
flaky-test-report

giles17 commented Apr 23, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        run: curl -fsSL https://ollama.com/install.sh | sh
+        run: |
+          set -euo pipefail
+          OLLAMA_VERSION="v0.5.7"
+          OLLAMA_ARCHIVE="ollama-linux-amd64.tgz"
+          OLLAMA_BASE_URL="https://github.com/ollama/ollama/releases/download/${OLLAMA_VERSION}"
+          curl -fsSLo "${OLLAMA_ARCHIVE}" "${OLLAMA_BASE_URL}/${OLLAMA_ARCHIVE}"
+          curl -fsSLo sha256sums.txt "${OLLAMA_BASE_URL}/sha256sums.txt"
+          grep " ${OLLAMA_ARCHIVE}$" sha256sums.txt | sha256sum -c -
+          sudo tar -C /usr/local -xzf "${OLLAMA_ARCHIVE}"
+          rm -f "${OLLAMA_ARCHIVE}" sha256sums.txt

-        run: curl -fsSL https://ollama.com/install.sh | sh
+        shell: bash
+        run: |
+          set -euo pipefail
+          OLLAMA_VERSION="0.5.7"
+          OLLAMA_ASSET="ollama-linux-amd64.tgz"
+          OLLAMA_BASE_URL="https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}"
+          curl -fsSLo "/tmp/${OLLAMA_ASSET}" "${OLLAMA_BASE_URL}/${OLLAMA_ASSET}"
+          curl -fsSLo /tmp/ollama-checksums.txt "${OLLAMA_BASE_URL}/sha256sum.txt"
+          grep " ${OLLAMA_ASSET}\$" /tmp/ollama-checksums.txt | sed "s#  ${OLLAMA_ASSET}\$#  /tmp/${OLLAMA_ASSET}#" | sha256sum --check --
+          sudo tar -C /usr -xzf "/tmp/${OLLAMA_ASSET}"

Conversation

giles17 commented Apr 23, 2026

Motivation and Context

Description

Ollama CI Enablement

Report Rename

Files Changed

Contribution Checklist

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Automated Code Review

✓ Correctness

✓ Security Reliability

✓ Test Coverage

✓ Design Approach

Suggestions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants