Python: Enable Ollama integration tests in CI and rename report#5454
Python: Enable Ollama integration tests in CI and rename report#5454
Conversation
…n Test Report - Install Ollama, cache models (qwen2.5:0.5b + nomic-embed-text), and start server in the Misc integration job for both workflow files - Set OLLAMA_MODEL and OLLAMA_EMBEDDING_MODEL env vars so the 5 Ollama tests are no longer skipped - Rename Flaky Test Report to Integration Test Report throughout (job names, artifact names, cache keys, file names, script titles/docstrings) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 91%
✓ Correctness
This PR makes two changes: (1) adds Ollama setup (install, cache, start, pull models) to the integration and merge test CI workflows, and (2) renames the 'Flaky Test Report' job/artifacts/caches to 'Integration Test Report' across both workflows and the Python reporting scripts. Both changes are straightforward and consistent. The Ollama startup loop correctly waits up to 30 seconds before proceding, and if it fails the subsequent
ollama pullwill surface a clear CI error. The rename is applied consistently across job IDs, cache keys, file paths, artifact names, and code strings. No other jobs reference the renamed job ID in theirneeds:blocks, so no dependency breakage. Thescripts/flaky_report/module directory was intentionally not renamed, and module-path references remain correct.
✓ Security Reliability
This PR adds Ollama model setup to integration and merge test CI workflows and renames the 'flaky test report' to 'integration test report'. The changes are low-risk. The Ollama install uses the standard
curl | shCI pattern over HTTPS, and the startup polling loop has an implicit failure mode (if Ollama never starts,ollama pullfails underset -e). The cache key rename fromflaky-report-history-*tointegration-report-history-*will orphan existing cache entries, meaning the first run loses historical trend data — this is expected but worth noting. No security or reliability bugs found.
✓ Test Coverage
This PR makes two changes: (1) adds Ollama setup to CI workflows so existing Ollama integration tests run against a real instance, and (2) renames 'Flaky Test Report' to 'Integration Test Report' in workflows and the flaky_report scripts. For (1), existing tests in
test_ollama_chat_client.pyandtest_ollama_embedding_client.pyalready gate onOLLAMA_MODEL/OLLAMA_EMBEDDING_MODELenv vars, so enabling those in CI is correctly covered by pre-existing tests. For (2), thepython/scripts/flaky_report/module has no unit tests at all — this is a pre-existing gap, not introduced by this PR. The string changes are purely cosmetic (report titles, filenames, docstrings) and low-risk. No new behavior is introduced that lacks test coverage.
✓ Design Approach
I did not find a blocking design flaw in the report renaming; it stays within the existing aggregate-report contract. The main design concern is in the new Ollama CI bootstrap: hard-coding specific model tags directly into the workflows couples the test infrastructure to one chosen local model configuration instead of keeping model selection in environment-level configuration like the other providers.
Suggestions
- The
python/scripts/flaky_report/module (aggregate.py, main.py) has zero unit tests. While this is a pre-existing gap and the current PR only changes display strings, consider adding basic tests forgenerate_trend_report(),load_current_run(), and the CLI entry point to prevent regressions in future changes.
Automated review by giles17's agents
There was a problem hiding this comment.
Pull request overview
Enables Ollama-backed Python integration tests in CI by provisioning an Ollama server + models during the “Misc integration” jobs, and renames the aggregated trend report from “Flaky Test Report” to “Integration Test Report” to reflect its broader scope.
Changes:
- Add Ollama install/start/model-pull steps and set
OLLAMA_MODEL/OLLAMA_EMBEDDING_MODELin CI workflows. - Rename the trend report job/artifacts/history/output filenames from “flaky” to “integration”.
- Update report tool docstrings/title text to match the new report naming.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/python-merge-tests.yml |
Installs/starts Ollama for misc integration job; renames report job/artifacts/cache/history/output. |
.github/workflows/python-integration-tests.yml |
Mirrors the same Ollama setup + report rename in the dedicated integration workflow. |
python/scripts/flaky_report/aggregate.py |
Updates the generated markdown report title string. |
python/scripts/flaky_report/__main__.py |
Updates CLI docstring/example to the new “integration” naming. |
python/scripts/flaky_report/__init__.py |
Updates module docstring to “integration test report”. |
| python-version: ${{ env.UV_PYTHON }} | ||
| os: ${{ runner.os }} | ||
| - name: Install Ollama | ||
| run: curl -fsSL https://ollama.com/install.sh | sh |
There was a problem hiding this comment.
The Ollama installer is executed via curl ... | sh, which is an unpinned remote script and introduces a supply-chain risk in CI. Consider pinning to a specific Ollama release (download a versioned artifact) and verifying its checksum/signature before installation, or use a package-manager-based install when available.
| run: curl -fsSL https://ollama.com/install.sh | sh | |
| run: | | |
| set -euo pipefail | |
| OLLAMA_VERSION="v0.5.7" | |
| OLLAMA_ARCHIVE="ollama-linux-amd64.tgz" | |
| OLLAMA_BASE_URL="https://github.com/ollama/ollama/releases/download/${OLLAMA_VERSION}" | |
| curl -fsSLo "${OLLAMA_ARCHIVE}" "${OLLAMA_BASE_URL}/${OLLAMA_ARCHIVE}" | |
| curl -fsSLo sha256sums.txt "${OLLAMA_BASE_URL}/sha256sums.txt" | |
| grep " ${OLLAMA_ARCHIVE}$" sha256sums.txt | sha256sum -c - | |
| sudo tar -C /usr/local -xzf "${OLLAMA_ARCHIVE}" | |
| rm -f "${OLLAMA_ARCHIVE}" sha256sums.txt |
| run: | | ||
| ollama serve & | ||
| for i in $(seq 1 30); do | ||
| if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done | ||
| ollama pull qwen2.5:0.5b | ||
| ollama pull nomic-embed-text |
There was a problem hiding this comment.
The readiness loop will fall through after 30 seconds even if the Ollama server never becomes reachable, and then proceeds to ollama pull, which can create flaky failures if startup is slow. After the loop, explicitly validate the server is up (and fail with a clear message if not), or exit early when the timeout is reached (similar to the Cosmos emulator readiness check used elsewhere in this workflow).
| python-version: ${{ env.UV_PYTHON }} | ||
| os: ${{ runner.os }} | ||
| - name: Install Ollama | ||
| run: curl -fsSL https://ollama.com/install.sh | sh |
There was a problem hiding this comment.
The Ollama installer is executed via curl ... | sh, which is an unpinned remote script and introduces a supply-chain risk in CI. Consider pinning to a specific Ollama release (download a versioned artifact) and verifying its checksum/signature before installation, or use a package-manager-based install when available.
| run: curl -fsSL https://ollama.com/install.sh | sh | |
| shell: bash | |
| run: | | |
| set -euo pipefail | |
| OLLAMA_VERSION="0.5.7" | |
| OLLAMA_ASSET="ollama-linux-amd64.tgz" | |
| OLLAMA_BASE_URL="https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}" | |
| curl -fsSLo "/tmp/${OLLAMA_ASSET}" "${OLLAMA_BASE_URL}/${OLLAMA_ASSET}" | |
| curl -fsSLo /tmp/ollama-checksums.txt "${OLLAMA_BASE_URL}/sha256sum.txt" | |
| grep " ${OLLAMA_ASSET}\$" /tmp/ollama-checksums.txt | sed "s# ${OLLAMA_ASSET}\$# /tmp/${OLLAMA_ASSET}#" | sha256sum --check -- | |
| sudo tar -C /usr -xzf "/tmp/${OLLAMA_ASSET}" |
| ollama serve & | ||
| for i in $(seq 1 30); do | ||
| if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done | ||
| ollama pull qwen2.5:0.5b | ||
| ollama pull nomic-embed-text |
There was a problem hiding this comment.
The readiness loop will fall through after 30 seconds even if the Ollama server never becomes reachable, and then proceeds to ollama pull, which can create flaky failures if startup is slow. After the loop, explicitly validate the server is up (and fail with a clear message if not), or exit early when the timeout is reached (similar to the Cosmos emulator readiness check used elsewhere in this workflow).
| ollama serve & | |
| for i in $(seq 1 30); do | |
| if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then | |
| break | |
| fi | |
| sleep 1 | |
| done | |
| ollama pull qwen2.5:0.5b | |
| ollama pull nomic-embed-text | |
| set -euo pipefail | |
| ollama serve & | |
| for i in $(seq 1 30); do | |
| if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then | |
| ollama pull qwen2.5:0.5b | |
| ollama pull nomic-embed-text | |
| exit 0 | |
| fi | |
| sleep 1 | |
| done | |
| echo "Ollama server did not become ready within 30 seconds." >&2 | |
| exit 1 |
| """CLI entry point for the integration test report tool. | ||
| Usage: | ||
| uv run python -m scripts.flaky_report <reports-dir> <history-file> <output-file> | ||
| Example (from python/ directory): | ||
| uv run python -m scripts.flaky_report \\ | ||
| ../flaky-reports/ \\ | ||
| flaky-report-history.json \\ | ||
| flaky-test-report.md | ||
| ../test-results/ \\ | ||
| integration-report-history.json \\ | ||
| integration-test-report.md |
There was a problem hiding this comment.
The docstring now calls this an “integration test report tool”, but the module entry point remains scripts.flaky_report. To avoid confusion for users, consider adding a brief note that the module/package name is kept for backward compatibility even though the report scope has expanded.
The 0.5b model was too small to reliably follow simple prompts like 'Say Hello World', causing test assertion failures. The 1.5b model follows instructions more reliably while still being small enough for fast CI pulls (~1GB). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove hard skips from 4 tests in test_11_workflow_parallel.py - Remove hard skip from test_conditional_branching in test_06_dt_multi_agent_orchestration_conditionals.py - Increase pytest --timeout from 360 to 480 for Functions+DurableTask CI job - Updated in both python-merge-tests.yml and python-integration-tests.yml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- test_11_workflow_parallel (4 tests): xdist worker crashes during execution - test_conditional_branching: orchestration fails with RuntimeError, not a timeout - Keep 480s timeout bump for remaining Functions tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…enAI Both samples passed a bearer token provider via api_key= which caused the client to route to api.openai.com instead of Azure OpenAI, resulting in 401 Unauthorized. Changed to credential= which correctly triggers Azure routing and picks up AZURE_OPENAI_ENDPOINT from the environment. - samples/azure_functions/11_workflow_parallel/function_app.py: 1 fix - samples/durabletask/06_multi_agent_orchestration_conditionals/worker.py: 2 fixes - Re-enable 4 parallel workflow tests and 1 conditional branching test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 4 parallel workflow tests crash because xdist worksteal distributes them across separate workers, each spawning its own func process against shared emulators. Auth fix (api_key->credential) was valid and stays. test_conditional_branching now passes with the auth fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wrap skip reason strings to stay within 120 char line limit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Kill any auto-started Ollama before launching serve (fixes port conflict: 'address already in use') - Retry ollama pull up to 3 times with 15s backoff (fixes 429 rate limit failures) - Applied to both python-merge-tests.yml and python-integration-tests.yml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Foundry agent: add allow_preview=True to custom client test - Foundry hosting: raise max_output_tokens 50->200, add temperature, relax assertion in test_temperature_and_max_tokens - Foundry embedding: update skip reason with root cause (endpoint mismatch) - OpenAI file search: fix vector store indexing race condition by polling file_counts before querying; fix get_streaming_response -> get_response(stream=True) - Azure OpenAI file search: remove skip (transient 500 resolved) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation and Context
The Misc integration CI job already runs
packages/ollama/testsbut all 5 Ollama tests were always skipped because no Ollama server was available and theOLLAMA_MODEL/OLLAMA_EMBEDDING_MODELenv vars were empty. This PR enables those tests by installing Ollama in CI, and also renames the report job from "Flaky Test Report" to "Integration Test Report" to better reflect its scope.Description
Ollama CI Enablement
Adds Ollama setup steps to the Misc integration job in both workflow files:
actions/cache@v4to cache~/.ollama/models, so model pulls only happen on the first run (~500MB). Subsequent runs restore from cache in ~10-15 seconds.ollama servein the background, waits for it to be ready, then pullsqwen2.5:0.5b(chat, ~400MB) andnomic-embed-text(embedding, ~50MB). Small models were chosen to minimize CI time while still exercising the integration path.OLLAMA_MODEL=qwen2.5:0.5bandOLLAMA_EMBEDDING_MODEL=nomic-embed-textso the skip conditions in the test files are satisfied.This enables 5 previously-skipped tests:
test_cmc_integration_with_chat_completiontest_cmc_integration_with_tool_calltest_cmc_streaming_integration_with_tool_calltest_cmc_streaming_integration_with_chat_completiontest_ollama_embedding_integrationReport Rename
Renames "Flaky Test Report" to "Integration Test Report" throughout:
Files Changed
.github/workflows/python-merge-tests.yml— Ollama setup steps + report rename.github/workflows/python-integration-tests.yml— Same changes (kept in sync)python/scripts/flaky_report/__init__.py— Docstring updatepython/scripts/flaky_report/__main__.py— Docstring and example updatepython/scripts/flaky_report/aggregate.py— Report title updateContribution Checklist