Conversation
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| SAST | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
There was a problem hiding this comment.
Pull request overview
This PR adds a new indexability test suite that validates AI agents (Claude, ChatGPT) and web crawlers can access all documentation content on the live Weaviate docs site. The tests are split into HTML structure tests (no API keys needed) and agent-based tests (requiring Anthropic/OpenAI API keys), with a corresponding CI workflow that runs weekly.
Changes:
- New test suite (
test_docs_indexability.py) with ~48 HTML structure tests and 6 agent tests covering page status codes, meta tags, heading hierarchy, tabbed content, code blocks, collapsibles, images,llms.txt, andsitemap.xml - New CI workflow (
.github/workflows/indexability_tests.yml) running weekly on Sunday, with corresponding pytest markers and dependencies - Supporting files including README documentation and a debug utility script
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_docs_indexability.py |
New test suite with HTML structure and AI agent indexability tests |
tests/README-INDEXABILITY.md |
Documentation for the indexability test suite |
.github/workflows/indexability_tests.yml |
Weekly CI workflow for running indexability tests |
pytest.ini |
Adds indexability and indexability_agents markers |
pyproject.toml |
Adds anthropic, beautifulsoup4, openai dependencies |
uv.lock |
Lock file updates for new dependencies |
tools/chatgpt_fetch_quickstart.py |
Debug script for inspecting ChatGPT web_search output |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - name: Run agent tests | ||
| id: agent-tests | ||
| continue-on-error: true | ||
| if: env.ANTHROPIC_API_KEY != '' && env.OPENAI_API_KEY != '' |
There was a problem hiding this comment.
The if condition references env.ANTHROPIC_API_KEY and env.OPENAI_API_KEY, but these environment variables are defined in this step's own env: block (lines 47-49). In GitHub Actions, the step-level env: is not available during evaluation of the same step's if: condition — the if: is evaluated before the step's environment is set up. Since these secrets are not defined in the workflow-level env: block either, both values will always be empty, and the agent tests will never run.
Use secrets directly in the condition instead: if: ${{ secrets.ANTHROPIC_API_KEY != '' && secrets.OPENAI_API_KEY != '' }}
| if: env.ANTHROPIC_API_KEY != '' && env.OPENAI_API_KEY != '' | |
| if: ${{ secrets.ANTHROPIC_API_KEY != '' && secrets.OPENAI_API_KEY != '' }} |
tests/README-INDEXABILITY.md
Outdated
| - **Fetch quickstart code tabs** — fetches `/weaviate/quickstart` and must extract actual code lines for all 5 languages (Python, TypeScript, Go, Java, C#). Asserts on language-specific code tokens like `near_text` (Python), `nearText` (TS), `NearTextArgBuilder` (Go), etc. | ||
| - Fetch a page with collapsible sections and read the content | ||
| - Fetch `/llms.txt` and identify it as Weaviate documentation | ||
|
|
||
| ### ChatGPT agent tests (Part 3) | ||
|
|
||
| Uses GPT-4.1 Mini with the `web_search_preview` tool to verify ChatGPT can: | ||
|
|
||
| - **Search quickstart code tabs** — searches for `/weaviate/quickstart` and must extract actual code lines for all 5 languages. Uses the same language-specific code token assertions as the Claude test. |
There was a problem hiding this comment.
The README documentation has multiple inaccuracies compared to the actual test code:
-
It states the tests assert on "language-specific code tokens like
near_text(Python),nearText(TS),NearTextArgBuilder(Go), etc." but the actual code asserts on vectorizer configuration lines likeConfigure.Vectors.text2vec_weaviate(),vectors.text2VecWeaviate(), etc. (seeQUICKSTART_VECTORIZER_LINESintest_docs_indexability.pylines 297-303). -
Both lines 42 and 50 say the tests require "all 5 languages", but the ChatGPT test actually only requires 3 out of 5 (line 501 of
test_docs_indexability.py).
These appear to be leftover from an earlier version of the tests and should be updated to match the current implementation.
| if: always() | ||
| uses: ./.github/actions/handle-test-results | ||
| with: | ||
| test-outcome: ${{ steps.html-tests.outcome }} |
There was a problem hiding this comment.
The test-outcome only considers the HTML tests outcome (steps.html-tests.outcome), ignoring the agent tests outcome entirely. If the HTML tests pass but agent tests fail, the workflow will report success and no failure notification will be sent. Consider combining both outcomes, e.g., reporting failure if either step failed: test-outcome: ${{ steps.html-tests.outcome == 'success' && (steps.agent-tests.outcome == 'success' || steps.agent-tests.outcome == 'skipped') && 'success' || 'failure' }}
| test-outcome: ${{ steps.html-tests.outcome }} | |
| test-outcome: ${{ steps.html-tests.outcome == 'success' && (steps.agent-tests.outcome == 'success' || steps.agent-tests.outcome == 'skipped') && 'success' || 'failure' }} |
|
|
||
| import os | ||
| import time | ||
| from functools import lru_cache |
There was a problem hiding this comment.
lru_cache is imported but never used. The caching is implemented manually via the _page_cache dict. This unused import should be removed.
| from functools import lru_cache |
| ("/weaviate/search/similarity", {"tabs", "code"}), | ||
| ("/weaviate/search/hybrid", {"tabs", "code"}), | ||
| ("/weaviate/connections/connect-cloud", {"tabs", "code"}), | ||
| ("/weaviate/config-refs/collections", {"details", "table"}), |
There was a problem hiding this comment.
The "table" feature tag is declared in TEST_PAGES and documented in the README as an available feature tag, but no test uses it (there's no "table" in f filtering). This is a dead feature tag with no effect. Either add a corresponding test (e.g., test_tables_present that verifies <table> elements exist on pages with this feature), or remove the tag from this entry and the README to avoid confusion.
What's being changed:
Add a test suite that validates AI agents and web crawlers can access all documentation content on the live site.
HTML structure tests (
pytest -m indexability, no API keys needed):<details>sections, and images are present and non-emptyllms.txtandsitemap.xmlare accessible with substantial contentAgent tests (
pytest -m indexability_agents, requiresANTHROPIC_API_KEYandOPENAI_API_KEY):web_fetch) fetches the quickstart page and extracts the exact vectorizer config line for all 5 languages (Python, TypeScript, Go, Java, C#)text2vec-contextionaryfrom inside a collapsible sectionllms.txtweb_search_preview) finds the quickstart URL and identifies multi-language code tabsFiles:
tests/test_docs_indexability.pytests/README-INDEXABILITY.md.github/workflows/indexability_tests.ymlpytest.iniindexabilityandindexability_agentsmarkerspyproject.tomlbeautifulsoup4,anthropic,openaidepstools/chatgpt_fetch_quickstart.pyType of change:
How has this been tested?
uv run pytest -m indexability -v— 47 HTML structure tests passuv run pytest -m indexability_agents -v— Claude tests pass (all 5 vectorizer lines found); ChatGPT finds quickstart URL and languages but can only see the active tab's code (knownweb_search_previewlimitation)