[Improvement] LLM indexability by g-despot · Pull Request #372 · weaviate/docs

g-despot · 2026-03-06T10:47:14Z

What's being changed:

Add a test suite that validates AI agents and web crawlers can access all documentation content on the live site.

HTML structure tests (pytest -m indexability, no API keys needed):

All 10 representative pages return HTTP 200 with proper meta tags and heading hierarchy
Tabbed code blocks render ALL tab panels in HTML (not just the active tab)
Code blocks, collapsible <details> sections, and images are present and non-empty
llms.txt and sitemap.xml are accessible with substantial content

Agent tests (pytest -m indexability_agents, requires ANTHROPIC_API_KEY and OPENAI_API_KEY):

Claude (web_fetch) fetches the quickstart page and extracts the exact vectorizer config line for all 5 languages (Python, TypeScript, Go, Java, C#)
Claude fetches the collections config-ref page and reads text2vec-contextionary from inside a collapsible section
Claude fetches and reads llms.txt
ChatGPT (web_search_preview) finds the quickstart URL and identifies multi-language code tabs

Files:

File	Action
`tests/test_docs_indexability.py`	New — 47 HTML tests + 6 agent tests
`tests/README-INDEXABILITY.md`	New — documentation for the test suite
`.github/workflows/indexability_tests.yml`	New — weekly CI (Sunday 22:00 UTC)
`pytest.ini`	Add `indexability` and `indexability_agents` markers
`pyproject.toml`	Add `beautifulsoup4`, `anthropic`, `openai` deps
`tools/chatgpt_fetch_quickstart.py`	New — debug script to inspect ChatGPT web_search output

Type of change:

Feature or enhancements (non-breaking change to add functionality)

How has this been tested?

uv run pytest -m indexability -v — 47 HTML structure tests pass
uv run pytest -m indexability_agents -v — Claude tests pass (all 5 vectorizer lines found); ChatGPT finds quickstart URL and languages but can only see the active tab's code (known web_search_preview limitation)

orca-security-eu

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Passed	SAST	0 0 0 0	View in Orca
Passed	Secrets	0 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

Copilot

Pull request overview

This PR adds a new indexability test suite that validates AI agents (Claude, ChatGPT) and web crawlers can access all documentation content on the live Weaviate docs site. The tests are split into HTML structure tests (no API keys needed) and agent-based tests (requiring Anthropic/OpenAI API keys), with a corresponding CI workflow that runs weekly.

Changes:

New test suite (test_docs_indexability.py) with ~48 HTML structure tests and 6 agent tests covering page status codes, meta tags, heading hierarchy, tabbed content, code blocks, collapsibles, images, llms.txt, and sitemap.xml
New CI workflow (.github/workflows/indexability_tests.yml) running weekly on Sunday, with corresponding pytest markers and dependencies
Supporting files including README documentation and a debug utility script

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/test_docs_indexability.py`	New test suite with HTML structure and AI agent indexability tests
`tests/README-INDEXABILITY.md`	Documentation for the indexability test suite
`.github/workflows/indexability_tests.yml`	Weekly CI workflow for running indexability tests
`pytest.ini`	Adds `indexability` and `indexability_agents` markers
`pyproject.toml`	Adds `anthropic`, `beautifulsoup4`, `openai` dependencies
`uv.lock`	Lock file updates for new dependencies
`tools/chatgpt_fetch_quickstart.py`	Debug script for inspecting ChatGPT web_search output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T10:57:40Z

.github/workflows/indexability_tests.yml

+      - name: Run agent tests
+        id: agent-tests
+        continue-on-error: true
+        if: env.ANTHROPIC_API_KEY != '' && env.OPENAI_API_KEY != ''


The if condition references env.ANTHROPIC_API_KEY and env.OPENAI_API_KEY, but these environment variables are defined in this step's own env: block (lines 47-49). In GitHub Actions, the step-level env: is not available during evaluation of the same step's if: condition — the if: is evaluated before the step's environment is set up. Since these secrets are not defined in the workflow-level env: block either, both values will always be empty, and the agent tests will never run.

Use secrets directly in the condition instead: if: ${{ secrets.ANTHROPIC_API_KEY != '' && secrets.OPENAI_API_KEY != '' }}

Suggested change

if: env.ANTHROPIC_API_KEY != '' && env.OPENAI_API_KEY != ''

if: ${{ secrets.ANTHROPIC_API_KEY != '' && secrets.OPENAI_API_KEY != '' }}

Copilot · 2026-03-06T10:57:41Z

tests/README-INDEXABILITY.md

+- **Fetch quickstart code tabs** — fetches `/weaviate/quickstart` and must extract actual code lines for all 5 languages (Python, TypeScript, Go, Java, C#). Asserts on language-specific code tokens like `near_text` (Python), `nearText` (TS), `NearTextArgBuilder` (Go), etc.
+- Fetch a page with collapsible sections and read the content
+- Fetch `/llms.txt` and identify it as Weaviate documentation
+
+### ChatGPT agent tests (Part 3)
+
+Uses GPT-4.1 Mini with the `web_search_preview` tool to verify ChatGPT can:
+
+- **Search quickstart code tabs** — searches for `/weaviate/quickstart` and must extract actual code lines for all 5 languages. Uses the same language-specific code token assertions as the Claude test.


The README documentation has multiple inaccuracies compared to the actual test code:

It states the tests assert on "language-specific code tokens like near_text (Python), nearText (TS), NearTextArgBuilder (Go), etc." but the actual code asserts on vectorizer configuration lines like Configure.Vectors.text2vec_weaviate(), vectors.text2VecWeaviate(), etc. (see QUICKSTART_VECTORIZER_LINES in test_docs_indexability.py lines 297-303).

Both lines 42 and 50 say the tests require "all 5 languages", but the ChatGPT test actually only requires 3 out of 5 (line 501 of test_docs_indexability.py).

These appear to be leftover from an earlier version of the tests and should be updated to match the current implementation.

Copilot · 2026-03-06T10:57:41Z

.github/workflows/indexability_tests.yml

+        if: always()
+        uses: ./.github/actions/handle-test-results
+        with:
+          test-outcome: ${{ steps.html-tests.outcome }}


The test-outcome only considers the HTML tests outcome (steps.html-tests.outcome), ignoring the agent tests outcome entirely. If the HTML tests pass but agent tests fail, the workflow will report success and no failure notification will be sent. Consider combining both outcomes, e.g., reporting failure if either step failed: test-outcome: ${{ steps.html-tests.outcome == 'success' && (steps.agent-tests.outcome == 'success' || steps.agent-tests.outcome == 'skipped') && 'success' || 'failure' }}

Suggested change

test-outcome: ${{ steps.html-tests.outcome }}

test-outcome: ${{ steps.html-tests.outcome == 'success' && (steps.agent-tests.outcome == 'success' || steps.agent-tests.outcome == 'skipped') && 'success' || 'failure' }}

Copilot · 2026-03-06T10:57:41Z

tests/test_docs_indexability.py

+
+import os
+import time
+from functools import lru_cache


lru_cache is imported but never used. The caching is implemented manually via the _page_cache dict. This unused import should be removed.

Suggested change

from functools import lru_cache

Copilot · 2026-03-06T10:57:41Z

tests/test_docs_indexability.py

+    ("/weaviate/search/similarity", {"tabs", "code"}),
+    ("/weaviate/search/hybrid", {"tabs", "code"}),
+    ("/weaviate/connections/connect-cloud", {"tabs", "code"}),
+    ("/weaviate/config-refs/collections", {"details", "table"}),


The "table" feature tag is declared in TEST_PAGES and documented in the README as an available feature tag, but no test uses it (there's no "table" in f filtering). This is a dead feature tag with no effect. Either add a corresponding test (e.g., test_tables_present that verifies <table> elements exist on pages with this feature), or remove the tag from this entry and the README to avoid confusion.

Initial LLM indexability testing

94e6e20

g-despot requested a review from Copilot March 6, 2026 10:48

orca-security-eu bot reviewed Mar 6, 2026

View reviewed changes

Copilot started reviewing on behalf of g-despot March 6, 2026 10:48 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Update tests

89a118e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] LLM indexability#372

[Improvement] LLM indexability#372
g-despot wants to merge 2 commits intomainfrom
llm-indexability

g-despot commented Mar 6, 2026 •

edited

Loading

Uh oh!

orca-security-eu bot left a comment •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if: env.ANTHROPIC_API_KEY != '' && env.OPENAI_API_KEY != ''
	if: ${{ secrets.ANTHROPIC_API_KEY != '' && secrets.OPENAI_API_KEY != '' }}

	test-outcome: ${{ steps.html-tests.outcome }}
	test-outcome: ${{ steps.html-tests.outcome == 'success' && (steps.agent-tests.outcome == 'success' \|\| steps.agent-tests.outcome == 'skipped') && 'success' \|\| 'failure' }}

Conversation

g-despot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's being changed:

Type of change:

How has this been tested?

Uh oh!

orca-security-eu bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Orca Security Scan Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

g-despot commented Mar 6, 2026 •

edited

Loading

orca-security-eu bot left a comment •

edited

Loading