Run the Q&A pipeline in-process (remove Render Workflows)#2
Open
Ho1yShif wants to merge 37 commits into
Open
Conversation
- workflows/app.py: add _stage_result helper to replace 8x repeated PipelineStageResult construction; consolidate the 6 page-injection tasks behind _run_ingest_script; tighten docstrings and fix the stale execute_pipeline / evaluation.py line references - workflows/serialization.py: remove unused evaluations_*/stages_to_json helpers - backend/pipeline/quality_gate.py: delete commented-out accuracy-gate block, condense the rationale note - README.md: dedupe the local-dev/env-group blockquotes, drop the repeated Blueprints note, trim generic capability bullets - remove the stale duplicate env.example (.env.example is canonical) No behavior changes. Verified: all modules import, dev server registers all 15 tasks, _stage_result builds correct stage shapes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lities - Revised project description to include "Render Workflows" for better context. - Added a new section detailing Render capabilities, including durable workflow tasks and PostgreSQL integration. - Removed outdated example questions and streamlined the architecture explanation. - Clarified local development requirements for asking questions and running the stack. These changes improve the documentation's accuracy and usability for developers.
…setup instructions - Changed the deployment link to point to the updated repository for Pydantic Agents Workflows. - Removed the manual setup section to streamline the README and focus on essential information for developers. These updates enhance the clarity and relevance of the documentation.
…ification Address audit feedback on the retrieval/generation layer and reorganize the workflow tasks into distinct, non-duplicative verification capabilities. retrieval.py - Replace the five near-identical detect_*/inject_* pairs (~350 lines of duplicated fetchrow + metadata-parse + prepend) with a declarative INJECTION_RULES table iterated by one inject_curated_docs helper. - Pin injected docs at a named INJECTED_DOC_SCORE constant (was a bare 0.95 literal) so the editorial pinning is explicit, not disguised as a real match. - Fix the multi-query boost off-by-one: boost i == 0 (the original question, per query_expansion.py) instead of i == 1 (first expanded variation). generation.py - Rewrite ANSWER_GENERATION_INSTRUCTIONS from an 87-line all-caps prompt to a lean, neutral set of grounding rules: drop the marketing steering and per-topic carve-outs, and remove the hardcoded "20" so it can't desync from rag_top_k. - Delete the unused stream_answer (and its export / AsyncGenerator import), which also removes the copy-pasted context + feedback block. accuracy.py / evaluation.py - Sharpen the two Claude judges into distinct roles: Accuracy owns factual grounding (errors/corrections -> feedback loop); Quality owns developer experience (clarity/completeness/usefulness -> gate score). workflows/app.py + docs - Re-group the orchestrator into Generate / Grounding / Accuracy+Quality phases and update README/PIPELINE.md to the 3-capability framing and neutral answers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…low-showcase refactor: data-driven RAG injection, neutral prompt, 3-capability verification
The pipeline reported a near-constant "~21 relevant documents" for every question because retrieval was a fixed top-k quota, not a relevance filter, plus a forced +1 injection. This makes the count reflect actual relevance and puts all scores on one interpretable scale. hybrid_search (database.py) - Stop overwriting the cosine similarity with the raw RRF score. Return cosine (0-1) as similarity_score; use RRF only to order the fused set. - Apply the similarity_threshold as a real gate on the FINAL returned set (not just the semantic candidate pool), so the result count is dynamic and <= k. - Remove pre-existing unused imports (numpy, DocumentChunk). retrieve_documents (retrieval.py) - Multi-query merge: dedup by max cosine, drop the off-scale 1.15 boost (it could exceed 1.0 and was the off-by-one), prefer original-question hits only as an ordering tiebreak, sort by cosine, cap at rag_top_k. inject_curated_docs (retrieval.py) - Replace-weakest policy: if a topic's curated doc was already retrieved, leave it; otherwise insert at top (score 1.0) and, only if over the rag_top_k ceiling, drop the lowest-ranked retrieved doc. Never pads the count, never duplicates. INJECTED_DOC_SCORE 0.95 -> 1.0 (top of the cosine scale). config.py - Document rag_top_k as a ceiling and similarity_threshold as a real relevance gate (with tuning guidance); drop unused Optional import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…etrieval fix: relevance-gated retrieval + replace-weakest injection
Updated the RAG configuration to adjust the `rag_top_k` value and introduced a new `relevance_cutoff_fraction` for improved document selection. Implemented an adaptive relevance cutoff in the retrieval pipeline to dynamically filter documents based on their similarity scores relative to the best match. This change aims to enhance the quality of retrieved documents by reducing the inclusion of marginally relevant results. Additionally, modified the ProgressTracker component to conditionally display status messages for better user feedback.
The ingestion side fanned out 6 near-identical @app.task wrappers (one per page), each spinning up an instance to do a single embed + insert, backed by ~1,280 lines of ~90% copy-paste across 6 scripts. Replace that with a source-oriented, data-driven design. Ingestion (8 tasks -> 3): - backend/ingestion.py: shared embed_documents() + replace_source() helpers (the delete-by-source + insert block was copy-pasted across all 7 scripts). - data/sources.py: SOURCES registry with curated-page / pricing-table / tutorials-index build strategies; curated content inlined as constants. - workflows/app.py: ingest_core (unchanged), ingest_source(name) (one parameterized, retried task), ingest_all (fans ingest_source over the registry). Removed _run_ingest_script + the 6 add_* wrappers. - data/scripts/ingest_pages.py + Makefile: local-dev parity without the deleted scripts. Backend pipeline dedup (light helpers): - backend/pipeline/_agents.py: anthropic_agent()/openai_agent() builders used by all 6 agents (drops a deprecated OpenAIModel usage). - observability.usage_and_cost(): replaces ~6 copy-pasted token/cost blocks. - evaluation.agreement_level(): extracted and reused in app.py. Docs/Makefile updated; deleted the 6 data/scripts/add_*_page.py files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Added a new table `pipeline_progress` to the database for tracking live progress of in-flight pipeline runs, keyed by a unique token. Updated the `ask_question` endpoint to generate and return this token, allowing the UI to display real-time stage updates. Enhanced the `get_answer` endpoint to return cumulative progress updates when the token is provided. Modified the frontend to display progress messages and updated API calls to support the new progress tracking feature.
…tion-workflow-tasks Live pipeline progress tracking + ingestion consolidation
Modified the `_fetch_curated_docs` function to fetch multiple rows for source-based lookups, allowing for chunked document retrieval. Updated the `_curated_build` function to create chunked documents from curated markdown files, enhancing the semantic retrieval process. Adjusted the logic in `inject_curated_docs` to deduplicate based on content rather than title, ensuring more efficient document injection. This refactor improves the handling of curated documents and optimizes the retrieval pipeline.
…verage
Claim verification embeds each claim and does a top-k similarity search over the
corpus; a claim is only verified if a retrieved passage substantiates it. Curated
docs were ingested as one whole-page chunk (e.g. the 2.6KB workflows_docs page →
a single chunk in prod), so a narrow claim like "billed prorated by the second"
had low cosine against the diluted multi-topic embedding and couldn't surface its
supporting passage — verifying at 0% despite the fact being present in our content.
Add chunk_markdown_by_heading: split curated markdown into one focused chunk per
##/### section, keeping the heading vocabulary ("Pricing", "Beta Limitations")
with the body so claims match the section that states them. Oversized sections
fall back to paragraph chunking; tiny fragments merge into the previous chunk;
heading-less files fall back to chunk_document. workflows_docs now yields 8
fact-level chunks instead of 1.
Requires re-ingesting the curated sources (delete-by-source replace) for the live
corpus to pick up the finer chunks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Section-aware chunking for curated docs (raise verification coverage)
…race ingest_all fans out an ingest_source task per source, each on a fresh instance calling vector_store.initialize(). The full schema DDL re-ran on every instance, so concurrent CREATE OR REPLACE FUNCTION / DROP+CREATE TRIGGER against the same Postgres catalog rows intermittently raised "tuple concurrently updated" (retried and eventually succeeded, but flaky). Wrap the DDL block in a transaction guarded by a transaction-scoped advisory lock (pg_advisory_xact_lock) on a stable key, so only one instance runs schema init at a time; the rest wait and re-run the now-idempotent statements. Fixes the race for every concurrent caller (ingest fan-out, gateway startup + QA run, parallel ingest_source triggers), not just ingest_all. Pool creation stays outside the lock; all statements are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Serialize schema init with advisory lock (fix concurrent ingest race)
The ivfflat(lists=100) index had catastrophic recall on the ~1.3k-row corpus: with pgvector's default ivfflat.probes=1, each query scanned only ~1 of 100 lists (~1% of rows), so the true #1 nearest neighbor was frequently never retrieved. Claims whose supporting chunk matched at cosine >0.7 (exact rank #1) still verified at 0% confidence because the approximate index never returned the chunk. Switch to HNSW (pgvector 0.8.0), which gives effectively exact recall at this scale with no probe tuning, fixing both similarity_search (verification) and hybrid_search (answer retrieval). Drop the old index first so existing deployments migrate on their next initialize(); the build runs inside the existing schema-init advisory lock. Also raise the verifier's candidate pool from 5 to 10 as defense-in-depth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix: replace ivfflat vector index with HNSW (fixes 0% claim verification)
Update the documents schema snippet in HYBRID_SEARCH.md to use the HNSW index (with a note on why ivfflat's default probes=1 tanked recall) and note the pgvector >= 0.5.0 requirement in the README prerequisites. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs: reflect HNSW vector index in schema + prerequisites
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Refactor the iteration tracking in the run_qa_pipeline function to use a dedicated variable, iterations_run, for clarity. This change ensures that the correct iteration count is logged and reported, improving the accuracy of the QA pipeline's performance metrics.
The quality gate (`quality_gate_decision`) and the iterative refinement loop never functioned in practice: with MAX_ITERATIONS=1 the gate short-circuits on its first check (`current_iteration >= max_iterations` is `1 >= 1`), so the quality-threshold and evaluator-disagreement branches and the feedback loop were unreachable. All 23 historical sessions in the live DB ran exactly one iteration, and the stored stage data shows the gate only ever emitting "Maximum iterations (1) reached". Collapse `run_qa_pipeline` to a single linear 7-stage pass and remove the now-dead surface area: - delete `backend/pipeline/quality_gate.py` and the `while` loop - drop the unused `feedback` param from answer generation - remove the `iterations` field/metric everywhere (AnswerResponse, track_pipeline_metrics, DB column + idempotent drop migration, frontend) - remove quality_threshold / accuracy_threshold / agreement_threshold / max_iterations from config.py, the /stats endpoint, and render.yaml - update README + docs to the 7-stage single-pass pipeline Grounding (claims extraction + verification), the accuracy check, and the dual-model OpenAI+Anthropic evaluation with cross-provider agreement are all unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Eliminate dead configuration parameters related to the quality gate and refinement loop, including QUALITY_THRESHOLD, ACCURACY_THRESHOLD, AGREEMENT_THRESHOLD, and MAX_ITERATIONS, as they are no longer applicable following recent refactoring of the QA pipeline.
The hand-rolled fetch+parse in backend/prices.py crashed on the real
genai-prices schema: tiered prices (e.g. gpt-5.4, where input_mtok is a
{base, tiers} dict) and constraint-based prices (e.g. o3, where prices is
a list). The first such model aborted the entire provider parse, and because
the try block wrapped both the HTTP call and the parse, a successful 200 +
failed parse was mislabeled "GitHub price fetch failed" -- silently falling
back to bundled files frozen at 2026-03-27. This also ran on every task,
making wasted GitHub calls in the hot path.
Replace the custom fetch+parse with the official genai-prices package, which
ships bundled, auto-updated price data and a lookup API -- immune to schema
drift and fully offline (no network in the cost-tracking path).
- backend/prices.py: thin model_cost() wrapper over calc_price, with a
graceful fallback rate so cost tracking never crashes
- backend/observability.py: repoint the 3 cost wrappers at model_cost
- workflows/app.py: drop the per-task load_prices() call + import
- deps: add genai-prices, remove now-unused pyyaml
- remove stale bundled backend/prices/*.yml files
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix: use official genai-prices package for model pricing
Remove dead quality gate and refinement loop
The Sources list showed the same document several times because retrieval
returns chunks and the chunk list was passed straight through as `sources`
with no document-level grouping (amplified by curated injection fetching every
chunk of a matched page). This collapses chunks sharing (source, title) into one
source entry for display while still feeding the full chunk list to generation.
Also closes three high-severity observability gaps surfaced by an audit:
- Add `logfire.force_flush()` (via a `flush_on_exit` decorator) to every workflow
task. Each task runs on its own short-lived instance, so without an explicit
flush buffered spans were lost and the Logs tab (`/sessions/{id}/logs`) returned
an empty trace.
- Report `tokens_used` for the claims_verification and quality_evaluation stages,
which computed token counts but dropped them — leaving 2 of 7 stages with null
token attribution.
Sources are collapsed at response assembly, so generation still sees every chunk
and History inherits the collapsed view via the persisted sources.
Verified: py_compile + import smoke on changed Python, a direct unit test of
collapse_sources, and `npm run build` (type-check + static export).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…observability Collapse duplicate source chunks + close observability gaps
Follow-up to PR #11. Closes the remaining low/medium audit items in one
reviewable pass — no behavior change to the happy-path pipeline.
- Remove broken `make test` references (no test suite in this example repo):
drop the Makefile target/help/.PHONY entries and pytest dev deps, relock.
- retrieve_documents now returns a real `queries_count` (the rag_retrieval
stage metadata previously always reported queries_expanded=1).
- Remove dead `evaluate_quality` (orchestrator uses the granular raters) and
its now-unused imports; remove the dead `iteration` param from _stage_result.
- Harden backend/api/logs.py: validate trace_id against ^[0-9a-f]{32}$ (->400)
before SQL interpolation; make the query window a Settings field (7->30 days)
so logs stay fetchable for older sessions.
- De-duplicate the instrument_stage async/sync wrappers via shared
_record_stage_success / _record_stage_failure helpers.
- Cross-validate embedding_model vs embedding_dimensions at startup.
- Drop the post-hoc docs[0].source citation fallback in verification.py — a
verified claim with no judge-cited passage keeps supporting_docs empty rather
than fabricating an attribution (frontend already guards empty lists).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix: close deferred audit items (cleanup)
The "Recent questions" panel showed every question ever asked across the whole app, because qa_sessions had no owner and GET /history had no WHERE clause. Scope history per user via an anonymous browser client ID (the app has no auth): a localStorage UUID sent with each request, stamped onto each saved session, and used to filter and scope all history operations. - database.py: add client_id column + idempotent migration + composite (client_id, created_at DESC) index; thread client_id through save_session, get_recent_sessions (WHERE client_id), delete_session (scoped), and delete_all_sessions (scoped). Legacy NULL-owner rows are excluded by the equality filter, so they stay hidden from everyone. - models.py / main.py: add client_id to QuestionRequest and pass it to the workflow; require client_id on GET/DELETE /history (400 otherwise) so we never fall back to a global query. - workflows/app.py: thread client_id through run_qa_pipeline -> _persist_session. - frontend/lib/api.ts: getClientId() (localStorage UUID, SSR-guarded) wired into askQuestion and all history calls. HistoryPanel needs no changes. Verified against local Postgres: migration applies, lists are per-client, legacy rows hidden, cross-client delete rejected, clear-all scoped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ient Scope "Recent questions" history to the anonymous browser client
Raise the query expansion agent's max_tokens from 300 to 1000 so gpt-4.1-nano can finish its structured QueryExpansionOutput payload instead of aborting with UnexpectedModelBehavior before any output. Also make expand_query() failures non-fatal: on any error, log a warning and fall back to the original question alone so a retrieval enhancement hiccup no longer takes down the whole Q&A pipeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-limit fix: prevent query expansion token-limit crash
…trypoint The Dashboard pre-deploy chained deleted per-page scripts (add_pricing_page.py && add_ai_agent_template_page.py), failing on the latter which never existed. Codify the data-driven replacement in the Blueprint so the setting is version-controlled and can't drift: preDeployCommand runs `ingest_pages.py`, which ingests all live sources in data/sources.py directly (no Render Workflows). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
POST /ask returned 503 because it delegated the pipeline to a separate
Render Workflows service via render.workflows.start_task(...), which needs
the sync:false Blueprint var WORKFLOW_SLUG. With it unset the endpoint
raised 503. Per project direction this stack must not use Render Workflows
and must stay deployable on free Render tiers.
Run the pipeline in-process instead:
- Add backend/pipeline/orchestrator.py: run_qa_pipeline ported from the
deleted workflows/app.py, calling the stage functions directly (no
@app.task wrappers, no JSON serialization round-trips). The accuracy +
dual-model quality checks still overlap via a single asyncio.gather.
- Add a pipeline_runs table + set_run_status/get_run_status so GET
/ask/{run_id} can read terminal status without polling Workflows.
- Rewrite POST /ask to launch the orchestrator as a background task and
GET /ask/{run_id} to read run status + live progress. The response
shape is unchanged, so the frontend needs no changes.
- Drop the render-sdk dependency, render_api_key/workflow_slug config,
and the workflows/ directory.
- Repoint the ingest cron to the in-process data/scripts/ingest_pages.py,
which now loads the core corpus then the live sources on a no-arg run.
- render.yaml: drop the pipeline-trigger env group and rename services
from pydantic-agents-workflows-* to pydantic-agents-*.
- Update README and .env.example off the Workflows model.
Verified locally (1339 docs): POST /ask -> 202 (no 503); polled to done
across all 7 stages with live progress; answer with 12 sources, 31 claims,
quality 89.5 / accuracy 97; session persisted and scoped per client;
unknown run_id -> 404; single-source in-process ingest works.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Clicking a question in the deployed UI returned 503 → frontend "API error".
POST /askdelegated the pipeline to a separate Render Workflows service viarender.workflows.start_task(...), which requires thesync:falseBlueprint varWORKFLOW_SLUG. When that's unset, the endpoint raised503 WORKFLOW_SLUG is not configured. This stack should not depend on Render Workflows and should stay deployable on free Render tiers.Fix — run the pipeline in-process
backend/pipeline/orchestrator.py(new):run_qa_pipelineported from the deletedworkflows/app.py, calling the stage functions directly (no@app.taskwrappers, no JSON serialization round-trips). The accuracy + dual-model quality checks still overlap via a singleasyncio.gather.backend/database.py: newpipeline_runstable +set_run_status/get_run_statussoGET /ask/{run_id}reads terminal status without polling Workflows (usesupdated_at, consistent with the siblingpipeline_progresstable).backend/main.py:POST /asklaunches the orchestrator as a background task;GET /ask/{run_id}reads run status + live progress. Response shape unchanged → no frontend changes.render-sdkdependency,render_api_key/workflow_slugconfig, and the entireworkflows/directory.data/scripts/ingest_pages.py, which now loads the core corpus then the live sources on a no-arg run.render.yaml: removed thepipeline-triggerenv group; renamed servicespydantic-agents-workflows-*→pydantic-agents-*. Stays on free tiers (starter web + static + basic-1gb Postgres).README.mdand.env.exampleoff the Workflows model.All "positive changes" from the workflows lineage (HNSW index, per-client history, query-expansion fix, consolidated ingest) were already on
mainand are preserved.Verification (local, 1339 docs)
POST /ask→ 202 (no 503); polled todoneacross all 7 stages with live progress updates./historycorrectly scoped perclient_id(owner sees it; other clients don't).run_id→ 404.ingest_pages.py pricing).uv sync(render-sdk removed),compileallclean,ruffshows only the pre-existingE402(deliberateload_dotenv()-first import order).🤖 Generated with Claude Code