Skip to content

Add repeatable output validation checklist + validator script (datasheets/page_tests) #2

@ajlennon

Description

@ajlennon

Goal

Make output checking repeatable and cheap so we can confidently say what improved/regressed (esp. embedded criteria: no hallucinations + table correctness).

We already have:

  • VALIDATION_PLAN.md (high-level criteria)
  • --pages for cheap iteration
  • scripts/datasheets_convert.sh PAGE TEST MODE that writes to outputs/page_tests/... (git-ignored)

This issue is to add a standard, runnable check that we run after every conversion (full or page-slice).

Proposal

  1. Add a script: scripts/validate_datasheet.py (or scripts/validate_output.py) that can validate either:
  • a datasheets canonical output folder (e.g. datasheets/manufacturers/nxp/AN13917/outputs/pdf2md/auto)
  • a page test folder (e.g. outputs/page_tests/nxp/AN13917/outputs/pdf2md/auto)
  1. Document the minimal manual checks (tables vs cropped images) with a small sampling checklist.

Automated checks (script)

Given an output dir containing index.md + section .md files + optional images/:

  • Link integrity:

    • Parse index.md links and assert targets exist
  • Red flags scan (fail if found unless explicitly allowed):

    • LLM_API_ERROR
    • [ERROR:
    • <<VERBATIM_TABLE_ placeholders (should never remain in final output)
  • Strict-mode reporting:

    • Count files containing [STRICT_MODE:
    • For each, surface the New technical tokens detected: line
  • HTML entity leakage:

    • Count occurrences of &amp;# and &#\d+; in .md
  • Table-first presence:

    • Count <!-- VERBATIM_TABLE_START --> / <!-- VERBATIM_TABLE_END --> blocks
    • Optionally verify each verbatim block contains at least one markdown table row delimiter (|---) OR at least one ./images/page_*_table_*.png reference
  • Output summary report:

    • Print totals: sections, images, verbatim tables, strict rejections, failures
    • Exit code non-zero on failures

Manual sampling checks (doc)

  • Pick 3–5 tables (units-heavy + mapping-heavy)
  • For each table block, open the referenced page_*_table_*.png and verify:
    • row/col count
    • headers
    • a few numeric cells + units

CLI

Example usage:

python scripts/validate_datasheet.py --output-dir datasheets/manufacturers/nxp/AN13917/outputs/pdf2md/auto
python scripts/validate_datasheet.py --output-dir outputs/page_tests/nxp/AN13917/outputs/pdf2md/auto

Acceptance criteria

  • Script exists and runs in a fresh venv (no extra deps beyond stdlib)
  • Link integrity check works
  • Fails on leftover placeholders (<<VERBATIM_TABLE_..>>)
  • Reports strict-mode rejections and their token diffs
  • Reports HTML entity leakage counts
  • Produces a concise summary and non-zero exit on failure
  • README (or VALIDATION_PLAN.md) includes the exact command(s) to run

Notes

  • Default behavior should be conservative: fail on obvious corruption/placeholder/errors; treat strict-mode rejections as warn (configurable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions