forked from Skybound-Logic/pdf2md
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or request
Description
Goal
Make output checking repeatable and cheap so we can confidently say what improved/regressed (esp. embedded criteria: no hallucinations + table correctness).
We already have:
VALIDATION_PLAN.md(high-level criteria)--pagesfor cheap iterationscripts/datasheets_convert.shPAGE TEST MODE that writes tooutputs/page_tests/...(git-ignored)
This issue is to add a standard, runnable check that we run after every conversion (full or page-slice).
Proposal
- Add a script:
scripts/validate_datasheet.py(orscripts/validate_output.py) that can validate either:
- a datasheets canonical output folder (e.g.
datasheets/manufacturers/nxp/AN13917/outputs/pdf2md/auto) - a page test folder (e.g.
outputs/page_tests/nxp/AN13917/outputs/pdf2md/auto)
- Document the minimal manual checks (tables vs cropped images) with a small sampling checklist.
Automated checks (script)
Given an output dir containing index.md + section .md files + optional images/:
-
Link integrity:
- Parse
index.mdlinks and assert targets exist
- Parse
-
Red flags scan (fail if found unless explicitly allowed):
LLM_API_ERROR[ERROR:<<VERBATIM_TABLE_placeholders (should never remain in final output)
-
Strict-mode reporting:
- Count files containing
[STRICT_MODE: - For each, surface the
New technical tokens detected:line
- Count files containing
-
HTML entity leakage:
- Count occurrences of
&#and&#\d+;in.md
- Count occurrences of
-
Table-first presence:
- Count
<!-- VERBATIM_TABLE_START -->/<!-- VERBATIM_TABLE_END -->blocks - Optionally verify each verbatim block contains at least one markdown table row delimiter (
|---) OR at least one./images/page_*_table_*.pngreference
- Count
-
Output summary report:
- Print totals: sections, images, verbatim tables, strict rejections, failures
- Exit code non-zero on failures
Manual sampling checks (doc)
- Pick 3–5 tables (units-heavy + mapping-heavy)
- For each table block, open the referenced
page_*_table_*.pngand verify:- row/col count
- headers
- a few numeric cells + units
CLI
Example usage:
python scripts/validate_datasheet.py --output-dir datasheets/manufacturers/nxp/AN13917/outputs/pdf2md/auto
python scripts/validate_datasheet.py --output-dir outputs/page_tests/nxp/AN13917/outputs/pdf2md/autoAcceptance criteria
- Script exists and runs in a fresh venv (no extra deps beyond stdlib)
- Link integrity check works
- Fails on leftover placeholders (
<<VERBATIM_TABLE_..>>) - Reports strict-mode rejections and their token diffs
- Reports HTML entity leakage counts
- Produces a concise summary and non-zero exit on failure
- README (or
VALIDATION_PLAN.md) includes the exact command(s) to run
Notes
- Default behavior should be conservative: fail on obvious corruption/placeholder/errors; treat strict-mode rejections as warn (configurable).
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or request