fix(extract): emit a block per cell paragraph for table contents (SD-2653)#2907
fix(extract): emit a block per cell paragraph for table contents (SD-2653)#2907caio-pizzol wants to merge 4 commits intomainfrom
Conversation
doc.extract() previously returned every table as a single block with every
cell's text concatenated and no separators, and the table's nodeId did not
navigate. Recurse into tables so each paragraph inside a cell becomes its
own ExtractBlock carrying a new tableContext { tableNodeId, rowIndex,
colIndex }. Paragraph paraIds round-trip through DOCX export, so the IDs
are stable — which is the property callers need for RAG.
- colIndex uses TableMap so merged cells (gridSpan) report correct logical
columns.
- SDT / content-control wrappers inside cells are recursed transparently,
keeping their inner paragraphs addressable.
SD-2653
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e951f2a77
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (let i = 0; i < candidates.length; i++) { | ||
| const candidate = candidates[i]; | ||
| if (candidate.nodeType === 'table') { | ||
| collectFromTable(candidate.node, candidate.pos, [i], candidate.nodeId, blocks); |
There was a problem hiding this comment.
Pass canonical top-level path into collectFromTable
collectTopLevelBlocks() resolves node IDs using the real document child index, but this call passes [i] where i is only the index in the filtered candidates array. When unsupported top-level nodes (for example passthroughBlock) appear before a table, that path diverges from the canonical traversal path used by resolveBlockNodeId/buildBlockIndex, so fallback IDs (notably table IDs derived from volatile sdBlockIds) are hashed from the wrong path and may no longer resolve through scrollToElement/block lookup.
Useful? React with 👍 / 👎.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
📦 Preview published: npm install superdoc@pr-2907 |
… contract sync Three follow-up fixes from code review of the initial SD-2653 commit. - **Top-level SDT transparency.** Content-control wrappers (`sdt`, `structuredContentBlock`) at the document root were still flattened to one opaque block via `textContent`, reintroducing the exact concatenation bug for form-style templates. `collectBlocks` now treats them the same way `emitFromContainer` already did inside tables, recursing so inner paragraphs emit as their own blocks with stable IDs. New behavior test. - **Merge info on `tableContext`.** Adds optional `colspan` / `rowspan` so callers can reconstruct row layout and distinguish "absent because merged" from "absent because empty" under `gridSpan`. - **Contract / schema sync.** Schemas and operation definitions still described the pre-SD-2653 shape (referenced `type: "table"` blocks, omitted `tableContext`). Regenerated reference docs + SDK codegen so the published docs, tool catalogs, and SDK intent-dispatch match runtime. Common-workflows RAG page now documents the table-extraction shape with a row-grouping example. All 16 behavior tests and 1362 document-api unit tests pass. SD-2653
…ndex collectBlocks iterated the filtered candidate list from collectTopLevelBlocks and passed that index as the outer traversal path. When an unsupported top-level node (e.g. passthroughBlock) sits before a table, the filtered index diverges from the raw PM child index that buildBlockIndex and the block-address resolver use. Any fallback-derived ID hashed from the wrong path would then miss lookups through scrollToElement. Narrow in practice (paraId/sdBlockId cover almost everything) but a real latent inconsistency flagged by code review. Walk editor.state.doc.child(i) directly so the outer path is always the canonical PM index. SD-2653
The previous commits added SDT-wrapper transparency, merged-cell coordinate
handling, colspan/rowspan fields, and the emitFromContainer helper. None of
those are in the customer's file; all were "just in case" coverage for
scenarios real users haven't reported yet.
Strip back to: walk tables, emit each cell's paragraphs with
{tableNodeId, rowIndex, colIndex}. Nested tables recurse because they're
literally the same bug one level deeper.
Everything else becomes its own follow-up when a real customer reports it:
- SDT / content-control wrappers emit as one opaque block (known limitation)
- colspan/rowspan not surfaced; colIndex is the physical child index
- TableMap logical-column computation removed
Reference doc + schema regenerated to match.
SD-2653
doc.extract()previously returned every table as a single block whose text concatenated every cell with no separators ("FlatsCurvesAnglesSquare404Circle010..."), and the table'snodeIdwouldn't resolve throughscrollToElement. That left tables unusable for RAG chunking and undetectable for navigation. Extract now recurses into each cell and emits its paragraphs as their own blocks, each carrying a newtableContext: { tableNodeId, rowIndex, colIndex }. ParagraphparaIds round-trip through DOCX export, so emitted nodeIds are stable across reopen — which is the property callers actually need.type: "table"block is gone. Callers that need "which table did this come from" usetableContext.tableNodeId.colIndexusesTableMapso merged cells (gridSpan) report the correct logical column.tableContext.Verified:
pnpm --filter @superdoc/document-api test→ 1362 pass.pnpm exec playwright test tests/navigation/extract.spec.ts --project=chromium→ 15 pass (7 SD-2525 + 8 new SD-2653 covering multi-paragraph cells, empty cells, headings in cells, nested tables, and a DOCX-imported fixture from the customer file).Review: focus on
collectFromTable+emitFromContainercorrectness (position math, TableMap column lookup, transparent-wrapper recursion). IgnorecollectComments/collectTrackedChanges— not touched.