Skip to content

fix(extract): emit a block per cell paragraph for table contents (SD-2653)#2907

Closed
caio-pizzol wants to merge 4 commits intomainfrom
caio/sd-2653-bug-docextract-flattens-table-cells-into-a-single
Closed

fix(extract): emit a block per cell paragraph for table contents (SD-2653)#2907
caio-pizzol wants to merge 4 commits intomainfrom
caio/sd-2653-bug-docextract-flattens-table-cells-into-a-single

Conversation

@caio-pizzol
Copy link
Copy Markdown
Contributor

doc.extract() previously returned every table as a single block whose text concatenated every cell with no separators ("FlatsCurvesAnglesSquare404Circle010..."), and the table's nodeId wouldn't resolve through scrollToElement. That left tables unusable for RAG chunking and undetectable for navigation. Extract now recurses into each cell and emits its paragraphs as their own blocks, each carrying a new tableContext: { tableNodeId, rowIndex, colIndex }. Paragraph paraIds round-trip through DOCX export, so emitted nodeIds are stable across reopen — which is the property callers actually need.

  • The opaque type: "table" block is gone. Callers that need "which table did this come from" use tableContext.tableNodeId.
  • colIndex uses TableMap so merged cells (gridSpan) report the correct logical column.
  • SDT / content-control wrappers inside cells recurse transparently — their inner paragraphs keep tableContext.

Verified: pnpm --filter @superdoc/document-api test → 1362 pass. pnpm exec playwright test tests/navigation/extract.spec.ts --project=chromium → 15 pass (7 SD-2525 + 8 new SD-2653 covering multi-paragraph cells, empty cells, headings in cells, nested tables, and a DOCX-imported fixture from the customer file).

Review: focus on collectFromTable + emitFromContainer correctness (position math, TableMap column lookup, transparent-wrapper recursion). Ignore collectComments / collectTrackedChanges — not touched.

doc.extract() previously returned every table as a single block with every
cell's text concatenated and no separators, and the table's nodeId did not
navigate. Recurse into tables so each paragraph inside a cell becomes its
own ExtractBlock carrying a new tableContext { tableNodeId, rowIndex,
colIndex }. Paragraph paraIds round-trip through DOCX export, so the IDs
are stable — which is the property callers need for RAG.

- colIndex uses TableMap so merged cells (gridSpan) report correct logical
  columns.
- SDT / content-control wrappers inside cells are recursed transparently,
  keeping their inner paragraphs addressable.

SD-2653
@linear
Copy link
Copy Markdown

linear Bot commented Apr 22, 2026

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e951f2a77

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

for (let i = 0; i < candidates.length; i++) {
const candidate = candidates[i];
if (candidate.nodeType === 'table') {
collectFromTable(candidate.node, candidate.pos, [i], candidate.nodeId, blocks);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass canonical top-level path into collectFromTable

collectTopLevelBlocks() resolves node IDs using the real document child index, but this call passes [i] where i is only the index in the filtered candidates array. When unsupported top-level nodes (for example passthroughBlock) appear before a table, that path diverges from the canonical traversal path used by resolveBlockNodeId/buildBlockIndex, so fallback IDs (notably table IDs derived from volatile sdBlockIds) are hashed from the wrong path and may no longer resolve through scrollToElement/block lookup.

Useful? React with 👍 / 👎.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

📦 Preview published: superdoc@1.27.0-pr.2907.1776884943

npm install superdoc@pr-2907

Caio Pizzol added 3 commits April 22, 2026 17:09
… contract sync

Three follow-up fixes from code review of the initial SD-2653 commit.

- **Top-level SDT transparency.** Content-control wrappers (`sdt`,
  `structuredContentBlock`) at the document root were still flattened to
  one opaque block via `textContent`, reintroducing the exact concatenation
  bug for form-style templates. `collectBlocks` now treats them the same
  way `emitFromContainer` already did inside tables, recursing so inner
  paragraphs emit as their own blocks with stable IDs. New behavior test.

- **Merge info on `tableContext`.** Adds optional `colspan` / `rowspan`
  so callers can reconstruct row layout and distinguish "absent because
  merged" from "absent because empty" under `gridSpan`.

- **Contract / schema sync.** Schemas and operation definitions still
  described the pre-SD-2653 shape (referenced `type: "table"` blocks,
  omitted `tableContext`). Regenerated reference docs + SDK codegen so the
  published docs, tool catalogs, and SDK intent-dispatch match runtime.

Common-workflows RAG page now documents the table-extraction shape with
a row-grouping example. All 16 behavior tests and 1362 document-api unit
tests pass.

SD-2653
…ndex

collectBlocks iterated the filtered candidate list from collectTopLevelBlocks
and passed that index as the outer traversal path. When an unsupported
top-level node (e.g. passthroughBlock) sits before a table, the filtered
index diverges from the raw PM child index that buildBlockIndex and the
block-address resolver use. Any fallback-derived ID hashed from the wrong
path would then miss lookups through scrollToElement. Narrow in practice
(paraId/sdBlockId cover almost everything) but a real latent inconsistency
flagged by code review.

Walk editor.state.doc.child(i) directly so the outer path is always the
canonical PM index.

SD-2653
The previous commits added SDT-wrapper transparency, merged-cell coordinate
handling, colspan/rowspan fields, and the emitFromContainer helper. None of
those are in the customer's file; all were "just in case" coverage for
scenarios real users haven't reported yet.

Strip back to: walk tables, emit each cell's paragraphs with
{tableNodeId, rowIndex, colIndex}. Nested tables recurse because they're
literally the same bug one level deeper.

Everything else becomes its own follow-up when a real customer reports it:

- SDT / content-control wrappers emit as one opaque block (known limitation)
- colspan/rowspan not surfaced; colIndex is the physical child index
- TableMap logical-column computation removed

Reference doc + schema regenerated to match.

SD-2653
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants