Fix: handle missing w:styleId in DOCX to prevent KeyError #1417

Ashhhh010101 · 2025-09-16T20:09:24Z

Issue

Some DOCX files may have headings, table cells, or footnotes without a styleId (w:styleId).
The original converter crashes with:

KeyError: 'w:styleId'

Fix

Wrapped Mammoth conversion in try/except to catch missing w:styleId.

If missing, fallback safely and continue conversion.

ct-jaryn · 2025-09-17T09:08:39Z

@MonkeyCode-AI review it

MonkeyCode-AI

已添加对 PR 的评审意见。

packages/markitdown/src/markitdown/converters/_docx_converter.py

MonkeyCode-AI · 2025-09-17T09:10:00Z

packages/markitdown/src/markitdown/converters/_docx_converter.py

+                # Ignore missing style IDs and convert anyway
+                html = mammoth.convert_to_html(pre_process_stream, style_map=style_map, ignore_empty_styles=True).value
+            else:
+                raise


# Patch: handle missing styleId safely try: html = mammoth.convert_to_html(pre_process_stream, style_map=style_map).value except KeyError as e: if str(e) == "'w:styleId'": # Ignore missing style IDs and convert anyway html = mammoth.convert_to_html(pre_process_stream, style_map=style_map, ignore_empty_styles=True).value else: raise

这段代码通过 try/except 块处理了可能因缺少 w:styleId 而导致的 KeyError。这是个不错的防御性编程实践。不过，可以考虑以下几点改进：

更明确的异常处理：检查异常信息是否为 'w:styleId' 可能会因语言环境或 mammoth 版本变化而失效。如果可能，尝试通过检查元素属性来确定是否缺少样式 ID。

日志记录：建议添加一些日志记录，以便在处理异常情况时能够跟踪发生了什么。

文档：虽然代码中有注释，但在 docstring 或开发者文档中描述这种特殊情况的处理方式会更有帮助。

总体来说，这个修复是有效的，提高了代码的健壮性。

Ashhhh010101 · 2025-09-17T18:34:47Z

@microsoft-github-policy-service agree

MonkeyCode-AI · 2025-09-19T09:33:34Z

⏳ MonkeyCode-AI 正在分析，请稍等片刻...

MonkeyCode-AI · 2026-01-06T17:44:03Z

💬 代码审查

评论: 本次改动通过捕获缺失 w:styleId 的 KeyError 并降级重试避免崩溃，但当前实现存在异常判断脆弱、流重试可能失效及依赖参数兼容性风险等问题需完善。

📊 统计

总问题数: 5
🔴 严重: 0
🟡 警告: 3
🔵 建议: 2
⚪ 细节: 0

🟡 警告 (3)

使用 `str(e)` 判断 KeyError 键值不稳健 packages/markitdown/src/markitdown/converters/_docx_converter.py:77

描述: 当前通过 str(e) == "'w:styleId'" 来判定是否为缺失 w:styleId 的 KeyError，这种判断对不同 Python 版本/实现、异常包装层或 Key 的表示形式较脆弱（例如 KeyError 可能在 e.args[0] 中是字符串但 str(e) 可能包含不同引号/格式）。这会导致本应走降级路径的场景未被捕获，从而仍然崩溃。

建议: 改用更稳健的判断方式，例如检查 e.args 或直接比对 key：if e.args and e.args[0] in ('w:styleId', "w:styleId")。同时可考虑捕获 Mammoth 可能抛出的其他异常类型（如 ValueError）并提供更清晰的错误信息。

修复代码:

try:
    html = mammoth.convert_to_html(pre_process_stream, style_map=style_map).value
except KeyError as e:
    key = e.args[0] if e.args else None
    if key == 'w:styleId':
        html = mammoth.convert_to_html(
            pre_process_stream,
            style_map=style_map,
            ignore_empty_styles=True,
        ).value
    else:
        raise

异常分支重试可能复用已被消耗的流且存在重复解析开销 packages/markitdown/src/markitdown/converters/_docx_converter.py:77

描述: 在 except 分支中再次调用 mammoth.convert_to_html(...) 会重复解析/遍历 DOCX，增加不必要的 CPU 开销；更重要的是，pre_process_stream 可能是类文件流对象，第一次失败后其游标位置可能已移动到末尾或中间，第二次调用可能读取不到内容而导致空输出或新的异常（取决于 pre_process_docx 的返回类型及 Mammoth 的读取方式）。

建议: 在重试前确保流可重复读取：优先在进入转换前将 pre_process_stream 读入 bytes 并用 io.BytesIO 包装，或在重试前 seek(0)（前提是可 seek）。也可以考虑只调用一次转换：在预处理阶段修复缺失 styleId 或在 Mammoth 层面通过参数一次性启用容错（若合适）。

修复代码:

import io

pre_process_stream = pre_process_docx(file_stream)
# 确保可重读
if hasattr(pre_process_stream, 'read'):
    data = pre_process_stream.read()
    pre_process_stream = io.BytesIO(data)

try:
    html = mammoth.convert_to_html(pre_process_stream, style_map=style_map).value
except KeyError as e:
    key = e.args[0] if e.args else None
    if key == 'w:styleId':
        pre_process_stream.seek(0)
        html = mammoth.convert_to_html(
            pre_process_stream,
            style_map=style_map,
            ignore_empty_styles=True,
        ).value
    else:
        raise

对 Mammoth 参数 `ignore_empty_styles` 的兼容性与语义风险未验证 packages/markitdown/src/markitdown/converters/_docx_converter.py:77

描述: 使用了 ignore_empty_styles=True 参数，但该参数是否为 Mammoth 公共 API、在项目锁定的 Mammoth 版本中是否存在/语义是否与“缺失 w:styleId”匹配不明确。如果参数不存在会抛出 TypeError，反而引入新的崩溃；即使存在，也可能掩盖其他样式相关问题导致输出质量下降且不易诊断。

建议: 确认并在依赖版本范围内固定/检查该参数：

在 pyproject.toml/requirements 中约束 mammoth 版本。
或用 inspect.signature/hasattr 做兼容性分支。
同时在降级路径记录 warning 日志，便于排查输入文件质量问题。

修复代码:

import inspect

sig = inspect.signature(mammoth.convert_to_html)
kwargs2 = {'style_map': style_map}
if 'ignore_empty_styles' in sig.parameters:
    kwargs2['ignore_empty_styles'] = True
html = mammoth.convert_to_html(pre_process_stream, **kwargs2).value

🔵 建议 (2)

降级转换缺少日志导致问题难以定位 packages/markitdown/src/markitdown/converters/_docx_converter.py:77

描述: 捕获并“忽略”缺失 styleId 后直接继续转换，但没有任何日志/指标记录；当转换结果不符合预期时难以追踪是否发生了降级路径，影响可维护性与可观测性。

建议: 在走降级路径时输出结构化日志（至少 warning 级别），包含文件名/来源（如可用）、异常信息与采取的降级策略；或将该信息附加到转换结果的元数据中（若框架支持）。

修复代码:

import logging
logger = logging.getLogger(__name__)
...
except KeyError as e:
    key = e.args[0] if e.args else None
    if key == 'w:styleId':
        logger.warning('DOCX 缺失 w:styleId，启用 ignore_empty_styles 降级转换')
        ...

异常覆盖面可能不足，与 PR 描述存在偏差 packages/markitdown/src/markitdown/converters/_docx_converter.py:77

描述: PR 描述声称“Wrapped Mammoth conversion in try/except to catch missing w:styleId”，但仅对 KeyError 做了处理；实际可能还会出现 AttributeError/IndexError/ValueError 等与样式缺失相关的异常（取决于 Mammoth 版本和 DOCX 结构），当前实现仍可能在类似坏文件上崩溃。

建议: 根据 issue 中真实样本补充异常类型与回归测试；至少在注释/文档中明确本修复覆盖范围仅限 KeyError('w:styleId')，避免误导。

MonkeyCode-AI · 2026-01-07T18:16:50Z

我是 MonkeyCode AI 编程助手，你可以在 GitHub 任意仓库的任意 Issue 或 PR 中 at @MonkeyCode-AI 来呼唤我。

任务执行细节请参考: https://monkeycode-ai.com

❌ 代码审查

需要修改: 本次改动通过捕获 KeyError 试图兜底缺失 w:styleId，但引入了对依赖参数兼容性和可重试流的潜在崩溃风险，需补齐兼容与测试后再合并。

📊 统计

总问题数: 5
🔴 严重: 1
🟡 警告: 2
🔵 建议: 2
⚪ 细节: 0

🔴 严重 (1)

新增参数 ignore_empty_styles 可能导致依赖不兼容 packages/markitdown/src/markitdown/converters/_docx_converter.py:88-91

描述: 使用 ignore_empty_styles=True 依赖 Mammoth 的特定版本/参数支持。若项目锁定的 Mammoth 版本不支持该参数，会抛出 TypeError: unexpected keyword argument，从而在某些环境中引入新崩溃。

建议: 在 pyproject/requirements 中明确提升并锁定支持该参数的 Mammoth 版本；或在代码中兼容处理（捕获 TypeError 并采用其它降级策略，如去掉 style_map 重试、或直接跳过样式处理）。同时补充回归测试覆盖不同 Mammoth 版本行为（如通过依赖约束确保一致）。

修复代码:

try:
    html = mammoth.convert_to_html(pre_stream, style_map=style_map).value
except KeyError as e:
    if (e.args and e.args[0] == "w:styleId"):
        pre_stream.seek(0)
        try:
            html = mammoth.convert_to_html(pre_stream, style_map=style_map, ignore_empty_styles=True).value
        except TypeError:
            # 兼容旧版 mammoth：退化策略（示例）
            pre_stream.seek(0)
            html = mammoth.convert_to_html(pre_stream, style_map=style_map).value
    else:
        raise

🟡 警告 (2)

异常匹配方式脆弱，可能无法正确兜底 packages/markitdown/src/markitdown/converters/_docx_converter.py:83-91

描述: 通过 try/except KeyError 并比较 str(e) == "'w:styleId'" 来分支处理不够稳健：不同 Python 版本/实现或上游库抛错方式可能导致字符串表现不同；并且 KeyError 的 args[0] 才是更可靠的键名。此处可能导致未命中分支而仍然崩溃。

建议: 用 e.args 判断键名，例如 if e.args and e.args[0] in ('w:styleId', '{...}styleId'):；或更保守地对 Mammoth 转换阶段捕获更明确的异常类型/错误信息，并记录日志。

修复代码:

try:
    html = mammoth.convert_to_html(pre_process_stream, style_map=style_map).value
except KeyError as e:
    key = e.args[0] if e.args else None
    if key == "w:styleId":
        html = mammoth.convert_to_html(
            pre_process_stream,
            style_map=style_map,
            ignore_empty_styles=True,
        ).value
    else:
        raise

重试时可能复用已消费的流，导致兜底失败 packages/markitdown/src/markitdown/converters/_docx_converter.py:83-91

描述: 在捕获到 w:styleId 缺失后再次调用 mammoth.convert_to_html(...)，会对同一个 pre_process_stream 做二次解析。若 pre_process_stream 是已被消费的流（file-like object），第一次失败可能已经移动了指针，第二次调用可能得到空内容或产生其它异常，导致兜底逻辑失效。

建议: 确保传入 Mammoth 的对象可重复读取：在第一次调用前缓存 bytes（pre_process_stream.read()）并用 io.BytesIO 包装，或在重试前 seek(0)（若支持）。更佳方式是始终使用可重入的 bytes 缓冲，避免二次调用因流状态变化而出错。

修复代码:

import io

pre = pre_process_docx(file_stream)
# 确保可重试
if hasattr(pre, "read"):
    data = pre.read()
    pre_stream = io.BytesIO(data)
else:
    pre_stream = io.BytesIO(pre)

try:
    html = mammoth.convert_to_html(pre_stream, style_map=style_map).value
except KeyError as e:
    key = e.args[0] if e.args else None
    if key == "w:styleId":
        pre_stream.seek(0)
        html = mammoth.convert_to_html(pre_stream, style_map=style_map, ignore_empty_styles=True).value
    else:
        raise

🔵 建议 (2)

缺少日志与可观测性，静默降级可能掩盖质量问题 packages/markitdown/src/markitdown/converters/_docx_converter.py:83-91

描述: 当前兜底仅针对 w:styleId 的 KeyError，未记录任何上下文信息（文件、位置、转换选项），线上排查会困难；且“静默忽略”可能掩盖 DOCX 结构问题或样式丢失，导致结果质量下降但不可见。

建议: 增加日志（debug/warn）说明发生了样式缺失并启用了降级；必要时把是否降级作为可配置项（例如 strict_styles）。如果有监控/错误上报，建议上报一次性事件而非完全吞掉。

修复代码:

import logging
logger = logging.getLogger(__name__)
...
except KeyError as e:
    key = e.args[0] if e.args else None
    if key == "w:styleId":
        logger.warning("DOCX 缺少 w:styleId，已启用 ignore_empty_styles 降级转换")
        ...

缺少针对缺失 w:styleId 场景的回归测试 packages/markitdown/src/markitdown/converters/_docx_converter.py:77-91

描述: 缺少单元/回归测试来覆盖：1) 缺失 w:styleId 的 DOCX 不再崩溃；2) 正常 DOCX 行为不变；3) 流可重试逻辑；4) 依赖版本不支持参数时的兼容路径（如采用兼容策略）。没有测试容易在后续重构或依赖升级时回归。

建议: 添加最小化的 DOCX fixture（可用 python-docx 生成或存放精简样例）触发缺失 styleId；断言转换不抛异常且输出包含预期内容。对异常分支加测试以锁定行为。

Fix: handle missing w:styleId in DOCX to prevent KeyError

b86f139

MonkeyCode-AI reviewed Sep 17, 2025

View reviewed changes

Ashhhh010101 added 2 commits November 17, 2025 01:50

Merge branch 'main' into fix-docx-styleid-crash

0899019

Merge branch 'main' into fix-docx-styleid-crash

45dc1d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: handle missing w:styleId in DOCX to prevent KeyError #1417

Fix: handle missing w:styleId in DOCX to prevent KeyError #1417

Ashhhh010101 commented Sep 16, 2025

Uh oh!

ct-jaryn commented Sep 17, 2025

Uh oh!

MonkeyCode-AI left a comment

Uh oh!

Uh oh!

MonkeyCode-AI Sep 17, 2025

Uh oh!

Ashhhh010101 commented Sep 17, 2025

Uh oh!

MonkeyCode-AI commented Sep 19, 2025

Uh oh!

MonkeyCode-AI commented Jan 6, 2026

Uh oh!

MonkeyCode-AI commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix: handle missing w:styleId in DOCX to prevent KeyError #1417

Are you sure you want to change the base?

Fix: handle missing w:styleId in DOCX to prevent KeyError #1417

Conversation

Ashhhh010101 commented Sep 16, 2025

Issue

Fix

Uh oh!

ct-jaryn commented Sep 17, 2025

Uh oh!

MonkeyCode-AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MonkeyCode-AI Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Ashhhh010101 commented Sep 17, 2025

Uh oh!

MonkeyCode-AI commented Sep 19, 2025

Uh oh!

MonkeyCode-AI commented Jan 6, 2026

💬 代码审查

📊 统计

🟡 警告 (3)

🔵 建议 (2)

Uh oh!

MonkeyCode-AI commented Jan 7, 2026

❌ 代码审查

📊 统计

🔴 严重 (1)

🟡 警告 (2)

🔵 建议 (2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants