fix: handle deeply nested HTML that triggers RecursionError by jigangz · Pull Request #1644 · microsoft/markitdown

jigangz · 2026-03-28T05:57:00Z

Summary

Fix large HTML files (>3MB) with deep DOM nesting silently returning unconverted HTML instead of markdown.

Problem

When converting deeply nested HTML documents (e.g., SEC EDGAR filings like Tesla's DEF 14A proxy statement), markdownify's recursive DOM traversal exceeds Python's default recursion limit (~400 nesting levels). The RecursionError is caught by the top-level _convert() dispatcher's except Exception block, and the request falls through to PlainTextConverter which returns the raw HTML as-is — with no error or warning.

Root cause chain:

HtmlConverter.convert() → markdownify.convert_soup() (recursive traversal)
Deep nesting (>~400 levels) → RecursionError
_convert() catches it via except Exception, stores in failed_attempts
PlainTextConverter.accepts() matches text/html via text/ prefix → true
PlainTextConverter.convert() returns raw HTML bytes as text
Caller gets "markdown" that is actually unconverted HTML

Fix

Catch RecursionError in HtmlConverter.convert() and fall back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A UserWarning is emitted so callers know the output is plain text rather than full markdown.

Changes

packages/markitdown/src/markitdown/converters/_html_converter.py: catch RecursionError, fall back to get_text(), emit warning
packages/markitdown/tests/test_module_misc.py: add test_deeply_nested_html_fallback verifying the fallback behavior and warning

Fixes #1636

…t#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML

jigangz · 2026-03-29T15:23:30Z

@microsoft-github-policy-service agree

VANDRANKI

Good fix for a real production issue. SEC EDGAR filings and similar regulatory documents are exactly the kind of deeply nested HTML that hits this in practice, and silently returning raw HTML instead of markdown is a worse outcome than a graceful fallback with a warning. The approach is correct.

Suggestion: move `import warnings` to the top of the file

import warnings is currently inside the except RecursionError block. Imports inside exception handlers work correctly in Python but are unconventional: the import happens at exception time rather than at module load time, and static analysis tools may flag it. Moving it to the top of the file with the other imports costs nothing and keeps the import layout consistent with the rest of the codebase.

Question: test recursion depth is environment-dependent

The test constructs HTML with 500 levels of nesting and comments that this is "deep enough to trigger RecursionError". Whether 500 levels actually triggers the error depends on:

Python's current sys.getrecursionlimit() (default 1000, but varies by environment)
The number of stack frames markdownify consumes per DOM level (typically more than 1)

In some environments (PyPy, or Python builds with a raised recursion limit), 500 levels may not trigger the error at all, causing the test to pass the fallback assertion vacuously. A more reliable approach is to temporarily lower the recursion limit in the test itself:

import sys
old_limit = sys.getrecursionlimit()
try:
    sys.setrecursionlimit(100)
    result = markitdown.convert_stream(...)
finally:
    sys.setrecursionlimit(old_limit)

This makes the test deterministic regardless of the environment's default limit.

Suggestion: no way to opt out of the fallback

The fallback to get_text() is always applied when a RecursionError occurs. For callers who need markdown specifically (not plain text) and would prefer an exception over degraded output, there is currently no way to opt out. A strict=False parameter on HtmlConverter that, when set to True, re-raises the RecursionError instead of falling back would give callers the choice. Not blocking for this PR, but worth a follow-up issue.

The core fix is correct and the test covers the main case well. The import placement is a quick cleanup and the recursion depth concern is worth resolving to make the test reliable across environments.

- Move 'import warnings' to module top level (was inside except block) - Make test environment-independent by temporarily lowering sys.setrecursionlimit(200) instead of relying on depth=500 being sufficient on all platforms; original limit restored in finally block - Add strict=True keyword argument to opt out of the plain-text fallback and let RecursionError propagate to the caller

jigangz

Thanks for the thorough review! I've addressed all three points in the latest commit:

import warnings moved to top of file — removed the inline import from inside the except block and added it at the module level.
Environment-independent test — the test now temporarily lowers sys.setrecursionlimit(200) (restoring it in a finally block) so the RecursionError is guaranteed regardless of the host's default limit. This makes it reliable on all platforms and CI environments.
strict parameter added — convert() now accepts a strict=True keyword argument. When set, the RecursionError is re-raised instead of falling back to plain-text extraction, giving callers explicit opt-out control.

jigangz marked this pull request as ready for review March 28, 2026 06:00

VANDRANKI reviewed Apr 4, 2026

View reviewed changes

jigangz commented Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle deeply nested HTML that triggers RecursionError#1644

fix: handle deeply nested HTML that triggers RecursionError#1644
jigangz wants to merge 2 commits intomicrosoft:mainfrom
jigangz:fix/large-html-silent-failure

jigangz commented Mar 28, 2026

Uh oh!

jigangz commented Mar 29, 2026

Uh oh!

VANDRANKI left a comment

Uh oh!

jigangz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jigangz commented Mar 28, 2026

Summary

Problem

Root cause chain:

Fix

Changes

Uh oh!

jigangz commented Mar 29, 2026

Uh oh!

VANDRANKI left a comment

Choose a reason for hiding this comment

Suggestion: move import warnings to the top of the file

Question: test recursion depth is environment-dependent

Suggestion: no way to opt out of the fallback

Uh oh!

jigangz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Suggestion: move `import warnings` to the top of the file