Skip to content

fix: handle deeply nested HTML that triggers RecursionError#1644

Open
jigangz wants to merge 2 commits intomicrosoft:mainfrom
jigangz:fix/large-html-silent-failure
Open

fix: handle deeply nested HTML that triggers RecursionError#1644
jigangz wants to merge 2 commits intomicrosoft:mainfrom
jigangz:fix/large-html-silent-failure

Conversation

@jigangz
Copy link
Copy Markdown

@jigangz jigangz commented Mar 28, 2026

Summary

Fix large HTML files (>3MB) with deep DOM nesting silently returning unconverted HTML instead of markdown.

Problem

When converting deeply nested HTML documents (e.g., SEC EDGAR filings like Tesla's DEF 14A proxy statement), markdownify's recursive DOM traversal exceeds Python's default recursion limit (~400 nesting levels). The RecursionError is caught by the top-level _convert() dispatcher's except Exception block, and the request falls through to PlainTextConverter which returns the raw HTML as-is — with no error or warning.

Root cause chain:

  1. HtmlConverter.convert()markdownify.convert_soup() (recursive traversal)
  2. Deep nesting (>~400 levels) → RecursionError
  3. _convert() catches it via except Exception, stores in failed_attempts
  4. PlainTextConverter.accepts() matches text/html via text/ prefix → true
  5. PlainTextConverter.convert() returns raw HTML bytes as text
  6. Caller gets "markdown" that is actually unconverted HTML

Fix

Catch RecursionError in HtmlConverter.convert() and fall back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A UserWarning is emitted so callers know the output is plain text rather than full markdown.

Changes

  • packages/markitdown/src/markitdown/converters/_html_converter.py: catch RecursionError, fall back to get_text(), emit warning
  • packages/markitdown/tests/test_module_misc.py: add test_deeply_nested_html_fallback verifying the fallback behavior and warning

Fixes #1636

…t#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML
@jigangz jigangz marked this pull request as ready for review March 28, 2026 06:00
@jigangz
Copy link
Copy Markdown
Author

jigangz commented Mar 29, 2026

@microsoft-github-policy-service agree

Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix for a real production issue. SEC EDGAR filings and similar regulatory documents are exactly the kind of deeply nested HTML that hits this in practice, and silently returning raw HTML instead of markdown is a worse outcome than a graceful fallback with a warning. The approach is correct.

Suggestion: move import warnings to the top of the file

import warnings is currently inside the except RecursionError block. Imports inside exception handlers work correctly in Python but are unconventional: the import happens at exception time rather than at module load time, and static analysis tools may flag it. Moving it to the top of the file with the other imports costs nothing and keeps the import layout consistent with the rest of the codebase.

Question: test recursion depth is environment-dependent

The test constructs HTML with 500 levels of nesting and comments that this is "deep enough to trigger RecursionError". Whether 500 levels actually triggers the error depends on:

  • Python's current sys.getrecursionlimit() (default 1000, but varies by environment)
  • The number of stack frames markdownify consumes per DOM level (typically more than 1)

In some environments (PyPy, or Python builds with a raised recursion limit), 500 levels may not trigger the error at all, causing the test to pass the fallback assertion vacuously. A more reliable approach is to temporarily lower the recursion limit in the test itself:

import sys
old_limit = sys.getrecursionlimit()
try:
    sys.setrecursionlimit(100)
    result = markitdown.convert_stream(...)
finally:
    sys.setrecursionlimit(old_limit)

This makes the test deterministic regardless of the environment's default limit.

Suggestion: no way to opt out of the fallback

The fallback to get_text() is always applied when a RecursionError occurs. For callers who need markdown specifically (not plain text) and would prefer an exception over degraded output, there is currently no way to opt out. A strict=False parameter on HtmlConverter that, when set to True, re-raises the RecursionError instead of falling back would give callers the choice. Not blocking for this PR, but worth a follow-up issue.


The core fix is correct and the test covers the main case well. The import placement is a quick cleanup and the recursion depth concern is worth resolving to make the test reliable across environments.

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller
Copy link
Copy Markdown
Author

@jigangz jigangz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review! I've addressed all three points in the latest commit:

  1. import warnings moved to top of file — removed the inline import from inside the except block and added it at the module level.

  2. Environment-independent test — the test now temporarily lowers sys.setrecursionlimit(200) (restoring it in a finally block) so the RecursionError is guaranteed regardless of the host's default limit. This makes it reliable on all platforms and CI environments.

  3. strict parameter addedconvert() now accepts a strict=True keyword argument. When set, the RecursionError is re-raised instead of falling back to plain-text extraction, giving callers explicit opt-out control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large HTML files (>3MB) silently return unconverted HTML

2 participants