fix: handle deeply nested HTML that triggers RecursionError#1644
fix: handle deeply nested HTML that triggers RecursionError#1644jigangz wants to merge 2 commits intomicrosoft:mainfrom
Conversation
…t#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML
|
@microsoft-github-policy-service agree |
VANDRANKI
left a comment
There was a problem hiding this comment.
Good fix for a real production issue. SEC EDGAR filings and similar regulatory documents are exactly the kind of deeply nested HTML that hits this in practice, and silently returning raw HTML instead of markdown is a worse outcome than a graceful fallback with a warning. The approach is correct.
Suggestion: move import warnings to the top of the file
import warnings is currently inside the except RecursionError block. Imports inside exception handlers work correctly in Python but are unconventional: the import happens at exception time rather than at module load time, and static analysis tools may flag it. Moving it to the top of the file with the other imports costs nothing and keeps the import layout consistent with the rest of the codebase.
Question: test recursion depth is environment-dependent
The test constructs HTML with 500 levels of nesting and comments that this is "deep enough to trigger RecursionError". Whether 500 levels actually triggers the error depends on:
- Python's current
sys.getrecursionlimit()(default 1000, but varies by environment) - The number of stack frames markdownify consumes per DOM level (typically more than 1)
In some environments (PyPy, or Python builds with a raised recursion limit), 500 levels may not trigger the error at all, causing the test to pass the fallback assertion vacuously. A more reliable approach is to temporarily lower the recursion limit in the test itself:
import sys
old_limit = sys.getrecursionlimit()
try:
sys.setrecursionlimit(100)
result = markitdown.convert_stream(...)
finally:
sys.setrecursionlimit(old_limit)This makes the test deterministic regardless of the environment's default limit.
Suggestion: no way to opt out of the fallback
The fallback to get_text() is always applied when a RecursionError occurs. For callers who need markdown specifically (not plain text) and would prefer an exception over degraded output, there is currently no way to opt out. A strict=False parameter on HtmlConverter that, when set to True, re-raises the RecursionError instead of falling back would give callers the choice. Not blocking for this PR, but worth a follow-up issue.
The core fix is correct and the test covers the main case well. The import placement is a quick cleanup and the recursion depth concern is worth resolving to make the test reliable across environments.
- Move 'import warnings' to module top level (was inside except block) - Make test environment-independent by temporarily lowering sys.setrecursionlimit(200) instead of relying on depth=500 being sufficient on all platforms; original limit restored in finally block - Add strict=True keyword argument to opt out of the plain-text fallback and let RecursionError propagate to the caller
jigangz
left a comment
There was a problem hiding this comment.
Thanks for the thorough review! I've addressed all three points in the latest commit:
-
import warningsmoved to top of file — removed the inline import from inside theexceptblock and added it at the module level. -
Environment-independent test — the test now temporarily lowers
sys.setrecursionlimit(200)(restoring it in afinallyblock) so theRecursionErroris guaranteed regardless of the host's default limit. This makes it reliable on all platforms and CI environments. -
strictparameter added —convert()now accepts astrict=Truekeyword argument. When set, theRecursionErroris re-raised instead of falling back to plain-text extraction, giving callers explicit opt-out control.
Summary
Fix large HTML files (>3MB) with deep DOM nesting silently returning unconverted HTML instead of markdown.
Problem
When converting deeply nested HTML documents (e.g., SEC EDGAR filings like Tesla's DEF 14A proxy statement),
markdownify's recursive DOM traversal exceeds Python's default recursion limit (~400 nesting levels). TheRecursionErroris caught by the top-level_convert()dispatcher'sexcept Exceptionblock, and the request falls through toPlainTextConverterwhich returns the raw HTML as-is — with no error or warning.Root cause chain:
HtmlConverter.convert()→markdownify.convert_soup()(recursive traversal)RecursionError_convert()catches it viaexcept Exception, stores infailed_attemptsPlainTextConverter.accepts()matchestext/htmlviatext/prefix → truePlainTextConverter.convert()returns raw HTML bytes as textFix
Catch
RecursionErrorinHtmlConverter.convert()and fall back to BeautifulSoup's iterativeget_text()method, which handles arbitrary nesting depths. AUserWarningis emitted so callers know the output is plain text rather than full markdown.Changes
packages/markitdown/src/markitdown/converters/_html_converter.py: catchRecursionError, fall back toget_text(), emit warningpackages/markitdown/tests/test_module_misc.py: addtest_deeply_nested_html_fallbackverifying the fallback behavior and warningFixes #1636