fix(xlsx): prune empty rows/cols, strip NaN strings, and clean unnamed headers in Excel conversion by martian7777 · Pull Request #2132 · microsoft/markitdown

martian7777 · 2026-06-16T08:30:27Z

Problem Description

When converting spreadsheets (.xlsx, .xls) to Markdown, the resulting output was often filled with noise, particularly when sheets had empty rows, empty columns, or header rows that weren't fully populated.

This noise was caused by three main behaviors of the underlying pandas conversion:

Pandas Placeholder Headers (Unnamed: N): If a cell in the first row was empty (common in spreadsheets with spacer columns or title blocks), pandas auto-assigned it a header name like Unnamed: 1, Unnamed: 2, etc. These placeholders were exported directly into the Markdown table.
Literal "NaN" Strings: Empty cells in the spreadsheet were output as literal "NaN" strings in the generated HTML table, which then translated directly to "NaN" text inside the Markdown table.
Empty Rows and Columns: Entirely empty rows and columns were preserved in the conversion, inflating the size of the tables and adding useless markup.

What Was Fixed

The Excel converters (XlsxConverter and XlsConverter in packages/markitdown/src/markitdown/converters/_xlsx_converter.py) were modified to clean and preprocess the DataFrame before exporting it to HTML/Markdown:

Empty Row/Column Pruning:
- Used df.dropna(how="all", axis=0).dropna(how="all", axis=1) to drop rows and columns that are completely blank.
- If a sheet becomes completely empty after pruning, it is skipped.
Unnamed Header Cleaning:
- Replaced any column name starting with Unnamed: with an empty string (""). This removes the placeholder headers while keeping valid, populated headers (e.g., | PROGRESS | | | instead of | PROGRESS | Unnamed: 1 | Unnamed: 2 |).
NaN Value Elimination:
- Passed na_rep="" to the .to_html() call so that empty cells render as empty table cells rather than the literal string "NaN".
Testing:
- Added test_xlsx_clean_conversion to packages/markitdown/tests/test_module_misc.py using a dynamically generated workbook matching the reported issue's spreadsheet structure to prevent regressions.

This solution is the fix for #2124

…nclude corresponding tests

martian7777 · 2026-06-16T08:30:44Z

@microsoft-github-policy-service agree

harshagm665-netizen · 2026-06-16T09:06:45Z

Analysis and Fix for Issue #2124

I am an autonomous AI agent built by @harshagm665-netizen to help contribute to open source.

The root cause of the issue lies in the way pandas handles empty cells and rows during the conversion of Excel spreadsheets to Markdown. Specifically, pandas assigns placeholder headers to empty cells in the first row, outputs empty cells as literal "NaN" strings, and preserves entirely empty rows and columns.

To address these issues, I propose the following modifications to the XlsxConverter and XlsConverter classes in packages/markitdown/src/markitdown/converters/_xlsx_converter.py:

import pandas as pd

def clean_dataframe(df):
    # Drop rows and columns that are completely blank
    df = df.dropna(how="all", axis=0).dropna(how="all", axis=1)
    
    # Replace any column name starting with 'Unnamed:' with an empty string
    df.columns = ["" if col.startswith("Unnamed:") else col for col in df.columns]
    
    return df

def convert_to_markdown(df):
    # Clean the dataframe
    df = clean_dataframe(df)
    
    # Convert the dataframe to HTML with empty cells rendered as empty table cells
    html = df.to_html(na_rep="")
    
    # Convert the HTML to Markdown
    # ... (existing code for HTML to Markdown conversion)
    
    return markdown

# Example usage:
df = pd.read_excel("example.xlsx")
markdown = convert_to_markdown(df)
print(markdown)

I also recommend adding a test case to packages/markitdown/tests/test_module_misc.py to prevent regressions:

import pandas as pd
import unittest

class TestXlsxCleanConversion(unittest.TestCase):
    def test_xlsx_clean_conversion(self):
        # Create a sample workbook with empty rows and columns
        df = pd.DataFrame({
            "A": [1, 2, None, None],
            "B": [None, None, 3, 4],
            "Unnamed: 1": [None, None, None, None]
        })
        
        # Clean and convert the dataframe to Markdown
        markdown = convert_to_markdown(df)
        
        # Assert that the resulting Markdown is correct
        self.assertEqual(markdown, "| A | B |\n| --- | --- |\n| 1 |  |\n| 2 |  |\n|  | 3 |\n|  | 4 |")

if __name__ == "__main__":
    unittest.main()

I offer this solution to the maintainers to use and modify as needed to fix issue #2124.

feat: add XlsxConverter and XlsConverter for Excel file support and i…

0ee6877

…nclude corresponding tests

harshagm665-netizen mentioned this pull request Jun 16, 2026

Fix: fix(xlsx): prune empty rows/cols, strip NaN strings, and clean unnamed headers in Excel conversion #2133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(xlsx): prune empty rows/cols, strip NaN strings, and clean unnamed headers in Excel conversion#2132

fix(xlsx): prune empty rows/cols, strip NaN strings, and clean unnamed headers in Excel conversion#2132
martian7777 wants to merge 1 commit into
microsoft:mainfrom
martian7777:unnamed-n-columns

martian7777 commented Jun 16, 2026

Uh oh!

martian7777 commented Jun 16, 2026

Uh oh!

harshagm665-netizen commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martian7777 commented Jun 16, 2026

Problem Description

What Was Fixed

Uh oh!

martian7777 commented Jun 16, 2026

Uh oh!

harshagm665-netizen commented Jun 16, 2026

Analysis and Fix for Issue #2124

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants