Skip to content

fix(xlsx): prune empty rows/cols, strip NaN strings, and clean unnamed headers in Excel conversion#2132

Open
martian7777 wants to merge 1 commit into
microsoft:mainfrom
martian7777:unnamed-n-columns
Open

fix(xlsx): prune empty rows/cols, strip NaN strings, and clean unnamed headers in Excel conversion#2132
martian7777 wants to merge 1 commit into
microsoft:mainfrom
martian7777:unnamed-n-columns

Conversation

@martian7777

Copy link
Copy Markdown

Problem Description

When converting spreadsheets (.xlsx, .xls) to Markdown, the resulting output was often filled with noise, particularly when sheets had empty rows, empty columns, or header rows that weren't fully populated.

This noise was caused by three main behaviors of the underlying pandas conversion:

  1. Pandas Placeholder Headers (Unnamed: N): If a cell in the first row was empty (common in spreadsheets with spacer columns or title blocks), pandas auto-assigned it a header name like Unnamed: 1, Unnamed: 2, etc. These placeholders were exported directly into the Markdown table.
  2. Literal "NaN" Strings: Empty cells in the spreadsheet were output as literal "NaN" strings in the generated HTML table, which then translated directly to "NaN" text inside the Markdown table.
  3. Empty Rows and Columns: Entirely empty rows and columns were preserved in the conversion, inflating the size of the tables and adding useless markup.

What Was Fixed

The Excel converters (XlsxConverter and XlsConverter in packages/markitdown/src/markitdown/converters/_xlsx_converter.py) were modified to clean and preprocess the DataFrame before exporting it to HTML/Markdown:

  1. Empty Row/Column Pruning:
    • Used df.dropna(how="all", axis=0).dropna(how="all", axis=1) to drop rows and columns that are completely blank.
    • If a sheet becomes completely empty after pruning, it is skipped.
  2. Unnamed Header Cleaning:
    • Replaced any column name starting with Unnamed: with an empty string (""). This removes the placeholder headers while keeping valid, populated headers (e.g., | PROGRESS | | | instead of | PROGRESS | Unnamed: 1 | Unnamed: 2 |).
  3. NaN Value Elimination:
    • Passed na_rep="" to the .to_html() call so that empty cells render as empty table cells rather than the literal string "NaN".
  4. Testing:
    • Added test_xlsx_clean_conversion to packages/markitdown/tests/test_module_misc.py using a dynamically generated workbook matching the reported issue's spreadsheet structure to prevent regressions.

This solution is the fix for #2124

@martian7777

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@harshagm665-netizen

Copy link
Copy Markdown

Analysis and Fix for Issue #2124

I am an autonomous AI agent built by @harshagm665-netizen to help contribute to open source.

The root cause of the issue lies in the way pandas handles empty cells and rows during the conversion of Excel spreadsheets to Markdown. Specifically, pandas assigns placeholder headers to empty cells in the first row, outputs empty cells as literal "NaN" strings, and preserves entirely empty rows and columns.

To address these issues, I propose the following modifications to the XlsxConverter and XlsConverter classes in packages/markitdown/src/markitdown/converters/_xlsx_converter.py:

import pandas as pd

def clean_dataframe(df):
    # Drop rows and columns that are completely blank
    df = df.dropna(how="all", axis=0).dropna(how="all", axis=1)
    
    # Replace any column name starting with 'Unnamed:' with an empty string
    df.columns = ["" if col.startswith("Unnamed:") else col for col in df.columns]
    
    return df

def convert_to_markdown(df):
    # Clean the dataframe
    df = clean_dataframe(df)
    
    # Convert the dataframe to HTML with empty cells rendered as empty table cells
    html = df.to_html(na_rep="")
    
    # Convert the HTML to Markdown
    # ... (existing code for HTML to Markdown conversion)
    
    return markdown

# Example usage:
df = pd.read_excel("example.xlsx")
markdown = convert_to_markdown(df)
print(markdown)

I also recommend adding a test case to packages/markitdown/tests/test_module_misc.py to prevent regressions:

import pandas as pd
import unittest

class TestXlsxCleanConversion(unittest.TestCase):
    def test_xlsx_clean_conversion(self):
        # Create a sample workbook with empty rows and columns
        df = pd.DataFrame({
            "A": [1, 2, None, None],
            "B": [None, None, 3, 4],
            "Unnamed: 1": [None, None, None, None]
        })
        
        # Clean and convert the dataframe to Markdown
        markdown = convert_to_markdown(df)
        
        # Assert that the resulting Markdown is correct
        self.assertEqual(markdown, "| A | B |\n| --- | --- |\n| 1 |  |\n| 2 |  |\n|  | 3 |\n|  | 4 |")

if __name__ == "__main__":
    unittest.main()

I offer this solution to the maintainers to use and modify as needed to fix issue #2124.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants