Skip to content

fix(file-based): support multi-sheet Excel workbooks via sheet_name config#1051

Draft
devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1781264638-fix-excel-multisheet
Draft

fix(file-based): support multi-sheet Excel workbooks via sheet_name config#1051
devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1781264638-fix-excel-multisheet

Conversation

@devin-ai-integration

Copy link
Copy Markdown
Contributor

Summary

The file-based CDK Excel parser previously relied on the pandas default sheet_name=0, so only the first worksheet was ever read from multi-sheet workbooks. This caused silent data loss for connectors like source-microsoft-sharepoint when users had Excel files with multiple sheets.

This PR adds an optional sheet_name field to ExcelFormat:

class ExcelFormat(BaseModel):
    sheet_name: str = Field(default="0", ...)

Accepted values: "0" (default, first sheet — preserves existing behavior), a sheet name string, a zero-indexed position, or "*" to read all sheets.

The new _parse_excel_file_resolve_sheet_name path converts the string config into a pandas-compatible sheet_name argument (int, str, or None). When sheet_name=None (all sheets), pandas returns a dict[str, DataFrame] instead of a single DataFrame; the parser iterates over all returned DataFrames for both record emission and schema inference (merging columns across sheets).

Both Calamine-first and OpenPyXL fallback paths now accept and forward the sheet_name parameter.

Resolves https://github.com/airbytehq/oncall/issues/12598:

Supersedes #1024 (stale draft from a prior session).

Declarative-First Evaluation

The affected connectors (source-microsoft-sharepoint, etc.) are Python file-based CDK sources, not manifest-only connectors. The fix operates at the Excel parsing layer before any records exist, so declarative features (RecordFilter, AddFields, RemoveFields, paginators, routers, error handlers, $ref overrides) cannot address this.

Breaking Change Evaluation

Not a breaking change. The new sheet_name field is optional and defaults to "0", preserving the previous first-sheet-only behavior. No streams, fields, primary keys, cursors, existing config fields, or state formats are removed or changed.

Test Coverage

New tests added to test_excel_parser.py:

  • test_parse_records_selects_configured_sheet — parametrized for default first sheet, named sheet, and all sheets
  • test_infer_schema_with_sheet_selection — parametrized for single-sheet and merged multi-sheet schema inference
  • test_resolve_sheet_name — parametrized unit test for the config → pandas argument conversion

All tests use real in-memory multi-sheet Excel workbooks (via xlsxwriter) instead of mocks.

Link to Devin session: https://app.devin.ai/sessions/6a77303e47e040da803aeee819febb5b

…onfig

Add an optional sheet_name field to ExcelFormat config that controls
which worksheet(s) the parser reads:
- '0' (default): first sheet only (preserves existing behavior)
- '<name>': a specific sheet by name
- '*': all sheets in the workbook

Both Calamine and OpenPyXL fallback paths now pass sheet_name through
to pandas. Schema inference merges columns across selected sheets.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@github-actions

Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1781264638-fix-excel-multisheet#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1781264638-fix-excel-multisheet

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

PyTest Results (Fast)

4 118 tests  +9   4 106 ✅ +9   7m 27s ⏱️ +8s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 57ca4e1. ± Comparison against base commit fd95ecf.

♻️ This comment has been updated with latest results.

Co-Authored-By: bot_apk <apk@cognition.ai>
@github-actions

Copy link
Copy Markdown

PyTest Results (Full)

4 121 tests  +9   4 109 ✅ +10   12m 8s ⏱️ + 2m 43s
    1 suites ±0      12 💤 ± 0 
    1 files   ±0       0 ❌  -  1 

Results for commit 57ca4e1. ± Comparison against base commit fd95ecf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants