Skip to content

FEAT: Add arrow fetch support#354

Merged
gargsaumya merged 35 commits intomicrosoft:mainfrom
ffelixg:arrow_fetch
Apr 2, 2026
Merged

FEAT: Add arrow fetch support#354
gargsaumya merged 35 commits intomicrosoft:mainfrom
ffelixg:arrow_fetch

Conversation

@ffelixg
Copy link
Copy Markdown
Contributor

@ffelixg ffelixg commented Nov 30, 2025

Work Item / Issue Reference

GitHub Issue: #130


Summary

Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.

The Function signatures I chose are:

  • arrow_batch(chunk_size=10000): Fetch a single pyarrow.RecordBatch, base for the other two methods.
  • arrow(chunk_size=10000): Fetches the entire result set as a single pyarrow.Table.
  • arrow_reader(chunk_size=10000): Returns a pyarrow.RecordBatchReader for streaming results without loading the entire dataset into RAM.

Using fetch_arrow... instead of just arrow... could also be a good option, but I think the terse version is not too ambiguous.

Technical details

I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the FetchArrowBatch_wrap function, which produces PyCapsules that are then consumed by arrow_batch and turned into actual arrow objects.

The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.

I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.

Notes on data types

I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.

Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.

SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.

Performance

I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:

  • Releasing the GIL for the entire fetch loop
  • Sharing the bound fetch buffer across repeated fetch calls
  • Improve the hot loop switching

Instead of looping over rows and columns and then switching on the data type for each cell, you could

  • Put the row loop inside each switch case (fastest I think, but would bloat the code a lot more)
  • Use function pointers like you recently did for python fetching (has overhead because of the indirect function call I think, also code is more scattered)
  • Replace both loops and the switch with computed gotos. That's what I opted for in my ODBC driver (the Zig equivalent is a labeled switch) and I am quite happy with how it came out. Performance seems very good and it allows you to abstract the fetching process on a row by row basis. I don't know how well that would translate to C++.

Overall the arrow performance seems not too far off from what I achieved with zodbc.

Copilot AI review requested due to automatic review settings November 30, 2025 21:00
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Nov 30, 2025

@microsoft-github-policy-service agree

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Apache Arrow fetch support to the mssql-python driver, enabling efficient columnar data retrieval from SQL Server. The implementation provides three new cursor methods (arrow_batch(), arrow(), and arrow_reader()) that convert result sets into Apache Arrow data structures using the Arrow C Data Interface, bypassing Python object creation in the hot path for improved performance.

Key changes:

  • Implemented Arrow fetch functionality in C++ that directly converts ODBC result sets to Arrow format
  • Added three Python API methods for different Arrow data consumption patterns (single batch, full table, streaming reader)
  • Added comprehensive test coverage for various data types, LOB columns, and edge cases

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 9 comments.

File Description
mssql_python/pybind/ddbc_bindings.cpp Core C++ implementation: Added FetchArrowBatch_wrap() function with Arrow C Data Interface structures, column buffer management, data type conversion logic, and memory management for Arrow structures
mssql_python/cursor.py Python API layer: Added arrow_batch(), arrow(), and arrow_reader() methods that wrap the C++ bindings and handle pyarrow imports
tests/test_004_cursor.py Comprehensive test suite covering wide tables, LOB columns, individual data types, empty result sets, datetime handling, and batch operations
requirements.txt Added pyarrow as a dependency for development and testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sumitmsft sumitmsft self-assigned this Dec 1, 2025
@sumitmsft sumitmsft added the enhancement New feature or request label Dec 1, 2025
@sumitmsft
Copy link
Copy Markdown
Contributor

Work Item / Issue Reference

GitHub Issue: #130

Summary

Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.

The Function signatures I chose are:

  • arrow_batch(chunk_size=10000): Fetch a single pyarrow.RecordBatch, base for the other two methods.
  • arrow(chunk_size=10000): Fetches the entire result set as a single pyarrow.Table.
  • arrow_reader(chunk_size=10000): Returns a pyarrow.RecordBatchReader for streaming results without loading the entire dataset into RAM.

Using fetch_arrow... instead of just arrow... could also be a good option, but I think the terse version is not too ambiguous.

Technical details

I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the FetchArrowBatch_wrap function, which produces PyCapsules that are then consumed by arrow_batch and turned into actual arrow objects.

The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.

I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.

Notes on data types

I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.

Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.

SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.

Performance

I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:

  • Releasing the GIL for the entire fetch loop
  • Sharing the bound fetch buffer across repeated fetch calls
  • Improve the hot loop switching

Instead of looping over rows and columns and then switching on the data type for each cell, you could

  • Put the row loop inside each switch case (fastest I think, but would bloat the code a lot more)
  • Use function pointers like you recently did for python fetching (has overhead because of the indirect function call I think, also code is more scattered)
  • Replace both loops and the switch with computed gotos. That's what I opted for in my ODBC driver (the Zig equivalent is a labeled switch) and I am quite happy with how it came out. Performance seems very good and it allows you to abstract the fetching process on a row by row basis. I don't know how well that would translate to C++.

Overall the arrow performance seems not too far off from what I achieved with zodbc.

Hi @ffelixg

Thanks for raising this PR. Please allow us time to review and share our comments.

Appreciate your diligence in strengthening this project.

Sumit

@sumitmsft sumitmsft added inADO under development community PR or Issue raised by community members labels Dec 1, 2025
@sumitmsft
Copy link
Copy Markdown
Contributor

sumitmsft commented Dec 4, 2025

Hello @ffelixg

Me and my team are in the process of reviewing your PR. While we are getting started, it would be great to have some preliminary information from you on the following items:

  1. Have you created any design document for this feature (high\low level)? Could you please attach it here or share it with us at the below mentioned email id?
  2. What is your motivation to bring the support for Arrow in mssql-python? Could you help us understand the use case(s) you're trying to address?
  3. Is there a way to connect with you over Microsoft Teams call, so that we can closely work on this feature together? You can reach out to us at mssql-python@microsoft.com with your contact details and consent to connect with you.

Regards,
Sumit

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Dec 4, 2025

Hello @sumitmsft,

I'm happy to hear that.

  1. I don't have any design document beyond what I wrote in the PR description. Are there any areas in particular you would like me to provide more information on?
  2. I assume the motivation is mostly in line with what most arrow users like about arrow. Mainly I believe that arrow is the correct format for anything that is using batches of data and has C-Extensions for both producer and consumer. For example arrow gives you great interop with things like duckdb, polars, pandas on the analytics/ML side. Also I want python to be the obvious one stop shop for ETL workloads and for that, plain python types don't work well both for performance and reliability. You still have plenty of situations though where you want to fetch one result set with python types and the next with arrow types, so it has to be in the same driver as well.
  3. Yes, for sure. I have sent you an Email.

Regards,
Felix

@bewithgaurav
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffelixg - Thanks for the contribution! :)
Before we get started on this PR - there are a few dev build workflows we need to fix.
Could you please take a look at the Azure DevOps Workflows which are failing? (goes by the check MSSQL-Python-PR-Validation):

  • Build issues on windows (ref)
  • A few tests failing on MacOS (ref)

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Dec 9, 2025

Hey,

the Windows issue was due to me using a 128 bit integer type which isn't supported by MSVC. To address that, I've added a custom 128 bit int type implemented as two 64 bit ints with some overloaded operators. I'm not super happy about that, it seems to me that using an existing library for this type of thing would be better. If you prefer to use a library, I'd leave the choice of which one to use up to you though.

The 128 bit type is only needed for decimals, so an alternative solution would be to use the numeric struct instead of strings for fetching. That one has near identical bit-representation compared to arrow and wouldn't require a lot of modification. But that's a change that would affect fetching to Python objects as well, since the two paths should probably stay in sync.

The MacOS issue seems to be due to std::mktime failing. I've added an implementation of days_from_civil to eliminate that call. I think a newer version of c++ would include that function. CPython also has an implementation for that in _datetimemodule.c, but it sadly isn't directly accessible, only when going through python objects, which would of course be slow.

I noticed some CI errors related to zip not taking strict=True, which only applies for Python 3.9 and below. I know you don't officially support 3.9, but if you're testing it and it works otherwise, I could write that differently I guess.

ffelixg and others added 7 commits December 28, 2025 16:40
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@sumitmsft
Copy link
Copy Markdown
Contributor

sumitmsft commented Mar 19, 2026

The new arrow_batch(), arrow(), and arrow_reader() methods should be added to the type stub file mssql_python/mssql_python.pyi because without stub entries, IDEs won't offer autocompletion and type checkers will report errors when users call these methods. Something like:

    # Arrow Extension Methods (requires pyarrow)
    def arrow_batch(self, batch_size: int = 8192) -> "pyarrow.RecordBatch": ...
    def arrow(self, batch_size: int = 8192) -> "pyarrow.Table": ...
    def arrow_reader(self, batch_size: int = 8192) -> "pyarrow.RecordBatchReader": ...

@sumitmsft
Copy link
Copy Markdown
Contributor

@ffelixg I have put in some of my review comments. Request you to look at them. Most of them are good to have - so they are no blocking issues.

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Mar 22, 2026

Thanks for the review! I've addressed your comments and added the stubs. I can confirm that it fixed complaints from ty. I totally missed the pyi file, because somehow mypy also seems to be looking at the definition inside cursor.py, so it didn't throw any errors.

@gargsaumya
Copy link
Copy Markdown
Contributor

Hi @ffelixg, the PR review is complete. Could you please update the branch so we can run the tests and proceed with the merge once everything passes? It looks like the pipeline is currently failing due to linting errors, so can you please address those as part of the update. Thanks!

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Mar 31, 2026

Hey @gargsaumya, nice! I merged main and fixed the formatting as well as a small windows compilation error that snuck in during one of the recent tweaks.

Just a heads up, SQL_VARIANT, which was added in the most recent commit on main, will raise an unsupported data type exception when trying to fetch as arrow for now. I'm not quite sure either how SQL_VARIANT should look like with arrow, since it's not really built for dynamic types like that.

Also these two currently open PRs will also necessitate minor adjustments in the arrow fetch path: #479 #478. Do you plan on merging them before or after this?

@gargsaumya
Copy link
Copy Markdown
Contributor

Hey @gargsaumya, nice! I merged main and fixed the formatting as well as a small windows compilation error that snuck in during one of the recent tweaks.

Just a heads up, SQL_VARIANT, which was added in the most recent commit on main, will raise an unsupported data type exception when trying to fetch as arrow for now. I'm not quite sure either how SQL_VARIANT should look like with arrow, since it's not really built for dynamic types like that.

Also these two currently open PRs will also necessitate minor adjustments in the arrow fetch path: #479 #478. Do you plan on merging them before or after this?

Thanks! I agree about SQL_VARIANT, that shouldn’t be an issue for now.

Regarding the PRs you mentioned and this one, there’s no priority order. Whichever PR completes review and has a clear pipeline will be merged first. The remaining PR authors can then pull in the latest changes and update their PRs accordingly.
I’ve started the pipeline run for this PR, if it passes, I’ll go ahead and merge it.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 1, 2026

There seems to be an issue with a test regarding odbc handle lifetimes. I don't see how any of the changes I made would relate to that though, is it possible that the test is flaky? Maybe we could try to run CI again? @gargsaumya

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@gargsaumya gargsaumya merged commit b786900 into microsoft:main Apr 2, 2026
27 of 28 checks passed
@gargsaumya
Copy link
Copy Markdown
Contributor

Hi @ffelixg ,
We have merged this PR, thank you for the substantial work you’ve put into this feature, it’s much appreciated.

Could you share some performance comparisons for this change? It would be helpful to see metrics (e.g., before vs after numbers, benchmarks, or any relevant stats) to better understand the impact this feature is delivering.

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 3, 2026

Hey @gargsaumya,
That's awesome, thanks to you and the rest of the team for making this happen as well! I will message you separately about the benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community PR or Issue raised by community members enhancement New feature or request inADO

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants