Skip to content

Conversation

@DanDits
Copy link
Collaborator

@DanDits DanDits commented Jan 26, 2026

No description provided.

dittmar added 3 commits January 26, 2026 23:18
…only empty strings, by working on the input dataframe and not an empty dataframe
…tions and to allow working around issues with pandas csv parsing and writing
@DanDits
Copy link
Collaborator Author

DanDits commented Jan 26, 2026

Should something be mentioned in the CHANGELOG? If we merge this, the only user visible changes will be the slightly adjusted example, the "support" for python 3.11 and some type annotation improvements. The CSV changes are in that sense no new features or behavior changes, more fixes to achieve the expected behavior in various 'edge' cases.

lines.append(_format_row(columns_list, columns_list, None, None, None, None))

# Write data rows
for _, row in df.iterrows():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterrows is very slow. It would probably be 5-10x faster to use itertuples or transform into a numpy array like so:

values = df.to_numpy(dtype=object, na_value=None)
for row in values:
    lines.append(_format_row(list(row), ...))

# Quoted value - extract content (can contain newlines)
pos += 1
value = []
while pos < len(csv_data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks at the whole payload character by character. For large data, this will be very slow.
We could user str.find() instead to find the next quote (should be implemented in C).

Something along the lines of

while pos < len(csv_data):                          
    next_quote = csv_data.find('"', pos)
    value_parts.append(csv_data[pos:next_quote])
    pos = next_quote + 1

Copy link
Collaborator

@buddemat buddemat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 comments concerning performance, whcih I guess should be addressed. I have not looked at the tests.

From what I read in the channels, both @julianjanssen and @ArneBab see some issues with the "full custom csv import" approach. We might want to discuss this once more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants