-
Notifications
You must be signed in to change notification settings - Fork 0
Custom CSV handling, small improvements to types and enrichment example #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…only empty strings, by working on the input dataframe and not an empty dataframe
…tions and to allow working around issues with pandas csv parsing and writing
|
Should something be mentioned in the CHANGELOG? If we merge this, the only user visible changes will be the slightly adjusted example, the "support" for python 3.11 and some type annotation improvements. The CSV changes are in that sense no new features or behavior changes, more fixes to achieve the expected behavior in various 'edge' cases. |
| lines.append(_format_row(columns_list, columns_list, None, None, None, None)) | ||
|
|
||
| # Write data rows | ||
| for _, row in df.iterrows(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterrows is very slow. It would probably be 5-10x faster to use itertuples or transform into a numpy array like so:
values = df.to_numpy(dtype=object, na_value=None)
for row in values:
lines.append(_format_row(list(row), ...))
| # Quoted value - extract content (can contain newlines) | ||
| pos += 1 | ||
| value = [] | ||
| while pos < len(csv_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks at the whole payload character by character. For large data, this will be very slow.
We could user str.find() instead to find the next quote (should be implemented in C).
Something along the lines of
while pos < len(csv_data):
next_quote = csv_data.find('"', pos)
value_parts.append(csv_data[pos:next_quote])
pos = next_quote + 1
buddemat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 comments concerning performance, whcih I guess should be addressed. I have not looked at the tests.
From what I read in the channels, both @julianjanssen and @ArneBab see some issues with the "full custom csv import" approach. We might want to discuss this once more?
No description provided.