Skip to content

fix: preserve 'NA' string values in CSV parsing instead of converting to None#1058

Open
devin-ai-integration[bot] wants to merge 3 commits into
mainfrom
devin/1782155104-fix-na-string-preservation
Open

fix: preserve 'NA' string values in CSV parsing instead of converting to None#1058
devin-ai-integration[bot] wants to merge 3 commits into
mainfrom
devin/1782155104-fix-na-string-preservation

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

ResponseToFileExtractor._read_with_chunks uses pd.read_csv() which by default interprets strings like 'NA', 'N/A', 'NaN', 'null' as NaN values (keep_default_na=True). This causes legitimate field values — e.g. 'NA' meaning "North America" in a Salesforce picklist — to be silently converted to None/NULL in synced data.

Fix: pass keep_default_na=False so pandas treats all cell values as literal strings (since we already set dtype=object). Empty CSV fields (which pandas would previously auto-tag as NaN) are now explicitly converted to None via .replace({"": None}) to maintain backward-compatible behavior.

- pd.read_csv(data, chunksize=chunk_size, iterator=True, dialect="unix", dtype=object)
+ pd.read_csv(data, chunksize=chunk_size, iterator=True, dialect="unix", dtype=object, keep_default_na=False)

- chunk.replace({nan: None})
+ chunk.replace({nan: None, "": None})

Link to Devin session: https://app.devin.ai/sessions/59e21b90c6c34df7bcae4bddc1947b44

… to None

pandas.read_csv() by default treats strings like 'NA', 'N/A', 'NaN' as
NaN values (via keep_default_na=True). This caused legitimate string
values like 'NA' (North America) in Salesforce fields to be incorrectly
converted to None/NULL during bulk API CSV parsing.

Adding keep_default_na=False prevents pandas from interpreting these
strings as missing values. Empty CSV fields are still correctly converted
to None via explicit empty string replacement.

Co-Authored-By: lucas.leadbetter@airbyte.io <5595530+lleadbet@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@github-actions

Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1782155104-fix-na-string-preservation#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1782155104-fix-na-string-preservation

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

PyTest Results (Fast)

4 121 tests  +2   4 110 ✅ +2   8m 14s ⏱️ -2s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit edc81bf. ± Comparison against base commit 977f051.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

PyTest Results (Full)

4 124 tests  +2   4 112 ✅ +2   12m 0s ⏱️ +24s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit edc81bf. ± Comparison against base commit 977f051.

♻️ This comment has been updated with latest results.

@tolik0

Copy link
Copy Markdown
Contributor

Reviewed the fix — it is correct and well-targeted. One suggestion on the tests:

The two added tests only exercise "NA", but keep_default_na=True was silently corrupting all 19 of pandas' default NA tokens — N/A, NULL, None, nan, null, #N/A, <NA>, etc. Those other tokens are arguably the under-reported part of this bug (this customer only happened to notice "NA").

I verified the fix against pandas 2.2.3 and it preserves all of them:

Cell value Before fix After fix
NA, N/A, NULL, None, nan None (corrupted) preserved as string ✓
empty unquoted (,) None None
empty quoted ("") None None
whitespace (" ") " " " "

Suggestion: extend test_na_string_values_are_preserved to include a couple of the other tokens (e.g. N/A, NULL, nan) so the test documents the full intent and guards against a regression on the broader set, not just "NA". Cheap insurance, non-blocking.

The {nan: None} entry in the .replace(...) is now effectively dead (with keep_default_na=False + dtype=object pandas never emits NaN), but it is harmless defensive code — fine to leave as-is.

Covers NA, N/A, NULL, None, nan, null, #N/A, <NA> per reviewer feedback.

Co-Authored-By: lucas.leadbetter@airbyte.io <5595530+lleadbet@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

Anatolii Yatsuk (@tolik0) ☑️ Resolved in 6e4989f. Extended test_na_string_values_are_preserved to cover 8 of pandas' default NA tokens: NA, N/A, NULL, None, nan, null, #N/A, <NA>.

Agreed on the {nan: None} in .replace() being effectively dead code now — leaving it as harmless defensive code.


Devin session

Co-Authored-By: lucas.leadbetter@airbyte.io <5595530+lleadbet@users.noreply.github.com>
@tolik0 Anatolii Yatsuk (tolik0) marked this pull request as ready for review June 23, 2026 10:42
Copilot AI review requested due to automatic review settings June 23, 2026 10:42
@tolik0

Anatolii Yatsuk (tolik0) commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/prerelease

Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

✅ Prerelease workflow triggered successfully.

View the publish workflow run: https://github.com/airbytehq/airbyte-python-cdk/actions/runs/28020507330

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the declarative ResponseToFileExtractor CSV parsing behavior to avoid Pandas’ default conversion of common “NA-like” string tokens (e.g., "NA", "N/A", "null") into NaN/None, which can silently corrupt legitimate source values (e.g., Salesforce picklists).

Changes:

  • Pass keep_default_na=False to pd.read_csv(...) so NA-like string tokens are preserved as literal strings.
  • Explicitly map empty-string cells ("" / empty fields) to None to preserve prior behavior for missing values.
  • Add unit tests covering both preservation of NA-like tokens and conversion of empty fields to None.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
airbyte_cdk/sources/declarative/extractors/response_to_file_extractor.py Disables Pandas default NA token coercion and normalizes empty strings to None during chunk conversion.
unit_tests/sources/declarative/extractors/test_response_to_file_extractor.py Adds regression tests ensuring NA-like strings are preserved and empty fields still become None.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants