Skip to content

Conversation

@darynaishchenko
Copy link
Contributor

@darynaishchenko darynaishchenko commented Nov 12, 2025

https://github.com/airbytehq/oncall/issues/9301

Summary by CodeRabbit

  • New Features

    • Added an option to use only the first discovered file for schema discovery to speed up inference.
  • Bug Fixes / Validation

    • Validation now detects and rejects conflicting schema-discovery options and surfaces a clearer validation error type.
    • Connection checks report configuration validation errors more explicitly; other failures are wrapped consistently. Stream input-schema validation flow was simplified.
  • Tests

    • Added unit and scenario tests covering the first-found-file discovery option and related multi-stream scenarios.

@darynaishchenko darynaishchenko self-assigned this Nov 12, 2025
@github-actions github-actions bot added the enhancement New feature or request label Nov 12, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/update-discover#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/update-discover

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

📝 Walkthrough

Walkthrough

Added boolean use_first_found_file_for_schema_discovery to stream config with a root-level validator enforcing discovery-option compatibility. DefaultFileBasedStream can restrict schema discovery to the first discovered file when enabled. Added new error enum member, updated tests/scenarios, and removed the _validate_input_schema inline validation with adjusted check_connection error handling.

Changes

Cohort / File(s) Summary
Configuration & Validation
airbyte_cdk/sources/file_based/config/file_based_stream_config.py
Imported Dict and root_validator; added public field use_first_found_file_for_schema_discovery: bool = Field(False, ...). Added @root_validator validate_discovery_related_fields to enforce mutual exclusivity/compatibility between discovery options and raise ConfigValidationError with ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS on conflicts.
Stream Implementation
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
_get_raw_json_schema now checks config.use_first_found_file_for_schema_discovery; when true, logs and restricts discovery input to the first discovered file (uses itertools.islice), causing inference to run on at most one file.
Source behavior & validation flow
airbyte_cdk/sources/file_based/file_based_source.py
Removed _validate_input_schema and its inline call in streams. check_connection now explicitly catches ConfigValidationError and wraps other exceptions as a config error with a generic message.
Errors
airbyte_cdk/sources/file_based/exceptions.py
Added ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS enum member for discovery-option validation conflicts.
CSV scenario schema & tests
unit_tests/sources/file_based/scenarios/csv_scenarios.py, unit_tests/sources/file_based/test_file_based_scenarios.py
Added use_first_found_file_for_schema_discovery boolean to CSV format schema (default false). Updated and added multi-stream scenarios and expectations to use ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS.
Unit tests (stream)
unit_tests/sources/file_based/stream/test_default_file_based_stream.py
Added test_use_first_found_file_for_schema_discovery which mocks multiple remote files, enables the flag, asserts infer_schema is invoked once on the first file, and verifies produced schema fields/types.

Sequence Diagram(s)

sequenceDiagram
  participant Config as FileBasedStreamConfig
  participant Stream as DefaultFileBasedStream
  participant FS as FileFetcher
  participant Infer as SchemaInferencer
  rect rgb(245,250,255)
    Note over Config,Stream: Schema discovery (flag-aware)
  end
  Stream->>Config: read discovery flags
  alt use_first_found_file_for_schema_discovery == true
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: files iterator
    Stream->>Stream: take first file (islice(...,1))
    Stream->>Infer: infer_schema(first_file)
    Infer-->>Stream: schema
  else
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: files iterator
    Stream->>Infer: infer_schema(files up to recent_n_files...)
    Infer-->>Stream: schema
  end
  Stream->>Stream: consolidate & return schema
Loading
sequenceDiagram
  participant Source as FileBasedSource
  participant Streams as StreamFactory
  participant Validator as FileBasedStreamConfig (root_validator)
  rect rgb(250,245,245)
    Note over Source,Validator: Validation and connection check changes
  end
  Source->>Streams: construct streams
  Streams->>Validator: validate stream configs (root_validator)
  alt validation error
    Validator-->>Source: raise ConfigValidationError (ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS)
    Source->>Source: check_connection catches ConfigValidationError and returns formatted error
  else other error
    Streams-->>Source: other exception
    Source->>Source: wrap as config error and return generic failure
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect validate_discovery_related_fields for all edge-case flag combinations and error message consistency.
  • Verify Field metadata/default and serialization effects in file_based_stream_config.py.
  • Confirm itertools.islice usage in _get_raw_json_schema does not exhaust iterators needed elsewhere.
  • Review removal of _validate_input_schema and altered check_connection exception handling for regressions.
  • Run updated unit tests and new scenarios to ensure expected error codes and behavior.

Want a short changelog entry or a one-line config-doc snippet for this flag, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main feature being added: a new option to use only the first found file for schema discovery in file-based sources.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch daryna/file-based/update-discover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0443aa and 8615bb9.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1 hunks)
  • unit_tests/sources/file_based/stream/test_default_file_based_stream.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
🧬 Code graph analysis (2)
unit_tests/sources/file_based/stream/test_default_file_based_stream.py (5)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
  • config (46-47)
  • config (51-61)
  • get_matching_files (79-99)
airbyte_cdk/sources/file_based/discovery_policy/abstract_discovery_policy.py (2)
  • get_max_n_files_for_schema_inference (21-21)
  • n_concurrent_requests (18-18)
airbyte_cdk/sources/file_based/discovery_policy/default_discovery_policy.py (2)
  • get_max_n_files_for_schema_inference (24-33)
  • n_concurrent_requests (21-22)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
  • infer_schema (338-342)
  • get_json_schema (246-269)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)
airbyte_cdk/sources/streams/core.py (1)
  • logger (128-129)
airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py (1)
  • get_files (92-96)
airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)
  • get_files (165-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-intercom
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (python)

@github-actions
Copy link

github-actions bot commented Nov 12, 2025

PyTest Results (Fast)

3 813 tests  +2   3 801 ✅ +1   6m 41s ⏱️ -31s
    1 suites ±0      12 💤 +1 
    1 files   ±0       0 ❌ ±0 

Results for commit 31b6be0. ± Comparison against base commit f0443aa.

This pull request removes 6 and adds 8 tests. Note that renamed tests count towards both.
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_datetime_based_cursor
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_given_data_feed_and_client_side_incremental_then_raise_error
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_limit_stream_slices
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_initial_sync_no_state]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_reset_with_next_page_token]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_reset_skip_completed_stream
unit_tests.sources.declarative.retrievers.test_pagination_tracker.TestPaginationTracker ‑ test_cursor_called_with_original_slice_when_reduce_slice_range
unit_tests.sources.file_based.stream.test_default_file_based_stream.DefaultFileBasedStreamTest ‑ test_use_first_found_file_for_schema_discovery
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
This pull request skips 1 test.
unit_tests.sources.declarative.test_concurrent_declarative_source ‑ test_read_with_concurrent_and_synchronous_streams

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Nov 12, 2025

PyTest Results (Full)

3 816 tests  +2   3 804 ✅ +2   10m 52s ⏱️ -12s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 31b6be0. ± Comparison against base commit f0443aa.

This pull request removes 6 and adds 8 tests. Note that renamed tests count towards both.
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_datetime_based_cursor
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_given_data_feed_and_client_side_incremental_then_raise_error
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_limit_stream_slices
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_initial_sync_no_state]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_reset_with_next_page_token]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_reset_skip_completed_stream
unit_tests.sources.declarative.retrievers.test_pagination_tracker.TestPaginationTracker ‑ test_cursor_called_with_original_slice_when_reduce_slice_range
unit_tests.sources.file_based.stream.test_default_file_based_stream.DefaultFileBasedStreamTest ‑ test_use_first_found_file_for_schema_discovery
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]

♻️ This comment has been updated with latest results.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments just to make sure we cover everything here

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)

488-493: Consider enhancing the description for clarity.

The description "Helps to avoid long discovery step" could be more explicit about the exact behavior. What do you think about making it clearer that schema inference will be limited to exactly the first file found? For example:

-                                    "description": "When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+                                    "description": "When enabled, the source will use only the first found file for schema discovery, rather than sampling multiple files. This can significantly speed up the discovery step when working with many files.",

This would help users understand both what happens (single file only) and why it matters (performance optimization), wdyt?


488-493: Consider adding a CSV scenario that exercises this flag.

While the new field is correctly added to the expected spec, I noticed that none of the CSV scenarios in this file actually set use_first_found_file_for_schema_discovery: True in their configuration.

Would it be valuable to add a CSV-specific test scenario that demonstrates this flag in action? For example, a scenario similar to multi_csv_scenario but with this flag enabled, expecting the schema to only include columns from the first file. This would:

  1. Provide clear documentation of the feature's behavior with CSV files
  2. Ensure the flag works correctly with CSV-specific configurations
  3. Complement the unit test in test_default_file_based_stream.py mentioned in the AI summary

Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 377884c and 10a3b4e.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Analyze (python)

@darynaishchenko
Copy link
Contributor Author

darynaishchenko commented Nov 13, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/config/file_based_stream_config.py (2)

77-81: Minor grammar improvement in the description.

The description on line 79 could be slightly improved. Consider:

  • "Helps to avoid a long discovery step" or
  • "Helps to avoid long discovery steps"

This was mentioned in a previous review comment, so just wanted to follow up. Wdyt?

Apply this diff:

     use_first_found_file_for_schema_discovery: bool = Field(
         title="Use First Found File For Schema Discover",
-        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid a long discovery step.",
         default=False,
     )

111-117: Consider splitting the validation condition for clarity.

The boolean logic on lines 111-114 is correct but somewhat complex. Would you consider splitting it into two separate conditions for readability? For example:

if recent_n_files_to_read_for_schema_discovery and use_first_found_file_for_schema_discovery:
    raise ConfigValidationError(
        FileBasedSourceError.ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS
    )

if [schemaless, input_schema, use_first_found_file_for_schema_discovery].count(True) > 1:
    raise ConfigValidationError(
        FileBasedSourceError.ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS
    )

This would make the two distinct validation rules more explicit. But the current implementation is functionally correct, so this is just a readability suggestion. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10a3b4e and 6cb4153.

📒 Files selected for processing (5)
  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py (3 hunks)
  • airbyte_cdk/sources/file_based/exceptions.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (1 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (3 hunks)
  • unit_tests/sources/file_based/test_file_based_scenarios.py (3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2024-11-18T23:40:06.391Z
Learnt from: ChristoGrab
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.

Applied to files:

  • airbyte_cdk/sources/file_based/file_based_source.py
  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
📚 Learning: 2024-11-15T01:04:21.272Z
Learnt from: aaronsteers
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
🧬 Code graph analysis (3)
airbyte_cdk/sources/file_based/file_based_source.py (3)
airbyte_cdk/sources/file_based/exceptions.py (2)
  • ConfigValidationError (75-76)
  • FileBasedSourceError (12-39)
airbyte_cdk/utils/traced_exception.py (1)
  • AirbyteTracedException (25-145)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
  • FailureType (546-549)
airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)
airbyte_cdk/sources/file_based/exceptions.py (2)
  • ConfigValidationError (75-76)
  • FileBasedSourceError (12-39)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (3)
unit_tests/sources/file_based/scenarios/scenario_builder.py (7)
  • set_expected_check_error (188-192)
  • set_expected_discover_error (194-198)
  • set_expected_read_error (200-204)
  • build (35-41)
  • build (225-258)
  • TestScenario (44-115)
  • TestScenarioBuilder (118-273)
airbyte_cdk/sources/file_based/exceptions.py (2)
  • FileBasedSourceError (12-39)
  • ConfigValidationError (75-76)
unit_tests/sources/file_based/in_memory_files_source.py (1)
  • InMemoryFilesSource (50-99)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/config/file_based_stream_config.py

[error] 93-93: mypy: Function is missing a type annotation [no-untyped-def]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
🔇 Additional comments (8)
airbyte_cdk/sources/file_based/exceptions.py (1)

26-26: LGTM!

The error message clearly describes the mutual exclusivity constraint for schema discovery options. It aligns well with the validation logic being added in the root validator.

unit_tests/sources/file_based/test_file_based_scenarios.py (2)

84-87: LGTM!

The new scenario imports follow the existing naming conventions and clearly indicate they test validation failures for conflicting discovery options.


212-213: LGTM!

The new scenarios are correctly wired into the appropriate test lists and will provide coverage for the validation error paths when conflicting discovery options are used.

Also applies to: 319-320

airbyte_cdk/sources/file_based/file_based_source.py (1)

159-172: LGTM!

The refined exception handling correctly distinguishes configuration validation errors from other errors, preserving the specific error message from the root validator. The generic exception handler provides a safety net while maintaining consistent error reporting.

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)

8-8: LGTM!

The root_validator import is correctly added from pydantic.v1, consistent with the existing imports.

unit_tests/sources/file_based/scenarios/csv_scenarios.py (3)

488-493: LGTM!

The expected spec correctly includes the new use_first_found_file_for_schema_discovery field with appropriate type, default, title, and description. The placement after recent_n_files_to_read_for_schema_discovery is logical.


2123-2131: LGTM!

The updated error expectations correctly reflect that the schemaless + input_schema conflict is now caught by the root validator and returns ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS. Both single-stream and multi-stream scenarios are consistently updated.

Also applies to: 2228-2236


2239-2452: LGTM!

The new test scenarios comprehensively cover the validation error paths:

  1. Conflict between recent_n_files_to_read_for_schema_discovery and use_first_found_file_for_schema_discovery
  2. Conflict between schemaless and use_first_found_file_for_schema_discovery

Both scenarios appropriately test multi-stream configurations and expect the correct ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS error across check, discover, and read operations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)

2239-2344: Verify this scenario is wired into the test runner.

Past review comments flag this scenario variable as unused. Please ensure it's properly exported and included in the test execution, either in unit_tests/sources/file_based/test_file_based_scenarios.py or another appropriate test file, wdyt?


2347-2452: Verify this scenario is wired into the test runner.

Similar to the scenario at line 2239, past review comments indicate this scenario variable is unused. Please ensure it's properly exported and included in test execution, wdyt?

🧹 Nitpick comments (3)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

488-493: Consider polishing the description for clarity, wdyt?

The description "Helps to avoid long discovery step" is missing an article. Would "Helps to avoid a long discovery step" or "Helps to avoid long discovery steps" read more naturally?

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (2)

77-81: Minor typo and grammar polish needed, wdyt?

Two small issues:

  1. The title has "Use First Found File For Schema Discover" - should be "Schema Discovery" (missing the 'y')
  2. The description "Helps to avoid long discovery step" is missing an article - perhaps "Helps to avoid a long discovery step" would be clearer?

Apply this diff:

     use_first_found_file_for_schema_discovery: bool = Field(
-        title="Use First Found File For Schema Discover",
+        title="Use First Found File For Schema Discovery",
-        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid a long discovery step.",
         default=False,
     )

96-96: Typo in the comment, wdyt?

Line 96 has "compatability" which should be "compatibility".

Apply this diff:

-        Validates schema discovery options compatability.
+        Validates schema discovery options compatibility.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cb4153 and 01af8ff.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py (3 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2024-11-15T01:04:21.272Z
Learnt from: aaronsteers
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
📚 Learning: 2024-11-18T23:40:06.391Z
Learnt from: ChristoGrab
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
🧬 Code graph analysis (2)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (3)
unit_tests/sources/file_based/scenarios/scenario_builder.py (7)
  • set_expected_check_error (188-192)
  • set_expected_discover_error (194-198)
  • set_expected_read_error (200-204)
  • build (35-41)
  • build (225-258)
  • TestScenario (44-115)
  • TestScenarioBuilder (118-273)
airbyte_cdk/sources/file_based/exceptions.py (2)
  • FileBasedSourceError (12-39)
  • ConfigValidationError (75-76)
unit_tests/sources/file_based/in_memory_files_source.py (1)
  • InMemoryFilesSource (50-99)
airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)
airbyte_cdk/sources/file_based/exceptions.py (2)
  • ConfigValidationError (75-76)
  • FileBasedSourceError (12-39)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/config/file_based_stream_config.py

[error] 93-93: mypy: Missing type parameters for generic type "dict" [type-arg]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

2123-2131: LGTM!

The error type change from CONFIG_VALIDATION_ERROR to ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS correctly reflects the more specific validation for mutually exclusive discovery options.

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (2)

8-8: LGTM!

The root_validator import is correctly added and necessary for the new cross-field validation logic.


102-119: LGTM on the validation logic!

The mutual exclusivity validation correctly enforces that the new use_first_found_file_for_schema_discovery cannot be combined with other discovery options, while maintaining backward compatibility for existing recent_n_files_to_read_for_schema_discovery combinations. The comment clearly explains the rationale for this approach.

@darynaishchenko
Copy link
Contributor Author

darynaishchenko commented Nov 13, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f723764 and 0d5e443.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py (3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2024-11-15T01:04:21.272Z
Learnt from: aaronsteers
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
📚 Learning: 2024-11-18T23:40:06.391Z
Learnt from: ChristoGrab
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
📚 Learning: 2024-12-11T16:34:46.319Z
Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

  • airbyte_cdk/sources/file_based/config/file_based_stream_config.py
🧬 Code graph analysis (1)
airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)
airbyte_cdk/sources/file_based/exceptions.py (2)
  • ConfigValidationError (75-76)
  • FileBasedSourceError (12-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)

6-6: LGTM!

The new imports (Dict from typing and root_validator from pydantic) are necessary for the validation logic below and are correctly added.

Also applies to: 8-8

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@darynaishchenko darynaishchenko changed the title feat(file-based): use only first found file for discover feat(file-based): option to use only first found file for discover Nov 13, 2025
@darynaishchenko darynaishchenko merged commit 5d9125f into main Nov 13, 2025
27 of 30 checks passed
@darynaishchenko darynaishchenko deleted the daryna/file-based/update-discover branch November 13, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants