feat(file-based): option to use only first found file for discover #841

darynaishchenko · 2025-11-12T13:58:23Z

https://github.com/airbytehq/oncall/issues/9301

Summary by CodeRabbit

New Features
- Added an option to use only the first discovered file for schema discovery to speed up inference.
Bug Fixes / Validation
- Validation now detects and rejects conflicting schema-discovery options and surfaces a clearer validation error type.
- Connection checks report configuration validation errors more explicitly; other failures are wrapped consistently. Stream input-schema validation flow was simplified.
Tests
- Added unit and scenario tests covering the first-found-file discovery option and related multi-stream scenarios.

github-actions · 2025-11-12T13:58:38Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/update-discover#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/update-discover

Helpful Resources

CDK API Reference

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

coderabbitai · 2025-11-12T14:02:08Z

📝 Walkthrough

Walkthrough

Added boolean use_first_found_file_for_schema_discovery to stream config with a root-level validator enforcing discovery-option compatibility. DefaultFileBasedStream can restrict schema discovery to the first discovered file when enabled. Added new error enum member, updated tests/scenarios, and removed the _validate_input_schema inline validation with adjusted check_connection error handling.

Changes

Cohort / File(s)	Summary
Configuration & Validation `airbyte_cdk/sources/file_based/config/file_based_stream_config.py`	Imported `Dict` and `root_validator`; added public field `use_first_found_file_for_schema_discovery: bool = Field(False, ...)`. Added `@root_validator` `validate_discovery_related_fields` to enforce mutual exclusivity/compatibility between discovery options and raise `ConfigValidationError` with `ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS` on conflicts.
Stream Implementation `airbyte_cdk/sources/file_based/stream/default_file_based_stream.py`	`_get_raw_json_schema` now checks `config.use_first_found_file_for_schema_discovery`; when true, logs and restricts discovery input to the first discovered file (uses `itertools.islice`), causing inference to run on at most one file.
Source behavior & validation flow `airbyte_cdk/sources/file_based/file_based_source.py`	Removed `_validate_input_schema` and its inline call in `streams`. `check_connection` now explicitly catches `ConfigValidationError` and wraps other exceptions as a config error with a generic message.
Errors `airbyte_cdk/sources/file_based/exceptions.py`	Added `ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS` enum member for discovery-option validation conflicts.
CSV scenario schema & tests `unit_tests/sources/file_based/scenarios/csv_scenarios.py`, `unit_tests/sources/file_based/test_file_based_scenarios.py`	Added `use_first_found_file_for_schema_discovery` boolean to CSV format schema (default `false`). Updated and added multi-stream scenarios and expectations to use `ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS`.
Unit tests (stream) `unit_tests/sources/file_based/stream/test_default_file_based_stream.py`	Added `test_use_first_found_file_for_schema_discovery` which mocks multiple remote files, enables the flag, asserts `infer_schema` is invoked once on the first file, and verifies produced schema fields/types.

Sequence Diagram(s)

sequenceDiagram
  participant Config as FileBasedStreamConfig
  participant Stream as DefaultFileBasedStream
  participant FS as FileFetcher
  participant Infer as SchemaInferencer
  rect rgb(245,250,255)
    Note over Config,Stream: Schema discovery (flag-aware)
  end
  Stream->>Config: read discovery flags
  alt use_first_found_file_for_schema_discovery == true
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: files iterator
    Stream->>Stream: take first file (islice(...,1))
    Stream->>Infer: infer_schema(first_file)
    Infer-->>Stream: schema
  else
    Stream->>FS: get_files() (iterator)
    FS-->>Stream: files iterator
    Stream->>Infer: infer_schema(files up to recent_n_files...)
    Infer-->>Stream: schema
  end
  Stream->>Stream: consolidate & return schema

sequenceDiagram
  participant Source as FileBasedSource
  participant Streams as StreamFactory
  participant Validator as FileBasedStreamConfig (root_validator)
  rect rgb(250,245,245)
    Note over Source,Validator: Validation and connection check changes
  end
  Source->>Streams: construct streams
  Streams->>Validator: validate stream configs (root_validator)
  alt validation error
    Validator-->>Source: raise ConfigValidationError (ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS)
    Source->>Source: check_connection catches ConfigValidationError and returns formatted error
  else other error
    Streams-->>Source: other exception
    Source->>Source: wrap as config error and return generic failure
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Inspect validate_discovery_related_fields for all edge-case flag combinations and error message consistency.
Verify Field metadata/default and serialization effects in file_based_stream_config.py.
Confirm itertools.islice usage in _get_raw_json_schema does not exhaust iterators needed elsewhere.
Review removal of _validate_input_schema and altered check_connection exception handling for regressions.
Run updated unit tests and new scenarios to ensure expected error codes and behavior.

Want a short changelog entry or a one-line config-doc snippet for this flag, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main feature being added: a new option to use only the first found file for schema discovery in file-based sources.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch daryna/file-based/update-discover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0443aa and 8615bb9.

📒 Files selected for processing (3)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1 hunks)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1 hunks)
unit_tests/sources/file_based/stream/test_default_file_based_stream.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.

📚 Learning: 2024-12-11T16:34:46.319Z

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

🧬 Code graph analysis (2)

unit_tests/sources/file_based/stream/test_default_file_based_stream.py (5)

airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

config (46-47)

config (51-61)

get_matching_files (79-99)

airbyte_cdk/sources/file_based/discovery_policy/abstract_discovery_policy.py (2)

get_max_n_files_for_schema_inference (21-21)

n_concurrent_requests (18-18)

airbyte_cdk/sources/file_based/discovery_policy/default_discovery_policy.py (2)

get_max_n_files_for_schema_inference (24-33)

n_concurrent_requests (21-22)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

infer_schema (338-342)

get_json_schema (246-269)

airbyte_cdk/sources/file_based/remote_file.py (1)

RemoteFile (11-18)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)

airbyte_cdk/sources/streams/core.py (1)

logger (128-129)

airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py (1)

get_files (92-96)

airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)

get_files (165-166)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-google-drive
GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-intercom
GitHub Check: SDM Docker Image Build
GitHub Check: Manifest Server Docker Image Build
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Analyze (python)
GitHub Check: Analyze (python)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

unit_tests/sources/file_based/stream/test_default_file_based_stream.py

github-actions · 2025-11-12T14:06:09Z

PyTest Results (Fast)

3 813 tests +2 3 801 ✅ +1 6m 41s ⏱️ -31s
1 suites ±0 12 💤 +1
1 files ±0 0 ❌ ±0

Results for commit 31b6be0. ± Comparison against base commit f0443aa.

This pull request removes 6 and adds 8 tests. Note that renamed tests count towards both.

unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_datetime_based_cursor
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_given_data_feed_and_client_side_incremental_then_raise_error
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_limit_stream_slices
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_initial_sync_no_state]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_reset_with_next_page_token]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_reset_skip_completed_stream

unit_tests.sources.declarative.retrievers.test_pagination_tracker.TestPaginationTracker ‑ test_cursor_called_with_original_slice_when_reduce_slice_range
unit_tests.sources.file_based.stream.test_default_file_based_stream.DefaultFileBasedStreamTest ‑ test_use_first_found_file_for_schema_discovery
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]

This pull request skips 1 test.

unit_tests.sources.declarative.test_concurrent_declarative_source ‑ test_read_with_concurrent_and_synchronous_streams

♻️ This comment has been updated with latest results.

github-actions · 2025-11-12T14:11:00Z

PyTest Results (Full)

3 816 tests +2 3 804 ✅ +2 10m 52s ⏱️ -12s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 31b6be0. ± Comparison against base commit f0443aa.

This pull request removes 6 and adds 8 tests. Note that renamed tests count towards both.

unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_datetime_based_cursor
unit_tests.sources.declarative.parsers.test_model_to_component_factory ‑ test_given_data_feed_and_client_side_incremental_then_raise_error
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_limit_stream_slices
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_initial_sync_no_state]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_page_increment[test_reset_with_next_page_token]
unit_tests.sources.declarative.retrievers.test_simple_retriever ‑ test_simple_retriever_resumable_full_refresh_cursor_reset_skip_completed_stream

unit_tests.sources.declarative.retrievers.test_pagination_tracker.TestPaginationTracker ‑ test_cursor_called_with_original_slice_when_reduce_slice_range
unit_tests.sources.file_based.stream.test_default_file_based_stream.DefaultFileBasedStreamTest ‑ test_use_first_found_file_for_schema_discovery
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_check[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_discover[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[recent_n_files_to_read_for_schema_discovery_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]
unit_tests.sources.file_based.test_file_based_scenarios ‑ test_file_based_read[schemaless_with_use_first_found_file_for_schema_discovery_fails_connection_check_multi_stream_scenario]

♻️ This comment has been updated with latest results.

maxi297

A couple comments just to make sure we cover everything here

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)
488-493: Consider enhancing the description for clarity.

The description "Helps to avoid long discovery step" could be more explicit about the exact behavior. What do you think about making it clearer that schema inference will be limited to exactly the first file found? For example:
-                                    "description": "When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+                                    "description": "When enabled, the source will use only the first found file for schema discovery, rather than sampling multiple files. This can significantly speed up the discovery step when working with many files.",
This would help users understand both what happens (single file only) and why it matters (performance optimization), wdyt?

488-493: Consider adding a CSV scenario that exercises this flag.

While the new field is correctly added to the expected spec, I noticed that none of the CSV scenarios in this file actually set use_first_found_file_for_schema_discovery: True in their configuration.

Would it be valuable to add a CSV-specific test scenario that demonstrates this flag in action? For example, a scenario similar to multi_csv_scenario but with this flag enabled, expecting the schema to only include columns from the first file. This would:

Provide clear documentation of the feature's behavior with CSV files

Ensure the flag works correctly with CSV-specific configurations

Complement the unit test in test_default_file_based_stream.py mentioned in the AI summary

Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 377884c and 10a3b4e.

📒 Files selected for processing (1)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Check: source-google-drive
GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-intercom
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: destination-motherduck
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Manifest Server Docker Image Build
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Analyze (python)

darynaishchenko · 2025-11-13T13:19:27Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

unit_tests/sources/file_based/scenarios/csv_scenarios.py

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (2)
77-81: Minor grammar improvement in the description.

The description on line 79 could be slightly improved. Consider:

"Helps to avoid a long discovery step" or

"Helps to avoid long discovery steps"

This was mentioned in a previous review comment, so just wanted to follow up. Wdyt?

Apply this diff:
     use_first_found_file_for_schema_discovery: bool = Field(
         title="Use First Found File For Schema Discover",
-        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid a long discovery step.",
         default=False,
     )
111-117: Consider splitting the validation condition for clarity.

The boolean logic on lines 111-114 is correct but somewhat complex. Would you consider splitting it into two separate conditions for readability? For example:
if recent_n_files_to_read_for_schema_discovery and use_first_found_file_for_schema_discovery:
    raise ConfigValidationError(
        FileBasedSourceError.ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS
    )

if [schemaless, input_schema, use_first_found_file_for_schema_discovery].count(True) > 1:
    raise ConfigValidationError(
        FileBasedSourceError.ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS
    )
This would make the two distinct validation rules more explicit. But the current implementation is functionally correct, so this is just a readability suggestion. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10a3b4e and 6cb4153.

📒 Files selected for processing (5)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (3 hunks)
airbyte_cdk/sources/file_based/exceptions.py (1 hunks)
airbyte_cdk/sources/file_based/file_based_source.py (1 hunks)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (3 hunks)
unit_tests/sources/file_based/test_file_based_scenarios.py (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2024-11-18T23:40:06.391Z

Learnt from: ChristoGrab
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.

Applied to files:

airbyte_cdk/sources/file_based/file_based_source.py
airbyte_cdk/sources/file_based/config/file_based_stream_config.py

📚 Learning: 2024-11-15T01:04:21.272Z

Learnt from: aaronsteers
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

📚 Learning: 2024-12-11T16:34:46.319Z

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

🧬 Code graph analysis (3)

airbyte_cdk/sources/file_based/file_based_source.py (3)

airbyte_cdk/sources/file_based/exceptions.py (2)

ConfigValidationError (75-76)

FileBasedSourceError (12-39)

airbyte_cdk/utils/traced_exception.py (1)

AirbyteTracedException (25-145)

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

FailureType (546-549)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)

airbyte_cdk/sources/file_based/exceptions.py (2)

ConfigValidationError (75-76)

FileBasedSourceError (12-39)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (3)

unit_tests/sources/file_based/scenarios/scenario_builder.py (7)

set_expected_check_error (188-192)

set_expected_discover_error (194-198)

set_expected_read_error (200-204)

build (35-41)

build (225-258)

TestScenario (44-115)

TestScenarioBuilder (118-273)

airbyte_cdk/sources/file_based/exceptions.py (2)

FileBasedSourceError (12-39)

ConfigValidationError (75-76)

unit_tests/sources/file_based/in_memory_files_source.py (1)

InMemoryFilesSource (50-99)

🪛 GitHub Actions: Linters

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

[error] 93-93: mypy: Function is missing a type annotation [no-untyped-def]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: Check: source-google-drive
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-intercom
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-shopify
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Manifest Server Docker Image Build
GitHub Check: SDM Docker Image Build

🔇 Additional comments (8)

airbyte_cdk/sources/file_based/exceptions.py (1)

26-26: LGTM!

The error message clearly describes the mutual exclusivity constraint for schema discovery options. It aligns well with the validation logic being added in the root validator.

unit_tests/sources/file_based/test_file_based_scenarios.py (2)

84-87: LGTM!

The new scenario imports follow the existing naming conventions and clearly indicate they test validation failures for conflicting discovery options.

212-213: LGTM!

The new scenarios are correctly wired into the appropriate test lists and will provide coverage for the validation error paths when conflicting discovery options are used.

Also applies to: 319-320

airbyte_cdk/sources/file_based/file_based_source.py (1)

159-172: LGTM!

The refined exception handling correctly distinguishes configuration validation errors from other errors, preserving the specific error message from the root validator. The generic exception handler provides a safety net while maintaining consistent error reporting.

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)

8-8: LGTM!

The root_validator import is correctly added from pydantic.v1, consistent with the existing imports.

unit_tests/sources/file_based/scenarios/csv_scenarios.py (3)

488-493: LGTM!

The expected spec correctly includes the new use_first_found_file_for_schema_discovery field with appropriate type, default, title, and description. The placement after recent_n_files_to_read_for_schema_discovery is logical.

2123-2131: LGTM!

The updated error expectations correctly reflect that the schemaless + input_schema conflict is now caught by the root validator and returns ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS. Both single-stream and multi-stream scenarios are consistently updated.

Also applies to: 2228-2236

2239-2452: LGTM!

The new test scenarios comprehensively cover the validation error paths:

Conflict between recent_n_files_to_read_for_schema_discovery and use_first_found_file_for_schema_discovery

Conflict between schemaless and use_first_found_file_for_schema_discovery

Both scenarios appropriately test multi-stream configurations and expect the correct ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS error across check, discover, and read operations.

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)

2239-2344: Verify this scenario is wired into the test runner.

Past review comments flag this scenario variable as unused. Please ensure it's properly exported and included in the test execution, either in unit_tests/sources/file_based/test_file_based_scenarios.py or another appropriate test file, wdyt?

2347-2452: Verify this scenario is wired into the test runner.

Similar to the scenario at line 2239, past review comments indicate this scenario variable is unused. Please ensure it's properly exported and included in test execution, wdyt?

🧹 Nitpick comments (3)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

488-493: Consider polishing the description for clarity, wdyt?

The description "Helps to avoid long discovery step" is missing an article. Would "Helps to avoid a long discovery step" or "Helps to avoid long discovery steps" read more naturally?
airbyte_cdk/sources/file_based/config/file_based_stream_config.py (2)
77-81: Minor typo and grammar polish needed, wdyt?

Two small issues:

The title has "Use First Found File For Schema Discover" - should be "Schema Discovery" (missing the 'y')

The description "Helps to avoid long discovery step" is missing an article - perhaps "Helps to avoid a long discovery step" would be clearer?

Apply this diff:
     use_first_found_file_for_schema_discovery: bool = Field(
-        title="Use First Found File For Schema Discover",
+        title="Use First Found File For Schema Discovery",
-        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid long discovery step",
+        description="When enabled, the source will use the first found file for schema discovery. Helps to avoid a long discovery step.",
         default=False,
     )
96-96: Typo in the comment, wdyt?

Line 96 has "compatability" which should be "compatibility".

Apply this diff:
-        Validates schema discovery options compatability.
+        Validates schema discovery options compatibility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cb4153 and 01af8ff.

📒 Files selected for processing (2)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (3 hunks)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2024-11-15T01:04:21.272Z

Learnt from: aaronsteers
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

📚 Learning: 2024-11-18T23:40:06.391Z

Learnt from: ChristoGrab
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

📚 Learning: 2024-12-11T16:34:46.319Z

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

🧬 Code graph analysis (2)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (3)

unit_tests/sources/file_based/scenarios/scenario_builder.py (7)

set_expected_check_error (188-192)

set_expected_discover_error (194-198)

set_expected_read_error (200-204)

build (35-41)

build (225-258)

TestScenario (44-115)

TestScenarioBuilder (118-273)

airbyte_cdk/sources/file_based/exceptions.py (2)

FileBasedSourceError (12-39)

ConfigValidationError (75-76)

unit_tests/sources/file_based/in_memory_files_source.py (1)

InMemoryFilesSource (50-99)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)

airbyte_cdk/sources/file_based/exceptions.py (2)

ConfigValidationError (75-76)

FileBasedSourceError (12-39)

🪛 GitHub Actions: Linters

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

[error] 93-93: mypy: Missing type parameters for generic type "dict" [type-arg]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-intercom
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-google-drive
GitHub Check: Check: source-shopify
GitHub Check: Check: source-pokeapi
GitHub Check: Pytest (Fast)
GitHub Check: Manifest Server Docker Image Build
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Analyze (python)

🔇 Additional comments (3)

unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

2123-2131: LGTM!

The error type change from CONFIG_VALIDATION_ERROR to ERROR_VALIDATION_STREAM_DISCOVERY_OPTIONS correctly reflects the more specific validation for mutually exclusive discovery options.

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (2)

8-8: LGTM!

The root_validator import is correctly added and necessary for the new cross-field validation logic.

102-119: LGTM on the validation logic!

The mutual exclusivity validation correctly enforces that the new use_first_found_file_for_schema_discovery cannot be combined with other discovery options, while maintaining backward compatibility for existing recent_n_files_to_read_for_schema_discovery combinations. The comment clearly explains the rationale for this approach.

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

darynaishchenko · 2025-11-13T13:49:01Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f723764 and 0d5e443.

📒 Files selected for processing (1)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2024-11-15T01:04:21.272Z

Learnt from: aaronsteers
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

📚 Learning: 2024-11-18T23:40:06.391Z

Learnt from: ChristoGrab
Repo: airbytehq/airbyte-python-cdk PR: 58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

📚 Learning: 2024-12-11T16:34:46.319Z

Learnt from: pnilan
Repo: airbytehq/airbyte-python-cdk PR: 0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.

Applied to files:

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

🧬 Code graph analysis (1)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)

airbyte_cdk/sources/file_based/exceptions.py (2)

ConfigValidationError (75-76)

FileBasedSourceError (12-39)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Check: source-intercom
GitHub Check: Check: source-shopify
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-google-drive
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-pokeapi
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: SDM Docker Image Build
GitHub Check: Manifest Server Docker Image Build
GitHub Check: Analyze (python)

🔇 Additional comments (1)

airbyte_cdk/sources/file_based/config/file_based_stream_config.py (1)

6-6: LGTM!

The new imports (Dict from typing and root_validator from pydantic) are necessary for the validation logic below and are correctly added.

Also applies to: 8-8

airbyte_cdk/sources/file_based/config/file_based_stream_config.py

airbyte_cdk/sources/file_based/file_based_source.py

maxi297

LGTM!

use only first found file for discover

8615bb9

darynaishchenko self-assigned this Nov 12, 2025

github-actions bot added the enhancement New feature or request label Nov 12, 2025

coderabbitai bot requested changes Nov 12, 2025

View reviewed changes

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py Show resolved Hide resolved

unit_tests/sources/file_based/stream/test_default_file_based_stream.py Show resolved Hide resolved

refactor code

377884c

coderabbitai bot approved these changes Nov 12, 2025

View reviewed changes

darynaishchenko requested a review from maxi297 November 12, 2025 14:36

update expected spec

10a3b4e

maxi297 reviewed Nov 12, 2025

View reviewed changes

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py Show resolved Hide resolved

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

darynaishchenko added 2 commits November 13, 2025 15:16

add schema discovery options validation

6b28a5c

add unit tests

2aaeb49

Auto-fix lint and format issues

6cb4153

github-code-quality bot found potential problems Nov 13, 2025

View reviewed changes

unit_tests/sources/file_based/scenarios/csv_scenarios.py Dismissed Show dismissed Hide dismissed

unit_tests/sources/file_based/scenarios/csv_scenarios.py Dismissed Show dismissed Hide dismissed

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Fixed Show fixed Hide fixed

coderabbitai bot requested changes Nov 13, 2025

View reviewed changes

fix unit tests

01af8ff

github-code-quality bot found potential problems Nov 13, 2025

View reviewed changes

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Fixed Show fixed Hide fixed

fix typing

f723764

coderabbitai bot requested changes Nov 13, 2025

View reviewed changes

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Outdated Show resolved Hide resolved

coderabbitai bot approved these changes Nov 13, 2025

View reviewed changes

github-code-quality bot found potential problems Nov 13, 2025

View reviewed changes

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Dismissed Show dismissed Hide dismissed

Auto-fix lint and format issues

0d5e443

coderabbitai bot requested changes Nov 13, 2025

View reviewed changes

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Show resolved Hide resolved

airbyte_cdk/sources/file_based/config/file_based_stream_config.py Show resolved Hide resolved

darynaishchenko added 3 commits November 13, 2025 15:58

update description

ad750a1

fix unit test

d312c9c

fix typo

31b6be0

coderabbitai bot approved these changes Nov 13, 2025

View reviewed changes

darynaishchenko commented Nov 13, 2025

View reviewed changes

airbyte_cdk/sources/file_based/file_based_source.py Show resolved Hide resolved

darynaishchenko requested a review from maxi297 November 13, 2025 14:21

maxi297 approved these changes Nov 13, 2025

View reviewed changes

darynaishchenko changed the title ~~feat(file-based): use only first found file for discover~~ feat(file-based): option to use only first found file for discover Nov 13, 2025

darynaishchenko merged commit 5d9125f into main Nov 13, 2025
27 of 30 checks passed

darynaishchenko deleted the daryna/file-based/update-discover branch November 13, 2025 15:10

feat(file-based): option to use only first found file for discover #841

feat(file-based): option to use only first found file for discover #841

Uh oh!

Conversation

darynaishchenko commented Nov 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Nov 12, 2025

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

Helpful Resources

PR Slash Commands

Uh oh!

coderabbitai bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

maxi297 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

darynaishchenko commented Nov 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

darynaishchenko commented Nov 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxi297 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

darynaishchenko commented Nov 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 12, 2025 •

edited

Loading

github-actions bot commented Nov 12, 2025 •

edited

Loading

github-actions bot commented Nov 12, 2025 •

edited

Loading

darynaishchenko commented Nov 13, 2025 •

edited by github-actions bot

Loading

darynaishchenko commented Nov 13, 2025 •

edited by github-actions bot

Loading