feat: skip uncomparable files, ignore unreliable columns, implement column level stats, improve performances for large files and balance the row-change cap#15
Conversation
…nd primary-key handling
There was a problem hiding this comment.
Pull request overview
This PR significantly expands gtfs_diff by adding remote HTTP(S) folder support, large-file routing to a DuckDB backend, and new “not_compared” behaviors (id churn detection and missing-PK handling) while also introducing richer per-file/per-column change statistics.
Changes:
- Add DuckDB-backed diffing for large eligible files (including URL-based reading via
httpfs) with parity tests against the in-memory engine. - Add HTTP(S) “folder URL” feed support with probing/existence checks and optional file filtering.
- Add id-churn detection + foreign-key-driven ignored columns, plus optional/conditional primary-key column handling and enhanced change stats.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_engine.py | Adds extensive unit/integration coverage for URL feeds, DuckDB parity/routing, id-churn + FK-ignored columns, optional PK behavior, and change stats. |
| tests/test_cli.py | Updates CLI expectations for missing-PK behavior and adds CLI option coverage for DuckDB toggles/thresholds. |
| src/gtfs_diff/tracing.py | Extracts shared progress tracing into its own module. |
| src/gtfs_diff/gtfs_definitions.py | Expands/adjusts PK definitions, introduces optional PK columns, FK graph, and id-churn threshold plumbing. |
| src/gtfs_diff/engine.py | Adds remote feed openers/probing, not_compared handling, FK-aware processing order, id-churn gating, optional DuckDB routing, and stats plumbing. |
| src/gtfs_diff/engine_duckdb.py | Introduces DuckDB backend for “modified” diffs with capped row collection, id-churn gate, ignored FK columns, and spill cleanup. |
| src/gtfs_diff/diff_helpers.py | Centralizes shared pure diff logic (ordering, id-churn detection, ignored columns, cap splitting, stats assembly). |
| src/gtfs_diff/csv_utils.py | Extracts shared CSV parsing/indexing/value-diff utilities and optional-PK column behavior. |
| src/gtfs_diff/cli.py | Adds CLI options for file filtering, id-churn thresholds, DuckDB routing, and column-stats toggling; supports URL inputs. |
| requirements.txt | Adds DuckDB runtime dependency pin for the requirements-based environment. |
| README.md | Updates documentation for URL feeds, DuckDB backend, id-churn/not_compared semantics, and stats fields/options. |
| pyproject.toml | Declares DuckDB as a dependency (and dev dependency). |
| docs/architecture.md | Documents remote feeds, optional PKs, DuckDB routing/parity strategy, id-churn detection, and FK propagation/ignored columns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sql = ( | ||
| f"SELECT {', '.join(select_parts)} " | ||
| f"FROM base_t b JOIN new_t n ON {pk_join} " | ||
| f"WHERE {distinct_pred}" | ||
| ) |
| Each candidate name is probed for presence (so missing files are correctly | ||
| treated as added/deleted) and fetched lazily only when its opener is called. | ||
| """ | ||
| candidates = list(files) if files else sorted(SUPPORTED_FILES) | ||
|
|
| @@ -0,0 +1,474 @@ | |||
| """DuckDB-backed diff for very large GTFS files. | |||
There was a problem hiding this comment.
Did you benchmark duckdb with small files? It would simplify the code.
Is there any major platform where duckdb is not available?
There was a problem hiding this comment.
the idea was just not to break everything and have to rewrite the entire test suite. The benchmark for using duckdb is customizable, but you're right that it would simplify the code (one less engine)
I can open an issue to address this post MVP depending on the performances of the tool?
There was a problem hiding this comment.
Not sure it's worth it now. Maybe if we have major work to do on the in-memory engine we can reconsider.
jcpitre
left a comment
There was a problem hiding this comment.
Approved, but frankly there's so many changes that I can't say I dug very deep in it. We have to trust the tests.
Summary
This PR makes the diff engine degrade gracefully instead of failing or producing misleading output when a feed can't be cleanly compared. Files that can't be matched are reported as
not_compared(rather than aborting the run), columns that reference those files are excluded, and the per-file row-change cap is now split fairly across added / modified / deleted so every report shows a little of everything.What's new
not_comparedinstead of producing a misleading diff.not_comparedand processing continues.not_comparedfile are excluded from the diff, since their values can't be trusted.column_stats).Performance: DuckDB backend for large files
Small files stay on the fast in-memory path, but large files are now routed to a DuckDB backend so big feeds no longer blow up memory or stall:
~50 MB uncompressed (e.g.
stop_times.txt, which can exceed 10 M rows) are diffed on disk by DuckDB instead of loading every row into memory. Memory stays roughly flat regardless of feed size.http(s), DuckDB reads it directly from the URL rather than fully materializing it first.diff_feeds(large_file_threshold_bytes=...).Testing
Tested against the two latest STM datasets:
Example report with dummy data
Files ignored due to id churn
Files ignored due to missing PK
Columns ignored due to ignored files
Split cap so the diff contains a little of all (added / modified / deleted), e.g. with a cap of 9
Added column stats
AI description:
This pull request significantly expands and updates the documentation for the GTFS Diff Engine, especially in the
README.mdand architecture documentation. The main focus is on clarifying new features, usage patterns, and options introduced in recent releases, such as support for public HTTP(S) folder URLs as feed sources, the built-in DuckDB backend for large files, configurable id-churn detection, per-column modification statistics, and more flexible file selection. The documentation now provides more detailed parameter explanations, usage examples, and output schema descriptions, making it easier for users to understand and leverage the full capabilities of the tool.Major documentation improvements:
Support for new feed sources and file selection:
filesparameter in both the CLI and Python API, including auto-discovery of GTFS files for URLs and selective comparison. [1] [2]Large file handling and DuckDB backend:
Id-churn detection and configuration:
not_comparedfiles), including new CLI/API options for global and per-file churn thresholds, and how files with regenerated primary keys or missing key columns are handled. [1] [2] [3]Per-column and file-level statistics:
stats.rows_changed_percentageandstats.column_stats, with details on their calculation, interpretation, and configuration. [1] [2]Primary key and foreign key handling:
Other enhancements:
These changes make the documentation much more comprehensive and user-friendly, reflecting the latest features and best practices for using the GTFS Diff Engine.