Skip to content

feat: skip uncomparable files, ignore unreliable columns, implement column level stats, improve performances for large files and balance the row-change cap#15

Merged
cka-y merged 15 commits into
mainfrom
feat/10
Jun 11, 2026
Merged

feat: skip uncomparable files, ignore unreliable columns, implement column level stats, improve performances for large files and balance the row-change cap#15
cka-y merged 15 commits into
mainfrom
feat/10

Conversation

@cka-y

@cka-y cka-y commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR makes the diff engine degrade gracefully instead of failing or producing misleading output when a feed can't be cleanly compared. Files that can't be matched are reported as not_compared (rather than aborting the run), columns that reference those files are excluded, and the per-file row-change cap is now split fairly across added / modified / deleted so every report shows a little of everything.

What's new

  • Skip files with churned identifiers (id churn). When a file's primary key is regenerated across versions, row-level matching is unreliable, so the file is reported as not_compared instead of producing a misleading diff.
  • Skip files missing required primary keys. A missing required PK no longer stops the whole run — the file is reported as not_compared and processing continues.
  • Ignore columns that reference skipped files. Foreign-key columns pointing at a not_compared file are excluded from the diff, since their values can't be trusted.
  • Fair cap split. The row-change cap is divided across change types so the diff contains a little of all three (added / modified / deleted) — e.g. with a cap of 9, each type gets up to 3.
  • Per-column modification stats. Modified files now report per-column modification counts and percentages (column_stats).

Performance: DuckDB backend for large files

Small files stay on the fast in-memory path, but large files are now routed to a DuckDB backend so big feeds no longer blow up memory or stall:

  • On-disk diffing, bounded memory. Files whose larger side exceeds
    ~50 MB uncompressed (e.g. stop_times.txt, which can exceed 10 M rows) are diffed on disk by DuckDB instead of loading every row into memory. Memory stays roughly flat regardless of feed size.
  • Reads remote files in place. When a file is served over http(s), DuckDB reads it directly from the URL rather than fully materializing it first.
  • Automatic and transparent. Routing is based on file size and primary-key shape — no flags required. The threshold is tunable via diff_feeds(large_file_threshold_bytes=...).
  • Safe fallback. If DuckDB is unavailable, a file isn't eligible, or the backend errors, the engine transparently falls back to the in-memory path, so results are identical either way.

Testing

Tested against the two latest STM datasets:

gtfs-diff https://files.mobilitydatabase.org/mdb-2126/mdb-2126-202605300005/extracted \
  https://files.mobilitydatabase.org/mdb-2126/mdb-2126-202606040044/extracted \
  --output results-STM.json \
  --cap 10000

Example report with dummy data

Files ignored due to id churn

image

Files ignored due to missing PK

Screenshot 2026-06-10 at 11 03 35 AM

Columns ignored due to ignored files

Screenshot 2026-06-10 at 11 04 01 AM

Split cap so the diff contains a little of all (added / modified / deleted), e.g. with a cap of 9

Screenshot 2026-06-10 at 11 04 35 AM

Added column stats

Screenshot 2026-06-10 at 11 08 31 AM

AI description:

This pull request significantly expands and updates the documentation for the GTFS Diff Engine, especially in the README.md and architecture documentation. The main focus is on clarifying new features, usage patterns, and options introduced in recent releases, such as support for public HTTP(S) folder URLs as feed sources, the built-in DuckDB backend for large files, configurable id-churn detection, per-column modification statistics, and more flexible file selection. The documentation now provides more detailed parameter explanations, usage examples, and output schema descriptions, making it easier for users to understand and leverage the full capabilities of the tool.

Major documentation improvements:

Support for new feed sources and file selection:

  • Updated all references to feed sources to include support for public HTTP(S) folder URLs, in addition to zip archives and directories. Clarified how file selection works for both local and remote feeds, including behavior with non-listable folders and missing files. [1] [2]
  • Added detailed explanations and examples for using the files parameter in both the CLI and Python API, including auto-discovery of GTFS files for URLs and selective comparison. [1] [2]

Large file handling and DuckDB backend:

  • Documented the built-in DuckDB backend for memory-efficient diffing of very large files, including new CLI and API options for controlling when DuckDB is used, and how remote files are accessed directly by DuckDB. [1] [2] [3]

Id-churn detection and configuration:

  • Added documentation for id-churn detection (not_compared files), including new CLI/API options for global and per-file churn thresholds, and how files with regenerated primary keys or missing key columns are handled. [1] [2] [3]

Per-column and file-level statistics:

  • Expanded documentation of per-file and per-column change statistics, including new output fields such as stats.rows_changed_percentage and stats.column_stats, with details on their calculation, interpretation, and configuration. [1] [2]

Primary key and foreign key handling:

  • Updated the table of primary key columns for supported GTFS files, clarified conditional/optional key handling, and described how missing or not-compared files affect foreign key diffs.

Other enhancements:

  • Improved CLI usage documentation, including new options and updated help text. [1] [2]
  • Added more complete and accurate Python API examples, including for advanced options.
  • Clarified memory efficiency strategies and how the engine chooses between in-memory and DuckDB backends.

These changes make the documentation much more comprehensive and user-friendly, reflecting the latest features and best practices for using the GTFS Diff Engine.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR significantly expands gtfs_diff by adding remote HTTP(S) folder support, large-file routing to a DuckDB backend, and new “not_compared” behaviors (id churn detection and missing-PK handling) while also introducing richer per-file/per-column change statistics.

Changes:

  • Add DuckDB-backed diffing for large eligible files (including URL-based reading via httpfs) with parity tests against the in-memory engine.
  • Add HTTP(S) “folder URL” feed support with probing/existence checks and optional file filtering.
  • Add id-churn detection + foreign-key-driven ignored columns, plus optional/conditional primary-key column handling and enhanced change stats.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_engine.py Adds extensive unit/integration coverage for URL feeds, DuckDB parity/routing, id-churn + FK-ignored columns, optional PK behavior, and change stats.
tests/test_cli.py Updates CLI expectations for missing-PK behavior and adds CLI option coverage for DuckDB toggles/thresholds.
src/gtfs_diff/tracing.py Extracts shared progress tracing into its own module.
src/gtfs_diff/gtfs_definitions.py Expands/adjusts PK definitions, introduces optional PK columns, FK graph, and id-churn threshold plumbing.
src/gtfs_diff/engine.py Adds remote feed openers/probing, not_compared handling, FK-aware processing order, id-churn gating, optional DuckDB routing, and stats plumbing.
src/gtfs_diff/engine_duckdb.py Introduces DuckDB backend for “modified” diffs with capped row collection, id-churn gate, ignored FK columns, and spill cleanup.
src/gtfs_diff/diff_helpers.py Centralizes shared pure diff logic (ordering, id-churn detection, ignored columns, cap splitting, stats assembly).
src/gtfs_diff/csv_utils.py Extracts shared CSV parsing/indexing/value-diff utilities and optional-PK column behavior.
src/gtfs_diff/cli.py Adds CLI options for file filtering, id-churn thresholds, DuckDB routing, and column-stats toggling; supports URL inputs.
requirements.txt Adds DuckDB runtime dependency pin for the requirements-based environment.
README.md Updates documentation for URL feeds, DuckDB backend, id-churn/not_compared semantics, and stats fields/options.
pyproject.toml Declares DuckDB as a dependency (and dev dependency).
docs/architecture.md Documents remote feeds, optional PKs, DuckDB routing/parity strategy, id-churn detection, and FK propagation/ignored columns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +349 to +353
sql = (
f"SELECT {', '.join(select_parts)} "
f"FROM base_t b JOIN new_t n ON {pk_join} "
f"WHERE {distinct_pred}"
)
Comment thread src/gtfs_diff/engine.py
Comment on lines +392 to +396
Each candidate name is probed for presence (so missing files are correctly
treated as added/deleted) and fetched lazily only when its opener is called.
"""
candidates = list(files) if files else sorted(SUPPORTED_FILES)

@cka-y cka-y changed the title Feat/10 feat: skip uncomparable files, ignore unreliable columns, improve performances for large files and balance the row-change cap Jun 10, 2026
@cka-y cka-y changed the title feat: skip uncomparable files, ignore unreliable columns, improve performances for large files and balance the row-change cap feat: skip uncomparable files, ignore unreliable columns, implement column level stats, improve performances for large files and balance the row-change cap Jun 10, 2026
@cka-y cka-y marked this pull request as ready for review June 10, 2026 15:19
@@ -0,0 +1,474 @@
"""DuckDB-backed diff for very large GTFS files.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you benchmark duckdb with small files? It would simplify the code.
Is there any major platform where duckdb is not available?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was just not to break everything and have to rewrite the entire test suite. The benchmark for using duckdb is customizable, but you're right that it would simplify the code (one less engine)

I can open an issue to address this post MVP depending on the performances of the tool?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's worth it now. Maybe if we have major work to do on the in-memory engine we can reconsider.

Comment thread src/gtfs_diff/engine.py Outdated
Comment thread src/gtfs_diff/engine_duckdb.py

@jcpitre jcpitre left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but frankly there's so many changes that I can't say I dug very deep in it. We have to trust the tests.

@cka-y cka-y merged commit e3ca12f into main Jun 11, 2026
6 checks passed
@cka-y cka-y deleted the feat/10 branch June 11, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add per-file and per-column change statistics to diff output Implement per-file "not compared" status for unreliable diffs

3 participants