feat: skip uncomparable files, ignore unreliable columns, implement column level stats, improve performances for large files and balance the row-change cap by cka-y · Pull Request #15 · MobilityData/gtfs-diff-engine

cka-y · 2026-06-10T12:14:44Z

Summary

This PR makes the diff engine degrade gracefully instead of failing or producing misleading output when a feed can't be cleanly compared. Files that can't be matched are reported as not_compared (rather than aborting the run), columns that reference those files are excluded, and the per-file row-change cap is now split fairly across added / modified / deleted so every report shows a little of everything.

What's new

Skip files with churned identifiers (id churn). When a file's primary key is regenerated across versions, row-level matching is unreliable, so the file is reported as not_compared instead of producing a misleading diff.
Skip files missing required primary keys. A missing required PK no longer stops the whole run — the file is reported as not_compared and processing continues.
Ignore columns that reference skipped files. Foreign-key columns pointing at a not_compared file are excluded from the diff, since their values can't be trusted.
Fair cap split. The row-change cap is divided across change types so the diff contains a little of all three (added / modified / deleted) — e.g. with a cap of 9, each type gets up to 3.
Per-column modification stats. Modified files now report per-column modification counts and percentages (column_stats).

Performance: DuckDB backend for large files

Small files stay on the fast in-memory path, but large files are now routed to a DuckDB backend so big feeds no longer blow up memory or stall:

On-disk diffing, bounded memory. Files whose larger side exceeds
~50 MB uncompressed (e.g. stop_times.txt, which can exceed 10 M rows) are diffed on disk by DuckDB instead of loading every row into memory. Memory stays roughly flat regardless of feed size.
Reads remote files in place. When a file is served over http(s), DuckDB reads it directly from the URL rather than fully materializing it first.
Automatic and transparent. Routing is based on file size and primary-key shape — no flags required. The threshold is tunable via diff_feeds(large_file_threshold_bytes=...).
Safe fallback. If DuckDB is unavailable, a file isn't eligible, or the backend errors, the engine transparently falls back to the in-memory path, so results are identical either way.

Testing

Tested against the two latest STM datasets:

gtfs-diff https://files.mobilitydatabase.org/mdb-2126/mdb-2126-202605300005/extracted \
  https://files.mobilitydatabase.org/mdb-2126/mdb-2126-202606040044/extracted \
  --output results-STM.json \
  --cap 10000

Example report with dummy data

Files ignored due to id churn

Files ignored due to missing PK

Columns ignored due to ignored files

Split cap so the diff contains a little of all (added / modified / deleted), e.g. with a cap of 9

Added column stats

AI description:

This pull request significantly expands and updates the documentation for the GTFS Diff Engine, especially in the README.md and architecture documentation. The main focus is on clarifying new features, usage patterns, and options introduced in recent releases, such as support for public HTTP(S) folder URLs as feed sources, the built-in DuckDB backend for large files, configurable id-churn detection, per-column modification statistics, and more flexible file selection. The documentation now provides more detailed parameter explanations, usage examples, and output schema descriptions, making it easier for users to understand and leverage the full capabilities of the tool.

Major documentation improvements:

Support for new feed sources and file selection:

Updated all references to feed sources to include support for public HTTP(S) folder URLs, in addition to zip archives and directories. Clarified how file selection works for both local and remote feeds, including behavior with non-listable folders and missing files. [1] [2]
Added detailed explanations and examples for using the files parameter in both the CLI and Python API, including auto-discovery of GTFS files for URLs and selective comparison. [1] [2]

Large file handling and DuckDB backend:

Documented the built-in DuckDB backend for memory-efficient diffing of very large files, including new CLI and API options for controlling when DuckDB is used, and how remote files are accessed directly by DuckDB. [1] [2] [3]

Id-churn detection and configuration:

Added documentation for id-churn detection (not_compared files), including new CLI/API options for global and per-file churn thresholds, and how files with regenerated primary keys or missing key columns are handled. [1] [2] [3]

Per-column and file-level statistics:

Expanded documentation of per-file and per-column change statistics, including new output fields such as stats.rows_changed_percentage and stats.column_stats, with details on their calculation, interpretation, and configuration. [1] [2]

Primary key and foreign key handling:

Updated the table of primary key columns for supported GTFS files, clarified conditional/optional key handling, and described how missing or not-compared files affect foreign key diffs.

Other enhancements:

Improved CLI usage documentation, including new options and updated help text. [1] [2]
Added more complete and accurate Python API examples, including for advanced options.
Clarified memory efficiency strategies and how the engine chooses between in-memory and DuckDB backends.

These changes make the documentation much more comprehensive and user-friendly, reflecting the latest features and best practices for using the GTFS Diff Engine.

…nd primary-key handling

Copilot

Pull request overview

This PR significantly expands gtfs_diff by adding remote HTTP(S) folder support, large-file routing to a DuckDB backend, and new “not_compared” behaviors (id churn detection and missing-PK handling) while also introducing richer per-file/per-column change statistics.

Changes:

Add DuckDB-backed diffing for large eligible files (including URL-based reading via httpfs) with parity tests against the in-memory engine.
Add HTTP(S) “folder URL” feed support with probing/existence checks and optional file filtering.
Add id-churn detection + foreign-key-driven ignored columns, plus optional/conditional primary-key column handling and enhanced change stats.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_engine.py	Adds extensive unit/integration coverage for URL feeds, DuckDB parity/routing, id-churn + FK-ignored columns, optional PK behavior, and change stats.
tests/test_cli.py	Updates CLI expectations for missing-PK behavior and adds CLI option coverage for DuckDB toggles/thresholds.
src/gtfs_diff/tracing.py	Extracts shared progress tracing into its own module.
src/gtfs_diff/gtfs_definitions.py	Expands/adjusts PK definitions, introduces optional PK columns, FK graph, and id-churn threshold plumbing.
src/gtfs_diff/engine.py	Adds remote feed openers/probing, not_compared handling, FK-aware processing order, id-churn gating, optional DuckDB routing, and stats plumbing.
src/gtfs_diff/engine_duckdb.py	Introduces DuckDB backend for “modified” diffs with capped row collection, id-churn gate, ignored FK columns, and spill cleanup.
src/gtfs_diff/diff_helpers.py	Centralizes shared pure diff logic (ordering, id-churn detection, ignored columns, cap splitting, stats assembly).
src/gtfs_diff/csv_utils.py	Extracts shared CSV parsing/indexing/value-diff utilities and optional-PK column behavior.
src/gtfs_diff/cli.py	Adds CLI options for file filtering, id-churn thresholds, DuckDB routing, and column-stats toggling; supports URL inputs.
requirements.txt	Adds DuckDB runtime dependency pin for the requirements-based environment.
README.md	Updates documentation for URL feeds, DuckDB backend, id-churn/not_compared semantics, and stats fields/options.
pyproject.toml	Declares DuckDB as a dependency (and dev dependency).
docs/architecture.md	Documents remote feeds, optional PKs, DuckDB routing/parity strategy, id-churn detection, and FK propagation/ignored columns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    sql = (
+        f"SELECT {', '.join(select_parts)} "
+        f"FROM base_t b JOIN new_t n ON {pk_join} "
+        f"WHERE {distinct_pred}"
+    )


+    Each candidate name is probed for presence (so missing files are correctly
+    treated as added/deleted) and fetched lazily only when its opener is called.
+    """
+    candidates = list(files) if files else sorted(SUPPORTED_FILES)
+


jcpitre · 2026-06-10T17:09:35Z

@@ -0,0 +1,474 @@
+"""DuckDB-backed diff for very large GTFS files.


Did you benchmark duckdb with small files? It would simplify the code.
Is there any major platform where duckdb is not available?

the idea was just not to break everything and have to rewrite the entire test suite. The benchmark for using duckdb is customizable, but you're right that it would simplify the code (one less engine)

I can open an issue to address this post MVP depending on the performances of the tool?

Not sure it's worth it now. Maybe if we have major work to do on the in-memory engine we can reconsider.

jcpitre

Approved, but frankly there's so many changes that I can't say I dug very deep in it. We have to trust the tests.

cka-y added 11 commits June 2, 2026 14:09

feat: automatic pydantic model generation from json schema

23195a9

added lint + improve ci

be7f731

fix: zip without strict

2c0bfae

update models

6d5af75

fix: moved conf file

f96ef39

fix: moved conf file

176b896

fix: added extra allowed

4c25ea7

merge main

a778822

feat: scalable GTFS diff engine with DuckDB backend, cap splitting, a…

4b17102

…nd primary-key handling

Merge branch 'main' into feat/10

9a625ea

fix: lint

869937e

cka-y requested a review from Copilot June 10, 2026 12:19

Copilot started reviewing on behalf of cka-y June 10, 2026 12:20 View session

This was linked to issues Jun 10, 2026

Implement per-file "not compared" status for unreliable diffs #10

Closed

Add per-file and per-column change statistics to diff output #11

Closed

Copilot AI reviewed Jun 10, 2026

View reviewed changes

cka-y changed the title ~~Feat/10~~ feat: skip uncomparable files, ignore unreliable columns, improve performances for large files and balance the row-change cap Jun 10, 2026

cka-y marked this pull request as ready for review June 10, 2026 15:19

jcpitre reviewed Jun 10, 2026

View reviewed changes

fix: jc env variable request

66bf88d

jcpitre reviewed Jun 10, 2026

View reviewed changes

Comment thread src/gtfs_diff/engine.py Outdated

jcpitre reviewed Jun 10, 2026

View reviewed changes

Comment thread src/gtfs_diff/engine_duckdb.py

jcpitre approved these changes Jun 10, 2026

View reviewed changes

cka-y added 3 commits June 11, 2026 11:28

fix: continue processing after duplicate pk

1e552fb

fix: PR comments

16c5ed8

fix: revert model

d0c226e

cka-y merged commit e3ca12f into main Jun 11, 2026
6 checks passed

cka-y deleted the feat/10 branch June 11, 2026 15:46

cka-y mentioned this pull request Jun 11, 2026

Diff is stopped if any file has missing keys #7

Closed

cka-y mentioned this pull request Jun 11, 2026

Release gtfs-diff-engine v0.2.0 #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: skip uncomparable files, ignore unreliable columns, implement column level stats, improve performances for large files and balance the row-change cap#15

feat: skip uncomparable files, ignore unreliable columns, implement column level stats, improve performances for large files and balance the row-change cap#15
cka-y merged 15 commits into
mainfrom
feat/10

cka-y commented Jun 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

jcpitre Jun 10, 2026

Uh oh!

cka-y Jun 10, 2026

Uh oh!

jcpitre Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

jcpitre left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,474 @@
		"""DuckDB-backed diff for very large GTFS files.

Conversation

cka-y commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Performance: DuckDB backend for large files

Testing

Files ignored due to id churn

Files ignored due to missing PK

Columns ignored due to ignored files

Split cap so the diff contains a little of all (added / modified / deleted), e.g. with a cap of 9

Added column stats

AI description:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jcpitre Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

cka-y Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

jcpitre Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jcpitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cka-y commented Jun 10, 2026 •

edited

Loading