Skip to content

BigQuery -> FCNet geo-prior pipeline#80

Open
mohamedelabbas1996 wants to merge 7 commits into
feature/bq-squashfs-pipelinefrom
feature/geoprior
Open

BigQuery -> FCNet geo-prior pipeline#80
mohamedelabbas1996 wants to merge 7 commits into
feature/bq-squashfs-pipelinefrom
feature/geoprior

Conversation

@mohamedelabbas1996

@mohamedelabbas1996 mohamedelabbas1996 commented Jun 5, 2026

Copy link
Copy Markdown

Summary

Adds a self-contained geographic-prior package at src/geoprior/. The geo-prior is a SINR-style FCNet that models p(species | latitude, longitude, date), no images, just coordinates/dates/labels, used to re-rank or gate the image classifier's predictions. The package covers the whole path from BigQuery occurrence data to a trained model and a top-5 fusion evaluation, with all configuration in .env and the network code vendored in-tree.

Pure addition: 18 files, +1,799 / -0. No existing code is modified (aside from two .gitignore lines).

Pipeline stages

Stage Script Does
0 build_geoprior_categ_map.py BigQuery -> frozen species -> class_id map (regenerate/verify, never silent-overwrite)
1 build_geoprior_json.py BigQuery + vision splits + map -> COCO-style train/val/test.json (excludes vision val/test gbif_ids to prevent leakage)
2 train_geoprior.py JSON -> trained FCNet (+ W&B logging, per-epoch checkpoints)
3 predict_geoprior.py per-occurrence priors for fusion
4 fusion_eval_top5.py geo-prior + classifier top-5 fusion eval

Changes

Pipeline scripts: the 5 stage scripts above, run as modules from repo root (python -m src.geoprior.<script>).

Frozen artifact + provenance

  • geoprior_categ_map.json: frozen class-space contract, 12,317 species (snapshot 2026-05, 6,864,466 geocoded occurrences). Model output indices are bound to this alphabetical ordering, so it must not change while a model trained against it is in use.
  • geoprior_categ_map.PROVENANCE.md: records the exact BigQuery tables + query and how to regenerate/verify.

In-tree network: geoprior_fagner/ (models.py, losses.py, dataloader.py) frozen at the upstream commit, replacing the previous hardcoded external path so the package is portable.

Config & reproducibility

  • config.py + .env.sample: every path/identifier/secret lives in .env (gitignored). Nothing hardcoded; WANDB_API_KEY and GOOGLE_APPLICATION_CREDENTIALS never committed.
  • requirements.txt: pinned dependency lower bounds (BigQuery build stage + Torch/timm/wandb train stage).
  • README.md: full setup, per-stage commands, and config table.
  • .gitignore: ignore geo-prior checkpoints (*.pth).

Notes for reviewers

  • Base is currently feature/bq-squashfs-pipeline (Stage 1 sources gbif_occurrence_location produced by that branch), not main.
  • MIN_OCC_PER_SPECIES = 0: all species kept, no occurrence-count threshold filter.

mohamedelabbas1996 and others added 6 commits June 4, 2026 20:27
Initial commit of the global geo-prior (FCNet) pipeline as used for the
geoprior-fcnet-global-12317cls-v1 run:

- src/dataset_tools/build_geoprior_json.py: BigQuery -> COCO-style
  train/val/test JSON for geo-prior training.
- research/geoprior/train_geoprior.py: FCNet training wrapper around
  fagner-lepsAI/geo_prior, with wandb logging + per-epoch checkpoints.
- research/geoprior/predict_geoprior.py: per-occurrence prior generation.
- research/geoprior/fusion_eval_top5.py: geo-prior + classifier top-5
  fusion evaluation.

Scripts committed as-is; cleanup and reproducibility work (category-map
builder, vendored geo_prior modules, README, pinned deps, .gitignore) to
follow in subsequent commits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The species->class_id map is the frozen class-space contract for the
geo-prior FCNet model (output indices are bound to this alphabetical
ordering). Commit it as a versioned artifact and add a builder that
regenerates+verifies it from BigQuery rather than blindly overwriting.

- research/geoprior/geoprior_categ_map.json: frozen 12,317-class map
  (snapshot public_gbif_2026-05; 6,864,466 geocoded occurrences).
- research/geoprior/geoprior_categ_map.PROVENANCE.md: source BQ tables,
  definition, and regenerate/verify commands.
- src/dataset_tools/build_geoprior_categ_map.py: rebuilds the map (and
  label_map/metadata/master lists) from leps-ai BQ tables; verifies
  against the frozen artifact and refuses to overwrite on drift.

Verified: the builder reproduces the committed map exactly (12,317
entries identical) and its BQ count query matches master_species_with_counts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the geo-prior scripts and frozen artifacts out of the shared
src/dataset_tools/ (BQ/vision pipeline) and research/ into a dedicated
src/geoprior/ package. Pure relocation — no content changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- geoprior_fagner/: the geo-prior FCNet network (FCNet, losses, dataloader)
  kept in-tree, frozen at upstream commit ff4ccd1 (Apache-2.0). Removes the
  dependency on an external clone.
- config.py: central configuration read from environment / src/geoprior/.env.
  No machine-specific paths or secrets in code.
- .env.sample: annotated configuration template (real .env is gitignored).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rdcoded paths

- All scripts now import the in-tree geoprior_fagner network and read paths,
  BigQuery project/dataset, and W&B settings from config (no /mnt or /home
  paths, no sys.path hack, no secrets in code).
- build_geoprior_json now reads the in-repo frozen category map (was /mnt).
- Fix stale docstring: MIN_OCC_PER_SPECIES default is 0 (keep all species).
- .gitignore: ignore model checkpoints (*.pth).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- README.md: end-to-end documentation (layout, .env config, each pipeline
  stage, inputs, model details, reproducibility).
- requirements.txt: pinned dependencies for the build and training stages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fbed0aa0-7e54-48c5-a48d-5343c3773b22

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/geoprior

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mohamedelabbas1996 mohamedelabbas1996 changed the base branch from main to feature/bq-squashfs-pipeline June 5, 2026 22:11
@mohamedelabbas1996 mohamedelabbas1996 changed the title Feature/geoprior feat(geoprior): self-contained BigQuery -> FCNet geo-prior pipeline Jun 11, 2026
Run the repo's pinned pre-commit hooks (black 23.3.0, isort 5.12.0 with
the black profile, flake8 6.0.0) over src/geoprior/. Changes are
formatting-only: line reflow, double-quote normalization, import
ordering, plus dropping a stray f-prefix on a non-interpolated string
(F541) in fusion_eval_top5.py. No logic changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@mohamedelabbas1996 mohamedelabbas1996 changed the title feat(geoprior): self-contained BigQuery -> FCNet geo-prior pipeline BigQuery -> FCNet geo-prior pipeline Jun 11, 2026
@mohamedelabbas1996 mohamedelabbas1996 marked this pull request as ready for review June 11, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant