BigQuery -> FCNet geo-prior pipeline#80
Open
mohamedelabbas1996 wants to merge 7 commits into
Open
Conversation
Initial commit of the global geo-prior (FCNet) pipeline as used for the geoprior-fcnet-global-12317cls-v1 run: - src/dataset_tools/build_geoprior_json.py: BigQuery -> COCO-style train/val/test JSON for geo-prior training. - research/geoprior/train_geoprior.py: FCNet training wrapper around fagner-lepsAI/geo_prior, with wandb logging + per-epoch checkpoints. - research/geoprior/predict_geoprior.py: per-occurrence prior generation. - research/geoprior/fusion_eval_top5.py: geo-prior + classifier top-5 fusion evaluation. Scripts committed as-is; cleanup and reproducibility work (category-map builder, vendored geo_prior modules, README, pinned deps, .gitignore) to follow in subsequent commits. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The species->class_id map is the frozen class-space contract for the geo-prior FCNet model (output indices are bound to this alphabetical ordering). Commit it as a versioned artifact and add a builder that regenerates+verifies it from BigQuery rather than blindly overwriting. - research/geoprior/geoprior_categ_map.json: frozen 12,317-class map (snapshot public_gbif_2026-05; 6,864,466 geocoded occurrences). - research/geoprior/geoprior_categ_map.PROVENANCE.md: source BQ tables, definition, and regenerate/verify commands. - src/dataset_tools/build_geoprior_categ_map.py: rebuilds the map (and label_map/metadata/master lists) from leps-ai BQ tables; verifies against the frozen artifact and refuses to overwrite on drift. Verified: the builder reproduces the committed map exactly (12,317 entries identical) and its BQ count query matches master_species_with_counts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the geo-prior scripts and frozen artifacts out of the shared src/dataset_tools/ (BQ/vision pipeline) and research/ into a dedicated src/geoprior/ package. Pure relocation — no content changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- geoprior_fagner/: the geo-prior FCNet network (FCNet, losses, dataloader) kept in-tree, frozen at upstream commit ff4ccd1 (Apache-2.0). Removes the dependency on an external clone. - config.py: central configuration read from environment / src/geoprior/.env. No machine-specific paths or secrets in code. - .env.sample: annotated configuration template (real .env is gitignored). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rdcoded paths - All scripts now import the in-tree geoprior_fagner network and read paths, BigQuery project/dataset, and W&B settings from config (no /mnt or /home paths, no sys.path hack, no secrets in code). - build_geoprior_json now reads the in-repo frozen category map (was /mnt). - Fix stale docstring: MIN_OCC_PER_SPECIES default is 0 (keep all species). - .gitignore: ignore model checkpoints (*.pth). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- README.md: end-to-end documentation (layout, .env config, each pipeline stage, inputs, model details, reproducibility). - requirements.txt: pinned dependencies for the build and training stages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Run the repo's pinned pre-commit hooks (black 23.3.0, isort 5.12.0 with the black profile, flake8 6.0.0) over src/geoprior/. Changes are formatting-only: line reflow, double-quote normalization, import ordering, plus dropping a stray f-prefix on a non-interpolated string (F541) in fusion_eval_top5.py. No logic changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a self-contained geographic-prior package at
src/geoprior/. The geo-prior is a SINR-style FCNet that modelsp(species | latitude, longitude, date), no images, just coordinates/dates/labels, used to re-rank or gate the image classifier's predictions. The package covers the whole path from BigQuery occurrence data to a trained model and a top-5 fusion evaluation, with all configuration in.envand the network code vendored in-tree.Pure addition: 18 files, +1,799 / -0. No existing code is modified (aside from two
.gitignorelines).Pipeline stages
build_geoprior_categ_map.pyspecies -> class_idmap (regenerate/verify, never silent-overwrite)build_geoprior_json.pytrain/val/test.json(excludes vision val/test gbif_ids to prevent leakage)train_geoprior.pypredict_geoprior.pyfusion_eval_top5.pyChanges
Pipeline scripts: the 5 stage scripts above, run as modules from repo root (
python -m src.geoprior.<script>).Frozen artifact + provenance
geoprior_categ_map.json: frozen class-space contract, 12,317 species (snapshot 2026-05, 6,864,466 geocoded occurrences). Model output indices are bound to this alphabetical ordering, so it must not change while a model trained against it is in use.geoprior_categ_map.PROVENANCE.md: records the exact BigQuery tables + query and how to regenerate/verify.In-tree network:
geoprior_fagner/(models.py,losses.py,dataloader.py) frozen at the upstream commit, replacing the previous hardcoded external path so the package is portable.Config & reproducibility
config.py+.env.sample: every path/identifier/secret lives in.env(gitignored). Nothing hardcoded;WANDB_API_KEYandGOOGLE_APPLICATION_CREDENTIALSnever committed.requirements.txt: pinned dependency lower bounds (BigQuery build stage + Torch/timm/wandb train stage).README.md: full setup, per-stage commands, and config table..gitignore: ignore geo-prior checkpoints (*.pth).Notes for reviewers
feature/bq-squashfs-pipeline(Stage 1 sourcesgbif_occurrence_locationproduced by that branch), notmain.MIN_OCC_PER_SPECIES = 0: all species kept, no occurrence-count threshold filter.