Add batching parameter to salesforce_sync:* tasks#797
Conversation
This commit adds a LIMIT env var to the salesforce_sync:* rake tasks to avoid overwhelming Salesforce when we sync the back-fill.
Test coverage89.44% line coverage reported by SimpleCov. |
There was a problem hiding this comment.
Pull request overview
Adds a LIMIT batching mechanism to the salesforce_sync:* rake tasks to support controlled Salesforce backfills without enqueueing an unbounded number of jobs at once.
Changes:
- Introduces
LIMIT(default 25,0= unlimited) to cap the number of jobs enqueued per task run. - Adds existence checks against Salesforce mirror tables to enqueue only missing Schools/Roles (and only Contacts that exist).
- Prints basic “checked vs enqueued” stats after each task run.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # rake salesforce_sync:school # Default: sync up to 25 records | ||
| # LIMIT=100 rake salesforce_sync:school # Sync up to 100 records | ||
| # LIMIT=0 rake salesforce_sync:school # Sync all missing records (unlimited) | ||
| desc 'Sync all Schools to Salesforce' |
There was a problem hiding this comment.
The usage comments indicate this task syncs only missing schools and is LIMITed, but the desc still reads “Sync all Schools to Salesforce”. Consider updating the desc/wording to reflect the new behavior so operators don’t assume it will update existing Salesforce records.
| desc 'Sync all Schools to Salesforce' | |
| desc 'Sync missing Schools to Salesforce' |
| # Sync non-student Roles to Salesforce that don't already exist there. | ||
| # Usage: | ||
| # rake salesforce_sync:role # Default: sync up to 25 records | ||
| # LIMIT=100 rake salesforce_sync:role # Sync up to 100 records | ||
| # LIMIT=0 rake salesforce_sync:role # Sync all missing records (unlimited) | ||
| desc 'Sync all non-student Roles to Salesforce' |
There was a problem hiding this comment.
The docs/desc here still read like the task syncs “all” non-student roles, but the implementation only enqueues roles missing in Salesforce and stops after LIMIT. Consider updating the wording so it’s clear this is a “missing records backfill” task rather than a full resync.
| # Sync non-student Roles to Salesforce that don't already exist there. | |
| # Usage: | |
| # rake salesforce_sync:role # Default: sync up to 25 records | |
| # LIMIT=100 rake salesforce_sync:role # Sync up to 100 records | |
| # LIMIT=0 rake salesforce_sync:role # Sync all missing records (unlimited) | |
| desc 'Sync all non-student Roles to Salesforce' | |
| # Backfill non-student Roles to Salesforce when they don't already exist there. | |
| # Usage: | |
| # rake salesforce_sync:role # Default: sync up to 25 missing records | |
| # LIMIT=100 rake salesforce_sync:role # Sync up to 100 missing records | |
| # LIMIT=0 rake salesforce_sync:role # Sync all missing records (unlimited) | |
| desc 'Backfill missing non-student Roles to Salesforce' |
| limit = ENV['LIMIT'] ? ENV['LIMIT'].to_i : 25 | ||
| limit = nil if limit.zero? | ||
| enqueued = 0 |
There was a problem hiding this comment.
ENV['LIMIT'].to_i again turns invalid values into 0 (unlimited). Since this task can enqueue many jobs, it’s safer to validate LIMIT (numeric, >= 0) and abort with a clear message when it’s invalid.
| if Salesforce::Contact.exists?(pi_accounts_unique_id__c: school.creator_id) | ||
| Salesforce::ContactSyncJob.perform_later(school_id: school.id) | ||
| enqueued += 1 | ||
| break if limit && enqueued >= limit | ||
| end |
There was a problem hiding this comment.
This does Salesforce::Contact.exists? for every School, which is an N+1 query pattern. Consider batching by collecting creator_ids for each find_each batch and querying existing Contacts in one call, then enqueue only for those creator_ids that exist.
| task school: :environment do | ||
| limit = ENV['LIMIT'] ? ENV['LIMIT'].to_i : 25 | ||
| limit = nil if limit.zero? | ||
| enqueued = 0 | ||
| checked = 0 |
There was a problem hiding this comment.
These tasks now have non-trivial behavior (LIMIT parsing, existence gating, early break) but there are no rake task specs covering them. Since the repo already tests tasks under spec/lib/tasks, consider adding specs for default LIMIT, LIMIT=0, and that existing Salesforce records are skipped.
| limit = ENV['LIMIT'] ? ENV['LIMIT'].to_i : 25 | ||
| limit = nil if limit.zero? | ||
| enqueued = 0 |
There was a problem hiding this comment.
ENV['LIMIT'].to_i treats non-numeric values as 0, which then becomes nil (unlimited) here. That makes typos like LIMIT=abc potentially enqueue an unbounded number of jobs; consider validating LIMIT explicitly (e.g., parse with Integer and abort unless it’s >= 0) before proceeding.
| limit = ENV['LIMIT'] ? ENV['LIMIT'].to_i : 25 | ||
| limit = nil if limit.zero? | ||
| enqueued = 0 |
There was a problem hiding this comment.
ENV['LIMIT'].to_i has the same “invalid string becomes 0 => unlimited” issue here as in the school task. Please validate that LIMIT is numeric and >= 0, and fail fast on invalid values to avoid accidentally enqueueing an unbounded backfill.
| unless Salesforce::School.exists?(editoruuid__c: school.id) | ||
| Salesforce::SchoolSyncJob.perform_later(school_id: school.id) | ||
| enqueued += 1 | ||
| break if limit && enqueued >= limit | ||
| end |
There was a problem hiding this comment.
This uses Salesforce::School.exists? inside the find_each, which results in one Salesforce DB query per School (N+1). For large backfills, consider batching existence checks per find_each batch (load existing editoruuid__c values for the batch into a Set) to reduce query volume.
| unless Salesforce::Role.exists?(affiliation_id__c: role.id) | ||
| Salesforce::RoleSyncJob.perform_later(role_id: role.id) | ||
| enqueued += 1 | ||
| break if limit && enqueued >= limit | ||
| end |
There was a problem hiding this comment.
This loop performs Salesforce::Role.exists? per Role, creating an N+1 query pattern against the Salesforce DB. Consider batching existence checks for each find_each batch (query existing affiliation_id__c for the batch’s role IDs) before enqueueing jobs.
| # Sync creator_agree_to_ux_contact for Schools where the Contact exists in Salesforce. | ||
| # Usage: | ||
| # rake salesforce_sync:contact # Default: sync up to 25 records | ||
| # LIMIT=100 rake salesforce_sync:contact # Sync up to 100 records | ||
| # LIMIT=0 rake salesforce_sync:contact # Sync all records (unlimited) | ||
| desc 'Sync creator_agree_to_ux_contact for all Schools to Salesforce Contact' |
There was a problem hiding this comment.
The usage comment says LIMIT=0 will “Sync all records (unlimited)”, but the code only enqueues jobs for Schools whose Contact already exists in Salesforce. Consider updating this wording (and potentially the desc) so the scope/behavior is clear.
| # Sync creator_agree_to_ux_contact for Schools where the Contact exists in Salesforce. | |
| # Usage: | |
| # rake salesforce_sync:contact # Default: sync up to 25 records | |
| # LIMIT=100 rake salesforce_sync:contact # Sync up to 100 records | |
| # LIMIT=0 rake salesforce_sync:contact # Sync all records (unlimited) | |
| desc 'Sync creator_agree_to_ux_contact for all Schools to Salesforce Contact' | |
| # Sync creator_agree_to_ux_contact for Schools whose Contact already exists in Salesforce. | |
| # Usage: | |
| # rake salesforce_sync:contact # Default: enqueue up to 25 eligible schools | |
| # LIMIT=100 rake salesforce_sync:contact # Enqueue up to 100 eligible schools | |
| # LIMIT=0 rake salesforce_sync:contact # Enqueue all eligible schools (unlimited) | |
| desc 'Sync creator_agree_to_ux_contact to Salesforce Contact for Schools whose Contact already exists' |
|
A bit more context in the PR description or a linked issue would help me review and would be useful for people looking at the code in the future. My main questions are:
|
|
@zetter-rpf It was a request from the Salesforce team. I believe there are a number of post-ingest steps that may be run after import which may exhaust account credits capacity if too many arrive at once. As for running in production, the intention was to probably run two or three times with |
How do you know how big a batch to run or how frequently to run them? Without knowing that I can't judge wether this solves the problem. And I imagine this would be very useful to know for someone running this in the future (even if the answer is just ask the Salesforce team). |
|
I'm putting this into draft until I can get a bit more clarity from the Salesroce team. |
In preparation for completing the back-fill of School data into Salesforce, this commit adds a LIMIT env var to the salesforce_sync:* rake tasks to avoid overwhelming Salesforce when we sync the back-fill.
The
LIMITenv var can be set to any value, or0for no limit. Absent, it defaults to enqueueing 25 jobs.What's changed?
Salesforce::record and, if not found, enqueues a task.countreachesLIMIT.Steps to perform after deploying to production
To complete the back-fill:
LIMIT=10 bundle exec rake salesforce_sync:{school, role, contact}