Skip to content

fix(gcp): use redis backend for celery autoscaler when broker is redis#803

Merged
arniechops merged 2 commits intomainfrom
arnavchopra/fix-celery-autoscaler-redis-backend
Apr 5, 2026
Merged

fix(gcp): use redis backend for celery autoscaler when broker is redis#803
arniechops merged 2 commits intomainfrom
arnavchopra/fix-celery-autoscaler-redis-backend

Conversation

@arniechops
Copy link
Copy Markdown
Collaborator

@arniechops arniechops commented Apr 3, 2026

Summary

  • When the celery broker is redis, the autoscaler calls celery_app() without specifying backend_protocol, which defaults to "s3"
  • The S3 backend path tries to create a boto3 AWS session, which fails on GCP with ProfileNotFound: The config profile (default) could not be found
  • This causes celery-autoscaler-redis to CrashLoopBackOff on GCP clusters
  • Fix: pass backend_protocol="redis" when broker is redis (matching how the servicebus path already passes backend_protocol="abs")

Test plan

  • Deploy to GCP cluster and verify celery-autoscaler-redis-0 starts without crash-looping
  • Verify celery task autoscaling still works on AWS (no regression — AWS uses SQS broker, not redis)

Greptile Summary

This PR fixes a GCP-specific crash in the celery-autoscaler-redis pod by passing backend_protocol="redis" when constructing the Celery inspect app for the Redis broker path. Previously, the code fell back to the default backend_protocol="s3", which triggered a boto3 AWS session initialization (via _get_backend_url_and_conf) that fails on GCP with ProfileNotFound.

Key changes:

  • In the broker_type == "redis" branch of main(), celery_app() is now called with backend_protocol="redis", consistent with how the servicebus branch already passes backend_protocol="abs".
  • The fix is minimal and targeted — no other code paths are affected. AWS deployments using SQS are unaffected.
  • The _get_backend_url_and_conf function shows that "redis" backend resolves to get_redis_endpoint(1) (db index 1), which is a valid, always-available endpoint in a Redis-based deployment.

Confidence Score: 5/5

Safe to merge — targeted one-line fix that correctly mirrors the existing servicebus pattern and avoids a well-understood crash on GCP.

The change is small, well-motivated, and aligns with the existing convention (servicebus uses backend_protocol='abs'). The backend_protocol='redis' path in _get_backend_url_and_conf is straightforward and does not touch AWS credentials. The only remaining feedback is a P2 style suggestion to add a clarifying comment. No logic or correctness issues found.

No files require special attention — the single changed file is straightforward.

Important Files Changed

Filename Overview
model-engine/model_engine_server/core/celery/celery_autoscaler.py Adds backend_protocol='redis' when constructing the Celery inspect app for the Redis broker path, mirroring the existing backend_protocol='abs' pattern used for the Service Bus path. This prevents a boto3 AWS profile lookup from crashing the autoscaler on GCP.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[main: autoscaler_broker] --> B{broker_type?}
    B -- redis --> C["celery_app with backend_protocol='redis'"]
    B -- sqs --> D["inspect = empty, reads directly from SQS"]
    B -- servicebus --> E["celery_app with backend_protocol='abs'"]

    C --> F[_get_backend_url_and_conf]
    E --> F

    F -- redis --> G["get_redis_endpoint db=1, no AWS session needed"]
    F -- s3 old default --> H{cloud_provider?}
    H -- aws --> I["session + boto3, works on AWS"]
    H -- gcp --> J["boto3 default profile, ProfileNotFound crash on GCP"]
    F -- abs --> K["azureblockblob URL, no AWS session needed"]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/core/celery/celery_autoscaler.py
Line: 606-614

Comment:
**Add comment explaining the workaround**

The `backend_protocol="redis"` is a non-obvious workaround — without a comment, future readers won't know why it's needed here but not defaulted elsewhere. The `servicebus` block (line 622) has the same gap. A short inline comment (similar to the PR description) would make the intent clear and prevent the fix from being accidentally reverted.

```suggestion
                app=celery_app(
                    None,
                    broker_type=broker_type,
                    task_visibility=db_index,
                    aws_role=aws_profile,
                    # Use redis backend to avoid S3/boto3 session initialization,
                    # which fails on GCP with ProfileNotFound when no AWS credentials exist.
                    backend_protocol="redis",
                )
```

**Rule Used:** Add comments to explain complex or non-obvious log... ([source](https://app.greptile.com/review/custom-context?memory=928586f9-9432-435e-a385-026fa49318a2))

**Learnt From**
[scaleapi/scaleapi#126958](https://github.com/scaleapi/scaleapi/pull/126958)

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix" | Re-trigger Greptile

@arniechops arniechops merged commit af2d2da into main Apr 5, 2026
8 checks passed
@arniechops arniechops deleted the arnavchopra/fix-celery-autoscaler-redis-backend branch April 5, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants