Skip to content

Support gateways with multiple replicas#3960

Open
jvstme wants to merge 1 commit into
masterfrom
gateway_replicas
Open

Support gateways with multiple replicas#3960
jvstme wants to merge 1 commit into
masterfrom
gateway_replicas

Conversation

@jvstme

@jvstme jvstme commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

A gateway can now have multiple replicas for
improved availability.

type: gateway
name: example-gateway

backend: aws
region: eu-west-1

domain: example.com

certificate: null
replicas: 2

To balance requests between gateway replicas, add
DNS records for each replica or set up a load
balancer outside of dstack. Replica hostnames
are displayed in dstack CLI and UI.

$ dstack gateway list
 NAME             BACKEND          HOSTNAME        DOMAIN       DEFAULT  STATUS
 example-gateway                                   example.com  ✓        running
    replica=0     aws (eu-west-1)  34.244.128.46
    replica=1     aws (eu-west-1)  18.201.201.174

Limitations:

  • Changing the number of replicas or redeploying
    replicas is not supported.
  • HTTPS is not supported. Use an external load
    balancer for TLS termination.
  • An unavailable gateway replica prevents any new
    services or service replicas from being added.
  • All replicas are bound to the same backend and
    region.

Implementation notes:

  • GatewayComputeModel now represents a gateway
    replica.
  • In this version, the terms "compute" and
    "replica" are used interchangeably. The plan is
    to switch to using exclusively "replica" later.
  • In this version, replica provisioning and
    termination are still done in the gateway
    pipeline, for all replicas at once. The plan is
    to introduce gateway replica pipelines later to
    allow for independent replica processing.

#3959

A gateway can now have multiple replicas for
improved availability.

```yaml
type: gateway
name: example-gateway

backend: aws
region: eu-west-1

domain: example.com

certificate: null
replicas: 2
```

To balance requests between gateway replicas, add
DNS records for each replica or set up a load
balancer outside of `dstack`. Replica hostnames
are displayed in `dstack` CLI and UI.

```shell
$ dstack gateway list
 NAME             BACKEND          HOSTNAME        DOMAIN       DEFAULT  STATUS
 example-gateway                                   example.com  ✓        running
    replica=0     aws (eu-west-1)  34.244.128.46
    replica=1     aws (eu-west-1)  18.201.201.174
```

Limitations:
- Changing the number of replicas or redeploying
  replicas is not supported.
- HTTPS is not supported. Use an external load
  balancer for TLS termination.
- An unavailable gateway replica prevents any new
  services or service replicas from being added.
- All replicas are bound to the same backend and
  region.

Implementation notes:
- `GatewayComputeModel` now represents a gateway
  replica.
- In this version, the terms "compute" and
  "replica" are used interchangeably. The plan is
  to switch to using exclusively "replica" later.
- In this version, replica provisioning and
  termination are still done in the gateway
  pipeline, for all replicas at once. The plan is
  to introduce gateway replica pipelines later to
  allow for independent replica processing.
@jvstme jvstme requested a review from r4victor June 12, 2026 01:10
logger.debug(
"%s replica %d: creating gateway compute", fmt(gateway_model), replica_num
)
gateway_compute_model = await gateways_services.create_gateway_compute(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one replica fails but others are provisioned – there needs to be a clean up of successfully provisioned replicas.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gateway will enter the failed status, but any successfully provisioned replicas will remain in the database and can be deleted with dstack gateway delete.

This is consistent with the existing handling of single-replica gateway provisioning failures. If dstack creates an instance for a gateway but later fails to connect to it, the instance is not cleaned up automatically and can only be removed along with the failed gateway using dstack gateway delete.

While this behavior may be counterintuitive and worth revisiting, I think it will be easier to address after gateway replica statuses and pipelines are introduced, which I plan to implement in the next iteration.

" Set to `null` to disable. Defaults to `type: lets-encrypt`"
),
] = LetsEncryptGatewayCertificate()
replicas: Annotated[

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put a technical upper bound on the number of provisioned replicas, e.g. 20 (to avoid provisioning 1000 replicas in a loop).

Comment on lines +630 to +632
stats = await conn.get_stats(project_name, run_name)
if stats is None: # Stats not fetched yet
return None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any one replica goes unavailable, it breaks autoscaling. I expect this needs to be fixed for HA so worth adding a TODO/FIXME.

" NOTE: if you just updated dstack from pre-0.19.25 to 0.19.25+,"
" expect to see this warning once for every running service replica"
),
for conn in connections:

@r4victor r4victor Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if some gateway replicas are registered and some are not – it seems the job won't be considered registered and won't be unregistered from the succeeded gateway replicas. Same applies to service registration.

Not sure about the consequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants