Support gateways with multiple replicas#3960
Conversation
A gateway can now have multiple replicas for
improved availability.
```yaml
type: gateway
name: example-gateway
backend: aws
region: eu-west-1
domain: example.com
certificate: null
replicas: 2
```
To balance requests between gateway replicas, add
DNS records for each replica or set up a load
balancer outside of `dstack`. Replica hostnames
are displayed in `dstack` CLI and UI.
```shell
$ dstack gateway list
NAME BACKEND HOSTNAME DOMAIN DEFAULT STATUS
example-gateway example.com ✓ running
replica=0 aws (eu-west-1) 34.244.128.46
replica=1 aws (eu-west-1) 18.201.201.174
```
Limitations:
- Changing the number of replicas or redeploying
replicas is not supported.
- HTTPS is not supported. Use an external load
balancer for TLS termination.
- An unavailable gateway replica prevents any new
services or service replicas from being added.
- All replicas are bound to the same backend and
region.
Implementation notes:
- `GatewayComputeModel` now represents a gateway
replica.
- In this version, the terms "compute" and
"replica" are used interchangeably. The plan is
to switch to using exclusively "replica" later.
- In this version, replica provisioning and
termination are still done in the gateway
pipeline, for all replicas at once. The plan is
to introduce gateway replica pipelines later to
allow for independent replica processing.
| logger.debug( | ||
| "%s replica %d: creating gateway compute", fmt(gateway_model), replica_num | ||
| ) | ||
| gateway_compute_model = await gateways_services.create_gateway_compute( |
There was a problem hiding this comment.
What if one replica fails but others are provisioned – there needs to be a clean up of successfully provisioned replicas.
There was a problem hiding this comment.
The gateway will enter the failed status, but any successfully provisioned replicas will remain in the database and can be deleted with dstack gateway delete.
This is consistent with the existing handling of single-replica gateway provisioning failures. If dstack creates an instance for a gateway but later fails to connect to it, the instance is not cleaned up automatically and can only be removed along with the failed gateway using dstack gateway delete.
While this behavior may be counterintuitive and worth revisiting, I think it will be easier to address after gateway replica statuses and pipelines are introduced, which I plan to implement in the next iteration.
| " Set to `null` to disable. Defaults to `type: lets-encrypt`" | ||
| ), | ||
| ] = LetsEncryptGatewayCertificate() | ||
| replicas: Annotated[ |
There was a problem hiding this comment.
Let's put a technical upper bound on the number of provisioned replicas, e.g. 20 (to avoid provisioning 1000 replicas in a loop).
| stats = await conn.get_stats(project_name, run_name) | ||
| if stats is None: # Stats not fetched yet | ||
| return None |
There was a problem hiding this comment.
If any one replica goes unavailable, it breaks autoscaling. I expect this needs to be fixed for HA so worth adding a TODO/FIXME.
| " NOTE: if you just updated dstack from pre-0.19.25 to 0.19.25+," | ||
| " expect to see this warning once for every running service replica" | ||
| ), | ||
| for conn in connections: |
There was a problem hiding this comment.
What happens if some gateway replicas are registered and some are not – it seems the job won't be considered registered and won't be unregistered from the succeeded gateway replicas. Same applies to service registration.
Not sure about the consequences.
A gateway can now have multiple replicas for
improved availability.
To balance requests between gateway replicas, add
DNS records for each replica or set up a load
balancer outside of
dstack. Replica hostnamesare displayed in
dstackCLI and UI.$ dstack gateway list NAME BACKEND HOSTNAME DOMAIN DEFAULT STATUS example-gateway example.com ✓ running replica=0 aws (eu-west-1) 34.244.128.46 replica=1 aws (eu-west-1) 18.201.201.174Limitations:
replicas is not supported.
balancer for TLS termination.
services or service replicas from being added.
region.
Implementation notes:
GatewayComputeModelnow represents a gatewayreplica.
"replica" are used interchangeably. The plan is
to switch to using exclusively "replica" later.
termination are still done in the gateway
pipeline, for all replicas at once. The plan is
to introduce gateway replica pipelines later to
allow for independent replica processing.
#3959