Support router as replica with pipelines#3721
Conversation
2fe5e14 to
bafd2d9
Compare
|
|
||
|
|
||
| class ServiceRouterWorkerSyncFetcher(Fetcher[ServiceRouterWorkerSyncPipelineItem]): | ||
| @sentry_utils.instrument_named_task("pipeline_tasks.ServiceRouterWorkerSyncFetcher.fetch") |
There was a problem hiding this comment.
I recently added @sentry_utils.instrument_pipeline_task – use it to avoid hardcoding pipeline_tasks prefix.
| run_model = sync_row.run | ||
| if run_model is None: | ||
| await session.delete(sync_row) | ||
| await session.commit() | ||
| return |
There was a problem hiding this comment.
How can run_model be None here?
There was a problem hiding this comment.
I thought what if the run row can be hard-deleted, so sync_row.run becomes None. If this is not possible we can delete this block.
There was a problem hiding this comment.
But you defined run_id as non-optional with ondelete="CASCADE" - how can it be possible?
There was a problem hiding this comment.
You are right. Maybe I delete this block.
| .options( | ||
| selectinload(RunModel.project), | ||
| selectinload(RunModel.jobs).selectinload(JobModel.project), | ||
| selectinload(RunModel.jobs) | ||
| .selectinload(JobModel.instance) | ||
| .selectinload(InstanceModel.project), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
This is potentially a very inefficient select – a run can have thousands of job submissions. Select only the jobs that the processing needs, i.e. only the router replica job. Also every selectinload will be a separate query here – not sure if it's justified. joinedload may be a better suited for a one-to-one rel. Also, try to avoid loading all models's columns and use load_only to select only the necessary.
There was a problem hiding this comment.
Please check if below proposed query addresses the concerns
-
Avoid loading thousands of job submissions: no longer load RunModel.jobs unconditionally. The selectinload(RunModel.jobs.and_(...)) restricts the loaded jobs to only RUNNING + registered replicas, which are the only ones sync_router_workers_for_run_model() can use (router job selection and worker list building both ignore non‑running / unregistered jobs).
-
selectinload is intentional: RunModel.jobs is a one‑to‑many collection; using joinedload would duplicate the RunModel row per job.
-
joinedload for one‑to‑one/many‑to‑one: RunModel.project, JobModel.project, JobModel.instance, InstanceModel.project are loaded with joinedload because these are scalar relationships from from run,job and instance.
-
Use load_only: This limits columns required by
sync_router_workers_for_run_model(run_for_sync)and_get_service_replica_client(job_model)
res = await session.execute(
select(RunModel)
.where(RunModel.id == item.run_id)
.options(
load_only(RunModel.id, RunModel.run_spec),
selectinload(
RunModel.jobs.and_(
JobModel.status == JobStatus.RUNNING,
JobModel.registered == true(),
)
)
.load_only(
JobModel.id,
JobModel.status,
JobModel.registered,
JobModel.job_spec_data,
JobModel.job_provisioning_data,
JobModel.job_runtime_data,
)
.options(
joinedload(JobModel.project).load_only(ProjectModel.id, ProjectModel.ssh_private_key),
joinedload(JobModel.instance)
.load_only(InstanceModel.id, InstanceModel.remote_connection_info)
.joinedload(InstanceModel.project)
.load_only(ProjectModel.id, ProjectModel.ssh_private_key),
),
)
)
There was a problem hiding this comment.
looks good, at least at a glance
| router_jobs = [ | ||
| j | ||
| for j in run_model.jobs | ||
| if job_belongs_to_group(j, group_name) and j.status == JobStatus.RUNNING | ||
| ] | ||
| if not router_jobs or not is_replica_registered(router_jobs): | ||
| return None | ||
| return router_jobs[0] |
There was a problem hiding this comment.
Can there be multiple router jobs? If so, how does that work?
There was a problem hiding this comment.
For the first iteration, I suggest restricting the router replica group to count: 1 via configuration validation. The current sync logic effectively assumes a single active router job. We can extend this later to support multiple router replicas for HA.
| def run_spec_has_router_replica_group(run_spec: RunSpec) -> bool: | ||
| if run_spec.configuration.type != "service": | ||
| return False | ||
| cfg = run_spec.configuration | ||
| if not isinstance(cfg, ServiceConfiguration): | ||
| return False | ||
| return any(g.router is not None for g in cfg.replica_groups) | ||
|
|
||
|
|
||
| async def ensure_service_router_worker_sync_row( |
There was a problem hiding this comment.
Why put these router-speicfic functions in top of runs services.
There was a problem hiding this comment.
I kept it there because they are used by run lifecycle. Should I shift them to src/dstack/_internal/server/services/router_worker_sync.py?
There was a problem hiding this comment.
I mean at least they should not be at the top of the file.
| if not run_spec_has_router_replica_group(run_spec): | ||
| return | ||
| res = await session.execute( | ||
| select(ServiceRouterWorkerSyncModel.id).where( | ||
| ServiceRouterWorkerSyncModel.run_id == run_model.id | ||
| ) | ||
| ) | ||
| if res.scalar_one_or_none() is not None: | ||
| return |
There was a problem hiding this comment.
How can it be that ServiceRouterWorkerSyncModel already exists for a run if ensure_service_router_worker_sync_row is called only on run submit?
| return | ||
| run_model = sync_row.run | ||
| if run_model is None: | ||
| await session.delete(sync_row) |
There was a problem hiding this comment.
We generally use soft deletes in dstack server easier debugging and historical data. Assuming there will be very few ServiceRouterWorkerSyncModel rows (one per service replica router), I'd also soft-delete it for consistency.
| ) | ||
|
|
||
|
|
||
| class ServiceRouterWorkerSyncModel(PipelineModelMixin, BaseModel): |
There was a problem hiding this comment.
Let's put it somewhere in the end of the file so that "core" models come first.
| @@ -0,0 +1,49 @@ | |||
| """SSH-tunneled async HTTP client to a job's service port (same path as probes).""" | |||
There was a problem hiding this comment.
put this file in jobs services?
| @@ -0,0 +1,345 @@ | |||
| """Reconcile SGLang router /workers with dstack's registered worker replicas (async, SSH-tunneled).""" | |||
There was a problem hiding this comment.
put this file in runs services
r4victor
left a comment
There was a problem hiding this comment.
Did a quick review of the pipeline code. Haven't looked into the worker sync logic.
e155d17 to
7b268cb
Compare
| async def _stream_response_body_bytes(resp: Response, max_bytes: int) -> bytes: | ||
| buf = bytearray() | ||
| async for chunk in resp.aiter_bytes(): | ||
| buf.extend(chunk) | ||
| if len(buf) > max_bytes: | ||
| raise _ResponseTooLargeError() | ||
| return bytes(buf) |
There was a problem hiding this comment.
(nit) We have the join_byte_stream_checked function that appears to do the same thing
3bc04df to
8fe01e5
Compare
| fleets: [pd-disagg] | ||
|
|
||
| # Custom probe is required for PD disaggregation | ||
| # Custom probe is required for PD disaggregation. |
There was a problem hiding this comment.
(nit) By the way, is it still required? I thought sync_router_workers_for_run_model can gracefully handle the router or workers not being ready, and perform the registration eventually, once they become ready
There was a problem hiding this comment.
Yes this is still required. Because probes queries /v1/chat/completions to register the job but router fails to serve /v1/chat/completions until workers are registered. Meanwhile, the router-worker sync pipeline only considers RUNNING jobs that are also registered=True.
There was a problem hiding this comment.
Oh, I see, so our default probe is the problem. But I assume it's possible to work around it by either setting probes: [], or not setting model. If that's the case, a custom probe is more of a recommendation, not a strict requirement.
Anyways, I think we were going to improve the UX here by introducing a different default probe for services with the SGLang router. Not in this PR, of course.
| set_processed_update_map_fields(early_cleanup_update_map) | ||
| set_unlock_update_map_fields(early_cleanup_update_map) | ||
| now = get_current_datetime() | ||
| resolve_now_placeholders(early_cleanup_update_map, now=now) | ||
| await session.execute( | ||
| update(ServiceRouterWorkerSyncModel) | ||
| .where( | ||
| ServiceRouterWorkerSyncModel.id == item.id, | ||
| ServiceRouterWorkerSyncModel.lock_token == item.lock_token, | ||
| ) | ||
| .values(**early_cleanup_update_map) | ||
| await _update_sync_row_or_log_lock_token_changed( |
There was a problem hiding this comment.
(nit) Identical set_processed_update_map_fields, set_unlock_update_map_fields, and resolve_now_placeholders calls are also repeated in three places in this method. It's worth moving them inside _update_sync_row_or_log_lock_token_changed
@jvstme Done |
Refer design document for this PR is here.