[Feature] get lora capacity info by kevssim · Pull Request #201 · modelscope/twinkle

kevssim · 2026-05-22T06:40:17Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Summary

Adds a capacity_info endpoint for querying current LoRA capacity across registered model replicas.

API

Tinker-Compatible Endpoint

GET /api/v1/capacity_info

Twinkle Endpoint

GET /api/v1/twinkle/capacity_info

Response format:

{
  "max_loras": 5,
  "used_loras": 0,
  "free_loras": 5
}

Fields:

max_loras: Total LoRA capacity across registered replicas.
used_loras: Number of currently loaded LoRA adapters.
free_loras: Remaining available LoRA slots.

Write the detail information belongs to this PR.

gemini-code-assist

Code Review

This pull request implements a new /capacity_info endpoint to monitor global LoRA capacity and updates the model registration logic. The review identifies a critical omission of the session_id during model registration, which is required for automatic session cleanup. It also suggests removing redundant synchronous registration logic that uses blocking Ray calls to prevent potential initialization hangs, and recommends returning Pydantic models in the client for improved type safety.

Copilot

Pull request overview

This PR adds a server-side capacity_info API for querying global LoRA capacity (max/used/free) aggregated across registered model replicas, and wires it through both Twinkle-native and Tinker-compatible gateways as well as the Python client.

Changes:

Add get_capacity_info() to server state/model manager and expose it via new gateway routes (/twinkle/capacity_info and /capacity_info).
Register model replicas with the shared ServerState on startup to make capacity tracking meaningful across replicas.
Add CapacityInfoResponse to the client types and a TwinkleClient.get_capacity_info() convenience method.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/twinkle/server/utils/state/server_state.py`	Exposes capacity info via `ServerState` and `ServerStateProxy`, plus a blocking replica registration helper.
`src/twinkle/server/utils/state/model_manager.py`	Implements global capacity aggregation across registered replicas.
`src/twinkle/server/model/twinkle_handlers.py`	Updates Twinkle add-adapter flow to register/unregister models in server state for capacity tracking.
`src/twinkle/server/model/app.py`	Ensures replicas register capacity on startup (and during lifespan) to populate global capacity stats.
`src/twinkle/server/gateway/twinkle_gateway_handlers.py`	Adds Twinkle route `/twinkle/capacity_info` with a typed response model.
`src/twinkle/server/gateway/tinker_gateway_handlers.py`	Adds Tinker-compatible route `/capacity_info` returning a raw dict.
`src/twinkle_client/types/server.py`	Adds `CapacityInfoResponse` schema.
`src/twinkle_client/types/__init__.py`	Re-exports `CapacityInfoResponse`.
`src/twinkle_client/manager.py`	Adds a `get_capacity_info()` client helper method.

Yunnglin

Thanks for the PR! The capacity_info feature looks solid overall. A few comments:

Must Fix

1. Remove the Tinker gateway /capacity_info endpoint (tinker_gateway_handlers.py)

Tinker SDK does not have a corresponding client method for capacity_info, and there's no plan to add one. This endpoint should only be exposed via the Twinkle gateway (/twinkle/capacity_info). Please remove the Tinker-compatible route.

2. Missing session_id in register_model payload (twinkle_handlers.py)

run_config.model_dump() does not include session_id. As a result, ModelRecord.session_id will be None, and when the session expires, cleanup_expired won't cascade-remove the model from _replica_models. This means capacity_info.used_loras will only increase and never decrease on session expiry.

Suggested fix:

payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
    payload,
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
)

Discussion

3. Is _register_replica_at_startup (blocking) necessary? (app.py)

The lifespan handler already calls _ensure_replica_registered() asynchronously, and _on_request_start also has lazy registration. Adding a blocking ray.get() in __init__ introduces a potential hang risk if the state actor isn't ready. Could we just rely on the lifespan + lazy path and remove _register_replica_at_startup along with register_replica_blocking? Would like to hear your thoughts on the trade-off here.

Suggestions

4. Client get_capacity_info return type (manager.py)

For consistency with other client methods, consider returning CapacityInfoResponse directly instead of dict. Also move the import to the top of the file.

5. Differentiate log messages for replica registration failures (app.py)

Both _register_replica_at_startup and the lifespan handler log the same message. Consider making them distinct for easier debugging.

kevssim · 2026-05-28T07:40:46Z

Thanks for the PR! The capacity_info feature looks solid overall. A few comments:感谢这个 PR！capacity_info 功能整体看起来很稳健。几点意见：

Must Fix

1. Remove the Tinker gateway /capacity_info endpoint (tinker_gateway_handlers.py)1. 移除 Tinker 网关 /capacity_info 端点（ tinker_gateway_handlers.py ）

Tinker SDK does not have a corresponding client method for capacity_info, and there's no plan to add one. This endpoint should only be exposed via the Twinkle gateway (/twinkle/capacity_info). Please remove the Tinker-compatible route.Tinker SDK 没有为 capacity_info 提供对应的客户端方法，而且也没有添加该方法的计划。此端点应仅通过 Twinkle 网关（ /twinkle/capacity_info ）对外暴露。请移除 Tinker 兼容路由。

2. Missing session_id in register_model payload (twinkle_handlers.py)2. register_model 负载中缺少 session_id （ twinkle_handlers.py ）

run_config.model_dump() does not include session_id. As a result, ModelRecord.session_id will be None, and when the session expires, cleanup_expired won't cascade-remove the model from _replica_models. This means capacity_info.used_loras will only increase and never decrease on session expiry. run_config.model_dump() 不包含 session_id 。因此， ModelRecord.session_id 将为 None ，并且当会话过期时， cleanup_expired 不会从 _replica_models 中级联移除该模型。这意味着 capacity_info.used_loras 只会在会话过期时增加，而不会减少。

Suggested fix: 建议的修复：
payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
    payload,
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
)
Discussion

3. Is _register_replica_at_startup (blocking) necessary? (app.py)3. _register_replica_at_startup （阻塞）有必要吗？（ app.py ）

The lifespan handler already calls _ensure_replica_registered() asynchronously, and _on_request_start also has lazy registration. Adding a blocking ray.get() in __init__ introduces a potential hang risk if the state actor isn't ready. Could we just rely on the lifespan + lazy path and remove _register_replica_at_startup along with register_replica_blocking? Would like to hear your thoughts on the trade-off here.生命周期处理器已经异步调用了 _ensure_replica_registered() ，而且 _on_request_start 也支持延迟注册。在 __init__ 中添加一个阻塞的 ray.get() ，如果状态 actor 还没准备好，可能会引入挂起风险。我们能否只依赖生命周期 + 延迟路径，并移除 _register_replica_at_startup 以及 register_replica_blocking ？也想听听你对这里取舍的看法。

Suggestions

4. Client get_capacity_info return type (manager.py)4. Client get_capacity_info 返回类型（ manager.py ）

For consistency with other client methods, consider returning CapacityInfoResponse directly instead of dict. Also move the import to the top of the file.为了与其他 client 方法保持一致，考虑直接返回 CapacityInfoResponse ，而不是 dict 。另外也把 import 移到文件顶部。

5. Differentiate log messages for replica registration failures (app.py)5. 区分副本注册失败的日志消息（ app.py ）

Both _register_replica_at_startup and the lifespan handler log the same message. Consider making them distinct for easier debugging. _register_replica_at_startup 和生命周期处理程序记录了相同的消息。考虑让它们有所区别，以便更容易调试。

逐条回复如下：

Must Fix

1. Remove the Tinker gateway /capacity_info endpoint ✅

已删除 tinker_gateway_handlers.py 中的 /capacity_info 端点，仅保留 /twinkle/capacity_info。

2. Missing session_id in register_model payload ✅

已修复，但采用了和建议略有不同的方式：把 session_id 提升为 ServerState.register_model 的显式 kwarg（与已有的 token/model_id/replica_id 风格一致），而不是在 caller 层做 payload['session_id'] = session_id 的 dict 变更。这样：

caller 不再需要在 typed CreateModelRequest 之外做 sideband 注入
tinker_handlers.py 中的 register_model 调用也顺便改为显式 session_id=body.session_id，消除两条路径的隐式差异（之前 tinker 走 body.session_id 内嵌 payload，twinkle 那条根本没走通）
客户端类型 CreateModelRequest 保持不变，不污染 client schema

修改后：

# twinkle_handlers.py
await self.state.register_model(
    run_config.model_dump(),
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
    session_id=session_id,
)

# tinker_handlers.py
_model_id = await self.state.register_model(
    body.model_dump(),
    token=token,
    replica_id=self.replica_id,
    session_id=body.session_id,
)

# server_state.py: register_model 内部 fallback 到 payload.get('session_id')，
# 兼容已有内嵌 session_id 的调用方

Discussion

3. _register_replica_at_startup 的取舍 ✅ 同意删除

已删除 _register_replica_at_startup 和 register_replica_blocking，仅依赖 lifespan + lazy 路径。

当初加这条阻塞注册是因为：副本 __init__ 完成到 lifespan startup 完成之间存在窗口，期间 capacity_info 会读到 max_loras=0 / used_loras=0 / free_loras=0。重新评估后觉得影响有限：

这个窗口只在冷启动 + 完全没有请求时存在
任意一个请求打到 ModelManagement 都会触发 lazy 注册，数值立即自愈
偏小的 free_loras 是 conservative 方向，不会让客户端误判超发，最坏只是看起来"满了"

相比 __init__ 里 ray.get() 在 ServerState actor 慢启动时把 replica 卡死的硬故障，这点偶发偏读可以接受。同意权衡。

Suggestions

4. Client get_capacity_info return type ✅

TwinkleClient.get_capacity_info 已改为返回 CapacityInfoResponse，CapacityInfoResponse 的 import 也移到了文件顶部。

5. 区分日志 — N/A

Item 3 删除 _register_replica_at_startup 之后，重复日志点已经不存在，无需再处理。

kevssim added 2 commits May 22, 2026 11:08

wip

7164772

Fix capacity info cold start registration

cb6b88f

kevssim requested a review from Yunnglin May 22, 2026 06:40

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/twinkle/server/model/twinkle_handlers.py

Comment thread src/twinkle/server/model/app.py Outdated

Comment thread src/twinkle/server/model/app.py Outdated

Comment thread src/twinkle/server/utils/state/server_state.py Outdated

Comment thread src/twinkle_client/manager.py Outdated

Yunnglin requested a review from Copilot May 26, 2026 03:30

Copilot started reviewing on behalf of Yunnglin May 26, 2026 03:30 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread src/twinkle/server/model/twinkle_handlers.py

Comment thread src/twinkle_client/manager.py Outdated

Yunnglin reviewed May 26, 2026

View reviewed changes

fix

012648a

Yunnglin approved these changes May 28, 2026

View reviewed changes

kevssim merged commit ddb9627 into modelscope:main May 28, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] get lora capacity info#201

[Feature] get lora capacity info#201
kevssim merged 3 commits into
modelscope:mainfrom
kevssim:get_num_lora_slot

kevssim commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Yunnglin left a comment

Uh oh!

kevssim commented May 28, 2026

Must Fix

Discussion

Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kevssim commented May 22, 2026

PR type

PR information

Summary

API

Tinker-Compatible Endpoint

Twinkle Endpoint

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Yunnglin left a comment

Choose a reason for hiding this comment

Must Fix

Discussion

Suggestions

Uh oh!

kevssim commented May 28, 2026

Must Fix

Discussion

Suggestions

Must Fix

Discussion

Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants