Skip to content

[Feature] get lora capacity info#201

Merged
kevssim merged 3 commits into
modelscope:mainfrom
kevssim:get_num_lora_slot
May 28, 2026
Merged

[Feature] get lora capacity info#201
kevssim merged 3 commits into
modelscope:mainfrom
kevssim:get_num_lora_slot

Conversation

@kevssim
Copy link
Copy Markdown
Collaborator

@kevssim kevssim commented May 22, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Summary

Adds a capacity_info endpoint for querying current LoRA capacity across registered model replicas.

API

Tinker-Compatible Endpoint

GET /api/v1/capacity_info

Twinkle Endpoint

GET /api/v1/twinkle/capacity_info

Response format:

{
  "max_loras": 5,
  "used_loras": 0,
  "free_loras": 5
}

Fields:

  • max_loras: Total LoRA capacity across registered replicas.
  • used_loras: Number of currently loaded LoRA adapters.
  • free_loras: Remaining available LoRA slots.

Write the detail information belongs to this PR.

@kevssim kevssim requested a review from Yunnglin May 22, 2026 06:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a new /capacity_info endpoint to monitor global LoRA capacity and updates the model registration logic. The review identifies a critical omission of the session_id during model registration, which is required for automatic session cleanup. It also suggests removing redundant synchronous registration logic that uses blocking Ray calls to prevent potential initialization hangs, and recommends returning Pydantic models in the client for improved type safety.

Comment thread src/twinkle/server/model/twinkle_handlers.py
Comment thread src/twinkle/server/model/app.py Outdated
Comment thread src/twinkle/server/model/app.py Outdated
Comment thread src/twinkle/server/utils/state/server_state.py Outdated
Comment thread src/twinkle_client/manager.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a server-side capacity_info API for querying global LoRA capacity (max/used/free) aggregated across registered model replicas, and wires it through both Twinkle-native and Tinker-compatible gateways as well as the Python client.

Changes:

  • Add get_capacity_info() to server state/model manager and expose it via new gateway routes (/twinkle/capacity_info and /capacity_info).
  • Register model replicas with the shared ServerState on startup to make capacity tracking meaningful across replicas.
  • Add CapacityInfoResponse to the client types and a TwinkleClient.get_capacity_info() convenience method.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/twinkle/server/utils/state/server_state.py Exposes capacity info via ServerState and ServerStateProxy, plus a blocking replica registration helper.
src/twinkle/server/utils/state/model_manager.py Implements global capacity aggregation across registered replicas.
src/twinkle/server/model/twinkle_handlers.py Updates Twinkle add-adapter flow to register/unregister models in server state for capacity tracking.
src/twinkle/server/model/app.py Ensures replicas register capacity on startup (and during lifespan) to populate global capacity stats.
src/twinkle/server/gateway/twinkle_gateway_handlers.py Adds Twinkle route /twinkle/capacity_info with a typed response model.
src/twinkle/server/gateway/tinker_gateway_handlers.py Adds Tinker-compatible route /capacity_info returning a raw dict.
src/twinkle_client/types/server.py Adds CapacityInfoResponse schema.
src/twinkle_client/types/__init__.py Re-exports CapacityInfoResponse.
src/twinkle_client/manager.py Adds a get_capacity_info() client helper method.

Comment thread src/twinkle/server/model/twinkle_handlers.py
Comment thread src/twinkle_client/manager.py Outdated
Copy link
Copy Markdown
Collaborator

@Yunnglin Yunnglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The capacity_info feature looks solid overall. A few comments:


Must Fix

1. Remove the Tinker gateway /capacity_info endpoint (tinker_gateway_handlers.py)

Tinker SDK does not have a corresponding client method for capacity_info, and there's no plan to add one. This endpoint should only be exposed via the Twinkle gateway (/twinkle/capacity_info). Please remove the Tinker-compatible route.

2. Missing session_id in register_model payload (twinkle_handlers.py)

run_config.model_dump() does not include session_id. As a result, ModelRecord.session_id will be None, and when the session expires, cleanup_expired won't cascade-remove the model from _replica_models. This means capacity_info.used_loras will only increase and never decrease on session expiry.

Suggested fix:

payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
    payload,
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
)

Discussion

3. Is _register_replica_at_startup (blocking) necessary? (app.py)

The lifespan handler already calls _ensure_replica_registered() asynchronously, and _on_request_start also has lazy registration. Adding a blocking ray.get() in __init__ introduces a potential hang risk if the state actor isn't ready. Could we just rely on the lifespan + lazy path and remove _register_replica_at_startup along with register_replica_blocking? Would like to hear your thoughts on the trade-off here.


Suggestions

4. Client get_capacity_info return type (manager.py)

For consistency with other client methods, consider returning CapacityInfoResponse directly instead of dict. Also move the import to the top of the file.

5. Differentiate log messages for replica registration failures (app.py)

Both _register_replica_at_startup and the lifespan handler log the same message. Consider making them distinct for easier debugging.

@kevssim
Copy link
Copy Markdown
Collaborator Author

kevssim commented May 28, 2026

Thanks for the PR! The capacity_info feature looks solid overall. A few comments:感谢这个 PR!capacity_info 功能整体看起来很稳健。几点意见:

Must Fix

1. Remove the Tinker gateway /capacity_info endpoint (tinker_gateway_handlers.py)1. 移除 Tinker 网关 /capacity_info 端点( tinker_gateway_handlers.py

Tinker SDK does not have a corresponding client method for capacity_info, and there's no plan to add one. This endpoint should only be exposed via the Twinkle gateway (/twinkle/capacity_info). Please remove the Tinker-compatible route.Tinker SDK 没有为 capacity_info 提供对应的客户端方法,而且也没有添加该方法的计划。此端点应仅通过 Twinkle 网关( /twinkle/capacity_info )对外暴露。请移除 Tinker 兼容路由。

2. Missing session_id in register_model payload (twinkle_handlers.py)2. register_model 负载中缺少 session_idtwinkle_handlers.py

run_config.model_dump() does not include session_id. As a result, ModelRecord.session_id will be None, and when the session expires, cleanup_expired won't cascade-remove the model from _replica_models. This means capacity_info.used_loras will only increase and never decrease on session expiry. run_config.model_dump() 不包含 session_id 。因此, ModelRecord.session_id 将为 None ,并且当会话过期时, cleanup_expired 不会从 _replica_models 中级联移除该模型。这意味着 capacity_info.used_loras 只会在会话过期时增加,而不会减少。

Suggested fix:  建议的修复:

payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
    payload,
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
)

Discussion

3. Is _register_replica_at_startup (blocking) necessary? (app.py)3. _register_replica_at_startup (阻塞)有必要吗?( app.py

The lifespan handler already calls _ensure_replica_registered() asynchronously, and _on_request_start also has lazy registration. Adding a blocking ray.get() in __init__ introduces a potential hang risk if the state actor isn't ready. Could we just rely on the lifespan + lazy path and remove _register_replica_at_startup along with register_replica_blocking? Would like to hear your thoughts on the trade-off here.生命周期处理器已经异步调用了 _ensure_replica_registered() ,而且 _on_request_start 也支持延迟注册。在 __init__ 中添加一个阻塞的 ray.get() ,如果状态 actor 还没准备好,可能会引入挂起风险。我们能否只依赖生命周期 + 延迟路径,并移除 _register_replica_at_startup 以及 register_replica_blocking ?也想听听你对这里取舍的看法。

Suggestions

4. Client get_capacity_info return type (manager.py)4. Client get_capacity_info 返回类型( manager.py

For consistency with other client methods, consider returning CapacityInfoResponse directly instead of dict. Also move the import to the top of the file.为了与其他 client 方法保持一致,考虑直接返回 CapacityInfoResponse ,而不是 dict 。另外也把 import 移到文件顶部。

5. Differentiate log messages for replica registration failures (app.py)5. 区分副本注册失败的日志消息( app.py

Both _register_replica_at_startup and the lifespan handler log the same message. Consider making them distinct for easier debugging. _register_replica_at_startup 和生命周期处理程序记录了相同的消息。考虑让它们有所区别,以便更容易调试。

逐条回复如下:


Must Fix

1. Remove the Tinker gateway /capacity_info endpoint

已删除 tinker_gateway_handlers.py 中的 /capacity_info 端点,仅保留 /twinkle/capacity_info

2. Missing session_id in register_model payload

已修复,但采用了和建议略有不同的方式:把 session_id 提升为 ServerState.register_model 的显式 kwarg(与已有的 token/model_id/replica_id 风格一致),而不是在 caller 层做 payload['session_id'] = session_id 的 dict 变更。这样:

  • caller 不再需要在 typed CreateModelRequest 之外做 sideband 注入
  • tinker_handlers.py 中的 register_model 调用也顺便改为显式 session_id=body.session_id,消除两条路径的隐式差异(之前 tinker 走 body.session_id 内嵌 payload,twinkle 那条根本没走通)
  • 客户端类型 CreateModelRequest 保持不变,不污染 client schema

修改后:

# twinkle_handlers.py
await self.state.register_model(
    run_config.model_dump(),
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
    session_id=session_id,
)

# tinker_handlers.py
_model_id = await self.state.register_model(
    body.model_dump(),
    token=token,
    replica_id=self.replica_id,
    session_id=body.session_id,
)

# server_state.py: register_model 内部 fallback 到 payload.get('session_id'),
# 兼容已有内嵌 session_id 的调用方

Discussion

3. _register_replica_at_startup 的取舍 ✅ 同意删除

已删除 _register_replica_at_startupregister_replica_blocking,仅依赖 lifespan + lazy 路径。

当初加这条阻塞注册是因为:副本 __init__ 完成到 lifespan startup 完成之间存在窗口,期间 capacity_info 会读到 max_loras=0 / used_loras=0 / free_loras=0。重新评估后觉得影响有限:

  • 这个窗口只在冷启动 + 完全没有请求时存在
  • 任意一个请求打到 ModelManagement 都会触发 lazy 注册,数值立即自愈
  • 偏小的 free_loras 是 conservative 方向,不会让客户端误判超发,最坏只是看起来"满了"

相比 __init__ray.get() 在 ServerState actor 慢启动时把 replica 卡死的硬故障,这点偶发偏读可以接受。同意权衡。


Suggestions

4. Client get_capacity_info return type

TwinkleClient.get_capacity_info 已改为返回 CapacityInfoResponseCapacityInfoResponse 的 import 也移到了文件顶部。

5. 区分日志 — N/A

Item 3 删除 _register_replica_at_startup 之后,重复日志点已经不存在,无需再处理。

@kevssim kevssim merged commit ddb9627 into modelscope:main May 28, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants