Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 15, 2025

平均 Token 数硬编码在 rate_limiter.py 导致并发计算不准确。Insert 和 Query 场景的 token 消耗差异大,固定保守值限制了系统吞吐量。

Changes

  • src/config.py: 各服务配置类新增 avg_tokens_per_request 字段

    • LLM/DS_OCR: 3500 (default)
    • Embedding: 20000
    • Rerank: 500
  • src/rate_limiter.py:

    • get_rate_limiter() 新增 avg_tokens_per_request 参数
    • 提取默认配置到 SERVICE_DEFAULTS 字典
  • src/tenant_config.py: 各服务合并方法传递 avg_tokens_per_request

  • src/multi_tenant.py / src/deepseek_ocr_client.py: 更新调用传参

  • env.example: 添加环境变量说明

配置优先级

  1. 租户配置 (API)
  2. 环境变量 (LLM_AVG_TOKENS_PER_REQUEST)
  3. 代码默认值

使用示例

# Insert 密集场景,降低平均 token 提升并发
LLM_AVG_TOKENS_PER_REQUEST=2500
// 租户 API
PUT /tenants/{tenant_id}/config
{
  "llm_config": {
    "avg_tokens_per_request": 4000
  }
}
Original prompt

This section details on the original issue you should resolve

<issue_title>Average tokens per request hardcoded - inaccurate concurrency calculation</issue_title>
<issue_description>## 问题描述
平均 Token 数硬编码导致并发计算不准确。

受影响的文件

  • src/rate_limiter.py 行 420-423

硬编码值

  • llm: 3500
  • embedding: 500
  • rerank: 500
  • ds_ocr: 3500

问题

这些值因场景而异(Query vs Insert),但被固定为保守值,导致不必要的速率限制,降低系统吞吐量。

解决方案

应该支持按租户/服务配置平均 token 数。</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits December 15, 2025 09:43
Co-authored-by: BukeLy <19304666+BukeLy@users.noreply.github.com>
Co-authored-by: BukeLy <19304666+BukeLy@users.noreply.github.com>
Copilot AI changed the title [WIP] Update average tokens per request for better concurrency calculation feat: 支持按租户/服务配置平均 token 数以优化并发计算 Dec 15, 2025
Copilot AI requested a review from BukeLy December 15, 2025 09:48
Copy link
Owner

@BukeLy BukeLy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds_ocr.avg_tokens_per_request
ds_ocr.tokens_per_minute
这两个参数的目的是什么?在没有设置TPM/RPM情况下估算吗?还有就是我记得关于Token我引入了Transform的tokenizer来计算,为什么还需要这两个参数呢?

Comment on lines -83 to +84
tokens_per_minute=getattr(self.config, 'tokens_per_minute', config.ds_ocr.tokens_per_minute)
tokens_per_minute=getattr(self.config, 'tokens_per_minute', config.ds_ocr.tokens_per_minute),
avg_tokens_per_request=getattr(self.config, 'avg_tokens_per_request', config.ds_ocr.avg_tokens_per_request)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds_ocr.avg_tokens_per_request
ds_ocr.tokens_per_minute
这两个参数的目的是什么?在没有设置TPM/RPM情况下估算吗?还有就是我记得关于Token我引入了Transform的tokenizer来计算,为什么还需要这两个参数呢? @copilot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个参数的作用不同:

  1. tokens_per_minute - API 服务商的 TPM 限制(如硅基流动 40000),用于 RateLimiter 控制每分钟最大 token 消耗,防止 429 错误。

  2. avg_tokens_per_request - 用于启动时预估安全并发数的参数。在 calculate_optimal_concurrent() 中计算:concurrent = TPM / avg_tokens / max_in_flight

虽然 count_tokens() 用 tiktoken 精确计算每个请求的实际 token(运行时扣减 TPM 配额),但 avg_tokens_per_request初始化时预估应允许多少并发请求。这是两个不同用途:

  • 运行时:count_tokens() 精确计算实际消耗
  • 初始化:avg_tokens_per_request 预估并发上限

不同场景(Insert vs Query)的平均 token 差异大,现在可配置这个预估值来优化并发。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Average tokens per request hardcoded - inaccurate concurrency calculation

2 participants