-
Notifications
You must be signed in to change notification settings - Fork 0
feat: 支持按租户/服务配置平均 token 数以优化并发计算 #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: BukeLy <19304666+BukeLy@users.noreply.github.com>
Co-authored-by: BukeLy <19304666+BukeLy@users.noreply.github.com>
BukeLy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds_ocr.avg_tokens_per_request
ds_ocr.tokens_per_minute
这两个参数的目的是什么?在没有设置TPM/RPM情况下估算吗?还有就是我记得关于Token我引入了Transform的tokenizer来计算,为什么还需要这两个参数呢?
| tokens_per_minute=getattr(self.config, 'tokens_per_minute', config.ds_ocr.tokens_per_minute) | ||
| tokens_per_minute=getattr(self.config, 'tokens_per_minute', config.ds_ocr.tokens_per_minute), | ||
| avg_tokens_per_request=getattr(self.config, 'avg_tokens_per_request', config.ds_ocr.avg_tokens_per_request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds_ocr.avg_tokens_per_request
ds_ocr.tokens_per_minute
这两个参数的目的是什么?在没有设置TPM/RPM情况下估算吗?还有就是我记得关于Token我引入了Transform的tokenizer来计算,为什么还需要这两个参数呢? @copilot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个参数的作用不同:
-
tokens_per_minute- API 服务商的 TPM 限制(如硅基流动 40000),用于 RateLimiter 控制每分钟最大 token 消耗,防止 429 错误。 -
avg_tokens_per_request- 用于启动时预估安全并发数的参数。在calculate_optimal_concurrent()中计算:concurrent = TPM / avg_tokens / max_in_flight。
虽然 count_tokens() 用 tiktoken 精确计算每个请求的实际 token(运行时扣减 TPM 配额),但 avg_tokens_per_request 是初始化时预估应允许多少并发请求。这是两个不同用途:
- 运行时:
count_tokens()精确计算实际消耗 - 初始化:
avg_tokens_per_request预估并发上限
不同场景(Insert vs Query)的平均 token 差异大,现在可配置这个预估值来优化并发。
平均 Token 数硬编码在
rate_limiter.py导致并发计算不准确。Insert 和 Query 场景的 token 消耗差异大,固定保守值限制了系统吞吐量。Changes
src/config.py: 各服务配置类新增avg_tokens_per_request字段src/rate_limiter.py:get_rate_limiter()新增avg_tokens_per_request参数SERVICE_DEFAULTS字典src/tenant_config.py: 各服务合并方法传递avg_tokens_per_requestsrc/multi_tenant.py/src/deepseek_ocr_client.py: 更新调用传参env.example: 添加环境变量说明配置优先级
LLM_AVG_TOKENS_PER_REQUEST)使用示例
# Insert 密集场景,降低平均 token 提升并发 LLM_AVG_TOKENS_PER_REQUEST=2500Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.