Skip to content

[FDConfig] Reduce FD_CUSTOM_AR_MAX_SIZE_MB default from 64 to 8#6997

Open
gongweibao wants to merge 1 commit intoPaddlePaddle:developfrom
gongweibao:fixcustomarsize
Open

[FDConfig] Reduce FD_CUSTOM_AR_MAX_SIZE_MB default from 64 to 8#6997
gongweibao wants to merge 1 commit intoPaddlePaddle:developfrom
gongweibao:fixcustomarsize

Conversation

@gongweibao
Copy link
Collaborator

Motivation

The default FD_CUSTOM_AR_MAX_SIZE_MB of 64MB is unnecessarily large for most single-GPU and small-model deployments. Reducing it to 8MB lowers shared memory allocation overhead. Multi-GPU or large-model scenarios that need bigger buffers can set the env var explicitly.

Modifications

  • fastdeploy/envs.py: Change default value from 64 to 8, update comment accordingly.
  • tests/e2e/4cards_cases/test_determinism_long.py: Explicitly set FD_CUSTOM_AR_MAX_SIZE_MB=64 (was using 57 as fallback; now aligned with other test files).

Usage or Command

# Use default 8MB buffer
python -m fastdeploy.entrypoints.openai.api_server ...

# Override for multi-GPU with large tensors
export FD_CUSTOM_AR_MAX_SIZE_MB=64

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No new unit tests needed — this is a config default change; existing e2e tests cover the behavior.
  • Provide accuracy results.

🤖 Generated with Claude Code

Most single-GPU and small-model deployments do not need 64MB custom
all-reduce buffers. Lowering the default to 8MB reduces unnecessary
shared memory allocation. Tests that require larger buffers now
explicitly set the value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 24, 2026 13:16
@paddle-bot
Copy link

paddle-bot bot commented Mar 24, 2026

Thanks for your contribution!

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


gongweibao seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 调整自定义 all-reduce 共享 buffer 的默认配置:将 FD_CUSTOM_AR_MAX_SIZE_MB 默认值从 64MB 下调到 8MB,以降低多数小规模部署的共享内存占用,并在确定性相关 e2e 用例中显式使用更大的 buffer 以保持稳定性。

Changes:

  • FD_CUSTOM_AR_MAX_SIZE_MB 默认值从 64 改为 8,并更新对应注释
  • tests/e2e/4cards_cases/test_determinism_long.py 中该环境变量的默认回退值改为 64

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
fastdeploy/envs.py 下调 custom all-reduce buffer 默认值并更新注释
tests/e2e/4cards_cases/test_determinism_long.py 调整 determinism long e2e 用例中 buffer 的默认回退值

Comment on lines +228 to 229
# Custom all-reduce max buffer size in MB (default 8MB).
# Increase this to avoid NCCL fallback for large tensors in deterministic mode.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的注释与实际行为不一致:在 FD_DETERMINISTIC_MODE=1 时,如果输入 tensor 超过 max_size,会直接抛 RuntimeError(communication._ensure_deterministic_ready),不会“fallback 到 NCCL”。建议把注释改成“超大 tensor 会报错/需要调大该值以满足 deterministic all-reduce 的 max_size 限制”,避免误导。

Suggested change
# Custom all-reduce max buffer size in MB (default 8MB).
# Increase this to avoid NCCL fallback for large tensors in deterministic mode.
# Custom deterministic all-reduce max buffer size in MB (default 8MB).
# When FD_DETERMINISTIC_MODE=1, tensors larger than this limit will raise an error
# instead of falling back to NCCL. Increase this value to avoid max_size errors.

Copilot uses AI. Check for mistakes.
# Increase this to avoid NCCL fallback for large tensors in deterministic mode.
# E.g. FD_CUSTOM_AR_MAX_SIZE_MB=128 for 128MB.
"FD_CUSTOM_AR_MAX_SIZE_MB": lambda: int(os.getenv("FD_CUSTOM_AR_MAX_SIZE_MB", "64")),
"FD_CUSTOM_AR_MAX_SIZE_MB": lambda: int(os.getenv("FD_CUSTOM_AR_MAX_SIZE_MB", "8")),
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FD_CUSTOM_AR_MAX_SIZE_MB 这里直接 int(os.getenv(...)),未校验取值范围;如果被设置为 0/负数,会导致 CustomAllreduce 分配共享 buffer 时 size_in_bytes 非法并在更底层报错,排查困难。建议新增一个校验(类似 _validate_split_kv_size),确保该值为正整数(例如 >=1),并在不合法时给出明确异常信息。

Copilot uses AI. Check for mistakes.
"CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3"),
"FD_DETERMINISTIC_MODE": "1",
"FD_CUSTOM_AR_MAX_SIZE_MB": os.environ.get("FD_CUSTOM_AR_MAX_SIZE_MB", "57"),
"FD_CUSTOM_AR_MAX_SIZE_MB": os.environ.get("FD_CUSTOM_AR_MAX_SIZE_MB", "64"),
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该测试这里允许外部环境变量覆盖 FD_CUSTOM_AR_MAX_SIZE_MB;如果 CI/本地恰好设置为更小值(比如现在默认 8),可能导致 deterministic all-reduce 在大 tensor 场景直接报错而引入不稳定。考虑与同目录其他 determinism e2e 测试保持一致,直接固定为 "64"(如 _test_determinism_offline.py),让用例结果不受外部环境影响。

Suggested change
"FD_CUSTOM_AR_MAX_SIZE_MB": os.environ.get("FD_CUSTOM_AR_MAX_SIZE_MB", "64"),
"FD_CUSTOM_AR_MAX_SIZE_MB": "64",

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@6f5aa88). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6997   +/-   ##
==========================================
  Coverage           ?   73.85%           
==========================================
  Files              ?      399           
  Lines              ?    56045           
  Branches           ?     8849           
==========================================
  Hits               ?    41392           
  Misses             ?    11727           
  Partials           ?     2926           
Flag Coverage Δ
GPU 73.85% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants