【Hackathon 10th Spring No.45】[Build] SM-tier compile guards for T4/V100 support#6941
Conversation
|
Thanks for your contribution! |
141b8e5 to
520b220
Compare
520b220 to
8f74ea3
Compare
|
Aware of PR #6488 which targets the same task. This PR takes a lighter approach (+47 lines vs +73) with a smaller guard surface. Happy to defer to whichever implementation the maintainers prefer — this PR is conflict-free against current |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #6941 +/- ##
==========================================
Coverage ? 72.07%
==========================================
Files ? 399
Lines ? 55950
Branches ? 8828
==========================================
Hits ? 40324
Misses ? 12785
Partials ? 2841
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c615280 to
c00b87a
Compare
…ed compile guards Wholesale replace cpp_extensions.cc and setup_ops.py with our AI Studio V100-verified implementation (pipeline p-1051a228d3c7). Changes vs merged PaddlePaddle#6488: - cpp_extensions.cc: Add #ifdef ENABLE_SCALED_MM_C2X guard for 5 cutlass/FP8 ops (linker error on SM70 without guard) - cpp_extensions.cc: Add #ifdef ENABLE_SM80_EXT_OPS guard for 7 tail MoE ops (linker error on SM70/SM75 without guard) - setup_ops.py: Fix ENABLE_SM80_EXT_OPS placement (cc>=80, not cc>=75) - setup_ops.py: Remove get_compile_parallelism() scope creep (26 lines, functionally identical to 1-liner)
c00b87a to
ebe5356
Compare
Bug report: merged PR #6488 has two linker-error bugsSince #6488 was merged as Bug 1 — 5 unguarded cutlass/FP8 ops: Bug 2 — 7 unguarded tail MoE/MLA ops: Additional issue — scope creep: This PR replaces both See PR #6977 for a minimal additive-only alternative (4 lines, |
|
补充说明一下当前 CI 状态:
如果 reviewer 希望我再补充更细的验证说明或补一个更轻量的 targeted test,我可以继续跟进;当前这条红灯更接近 policy mismatch,不是编译守卫逻辑失效。 |
Motivation
Task 45 requires FastDeploy's
custom_opsto compile on T4 (SM75) and V100 (SM70) GPUs. Currently,cpp_extensions.ccregisters all 117 ops unconditionally, causing link errors when SM80+-only CUDA kernels (MoE, MLA, speculative decoding, append attention) are absent from the build.This PR adds conditional compilation guards to
cpp_extensions.ccand corresponding macro definitions insetup_ops.py, gating SM80+ op bindings behindENABLE_SM80_EXT_OPS, SM75+ ops behindENABLE_SM75_EXT_OPS/ENABLE_SCALED_MM_C2X, and SM70'sgelu_tanhbehindDISABLE_GELU_TANH_OP.Modifications
cpp_extensions.cc(+28 lines)14 guard blocks wrapping 78 of 117 ops (updated after merge with latest upstream):
ENABLE_SM80_EXT_OPSENABLE_SM75_EXT_OPSENABLE_SCALED_MM_C2XDISABLE_GELU_TANH_OPThe remaining 39 ops (per_token_quant, get_padding_offset, fused_rotary_position_encoding, noaux_tc, etc.) compile on all SM tiers and remain unguarded.
setup_ops.py(+19 lines, -1 line)ENABLE_SM75_EXT_OPSadded to bothcc_compile_argsandnvcc_compile_argsatcc >= 75— also addsmoe_deepgemm_permute.cuandmoe_deepgemm_depermute.cusources (these kernels have no BF16 dependency)ENABLE_SM80_EXT_OPSadded to bothcc_compile_argsandnvcc_compile_argsatcc >= 80DISABLE_GELU_TANH_OPadded to both compile args when SM70 is in the target architectures — also removesgelu_tanh.cufrom sources to avoid compiling unsupported SM75 Tanh instructionssm_versionscomputed once and reused (avoids redundantget_sm_version()call)dict.fromkeys()beforesetup()to prevent duplicate translation units from overlappingfind_end_files()callsUsage or Command
Verification script (run from repo root)
Hardware Verification (AI Studio V100)
Guard counts verified on Tesla V100-SXM2-32GB via AI Studio CLI pipeline:
p-1051a228d3c7Guard balance:
#if*=18, #endif=18— balanced.Full V100 nvcc compilation blocked by GFW (cutlass submodule requires GitHub access from AI Studio). Guard structure and macro gating verified independently on hardware.
Accuracy Tests
#if*= 18#endif.Pipeline Evidence:
Checklist
pre-commit) passed for modified files.