-
Notifications
You must be signed in to change notification settings - Fork 196
Fix hf_quant_config with kv cache type #557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #557 +/- ##
=======================================
Coverage 74.37% 74.37%
=======================================
Files 182 182
Lines 18219 18219
=======================================
Hits 13550 13550
Misses 4669 4669 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| kv_cache_quantization = None | ||
| if get_kv_cache_dtype(self.model) == KV_CACHE_FP8: | ||
| # Only FP8 KV Cache is supported in VLLM for now | ||
| kv_cache_quantization = "FP8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add FP4 KV support? TRT-LLM actually supports FP4 kv now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added
What does this PR do?
Type of change: ? Bug fix
Fix hf_quant_config with correct kv cache type for FP8
Overview: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
will test export with KV cache fp8 enabled
Before your PR is "Ready for review"
Additional Information