-
Notifications
You must be signed in to change notification settings - Fork 284
Add DS/QWEN Examples #2333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yiliu30
wants to merge
19
commits into
master
Choose a base branch
from
ds-qwen
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add DS/QWEN Examples #2333
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
df06537
add qwen-ds example
yiliu30 859f13f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d417867
fix
yiliu30 290b982
Merge branch 'ds-qwen' of https://github.com/intel/neural-compressor …
yiliu30 6b5314c
clean
yiliu30 3b6585c
format
yiliu30 db73aba
update
yiliu30 1126b00
add eval
yiliu30 59ed49a
fix
yiliu30 94a955e
fix
yiliu30 e5afc7c
Merge branch 'master' into ds-qwen
chensuyue be09232
update
yiliu30 d1086a1
update
yiliu30 0ac9c2a
update
yiliu30 e566cc3
rename folder
yiliu30 9809b29
update
yiliu30 82c9f45
add req
yiliu30 484e6a3
fix
yiliu30 a21b42c
fix
yiliu30 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
62 changes: 62 additions & 0 deletions
62
...nlp/huggingface_models/language-modeling/quantization/auto_round/qwen/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
|
|
||
| ### Quantize Model | ||
| - Export model path | ||
| ```bash | ||
| export MODEL=Qwen/Qwen3-235B-A22B | ||
| ``` | ||
| > [!TIP] | ||
| > For quicker experimentation (shorter quantization and evaluation time, lower memory), | ||
| > you can start with the smaller `export MODEL=Qwen/Qwen3-30B-A3B` model before moving to larger variants. | ||
|
|
||
| - MXFP8 | ||
| ```bash | ||
| bash run_quant.sh --model $MODEL -t mxfp8 --output_dir ./qmodels | ||
| ``` | ||
|
|
||
| - MXFP4 | ||
| ```bash | ||
| bash run_quant.sh --model $MODEL -t mxfp8 --output_dir ./qmodels | ||
| ``` | ||
|
|
||
| ## Evaluation | ||
| ```bash | ||
| git clone -b fused-moe-ar --single-branch --quiet https://github.com/yiliu30/vllm-fork.git && cd vllm-fork | ||
| VLLM_USE_PRECOMPILED=1 pip install --editable . -vvv | ||
| ``` | ||
|
|
||
| ### Prompt Tests | ||
|
|
||
| Usage: | ||
| ```bash | ||
| bash ./run_generate.sh -s [mxfp4|mxfp8] -tp [tensor_parallel_size] -m [model_path] | ||
| ``` | ||
|
|
||
| - MXFP8 | ||
| ```bash | ||
| bash ./run_generate.sh -s mxfp8 -tp 4 -m /path/to/qwen_mxfp8 | ||
| ``` | ||
| - MXFP4 | ||
| ```bash | ||
| bash ./run_generate.sh -s mxfp4 -tp 4 -m /path/to/qwen_mxfp | ||
| ``` | ||
| ### Evaluation | ||
|
|
||
|
|
||
| Usage: | ||
| ```bash | ||
| bash run_evaluation.sh -m [model_path] -s [mxfp4|mxfp8] -t [task_name] -tp [tensor_parallel_size] -b [batch_size] | ||
| ``` | ||
| ```bash | ||
| bash run_evaluation.sh -s mxfp8 -t piqa,hellaswag,mmlu -tp 4 -b 512 -m /path/to/qwen_mxfp8 | ||
| bash run_evaluation.sh -s mxfp8 -t gsm8k -tp 4 -b 256 -m /path/to/qwen_mxfp8 | ||
|
|
||
| ``` | ||
| - MXFP4 | ||
| ```bash | ||
| bash run_evaluation.sh -s mxfp4 -t piqa,hellaswag,mmlu -tp 4 -b 512 -m /path/to/qwen_mxfp4 | ||
| bash run_evaluation.sh -s mxfp4 -t gsm8k -tp 4 -b 256 -m /path/to/qwen_mxfp4 | ||
| ``` | ||
|
|
||
|
|
||
|
|
||
|
|
67 changes: 67 additions & 0 deletions
67
...pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/qwen/generate.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| # Copied from https://github.com/vllm-project/vllm/ | ||
|
|
||
| from vllm import LLM, EngineArgs | ||
| from vllm.utils.argparse_utils import FlexibleArgumentParser | ||
|
|
||
|
|
||
|
|
||
| def create_parser(): | ||
| parser = FlexibleArgumentParser() | ||
| # Add engine args | ||
| EngineArgs.add_cli_args(parser) | ||
| parser.set_defaults(model="meta-llama/Llama-3.2-1B-Instruct") | ||
| # Add sampling params | ||
| sampling_group = parser.add_argument_group("Sampling parameters") | ||
| sampling_group.add_argument("--max-tokens", type=int) | ||
| sampling_group.add_argument("--temperature", type=float) | ||
| sampling_group.add_argument("--top-p", type=float) | ||
| sampling_group.add_argument("--top-k", type=int) | ||
|
|
||
| return parser | ||
|
|
||
|
|
||
| def main(args: dict): | ||
| # Pop arguments not used by LLM | ||
| max_tokens = args.pop("max_tokens") | ||
| temperature = args.pop("temperature") | ||
| top_p = args.pop("top_p") | ||
| top_k = args.pop("top_k") | ||
|
|
||
| # Create an LLM | ||
| llm = LLM(**args) | ||
|
|
||
| # Create a sampling params object | ||
| sampling_params = llm.get_default_sampling_params() | ||
| if max_tokens is not None: | ||
| sampling_params.max_tokens = max_tokens | ||
| if temperature is not None: | ||
| sampling_params.temperature = temperature | ||
| if top_p is not None: | ||
| sampling_params.top_p = top_p | ||
| if top_k is not None: | ||
| sampling_params.top_k = top_k | ||
|
|
||
| # Generate texts from the prompts. The output is a list of RequestOutput | ||
| # objects that contain the prompt, generated text, and other information. | ||
| prompts = [ | ||
| "Hello, my name is", | ||
| "The president of the United States is", | ||
| "The capital of France is", | ||
| "The future of AI is", | ||
| ] | ||
| outputs = llm.generate(prompts, sampling_params) | ||
| # Print the outputs. | ||
| print("-" * 50) | ||
| for output in outputs: | ||
| prompt = output.prompt | ||
| generated_text = output.outputs[0].text | ||
| print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}") | ||
| print("-" * 50) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = create_parser() | ||
| args: dict = vars(parser.parse_args()) | ||
| main(args) |
149 changes: 149 additions & 0 deletions
149
...pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/qwen/quantize.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| import torch | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
| import transformers | ||
| import logging | ||
| from auto_round import AutoRound | ||
|
|
||
| logging.basicConfig(level=logging.INFO) | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| topologies_config = { | ||
| # "ds_mxfp8": { | ||
| # "scheme": "MXFP8", | ||
| # "fp_layers": "lm_head", | ||
| # "iters": 0, | ||
| # }, | ||
| # "ds_mxfp4": { | ||
| # "scheme": "MXFP4", | ||
| # "fp_layers": "lm_head,self_attn", | ||
| # "iters": 0, | ||
| # }, | ||
| "mxfp8": { | ||
| "scheme": "MXFP8", | ||
| "fp_layers": "lm_head,mlp.gate", | ||
| "iters": 0, | ||
| }, | ||
| "mxfp4": { | ||
| "scheme": "MXFP4", | ||
| "fp_layers": "lm_head,mlp.gate,self_attn", | ||
| "iters": 0, # TODO: set to 200 before merge | ||
| }, | ||
| } | ||
|
|
||
|
|
||
| def quant_model_ar(args): | ||
| config = topologies_config[args.t] | ||
|
|
||
| logger.info(f"Using fp_layers: {config['fp_layers']}") | ||
| autoround = AutoRound( | ||
| model=args.model, | ||
| scheme=config["scheme"], | ||
| enable_torch_compile=args.enable_torch_compile, | ||
| iters=config["iters"], | ||
| fp_layers=config["fp_layers"], | ||
| ) | ||
| logger.info(f"Save quantized model to {args.output_dir}") | ||
| format_type = "auto_round" if args.use_autoround_format else "llm_compressor" | ||
| autoround.quantize_and_save( | ||
| format=format_type, | ||
| output_dir=f"{args.output_dir}/quantized_model_{args.t}", | ||
| ) | ||
|
|
||
|
|
||
| def get_model_and_tokenizer(model_name): | ||
| # Load model and tokenizer | ||
| fp32_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name, | ||
| device_map="cpu", | ||
| trust_remote_code=True, | ||
| ) | ||
| tokenizer = AutoTokenizer.from_pretrained( | ||
| model_name, | ||
| trust_remote_code=True, | ||
| ) | ||
| return fp32_model, tokenizer | ||
|
|
||
|
|
||
| def quant_model(args): | ||
| from neural_compressor.torch.quantization import ( | ||
| AutoRoundConfig, | ||
| convert, | ||
| prepare, | ||
| ) | ||
|
|
||
| config = topologies_config[args.t] | ||
| export_format = "auto_round" if args.use_autoround_format else "llm_compressor" | ||
| output_dir = f"{args.output_dir}/quantized_model_{args.t}" | ||
| fp32_model, tokenizer = get_model_and_tokenizer(args.model) | ||
| quant_config = AutoRoundConfig( | ||
| tokenizer=tokenizer, | ||
| # nsamples=32, | ||
| # seqlen=10, | ||
| # iters=1, | ||
| # amp=False, | ||
| # scale_dtype="fp16", | ||
| scheme=config["scheme"], | ||
| enable_torch_compile=args.enable_torch_compile, | ||
| iters=config["iters"], | ||
| fp_layers=config["fp_layers"], | ||
| export_format=export_format, | ||
| output_dir=output_dir, | ||
| ) | ||
|
|
||
| # quantizer execute | ||
| model = prepare(model=fp32_model, quant_config=quant_config) | ||
| inc_model = convert(model) | ||
| logger.info(f"Quantized model saved to {output_dir}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| import argparse | ||
|
|
||
| # Parse command-line arguments | ||
| parser = argparse.ArgumentParser(description="Select a quantization scheme.") | ||
| parser.add_argument( | ||
| "--model", | ||
| type=str, | ||
| help="Path to the pre-trained model or model identifier from Hugging Face Hub.", | ||
| ) | ||
| parser.add_argument( | ||
| "-t", | ||
| type=str, | ||
| choices=topologies_config.keys(), | ||
| default="qwen_mxfp4", | ||
| help="Quantization scheme to use. Available options: " + ", ".join(topologies_config.keys()), | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--enable_torch_compile", | ||
| action="store_true", | ||
| help="Enable torch compile for the model.", | ||
| ) | ||
| parser.add_argument( | ||
| "--use_autoround_format", | ||
| action="store_true", | ||
| help="Use AutoRound format for saving the quantized model.", | ||
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--skip_attn", | ||
| action="store_true", | ||
| help="Skip quantize attention layers.", | ||
| ) | ||
| parser.add_argument( | ||
| "--iters", | ||
| type=int, | ||
| default=0, | ||
| help="Number of iterations for quantization.", | ||
| ) | ||
| parser.add_argument( | ||
| "--output_dir", | ||
| type=str, | ||
| default="./", | ||
| help="Directory to save the quantized model.", | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| quant_model(args) | ||
4 changes: 4 additions & 0 deletions
4
...ch/nlp/huggingface_models/language-modeling/quantization/auto_round/qwen/requirements.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| lm-eval==0.4.9.1 | ||
| loguru | ||
| # TODO: (yiliu30) replace it with release version | ||
| auto_round @ git+https://github.com/intel/auto-round.git@more-ar-ext |
71 changes: 71 additions & 0 deletions
71
...pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/qwen/run_eval.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. clean up test code. |
||
|
|
||
| #!/bin/bash | ||
| # Check if a model name is passed as an argument, otherwise use the default model path | ||
| if [ -z "$1" ]; then | ||
| # model_path="Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound" | ||
| # model_path="/storage/yiliu7/quantized_model_ds_mxfp8" | ||
| model_path="/storage/yiliu7/quantized_model_ds_mxfp4" | ||
| model_path="/storage/yiliu7/quantized_model_ds_mxfp4" | ||
| model_path="/storage/yiliu7/quantized_model_ds_mxfp8" | ||
| # model_path="qmodels/quantized_model_ds_mxfp8" | ||
| # model_path="./small-qmodels/quantized_model_qwen_mxfp8/" | ||
| # model_path="/storage/yiliu7/quantized_model_qwen_mxfp4" | ||
| # model_path="/storage/yiliu7/quantized_model_qwen_mxfp8" | ||
| else | ||
| model_path="$1" | ||
| fi | ||
|
|
||
| tp_size=8 | ||
| model_name=$(basename ${model_path}) | ||
| output_dir="${model_name}-tp${tp_size}-gsm8k-acc" | ||
| # task_name="gsm8k" | ||
| # batch_size=256 | ||
| batch_size=512 | ||
| task_name="piqa,hellaswag,mmlu" | ||
| # task_name="mmlu_high_school_biology" | ||
|
|
||
| echo "Evaluating model: ${model_path} on task: ${task_name}, output dir: ${output_dir}" | ||
| # VLLM_ATTENTION_BACKEND=TRITON_ATTN \ | ||
| mkdir -p ${output_dir} | ||
| # VLLM_ATTENTION_BACKEND=FLASHINFER \ | ||
|
|
||
|
|
||
| # -MXFP4 Evaluation | ||
| # /storage/yiliu7/quantized_model_qwen_mxfp4 4x200 | ||
| # VLLM_AR_MXFP4_MODULAR_MOE=1 \ | ||
| # VLLM_MXFP4_PRE_UNPACK_TO_FP8=1 \ | ||
| # VLLM_ENABLE_STATIC_MOE=0 \ | ||
| # VLLM_MXFP4_PRE_UNPACK_WEIGHTS=0 \ | ||
| # VLLM_USE_DEEP_GEMM=0 \ | ||
| # VLLM_ENABLE_AR_EXT=1 \ | ||
| # VLLM_ENABLE_V1_MULTIPROCESSING=1 \ | ||
| # lm_eval --model vllm \ | ||
| # --model_args "pretrained=${model_path},tensor_parallel_size=${tp_size},max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False,enable_expert_parallel=True" \ | ||
| # --tasks $task_name \ | ||
| # --batch_size 16 \ | ||
| # --limit 256 \ | ||
| # --log_samples \ | ||
| # --seed 42 \ | ||
| # --output_path ${output_dir} \ | ||
| # --show_config 2>&1 | tee ${output_dir}/log.txt | ||
|
|
||
| # -MXFP8 Evaluation | ||
| # !!! Please set below knobs strictly for MXFP8 model evaluation !!! | ||
| # /storage/yiliu7/quantized_model_qwen_mxfp8 4x200 | ||
| VLLM_ENABLE_AR_EXT=1 \ | ||
| VLLM_AR_MXFP4_MODULAR_MOE=0 \ | ||
| VLLM_MXFP4_PRE_UNPACK_WEIGHTS=0 \ | ||
| VLLM_MXFP4_PRE_UNPACK_TO_FP8=0 \ | ||
| VLLM_ENABLE_STATIC_MOE=0 \ | ||
| VLLM_USE_DEEP_GEMM=0 \ | ||
| VLLM_ENABLE_V1_MULTIPROCESSING=1 \ | ||
| lm_eval --model vllm \ | ||
| --model_args "pretrained=${model_path},tensor_parallel_size=${tp_size},max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False" \ | ||
| --tasks $task_name \ | ||
| --batch_size $batch_size \ | ||
| --log_samples \ | ||
| --seed 42 \ | ||
| --output_path ${output_dir} \ | ||
| --show_config 2>&1 | tee ${output_dir}/log.txt | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up invalid code.