LP: vllm benchmarking with quantised models#3207
Conversation
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
fadara01
left a comment
There was a problem hiding this comment.
Thank you for your work!
I added some initial comments
| lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10 | ||
| ``` | ||
|
|
||
| We would expect to see the precision is slightly lower with INT8. |
There was a problem hiding this comment.
we should expect to see numbers similar to the ones reported here for int8: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
There was a problem hiding this comment.
...are you saying we need to point to this explicitly in the article?
|
|
||
| ## Set up access to LLama3.1-8B models | ||
|
|
||
| To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login: |
There was a problem hiding this comment.
is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?
There was a problem hiding this comment.
Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?
| * Accuracy: --limit mmlu=10,gsm8k=500 | ||
|
|
||
| ### Throughput ratios: INT8/BF16 | ||
| | Requests/s | Total Tokens/s | Output Tokens/s | |
There was a problem hiding this comment.
given that we ran a serving benchmark, I think we should report latency here too.
nikhil-arm
left a comment
There was a problem hiding this comment.
I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly
Llama and/or Whisper I think. |
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs. Signed-off-by: Anna Mayne <anna.mayne@arm.com>
|
|
||
| ## Create and activate a Python virtual environment | ||
|
|
||
| It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies: |
There was a problem hiding this comment.
nit: It's => It is
or It is considered
|
|
||
| ## Install vLLM for CPU | ||
|
|
||
| Install a recent CPU specific build of vLLM: |
There was a problem hiding this comment.
Also "recent" could we be more specific? Do we need to be? (e.g. >v 0.x.y)
|
|
||
| ## Set up access to LLama3.1-8B models | ||
|
|
||
| To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login: |
There was a problem hiding this comment.
"to set up your access token" => "to install the CLI and setup your access token"
There was a problem hiding this comment.
"You can then install the CLI and login:"
Maybe...
"You can then authenticate yourself with the HuggingFace API:"
...or "you can then authenticate with HF"
...or "you can then login to HF"
|
|
||
| To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login: | ||
| ```bash | ||
| curl -LsSf https://hf.co/cli/install.sh | bash |
There was a problem hiding this comment.
I don't think we need this - we should just direct people to the HF docs for how to install.
There was a problem hiding this comment.
it avoid us telling people to curl some script we don't own into bash
|
|
||
| ## Understanding quantisation | ||
|
|
||
| Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path. |
There was a problem hiding this comment.
nit: not sure you need the comma after type
| python w8a8_quant.py | ||
| ``` | ||
|
|
||
| Where w8a8_quant.py contains: |
There was a problem hiding this comment.
is this script taken from anywhere upstream (e.g. HF?)
| vllm serve meta-llama/Llama-3.1-8B | ||
| ``` | ||
|
|
||
| Then we can create a test script that sends a request to the server using the OpenAI library. Copy the below to a file named llama_test.py. |
There was a problem hiding this comment.
nit: "Copy the below" => "Copy the python script below"
| python llama_test.py | ||
| ``` | ||
|
|
||
| You have now run inference using both the non-quantised and quantised Llama3.1-8B models. |
There was a problem hiding this comment.
suggestion: could get away with a "!" at the end of this line... if that's not too colloquial...
| vllm serve openai/whisper-large-v3 | ||
| ``` | ||
|
|
||
| Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the below to a file named whisper_test.py. |
| python whisper_test.py | ||
| ``` | ||
|
|
||
| You now have the quantised and non-quantised Llama and Whisper models on your local machine. You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance. |
|
|
||
| Repeat with the quantised model. The smaller model allows us to increase the concurrency. You should see a significant improvement in the throughput results (increased tokens/s). | ||
| ```bash | ||
| vllm serve \ |
There was a problem hiding this comment.
Should we add a quick explanation for some of the free params (e.g. --max-num-batched-tokens) and why we've picked these values (even if the reason is just because it's a good, tractable, place to start?
|
|
||
| ## Llama accuracy benchmarking | ||
|
|
||
| The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers. |
There was a problem hiding this comment.
links out for benchmarks and runtimes mentioned?
| who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness. | ||
|
|
||
| learning_objectives: | ||
| - Install a recent release of vLLM |
There was a problem hiding this comment.
Question: Is there a convention for not punctuating lists in the LPs? (this list and the others higher up don't have any on the end of lines).
Before submitting a pull request for a new Learning Path, please review Create a Learning Path
Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.