LP: vllm benchmarking with quantised models by almayne · Pull Request #3207 · ArmDeveloperEcosystem/arm-learning-paths

almayne · 2026-04-24T15:27:48Z

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

fadara01

Thank you for your work!

I added some initial comments

fadara01 · 2026-04-24T15:47:42Z

+lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
+```
+
+We would expect to see the precision is slightly lower with INT8.


we should expect to see numbers similar to the ones reported here for int8: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8

...are you saying we need to point to this explicitly in the article?

fadara01 · 2026-04-24T16:12:09Z

+
+## Set up access to LLama3.1-8B models
+
+To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login:


is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?

Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?

fadara01 · 2026-04-24T16:16:54Z

+  * Accuracy: --limit mmlu=10,gsm8k=500
+
+### Throughput ratios: INT8/BF16
+| Requests/s | Total Tokens/s | Output Tokens/s |


given that we ran a serving benchmark, I think we should report latency here too.

nikhil-arm

I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly

nSircombe · 2026-04-27T07:12:29Z

I think we need to redo the inference and benchmarking pages from scratch. Also I did not find any mention of whisper which was one of the requirement if I understand correctly

Llama and/or Whisper I think.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs. Signed-off-by: Anna Mayne <anna.mayne@arm.com>

nSircombe · 2026-05-13T07:46:22Z

+
+## Create and activate a Python virtual environment
+
+It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies:


nit: It's => It is
or It is considered

nSircombe · 2026-05-13T07:46:55Z

+
+## Install vLLM for CPU
+
+Install a recent CPU specific build of vLLM:


nit: hyphen, "CPU-specific"

Also "recent" could we be more specific? Do we need to be? (e.g. >v 0.x.y)

nSircombe · 2026-05-13T07:49:43Z

+
+## Set up access to LLama3.1-8B models
+
+To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:


"to set up your access token" => "to install the CLI and setup your access token"

"You can then install the CLI and login:"
Maybe...
"You can then authenticate yourself with the HuggingFace API:"

...or "you can then authenticate with HF"
...or "you can then login to HF"

nSircombe · 2026-05-13T07:51:38Z

+
+To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:
+```bash
+curl -LsSf https://hf.co/cli/install.sh | bash


I don't think we need this - we should just direct people to the HF docs for how to install.

it avoid us telling people to curl some script we don't own into bash

nSircombe · 2026-05-13T07:53:45Z

+
+## Understanding quantisation
+
+Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path.


nit: not sure you need the comma after type

nSircombe · 2026-05-13T07:55:25Z

+python w8a8_quant.py
+```
+
+Where w8a8_quant.py contains:


is this script taken from anywhere upstream (e.g. HF?)

nSircombe · 2026-05-13T07:56:38Z

+vllm serve meta-llama/Llama-3.1-8B
+```
+
+Then we can create a test script that sends a request to the server using the OpenAI library. Copy the below to a file named llama_test.py.


nit: "Copy the below" => "Copy the python script below"

nSircombe · 2026-05-13T07:57:27Z

+python llama_test.py
+```
+
+You have now run inference using both the non-quantised and quantised Llama3.1-8B models.


suggestion: could get away with a "!" at the end of this line... if that's not too colloquial...

nSircombe · 2026-05-13T07:57:52Z

+vllm serve openai/whisper-large-v3 
+```
+
+Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the below to a file named whisper_test.py.


copy the python script below

nSircombe · 2026-05-13T07:58:11Z

+python whisper_test.py
+```
+
+You now have the quantised and non-quantised Llama and Whisper models on your local machine. You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance.


nSircombe · 2026-05-13T08:00:23Z

+
+Repeat with the quantised model. The smaller model allows us to increase the concurrency. You should see a significant improvement in the throughput results (increased tokens/s).
+```bash
+vllm serve \


Should we add a quick explanation for some of the free params (e.g. --max-num-batched-tokens) and why we've picked these values (even if the reason is just because it's a good, tractable, place to start?

nSircombe · 2026-05-13T08:01:06Z

+
+## Llama accuracy benchmarking
+
+The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers.


links out for benchmarks and runtimes mentioned?

nSircombe · 2026-05-13T08:37:11Z

+who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness.
+
+learning_objectives: 
+    - Install a recent release of vLLM


Question: Is there a convention for not punctuating lists in the LPs? (this list and the others higher up don't have any on the end of lines).

LP: vllm benchmarking with quantised models

13546a9

Signed-off-by: Anna Mayne <anna.mayne@arm.com>