Skip to content

LP: vllm benchmarking with quantised models#3207

Open
almayne wants to merge 5 commits into
ArmDeveloperEcosystem:mainfrom
almayne:vllm_bench_quantised
Open

LP: vllm benchmarking with quantised models#3207
almayne wants to merge 5 commits into
ArmDeveloperEcosystem:mainfrom
almayne:vllm_bench_quantised

Conversation

@almayne
Copy link
Copy Markdown

@almayne almayne commented Apr 24, 2026

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

  • I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Copy link
Copy Markdown

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work!

I added some initial comments

lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
```

We would expect to see the precision is slightly lower with INT8.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should expect to see numbers similar to the ones reported here for int8: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...are you saying we need to point to this explicitly in the article?


## Set up access to LLama3.1-8B models

To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?

* Accuracy: --limit mmlu=10,gsm8k=500

### Throughput ratios: INT8/BF16
| Requests/s | Total Tokens/s | Output Tokens/s |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we ran a serving benchmark, I think we should report latency here too.

Copy link
Copy Markdown
Contributor

@nikhil-arm nikhil-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly

@nSircombe
Copy link
Copy Markdown

I think we need to redo the inference and benchmarking pages from scratch. Also I did not find any mention of whisper which was one of the requirement if I understand correctly

Llama and/or Whisper I think.

almayne added 3 commits April 27, 2026 08:30
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
…page to use custom scripts and added whisper inference back in. Accuracy results in benchmarking page are now full runs.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>

## Create and activate a Python virtual environment

It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's => It is
or It is considered


## Install vLLM for CPU

Install a recent CPU specific build of vLLM:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: hyphen, "CPU-specific"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also "recent" could we be more specific? Do we need to be? (e.g. >v 0.x.y)


## Set up access to LLama3.1-8B models

To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to set up your access token" => "to install the CLI and setup your access token"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"You can then install the CLI and login:"
Maybe...
"You can then authenticate yourself with the HuggingFace API:"

...or "you can then authenticate with HF"
...or "you can then login to HF"


To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:
```bash
curl -LsSf https://hf.co/cli/install.sh | bash
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this - we should just direct people to the HF docs for how to install.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it avoid us telling people to curl some script we don't own into bash


## Understanding quantisation

Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure you need the comma after type

python w8a8_quant.py
```

Where w8a8_quant.py contains:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this script taken from anywhere upstream (e.g. HF?)

vllm serve meta-llama/Llama-3.1-8B
```

Then we can create a test script that sends a request to the server using the OpenAI library. Copy the below to a file named llama_test.py.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Copy the below" => "Copy the python script below"

python llama_test.py
```

You have now run inference using both the non-quantised and quantised Llama3.1-8B models.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: could get away with a "!" at the end of this line... if that's not too colloquial...

vllm serve openai/whisper-large-v3
```

Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the below to a file named whisper_test.py.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy the python script below

python whisper_test.py
```

You now have the quantised and non-quantised Llama and Whisper models on your local machine. You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"!"


Repeat with the quantised model. The smaller model allows us to increase the concurrency. You should see a significant improvement in the throughput results (increased tokens/s).
```bash
vllm serve \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a quick explanation for some of the free params (e.g. --max-num-batched-tokens) and why we've picked these values (even if the reason is just because it's a good, tractable, place to start?


## Llama accuracy benchmarking

The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

links out for benchmarks and runtimes mentioned?

who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness.

learning_objectives:
- Install a recent release of vLLM
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a convention for not punctuating lists in the LPs? (this list and the others higher up don't have any on the end of lines).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants