diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md new file mode 100644 index 0000000000..d5f4df5d83 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md @@ -0,0 +1,70 @@ +--- +title: Setup vLLM +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## What is vLLM + +[vLLM](https://docs.vllm.ai/en/latest/) is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to maximise hardware efficiency, making LLM inference faster, more memory-efficient, and scalable. + +## Understanding the models + +Llama 3.1 8B is an open-weight, text-only LLM with 8 billion parameters that can understand and generate text. You can view the model card at https://huggingface.co/meta-llama/Llama-3.1-8B. + +Whisper large V3 is an automatic speech recognition (ASR) and speech translation model. It has 1.55 billion parameters and can both transcribe many languages and translate them to English. You can find the model card at https://huggingface.co/openai/whisper-large-v3. + +## Set up your environment + +Before you begin, make sure your environment meets these requirements: + +- Python 3.12 on Ubuntu 22.04 LTS or newer +- At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space + +This Learning Path was tested on a 96 core machine with 128-bit SVE, 192 GB of RAM and 500 GB of attached storage. + +## Install build dependencies + +Install the following packages required for running inference with vLLM on Arm64: +```bash +sudo apt-get update -y +sudo apt install -y python3.12-venv python3.12-dev +``` + +Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency: +```bash +sudo apt-get install -y libtcmalloc-minimal4 +``` + +## Create and activate a Python virtual environment + +It is considered best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies: +```bash +python3.12 -m venv vllm_env +source vllm_env/bin/activate +python -m pip install --upgrade pip +``` + +## Install vLLM for CPU + +Install a CPU-specific build of vLLM: +```bash +export VLLM_VERSION=0.20.0 +pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl --extra-index-url https://download.pytorch.org/whl/cpu +``` + +If you wish to build vLLM from source you can follow the instructions in the [Build and Run vLLM on Arm Servers Learning Path](/learning-paths/servers-and-cloud-computing/vllm/vllm-setup/). + + +## Set up access to LLama3.1-8B models + +To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face CLI guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to install the CLI and setup your access token. You can then login to HF: +```bash +hf auth login +``` + +Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes. + +Your environment is now setup to run inference with vLLM. Next, we'll review model quantisation and then you'll use vLLM to run inference on both quantised and non-quantised Llama and Whisper models. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md new file mode 100644 index 0000000000..531564b880 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-quantisation-recipe.md @@ -0,0 +1,104 @@ +--- +title: Quantisation Recipe +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Understanding quantisation + +Quantised models have their weights converted to a lower precision data type which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and https://huggingface.co/RedHatAI/whisper-large-v3-quantized.w8a8, which we will be using in this Learning Path. + +The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path. + +The w8a8 models we are using in this Learning Path only apply quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations. + +## Quantising your own models + +If you would prefer to generate your own w8a8 quantised models, the recipe below is provided as an example. This is an optional activity and not a core part of this Learning Path, as it can take several hours to run. + +You will need to install the required packages before running the quantisation script. +```bash +pip install compressed-tensors==0.14.0.1 +pip install llmcompressor==0.10.0.1 +pip install datasets==4.6.0 + +python w8a8_quant.py +``` + +Where w8a8_quant.py contains: +```python +from transformers import AutoTokenizer +from datasets import Dataset, load_dataset +from transformers import AutoModelForCausalLM +from llmcompressor import oneshot +from llmcompressor.modifiers.quantization import GPTQModifier +from compressed_tensors.quantization import QuantizationType, QuantizationStrategy +import random + +model_id = "meta-llama/Meta-Llama-3.1-8B" + +num_samples = 256 +max_seq_len = 4096 + +tokenizer = AutoTokenizer.from_pretrained(model_id) + +def preprocess_fn(example): + return {"text": example["text"]} + +ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train") +ds = ds.shuffle().select(range(num_samples)) +ds = ds.map(preprocess_fn) + +scheme = { + "targets": ["Linear"], + "weights": { + "num_bits": 8, + "type": QuantizationType.INT, + "strategy": QuantizationStrategy.CHANNEL, + "symmetric": True, + "dynamic": False, + "group_size": None + }, + "input_activations": + { + "num_bits": 8, + "type": QuantizationType.INT, + "strategy": QuantizationStrategy.TOKEN, + "dynamic": True, + "symmetric": False, + "observer": None, + }, + "output_activations": None, +} + +recipe = GPTQModifier( + targets="Linear", + config_groups={"group_0": scheme}, + ignore=["lm_head"], + dampening_frac=0.01, + block_size=512, +) + +model = AutoModelForCausalLM.from_pretrained( + model_id, + device_map="auto", + trust_remote_code=True, +) + +oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=max_seq_len, + num_calibration_samples=num_samples, +) +model.save_pretrained("Meta-Llama-3.1-8B-quantized.w8a8") +``` + +When this has completed you will need to copy over the tokeniser specific files from the original model before you can run inference on your quantised model. + +```bash +cp ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/*/*token* Meta-Llama-3.1-8B-quantized.w8a8/ +``` diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md new file mode 100644 index 0000000000..addfec7d19 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-run-inference.md @@ -0,0 +1,148 @@ +--- +title: Run inference with vLLM +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run inference on LLama3.1-8B + +We will use vLLM to serve an openAI-compatible API that we can use to run inference on Llama3.1-8B. This will demonstrate that the local environment is setup correctly. + +Start vLLM’s OpenAI-compatible API server using Llama3.1-8B: +```bash +vllm serve meta-llama/Llama-3.1-8B +``` + +Then we can create a test script that sends a request to the server using the OpenAI library. Copy the Python script below to a file named llama_test.py. + +```python +import time +from openai import OpenAI +from transformers import AutoTokenizer + +# vLLM's OpenAI-compatible server +client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") + +model = "meta-llama/Llama-3.1-8B" # vllm server model + +# Define a chat template for the model +llama3_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.first and message['role'] != 'system' %}{{ '<|start_header_id|>system<|end_header_id|>\n\n'+ 'You are a helpful assistant.' + '<|eot_id|>' }}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}" + +# Define your prompt +message = [{"role": "user", "content": "Explain Big O notation with two examples."}] + +def run(prompt): + resp = client.completions.create( + model=model, + prompt=prompt, + max_tokens=128, # The maximum number of tokens that can be generated in the completion + ) + return resp.choices[0].text + +def main(): + t0 = time.time() + + tokenizer = AutoTokenizer.from_pretrained(model) + tokenizer.chat_template = llama3_template + prompt = tokenizer.apply_chat_template(message, tokenize=False) + result = run(prompt) + + print(f"\n=== Output ===\n{result}\n") + print(f"Batch completed in : {time.time() - t0:.2f}s") + +if __name__ == "__main__": + main() +``` + +Now run the script with: +```bash +python llama_test.py +``` + +This will return the text generated by the model from your prompt. In the server logs you can see the throughput measured in tokens per second. + +You can do the same for the pre-quantised model loaded directly from Hugging Face. Start the server: +```bash +vllm serve RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 +``` + +Update your test script to use the quantised model: +```python +model = "RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8" +``` + +Run inference on the quantised model: +```bash +python llama_test.py +``` + +You have now run inference using both the non-quantised and quantised Llama3.1-8B models! + +## Run inference on Whisper + +We will use a similar approach to test our ability to run inference on Whisper models. Install the required vLLM audio library then start vLLM’s OpenAI-compatible API server using Whisper-large-v3: +```bash +pip install vllm[audio] + +vllm serve openai/whisper-large-v3 +``` + +Then we can create a test script that sends a request with an audio file to the server using the OpenAI library. Copy the Python script below to a file named whisper_test.py. + +```python +import time +from openai import OpenAI +from vllm.assets.audio import AudioAsset + +# vLLM's OpenAI-compatible server +client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") + +model = "openai/whisper-large-v3" # vllm server model + +# You can update the below with an audio file of your choosing +audio_filepath=str(AudioAsset("winning_call").get_local_path()) + +def transcribe_audio(): + with open(audio_filepath, "rb") as audio: + transcription = client.audio.transcriptions.create( + model=model, + file=audio, + language="en", + response_format="json", + temperature=0.0, + ) + return transcription.text + +def main(): + t0 = time.time() + out = transcribe_audio() + print(f"\n=== Output ===\n{out}\n") + print(f"Batch completed in : {time.time() - t0:.2f}s") + +if __name__ == "__main__": + main() +``` + +Now run the script with: +```bash +python whisper_test.py +``` + +You can do the same for the pre-quantised model loaded directly from Hugging Face. Start the server: +```bash +vllm serve RedHatAI/whisper-large-v3-quantized.w8a8 +``` + +Update your test script to use the quantised model: +```python +model = "RedHatAI/whisper-large-v3-quantized.w8a8" +``` + +Run inference on the quantised model: +```bash +python whisper_test.py +``` + +You now have the quantised and non-quantised Llama and Whisper models on your local machine! You have installed vLLM and demonstrated you can run inference on your models. Now you can move on to benchmarking the Llama models and compare their performance. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-benchmarking.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-benchmarking.md new file mode 100644 index 0000000000..b07fd27a75 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/4-benchmarking.md @@ -0,0 +1,116 @@ +--- +title: Evaluate Llama3.1-8B throughput and accuracy +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Llama performance benchmarking + +We will use the vLLM bench CLI to measure the throughput of our models. First, install the required library then start the server and keep it running: +```bash +pip install vllm[bench] + +vllm serve \ + --model meta-llama/Llama-3.1-8B \ + --max-num-batched-tokens 8192 \ + --max-model-len 4096 & +``` + +vLLM uses dynamic continuous batching to maximise hardware utilisation. Two key parameters govern this process: + * max-model-len, which is the maximum sequence length (number of tokens per request). No single prompt or generated sequence can exceed this limit. We've chosen a value large enough for the selected model and dataset. + * max-num-batched-tokens, which is the total number of tokens processed in one batch across all requests. The sum of input and output tokens from all concurrent requests must stay within this limit. We've chosen a value that, combined with our concurrency limit shown below, gives optimal throughput and latency. + +Now the server is running, we can benchmark using the public ShareGPT dataset. +```bash +wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json + +vllm bench serve \ + --model meta-llama/Llama-3.1-8B \ + --dataset-name sharegpt \ + --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ + --num-prompts 256 \ + --request-rate 8 \ + --max-concurrency 10 \ + --top-p 1 --temperature 0 \ + --percentile-metrics ttft,tpot \ + --metric-percentiles 50,95,99 \ + --save-result --result-dir bench_out --result-filename serve.json +``` +Here we are using greedy decoding: '''--top-p 1 --temperature 0'''. This selects the next token with the highest probability at each step, instead of sampling from a selection of likely tokens. + +The interesting results are request throughput, output token throughput, total token throughput, TTFT (time to first token) and TPOT (time per output token). We're aiming for a mean TPOT < 100ms, so the maximum concurrency selected should be as high as possible while meeting that TPOT requirement. + +Repeat with the quantised model. The smaller model allows us to increase the concurrency. You should see a significant improvement in the throughput results (increased tokens/s). +```bash +vllm serve \ + --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \ + --max-num-batched-tokens 8192 \ + --max-model-len 4096 & + +vllm bench serve \ + --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \ + --dataset-name sharegpt \ + --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ + --num-prompts 256 \ + --request-rate 8 \ + --max-concurrency 24 \ + --top-p 1 --temperature 0 \ + --percentile-metrics ttft,tpot \ + --metric-percentiles 50,95,99 \ + --save-result --result-dir bench_out --result-filename serve.json +``` + +## Llama accuracy benchmarking + +The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example [MMLU](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu), [HellaSwag](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/hellaswag), [GSM8K](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k)) and runtimes (such as [Hugging Face](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), and [llama.cpp](https://github.com/ggml-org/llama.cpp)). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers. + +You will: +- Install the lm-eval harness with vLLM support +- Run benchmarks on a BF16 model and an INT8 (weight-quantized) model +- Interpret key metrics and compare quality across precisions + +First install the required libraries for benchmarking with lm_eval. +```bash +pip install ray lm_eval[vllm] +``` + +You can use a limited number of prompts to validate your environment by appending ```--limit 10``` to the command below. A proper accuracy benchmark should be run over the whole dataset, though this can be time consuming and is considered optional for this Learning Path. This accuracy benchmark will be slower the first time through as you will download the test data associated with your selected task: +```bash +lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto +``` + +The [MMLU task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu) is a set of multiple choice questions split into the subgroups listed above. It allows you to measure the ability of an LLM to understand questions and select the right answers. + +The [GSM8k task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k) is a set of math problems that test an LLM's mathematical reasoning ability. + +Repeat with the quantised model. +```bash +lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto +``` + +We expect INT8 inference to show a slight accuracy drop compared to BF16. For reference results and expected accuracy differences, see the Red Hat model card: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#accuracy + +## Summary of results + +The benchmarking results you generate will depend on the hardware you are using. The values below, provided as an example only, were measured on a 96 core machine with 128-bit SVE and 192 GB of RAM. Using the INT8 quantised Llama3.1-8B model we observe throughput improvements of over 2x at a cost of up to ~8% in accuracy. + +### Throughput ratios: INT8/BF16 +| Requests/s | Output Tokens/s | Total Tokens/s | +| -------- | -------- | -------- | +| 2.7x | 2.2x | 2.5x | + +### Accuracy recovery: INT8/BF16 +| MMLU | GSM8k | +| -------- | -------- | +| 97% | 92% | + +## Next steps + +Now that you have your environment set up for running inference, benchmarking and quantising different models, you can experiment with: +- Benchmarking accuracy with different tasks +- Different quantisation techniques +- Different models + +Your results will allow you to balance accuracy and performance when making decisions about model deployment. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md new file mode 100644 index 0000000000..c07f0ea7bc --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md @@ -0,0 +1,66 @@ +--- +title: Run vLLM inference with quantised models and benchmark on Arm servers + +minutes_to_complete: 60 + +who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B and Whisper, with and without quantisation, and benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness. + +learning_objectives: + - Install a recent release of vLLM + - Run both quantised and non-quantised variants of Llama3.1-8B and Whisper using vLLM + - Evaluate and compare model performance and accuracy using vLLM's bench CLI and the LM Evaluation Harness + +prerequisites: + - An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 96 GB RAM, and 64 GB free disk space + - Python 3.12 and basic familiarity with Hugging Face Transformers and quantisation + +author: Anna Mayne, Nikhil Gupta, Marek Michałowski + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Neoverse +tools_software_languages: + - vLLM + - LM Evaluation Harness + - LLM + - Generative AI + - Python + - PyTorch + - Hugging Face +operatingsystems: + - Linux + + + +further_reading: + - resource: + title: vLLM Documentation + link: https://docs.vllm.ai/ + type: documentation + - resource: + title: vLLM GitHub Repository + link: https://github.com/vllm-project/vllm + type: website + - resource: + title: Hugging Face Model Hub + link: https://huggingface.co/models + type: website + - resource: + title: Build and Run vLLM on Arm Servers + link: /learning-paths/servers-and-cloud-computing/vllm/ + type: website + - resource: + title: LM Evaluation Harness (GitHub) + link: https://github.com/EleutherAI/lm-evaluation-harness + type: website + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_next-steps.md new file mode 100644 index 0000000000..727b395ddd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +---