-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
I was trying to run the tree-based speculative decoding from the server_gpu_experiment in the specinfer-ae repository using the OPT model. However, the process gets stuck partway through and does not progress any further, as shown below. My GPU environment and the script I used are provided below.
Has anyone encountered or solved this issue before?
Additionally, are there any other examples of running tree-based speculative decoding with a prompt similar to chatgpt.json?
GPU Information:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:B1:00.0 Off | Off |
| 30% 32C P8 26W / 300W | 35MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Script: I am running the following bash script:
#! /usr/bin/env bash
set -e
set -x
# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}"
export UCX_DIR="$PWD/ucx-1.15.0/install"
export PATH=$UCX_DIR/bin:$PATH
export LD_LIBRARY_PATH=$UCX_DIR/lib:$LD_LIBRARY_PATH
./download_dataset.sh
./download_models.sh
batch_sizes=( 16 )
mkdir -p ./FlexFlow/inference/output
ncpus=1
ngpus=1
fsize=21890
zsize=80000
max_sequence_length=128
ssm_model_name="facebook/opt-125m"
llm_model_name="facebook/opt-6.7b"
for bs in "${batch_sizes[@]}"
do
./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --max-requests-per-batch $bs --max-sequence-length $max_sequence_length
done
Output:
$ ./server_gpu_experiments.sh
+ cd /workspace/.
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ ./download_dataset.sh
+ cd .
+ cd FlexFlow
+ rm -rf inference/prompt
+ mkdir -p inference/prompt
+ cd inference/prompt
+ wget https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
--2024-10-16 04:35:01-- https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
Resolving specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)... 3.5.129.104, 3.5.128.25, 52.219.109.114, ...
Connecting to specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)|3.5.129.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40769 (40K) [application/json]
Saving to: ‘chatgpt.json’
0K .......... .......... .......... ......... 100% 245K=0.2s
2024-10-16 04:35:02 (245 KB/s) - ‘chatgpt.json’ saved [40769/40769]
+ python -
+ rm chatgpt.json
+ ./download_models.sh
+ cd .
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
++ date +%s
+ start_time=1729053302
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-125m
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-125m/half-precision (if it doesn't exist)...
Loading 'facebook/opt-125m' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-125m' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-125m (if it doesn't exist)...
Saving facebook/opt-125m configs to file /root/.cache/flexflow/configs/facebook/opt-125m/config.json...
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-6.7b
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-6.7b/half-precision (if it doesn't exist)...
Loading 'facebook/opt-6.7b' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-6.7b' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-6.7b (if it doesn't exist)...
Saving facebook/opt-6.7b configs to file /root/.cache/flexflow/configs/facebook/opt-6.7b/config.json...
++ date +%s
+ end_time=1729053310
+ execution_time=8
+ echo 'Total download time: 8 seconds'
Total download time: 8 seconds
+ batch_sizes=(16)
+ mkdir -p ./FlexFlow/inference/output
++ date +%s
+ start_time=1729053310
+ ncpus=1
+ ngpus=1
+ fsize=21890
+ zsize=80000
+ max_sequence_length=128
+ ssm_model_name=facebook/opt-125m
+ llm_model_name=facebook/opt-6.7b
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 1 -ll:util 1 -ll:gpu 1 -ll:fsize 21890 -ll:zsize 80000 -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ./FlexFlow/inference/prompt/chatgpt_16.json --max-requests-per-batch 16 --max-sequence-length 128
[1729053311.514464] [36506f48411f:1356 :0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1729053311.514464] [36506f48411f:1356 :0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
...
Num of SSMs: 1
[0 - 7f0eec038000] 0.969221 {3}{RequestManager}: [1005740]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000] 0.969258 {3}{RequestManager}: [1005741]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000] 0.969298 {3}{RequestManager}: [1005742]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000] 0.969334 {3}{RequestManager}: [1005743]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
Metadata
Metadata
Assignees
Labels
No labels