Skip to content

Stuck During Tree-Based Speculative Decoding with OPT Model #4

@SeungjaeLim

Description

@SeungjaeLim

I was trying to run the tree-based speculative decoding from the server_gpu_experiment in the specinfer-ae repository using the OPT model. However, the process gets stuck partway through and does not progress any further, as shown below. My GPU environment and the script I used are provided below.

Has anyone encountered or solved this issue before?
Additionally, are there any other examples of running tree-based speculative decoding with a prompt similar to chatgpt.json?

GPU Information:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------|
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:B1:00.0 Off |                  Off |
| 30%   32C    P8              26W / 300W |     35MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                   

Script: I am running the following bash script:

#! /usr/bin/env bash
set -e
set -x

# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}"

export UCX_DIR="$PWD/ucx-1.15.0/install"
export PATH=$UCX_DIR/bin:$PATH
export LD_LIBRARY_PATH=$UCX_DIR/lib:$LD_LIBRARY_PATH

./download_dataset.sh
./download_models.sh

batch_sizes=( 16 )

mkdir -p ./FlexFlow/inference/output

ncpus=1
ngpus=1
fsize=21890
zsize=80000
max_sequence_length=128
ssm_model_name="facebook/opt-125m"
llm_model_name="facebook/opt-6.7b"

for bs in "${batch_sizes[@]}"
do
    ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --max-requests-per-batch $bs --max-sequence-length $max_sequence_length
done

Output:

$  ./server_gpu_experiments.sh 
+ cd /workspace/.
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ ./download_dataset.sh
+ cd .
+ cd FlexFlow
+ rm -rf inference/prompt
+ mkdir -p inference/prompt
+ cd inference/prompt
+ wget https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
--2024-10-16 04:35:01--  https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
Resolving specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)... 3.5.129.104, 3.5.128.25, 52.219.109.114, ...
Connecting to specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)|3.5.129.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40769 (40K) [application/json]
Saving to: ‘chatgpt.json’

     0K .......... .......... .......... .........            100%  245K=0.2s

2024-10-16 04:35:02 (245 KB/s) - ‘chatgpt.json’ saved [40769/40769]

+ python -
+ rm chatgpt.json
+ ./download_models.sh
+ cd .
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
++ date +%s
+ start_time=1729053302
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-125m
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-125m/half-precision (if it doesn't exist)...
Loading 'facebook/opt-125m' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-125m' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-125m (if it doesn't exist)...
Saving facebook/opt-125m configs to file /root/.cache/flexflow/configs/facebook/opt-125m/config.json...
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-6.7b
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-6.7b/half-precision (if it doesn't exist)...
Loading 'facebook/opt-6.7b' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-6.7b' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-6.7b (if it doesn't exist)...
Saving facebook/opt-6.7b configs to file /root/.cache/flexflow/configs/facebook/opt-6.7b/config.json...
++ date +%s
+ end_time=1729053310
+ execution_time=8
+ echo 'Total download time: 8 seconds'
Total download time: 8 seconds
+ batch_sizes=(16)
+ mkdir -p ./FlexFlow/inference/output
++ date +%s
+ start_time=1729053310
+ ncpus=1
+ ngpus=1
+ fsize=21890
+ zsize=80000
+ max_sequence_length=128
+ ssm_model_name=facebook/opt-125m
+ llm_model_name=facebook/opt-6.7b
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 1 -ll:util 1 -ll:gpu 1 -ll:fsize 21890 -ll:zsize 80000 -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ./FlexFlow/inference/prompt/chatgpt_16.json --max-requests-per-batch 16 --max-sequence-length 128
[1729053311.514464] [36506f48411f:1356 :0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1729053311.514464] [36506f48411f:1356 :0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)

...

Num of SSMs: 1
[0 - 7f0eec038000]    0.969221 {3}{RequestManager}: [1005740]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000]    0.969258 {3}{RequestManager}: [1005741]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000]    0.969298 {3}{RequestManager}: [1005742]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000]    0.969334 {3}{RequestManager}: [1005743]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions