About the acclerate problem with xpu

![image](https://github.com/intel-analytics/bigdl-llm-tutorial/assets/74948610/dba22357-ada1-4c00-8cbf-86b991db37a5)
After putting the model and inputs to xpu,  the model is work now on intel laptop.  But the inference time is about 588 seconds that is too long for me.  I think maybe the gpu is not working right now, may I ask what is the problem here?  Thank you very much for any response.

following is the code:
`import torch
import intel_extension_for_pytorch as ipex

from bigdl.llm.transformers import AutoModelForCausalLM, AutoModel
from transformers import AutoTokenizer

import time
import numpy as np
# from gpu_benchmark_util import BenchmarkWrapper

# model_path = r"D:\\rag\\test_api\\Baichuan2-7B-Chat"
model_path = r"C:\Users\Administrator\yishuo\chatglm2-6b"

prompt = """  你是human_prime2，你是一个高级智能实体，你融合了最先进的算法和深度学习网络，专为跨越星际的知识探索与智慧 收集而设计。
               你回答以下问题时必须跟哲学相结合，必须在15字内回答完，你会尽量参考知识库来回答。
               以下是问题：请介绍钱.
               以下是知识库:[{'对话': '什么是"帮费"？', '回复': '"帮费"是为中央各库采买物料时，为护送官员以及送部的饭食 银拨配的额外款项。'}, {'对话': '怎么说？', '回复': '如果技术能够复制我们的外貌，它也许能够复制我们的思想和感受。'}, {'对话': '你好。', '回复': '嘿，你好！你看起来长得和我可真像啊！'}].
"""
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).bfloat16().eval()
# model = AutoModel.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).eval()

input_ids = tokenizer.encode(prompt, return_tensors="pt")
print("finish to load")

model = model.to('xpu')
# model.model.embed_tokens.to('cpu')
model.transformer.embedding.to('cpu')
input_ids = input_ids.to('xpu')

print("finish to xpu")

# model = BenchmarkWrapper(model)

with torch.inference_mode():
    # wamup two times as use ipex
    for i in range(7):
        st = time.time()
        output = model.generate(input_ids, num_beams=1, do_sample=False, max_new_tokens=32)
        end = time.time()
        print(f'Inference time: {end-st} s')
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(output_str)
` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the acclerate problem with xpu #66

from gpu_benchmark_util import BenchmarkWrapper

model_path = r"D:\rag\test_api\Baichuan2-7B-Chat"

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).eval()

model.model.embed_tokens.to('cpu')

model = BenchmarkWrapper(model)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About the acclerate problem with xpu #66

Description

from gpu_benchmark_util import BenchmarkWrapper

model_path = r"D:\rag\test_api\Baichuan2-7B-Chat"

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).eval()

model.model.embed_tokens.to('cpu')

model = BenchmarkWrapper(model)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions