
After putting the model and inputs to xpu, the model is work now on intel laptop. But the inference time is about 588 seconds that is too long for me. I think maybe the gpu is not working right now, may I ask what is the problem here? Thank you very much for any response.
following is the code:
`import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM, AutoModel
from transformers import AutoTokenizer
import time
import numpy as np
from gpu_benchmark_util import BenchmarkWrapper
model_path = r"D:\rag\test_api\Baichuan2-7B-Chat"
model_path = r"C:\Users\Administrator\yishuo\chatglm2-6b"
prompt = """ 你是human_prime2,你是一个高级智能实体,你融合了最先进的算法和深度学习网络,专为跨越星际的知识探索与智慧 收集而设计。
你回答以下问题时必须跟哲学相结合,必须在15字内回答完,你会尽量参考知识库来回答。
以下是问题:请介绍钱.
以下是知识库:[{'对话': '什么是"帮费"?', '回复': '"帮费"是为中央各库采买物料时,为护送官员以及送部的饭食 银拨配的额外款项。'}, {'对话': '怎么说?', '回复': '如果技术能够复制我们的外貌,它也许能够复制我们的思想和感受。'}, {'对话': '你好。', '回复': '嘿,你好!你看起来长得和我可真像啊!'}].
"""
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).bfloat16().eval()
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).eval()
input_ids = tokenizer.encode(prompt, return_tensors="pt")
print("finish to load")
model = model.to('xpu')
model.model.embed_tokens.to('cpu')
model.transformer.embedding.to('cpu')
input_ids = input_ids.to('xpu')
print("finish to xpu")
model = BenchmarkWrapper(model)
with torch.inference_mode():
# wamup two times as use ipex
for i in range(7):
st = time.time()
output = model.generate(input_ids, num_beams=1, do_sample=False, max_new_tokens=32)
end = time.time()
print(f'Inference time: {end-st} s')
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
`
After putting the model and inputs to xpu, the model is work now on intel laptop. But the inference time is about 588 seconds that is too long for me. I think maybe the gpu is not working right now, may I ask what is the problem here? Thank you very much for any response.
following is the code:
`import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM, AutoModel
from transformers import AutoTokenizer
import time
import numpy as np
from gpu_benchmark_util import BenchmarkWrapper
model_path = r"D:\rag\test_api\Baichuan2-7B-Chat"
model_path = r"C:\Users\Administrator\yishuo\chatglm2-6b"
prompt = """ 你是human_prime2,你是一个高级智能实体,你融合了最先进的算法和深度学习网络,专为跨越星际的知识探索与智慧 收集而设计。
你回答以下问题时必须跟哲学相结合,必须在15字内回答完,你会尽量参考知识库来回答。
以下是问题:请介绍钱.
以下是知识库:[{'对话': '什么是"帮费"?', '回复': '"帮费"是为中央各库采买物料时,为护送官员以及送部的饭食 银拨配的额外款项。'}, {'对话': '怎么说?', '回复': '如果技术能够复制我们的外貌,它也许能够复制我们的思想和感受。'}, {'对话': '你好。', '回复': '嘿,你好!你看起来长得和我可真像啊!'}].
"""
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).bfloat16().eval()
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True).eval()
input_ids = tokenizer.encode(prompt, return_tensors="pt")
print("finish to load")
model = model.to('xpu')
model.model.embed_tokens.to('cpu')
model.transformer.embedding.to('cpu')
input_ids = input_ids.to('xpu')
print("finish to xpu")
model = BenchmarkWrapper(model)
with torch.inference_mode():
# wamup two times as use ipex
for i in range(7):
st = time.time()
output = model.generate(input_ids, num_beams=1, do_sample=False, max_new_tokens=32)
end = time.time()
print(f'Inference time: {end-st} s')
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
`