Regarding Qualcomm backend inference performance optimization #18968

guoguo1314 · 2026-04-17T07:28:15Z

guoguo1314
Apr 17, 2026

Hello, Qualcomm team. In my limited understanding, the current Qualcomm executor backend inference largely depends on the model transformation stage, not the inference stage. This is because the inference stage only involves some simple I/O reads and writes (key-value pairs, masks, etc.) and a QNN SDK API interface, qnn_graph_execute, for inference (I/O reads and writes are relatively fast, and you can't modify the qnn_graph_execute API). Therefore, I don't think there's much to optimize inference performance here. I believe inference performance is largely determined by the model transformation, and the most important aspect of this is graph optimization. I've also looked at litert graph optimization, although some aspects are already implemented in the executor, such as converting multi-head attention to single-head computation.

Is there any benefit to graph optimization?
Besides this, do you have any other good optimization suggestions?
Do you think that writing operators in llama.cpp might outperform the executor's method of calling the QNN SDK for inference?

If I have expressed myself incorrectly, please correct me. Thank you!

@shewu-quic @chunit-quic @haowhsu-quic @winskuo-quic @DannyYuyang-quic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding Qualcomm backend inference performance optimization #18968

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Regarding Qualcomm backend inference performance optimization #18968

Uh oh!

Uh oh!

guoguo1314 Apr 17, 2026

Replies: 0 comments

guoguo1314
Apr 17, 2026