Regarding Qualcomm backend inference performance optimization #18968
guoguo1314
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, Qualcomm team. In my limited understanding, the current Qualcomm executor backend inference largely depends on the model transformation stage, not the inference stage. This is because the inference stage only involves some simple I/O reads and writes (key-value pairs, masks, etc.) and a QNN SDK API interface,
qnn_graph_execute, for inference (I/O reads and writes are relatively fast, and you can't modify theqnn_graph_executeAPI). Therefore, I don't think there's much to optimize inference performance here. I believe inference performance is largely determined by the model transformation, and the most important aspect of this is graph optimization. I've also looked at litert graph optimization, although some aspects are already implemented in the executor, such as converting multi-head attention to single-head computation.Is there any benefit to graph optimization?
Besides this, do you have any other good optimization suggestions?
Do you think that writing operators in
llama.cppmight outperform the executor's method of calling the QNN SDK for inference?If I have expressed myself incorrectly, please correct me. Thank you!
@shewu-quic @chunit-quic @haowhsu-quic @winskuo-quic @DannyYuyang-quic
Beta Was this translation helpful? Give feedback.
All reactions