Skip to content

Question: on SM8750/sun, why do v79 HVX FP8 vcvt/vmpy work in hexagon-sim but return all zeros in FastRPC user PD? #4

@happyyzy

Description

@happyyzy

This is not a llama.cpp bug report.

I am asking here because this Qualcomm fork appears to be maintained by the people with the most relevant Hexagon / ggml-hexagon expertise, and I would like to confirm whether the behavior below is expected on current production devices.

Research Stage

Background Research

Previous existing literature and research

I am testing raw Hexagon v79 HVX FP8 instructions (vcvt(...f8) and vmpy(...f8, ...f8)) with a minimal standalone probe, outside of llama.cpp model evaluation.

Environment:

  • Device: SM8750 / sun
  • FastRPC capability query reports DSP arch 0x8c79
  • Hexagon SDK: 6.5.0.0
  • Hexagon tools: 19.0.07
  • Android NDK: r25c

What I have already verified:

  1. The generated v79 object code does contain FP8 instructions such as:

    • v?.f8 = vcvt(...)
    • v?:?.hf = vcvt(v?.f8)
    • v?:?.hf = vmpy(v?.f8, v?.f8)
  2. The same source works correctly in hexagon-sim -mv79.
    In simulator, the FP8 conversion / multiply path produces correct non-zero results.

  3. On device, inside FastRPC user PD, the same probe returns all zeros:

    • hf -> f8 gives all-zero bytes
    • f8 -> hf gives all-zero half values
    • f8 * f8 -> hf gives all-zero half values
  4. This remains true for:

    • intrinsic-generated code
    • inline asm
    • direct raw FP8 byte input followed by vcvt / vmpy
  5. I also tried:

    • qurt_hvx_lock(QURT_HVX_MODE_128B)
    • multiple qfloat codegen modes:
      • strict-ieee
      • ieee
      • lossy
      • legacy
        None of these changed the on-device result.
  6. I also traced existing vendor HTP/QNN paths on the same device.
    What I observed is:

    • working vendor QNN HTP runtime goes through CDSP unsigned PD
    • if I force the same path to signed PD, remote_handle_open / remote_handle64_open fails before execution

So at the moment I cannot tell whether this is:

  • an expected production-device limitation
  • an unsigned-PD limitation
  • a DSP image / firmware limitation
  • or something else specific to the runtime environment

Artifact bundle (logs and minimal sources):

Hypothesis

My current working hypothesis is:

  • code generation is correct
  • simulator behavior is correct
  • the issue is in the actual device execution environment, not in the C/intrinsic source itself

More specifically, it looks like on this SM8750/sun production image, the HVX FP8 datapath is not actually usable from the FastRPC user-PD path that is otherwise available to normal workloads.

However, I do not know whether that is:

  • expected platform behavior
  • a policy restriction
  • or something that should work on supported Qualcomm Hexagon runtimes

Implementation

Minimal standalone repro only.

This is not tied to a llama.cpp model or to llama.cpp FP8 code, and I am not claiming that this repository itself introduced the issue.

The reason for posting here is purely to ask the maintainers with Qualcomm Hexagon expertise:

Is it expected that v79 HVX FP8 vcvt / vmpy work in simulator but return all zeros on-device in the normal FastRPC user-PD path on SM8750/sun?

And if this is expected:

What execution environment is actually required for correct HVX FP8 execution on such a device?
For example:

  • signed PD only?
  • a different vendor runtime path?
  • a specific DSP image capability?
  • not supported at all on production user-accessible paths?

Analysis

Observed facts:

  • simulator: correct non-zero FP8 results
  • on-device FastRPC user PD: all-zero FP8 results
  • forcing signed PD on the vendor QNN path causes open failure, so I could not validate FP8 there

This strongly suggests that the problem is not in the source-level implementation of the FP8 instructions, but in the platform/runtime environment available on the device.

Relevant log output

On-device probe:
- hf -> f8 : all zeros
- f8 -> hf : all zeros
- f8 * f8 -> hf : all zeros

Simulator (`hexagon-sim -mv79`) output:
- same source produces correct non-zero FP8 conversion/multiply results

I can provide:

  • full on-device probe log
  • simulator output
  • disassembly snippet showing emitted FP8 instructions
  • signed/unsigned PD tracing logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions