Skip to content

fix: display the real details for aliases when requested, even if the alias is an uncompressed instruction#2923

Draft
moste00 wants to merge 3 commits into
capstone-engine:nextfrom
moste00:fix/real_details_for_uncompressed_aliases
Draft

fix: display the real details for aliases when requested, even if the alias is an uncompressed instruction#2923
moste00 wants to merge 3 commits into
capstone-engine:nextfrom
moste00:fix/real_details_for_uncompressed_aliases

Conversation

@moste00
Copy link
Copy Markdown
Contributor

@moste00 moste00 commented May 14, 2026

Your checklist for this pull request

  • I've documented or updated the documentation of every API function and struct this PR changes.
  • I've added tests that prove my fix is effective or that my feature works (if possible)

Detailed description

Background:

We depart from LLVM in what we count as aliases. LLVM only counts so-called "Pseudo-Instructions", non-compressed specialized uses of normal instructions. For example, LLVM considers the 4-byte ret as a psuedoinstruction that is just a specialized use of the instruction jalr

Capstone expands the meaning of "alias" to also mean the compressed instructions equivalence. For example, Capstone considers c.add to be an alias of the appropriate add instruction, whereas LLVM does NOT considers those 2 instructions to be aliases in the ordinary sense.

The problem:

Previously we only populated the real details when an instruction was an alias, but this was checked via printAliasInstr, which is an LLVM-derieved function that only considers the restricted LLVM-sense of the word "alias". This has an implication: Compressed equivalents don't have the details of the instruction they're equivalent to, even when the CS_OPT_DETAILS_REAL is set.

This change refactors the real details logic to also include Capstone wider usage of "alias", namely uncompressed instructions.

Test plan

...

Closing issues

...

@github-actions github-actions Bot added the RISCV Arch label May 14, 2026
@Rot127
Copy link
Copy Markdown
Collaborator

Rot127 commented May 15, 2026

Capstone expands the meaning of "alias" to also mean the compressed instructions equivalence. For example, Capstone considers c.add to be an alias of the appropriate add instruction, whereas LLVM does NOT considers those 2 instructions to be aliases in the ordinary sense.

When did we add this?
In #2869 ?

I need to look at it in more detail.
But we really should not deviate from LLVM. Except we can say "not categorizing it as alias is an LLVM bug".

How are alias defined in the ISA?

@moste00
Copy link
Copy Markdown
Contributor Author

moste00 commented May 16, 2026

When did we add this?
In #2869 ?

I need to look at it in more detail.

It's much older than that, it's probably as old as the noalias flag itself.. since the very beginning in November 2025 or so.

The noalias flag was reflecting that before noaliascompressed was introduced.

But we really should not deviate from LLVM. Except we can say "not categorizing it as alias is an LLVM bug".

How are alias defined in the ISA?

The short answer is that they aren't, the ISA never defines such a thing as an alias, it defines two things: pseudo instructions and compressed equivalents.

1- Pseudo instructions are basically assembly-time macros that allow you to write ret even though no such instruction exist, no binary encoding for ret exist. The assembler replaces it with a jalr and the CPU never knows the difference.

2- compressed equivalents are actual instructions with actual encodings, the CPU decoder is aware of them, but they happen to semantically correspond exactly to a restricted use of an equivalent non-compressed instruction (e.g. the compressed add corresponds to a +=, something the ordinary add can also do)

To my humble intuition, those two things look very much the same from a user perspective. They're both "this instruction is actually the same as this other one", with the meaning of "the same" being defined in two slightly different ways each time.

Are there any precedent in other architectures that allow us to go one way or another ? I know for a fact ARM has thumb mode which is their compressed mode, but I don't know if they have their own notion of pseudo instructions.

(PS: note that this entire PR is about separating the details from the alias text. That is, we can still go with the decision to NOT consider compressed instructions as aliases, but still also allowing the real details flag to populate their details with the non-compressed equivalent details.

This is very convienent for Rizin and any downstream consumers of Capstone, as it allows you to basically ignore all the compressed instructions, after all every single one corresponds to a special case of non-compressed instructions.)

@slate5
Copy link
Copy Markdown
Contributor

slate5 commented May 16, 2026

Hi @moste00, can you please give more precise examples of where LLVM returns an instruction that is/isn't an alias as expected? I'm a bit confused about what the desired result should be.
This is c.add example:

$ riscv64-linux-gnu-as -march=rv64gc -al - <<< 'add sp, sp, s0'
   1 0000 2291     	add	sp,sp,s0
$ riscv64-linux-gnu-objdump -d -M no-aliases a.out
   0:	9122            c.add	sp,s0

as shows instruction as an alias (pseudoinstr) add sp,sp,s0 while the real instruction is shown by objdump. I tested LLVM, and it follows the same logic as GNU utils. Now cstool:

$ cstool riscv64 2291
 0  22 91        add	sp, sp, s0
$ cstool riscv64+noalias 2291
 0  22 91        c.add	sp, s0

I don't see the inconsistency at first...

@moste00
Copy link
Copy Markdown
Contributor Author

moste00 commented May 16, 2026

Hi @moste00, can you please give more precise examples of where LLVM returns an instruction that is/isn't an alias as expected?

I just mean that all compressed instructions aren't understood by LLVM core as aliases, maybe the CLI tools implement this on top of the core (as they should, IMO), but the core itself has a function called printAliasInstr, and this function doesn't think that compressed equivalents are aliases. Aliases are purely ONLY pseudoinstructions, things with no encodings.

There IS an equivalent of printAliasInstr for the compressed instructions, which is uncompressInst, which will give you the equivalent non-compressed instruction of the compressed instruction you passed. But that's not an "alias" as LLVM defines it, it's a decompression.

Like you noticed, most CLI tools probably intuitively know that the user doesn't care about this pedantic distinction, and quietly just redefine "alias" to mean both things, but LLVM doesn't think that decompressed instructions are aliases, so we will be departing from them there.

(There are some consequences if we do this, for example we would have no alias ID for decompressed instructions, alias IDs are only assigned to the "fake" pseudo instructions that LLVM considers as aliases, compressed instructions are real from LLVM's POV, they have a real instruction ID and no alias ID.)

@slate5
Copy link
Copy Markdown
Contributor

slate5 commented May 16, 2026

Yea, we both understand that these aliases (pseudoinstructions) are just a programmer's convenience and, in a way, a relief from hard-coded decisions on which architecture will execute this. E.g., you just write add sp,sp,s0 and the assembler decides if it can be replaced by a compressed instruction or not. If it can be, then this is an alias for a compressed instruction, otherwise, this is a real instruction.

If you want to have an alias ID for compressed instructions, then we should have to add a table for it, right? Or even better, to just link them somehow to the existing table of aliases, because there is not really a compressed alias instruction. It's just an alias that is or is not compressed. As u said, from the user perspective, an alias represents a functionality, and there is no care if that functionality took 2 or 4 bytes of memory :)

@slate5
Copy link
Copy Markdown
Contributor

slate5 commented May 16, 2026

Also, I didn't reiterate that there is no difference between CLI tools and Capstone because CLI tools show the same string as cstool does add sp,sp,s0 or c.add sp,s0 depends if you ask for -M no-aliases
But if I understood you right, you wanna have alias ID?

@moste00
Copy link
Copy Markdown
Contributor Author

moste00 commented May 16, 2026

Yea, we both understand that these aliases (pseudoinstructions) are just a programmer's convenience and, in a way, a relief from hard-coded decisions on which architecture will execute this. E.g., you just write add sp,sp,s0 and the assembler decides if it can be replaced by a compressed instruction or not. If it can be, then this is an alias for a compressed instruction, otherwise, this is a real instruction.

This is my view, but another view is that we should do EXACLTY what LLVM core do, and LLVM core doesn't see compressed instructions as aliases. Maybe we can give them another flag, for example decompressed ? Or redefine noaliascompressed such that it's not a subset of noalias. (effectively defining two types of aliases, normal aliases, and compressed aliases, both mutually exclusive.)

If you want to have an alias ID for compressed instructions, then we should have to add a table for it, right?

Yes but this is its own deviation from LLVM too, we will define a manual table and maintain it with no auto-sync from LLVM. So whatever path you go, you will always have to face that you're going against LLVM convention.

@slate5
Copy link
Copy Markdown
Contributor

slate5 commented May 16, 2026

Let's backtrack a bit. I'm confused a lot 😅
I just tested the famous ret (aka, jalr zero, ra or c.jr ra), and this is how cstool detects it:

cstool -d riscv64 67800000
 0  67 80 00 00  ret	
	ID: 31 (jalr)
	Is alias: 1698 (ret) with ALIAS operand set

	Groups: jump 

cstool -d riscv64 8280
 0  82 80        ret	
	ID: 513 (c_jr)
	Is alias: 1698 (ret) with ALIAS operand set

	Groups: HasStdExtCOrZca jump 

alias ID is ret (1698) for both

@slate5
Copy link
Copy Markdown
Contributor

slate5 commented May 16, 2026

Ah, so the problem is that those that are aliased only as compressed instructions, while the real instruction counterpart doesn't have an alias...

@moste00
Copy link
Copy Markdown
Contributor Author

moste00 commented May 16, 2026

@slate5 good point, actually now I'm confused too :D

I didn't test ret before, but I tested another instruction (sext.w or something, the alias is sign extension but the core operation is compressed addition) and it had an invalid alias ID. So perhaps my statement doesn't apply to all decompressed instructions, but it certainly applies to some of them.

Anyway, let's wait for @Rot127 to do a final judgement call on this, preferably according to the precedent set by ARM. Then we will see the way forward.

@slate5
Copy link
Copy Markdown
Contributor

slate5 commented May 16, 2026

Hehe, sext.w (c.addiw t0,0) works well for me XD

$ cstool -rd riscv64 8122
 0  81 22        sext.w	t0, t0
	ID: 495 (c_addiw)
	Is alias: 1684 (sext.w) with REAL operand set
	op_count: 2
		operands[0].type: REG = t0
		operands[0].access: READ | WRITE
		operands[1].type: IMM = 0x0
		operands[1].access: READ

	Groups: HasStdExtCOrZca IsRV64 

I think the only "issue" is when you have an "alias" that, in itself, is nothing but the same mnemonic of the real instruction. And then, it only makes sense to call it an "alias" (i.e., alternative name) if it represents a compressed instruction. For example, sext.w can be used as an alias to both addiw and c.addiw, while addi doesn't exist as an alias to a full instruction and only exists as an "alias" to a compressed one (addi t0,t0,2 can be the real instruction and there is no pseudo version of it except if it represents a compressed one, c.addi t0,2)

So, it kinda makes sense, after all, R in RISC-V means reduced, not simple :)

@Rot127
Copy link
Copy Markdown
Collaborator

Rot127 commented May 17, 2026

preferably according to the precedent set by ARM

ARM has aliases :D There it is easy.

Yes but this is its own deviation from LLVM too, we will define a manual table and maintain it with no auto-sync from LLVM

Please don't introduce another table we need to maintain. Except it is easy to generate automatically.

The purpose of Auto-Sync is to just use the LLVM code as much as possible. Patching here and there a line in is fine. Or extending our LLVM backends to generate it for us of course.

I think the only "issue" is when you have an "alias" that, in itself, is nothing but the same mnemonic of the real instruction

That case is actually a bug (from our POV, not necessarily for LLVM).
If, for example, there is an "alias" instruction with the mnemonic addi this should be fixed.

It usually means that the LLVM definitions have an alias and a real instruction defined with the same mnemonic. You can search for InstAlias.*<mnemonic> in the RISCV.*.td files in the llvm-capstone repo.
We can change these definitions (remove the InstAlias) to fix it. But please leave a comment there that it is a Capstone edit.


Personally, I wouldn't want the compressed instructions to be counted as "alias".
An alias should really just be a different mnemonic or a "shortcut" writing for an instruction.

First of all, because this is what it usually means for all other archs. So we can have some consistency between them.
And second, because the alias must execute semantically the exact same way as its real counter part.

If one implements some tool with Capstone they maybe don't care about the mnemonic.
So sext.w being semantically equivalent to addi.w might be enough to know for them.
This is why we have the alias feature. So people can just get the operands of the real instructions and use them. Knowing that any alias of it, is semantically equivalent.

IF the compressed instructions are semantically equivalent to the full version of them, we could say that they are an alias. But since the encoding bytes differ, I would prefer to add an extra decompressed flag for them and treat them as real.

So something like that:

Compressed and not-compressed

  • Compressed instructions are real instructions.
  • They are distinct from their "not-compressed" equivalents because the encoding differs.
  • Compressed instructions have a flag "is_compressed" set to true.
  • Optionally: It stores the ID of the not-compressed instruction somewhere (if we can somehow generate the mapping table for it nicely).

Alias

  • Alias instructions only differ in mnemonic and/or used operands from the real instruction.
  • Alias and real instruction byte encodings are always the same.
  • Alias can have two real instruction parents. Not-compressed and compressed.

The topology is something like this:

alias:       ret
             / \
          is alias of
           /      \
real:    c_jr    jalr

Difference:

Bytes:      67800000
Alias ID:   ret
Real ID:    jalr
Detail:     cs_insn.details.is_compressed == false
            cs_insn.size == 4
            if (get_alias_details)
               cs_insn.op_count == 0
            else
               cs_insn.op_count == 1

Bytes:      8280
Alias ID:   ret
Real ID:    c_jr
Detail:     cs_insn.details.is_compressed == true
            cs_insn.size == 2
            if (get_alias_details)
               cs_insn.op_count == 0
            else
               cs_insn.op_count == 1
  • Anything which doesn't follow this definition is a bug.

wdyt?
Have I overlooked/over-read something?

@Rot127 Rot127 marked this pull request as draft May 18, 2026 09:58
@moste00
Copy link
Copy Markdown
Contributor Author

moste00 commented May 21, 2026

preferably according to the precedent set by ARM

ARM has aliases :D There it is easy.

xD very correct, indeed.

Personally, I wouldn't want the compressed instructions to be counted as "alias". An alias should really just be a different mnemonic or a "shortcut" writing for an instruction.

First of all, because this is what it usually means for all other archs. So we can have some consistency between them. And second, because the alias must execute semantically the exact same way as its real counter part.

This is reasonable, the thing is, compressed instructions satisfy the second condition exactly. Unless I'm misreading the spec/programmer's manual, it really does seem to say that a compressed equivalent MUST do the same effect as the uncompressed inspiration behind it, that's the intention in the first place, to give a size-shortcut to common idioms.

IF the compressed instructions are semantically equivalent to the full version of them, we could say that they are an alias. But since the encoding bytes differ, I would prefer to add an extra decompressed flag for them and treat them as real.

Very reasonable.

  • Optionally: It stores the ID of the not-compressed instruction somewhere (if we can somehow generate the mapping table for it nicely).

We can, uncompressInst function is basically this table.

wdyt? Have I overlooked/over-read something?

My original use case remains :( I need to be able to treat compressed instructions as basically their non-compressed equivalents, or else lifting would become very painful and repetitive. So one of 3 things:

1- The r real details flag treats compressed instructions as a "quasi-alias", they're not an alias, sure, but the real details flag would still replace the details of an is_compressed instruction with the non-compressed details

2- There is a seperate flag that does the same thing as (1) but is not r, maybe rc (real compressed?) ?

3- There is a seperate operands array in RISC-V other than the usual one, the real details flag operates on the usual one, the other flag operates on the other one.

Basically, I'm just circling and circling over the idea that I need to be able to obtain the non-compressed details, and since Rizin is just a serious test-drive of Capstone, probably many other tools depending on Capstone will have the same need.

@Rot127
Copy link
Copy Markdown
Collaborator

Rot127 commented May 22, 2026

My original use case remains :( I need to be able to treat compressed instructions as basically their non-compressed equivalents, or else lifting would become very painful and repetitive. So one of 3 things:

Sorry, I lost this context while reading.

The idea 2 seems good to me, but I would flip it around.

By default -r shows details of the real instruction for alias AND compressed. And we add an additional flag (--rc or something) which makes -r show the real details ONLY for proper alias, but not for compressed ones.

Because I think your lifting use case is way more common and should require only one flag instead of two.

@moste00
Copy link
Copy Markdown
Contributor Author

moste00 commented May 27, 2026

@Rot127 One final question: Does this mean we no longer treat noalias as a supression of the compressed instruction text ? since we don't classify compressed instructions as aliases, it would imply that noalias would no longer supress their text.

noaliascompressed would still be present, but preferably renamed to nocompressed so as to not imply that compressed instruction are a subset of aliases ? WDTY ?

@moste00
Copy link
Copy Markdown
Contributor Author

moste00 commented May 27, 2026

@Rot127 Also, one more note: It's never the case in LLVM that an alias has 2 parents, each alias in LLVM's alias table maps to exactly 1 parent, and most of those parents are the non-compressed.

So this presents another difficulty (if we so choose to hande it, ignoring is always an option). Some instruction that "logically" should be aliases, for example a c_addi that logically performs a move, will not be counted as aliases in the new classification.

We could handle this: Uncompress the instruction, then if the uncompression maps to an alias and the user hasn't done alias supression, then do print the alias. This way the c_addi will first uncompress to an addi, which , if it has the right operands, will then alias-map to a mv, and the net effect is that c_addi was successfully mapped to mv if they're equivalent.

More work, and this whole topic is surprisingly fractal in complexity and edge cases.

@github-actions github-actions Bot added the CS-core-files auto-sync label May 27, 2026
@Rot127
Copy link
Copy Markdown
Collaborator

Rot127 commented May 28, 2026

since we don't classify compressed instructions as aliases, it would imply that noalias would no longer supress their text.

Yes, I think this follows from it. nocompressed sounds good to me as addition.

We could handle this: Uncompress the instruction, then if the uncompression maps to an alias and the user hasn't done alias supression, then do print the alias. This way the c_addi will first uncompress to an addi, which , if it has the right operands, will then alias-map to a mv, and the net effect is that c_addi was successfully mapped to mv if they're equivalen

That is a tricky one indeed. Generally the assembly output should be as LLVM does it. Being comparable to it is one of the features we have.

How is the uncompression done? Does it cost a lot of runtime?
Because, if it is relatively low (or can be disabled alternatively), then I am fine with it. Of course, the Alias details must have the correct (compressed) id set as "real isntruction".

@slate5 Feel free to state your opinion as well btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants