-
Notifications
You must be signed in to change notification settings - Fork 163
feat: sometimes interpret char[] mutations as single bytes #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ec0033e to
5c73f71
Compare
46f44d5 to
6762580
Compare
simonresch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
6762580 to
046a4e9
Compare
| if (prng.choice()) { | ||
| return (char[]) toPrimitive.apply(bytes); | ||
| } else { | ||
| char[] chars = new String(bytes, Charset.forName("CESU-8")).toCharArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to respect tha length provided by
@WithLengthannotation.
Oh, it does, never mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: Doesn't the current code here (specifically convertWithLength) assume that 2 byte = 1 char?
Wouldn't this assumption be violated here because:
- 1 CESU-8
bytecan be converted to 1char(for ASCII chars) → array becomes too large - 3 CESU-8
bytecan be converted to 1char(for chars >= U+0800) → array becomes too short
Maybe I am overlooking something though; sorry for the trouble in that case.
(Also side note: Would it make sense to store the Charset.forName("CESU-8") in a static final field?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also couldn't see how the length gets respected here, and did some fuzzing with various @WithLength(min=..., max=...) values, and couldn't get the array length out of bounds. So it seems to be respected, for some reason!
Would it make sense to store the Charset.forName("CESU-8") in a static final field?
It would make sense, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oetr The length annotation is enforced by the innerMutator, which is used by toPrimitive and toPrimitiveAfterMutate: see https://github.com/CodeIntelligenceTesting/jazzer/blob/CIF-1863-string-compares-on-char-arrays/src/main/java/com/code_intelligence/jazzer/mutation/mutator/lang/PrimitiveArrayMutatorFactory.java#L95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Marcono1234 You are right regarding the convertWithLength assuming two bytes per char, which could lead to enforcing incorrect length constraints when interepreting the byte array as "CESU-8"-encoded. I adjusted the convertWithLength method to assume a maximum of 6 bytes for a char (the maximum number of bytes used with CESU-8). Then, when we do the actual conversion to a char array, we ensure to enforce the length constraints. @oetr Could you have a look at the last commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch @Marcono1234 , thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this * 6 for a complete Unicode code point, specifically a code point >= 0x10000?
However a Java char is UTF-16, so those code points >= 0x10000 would actually require 2 char.
So wouldn't it suffice to do * 3 here?
Or at worst * 4 in case a malformed 4-byte CESU-8 encoding can lead to a single replacement ? char.
046a4e9 to
f240d7a
Compare
When mutating char[] randomly interpret the bytes from libFuzzer as individual (single byte) chars. This helps to make use of libFuzzers table of recent compare entries (encoded as CESU8) if the char[] is used as a String inside the fuzz test.
f240d7a to
ef7ea96
Compare
oetr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one issue from my side!
| Optional<WithLength> withLength = Optional.ofNullable(type.getAnnotation(WithLength.class)); | ||
| int minLength = withLength.map(WithLength::min).orElse(DEFAULT_MIN_LENGTH); | ||
| int maxLength = withLength.map(WithLength::max).orElse(DEFAULT_MAX_LENGTH); | ||
| withLength.ifPresent(System.err::println); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be removed
| } | ||
|
|
||
| if (chars.length < minLength) { | ||
| return Arrays.copyOf(chars, minLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the newly padded chars might not be in range.
Maybe we can do a single-pass copy and forceInRange here? Or is Java smart enough to optimize it at runtime?
WDYT about the following:
int targetLength = Math.min(Math.max(chars.length, minLength), maxLength);
if (chars.length == targetLength) {
for (int i = 0; i < chars.length; i++) {
chars[i] = (char) forceInRange(chars[i], minRange, maxRange);
}
return chars;
} else {
char[] result = new char[targetLength];
for (int i = 0; i < targetLength; i++) {
result[i] = i < chars.length
? (char) forceInRange(chars[i], minRange, maxRange)
: (char) forceInRange(0, minRange, maxRange);
}
return result;
}
When mutating char[] randomly interpret the bytes from libFuzzer as individual (single byte) chars. This helps to make use of libFuzzers table of recent compare entries (encoded as CESU8) if the char[] is used as a String inside the fuzz test.