Skip to content

yes: use as_encoded_bytes#12306

Open
oech3 wants to merge 1 commit into
uutils:mainfrom
oech3:yes-win-utf
Open

yes: use as_encoded_bytes#12306
oech3 wants to merge 1 commit into
uutils:mainfrom
oech3:yes-win-utf

Conversation

@oech3
Copy link
Copy Markdown
Contributor

@oech3 oech3 commented May 15, 2026

Closes #12236

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 15, 2026

Merging this PR will degrade performance by 12.08%

❌ 2 regressed benchmarks
✅ 315 untouched benchmarks
⏩ 46 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory cp_recursive_deep_tree[(120, 4)] 562.5 KB 699.2 KB -19.55%
Simulation ls_recursive_balanced_tree[(6, 4, 15)] 48.9 ms 50.9 ms -3.92%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing oech3:yes-win-utf (aeec6b1) with main (40c36f6)

Open in CodSpeed

Footnotes

  1. 46 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@oech3 oech3 marked this pull request as ready for review May 15, 2026 09:12
@sylvestre
Copy link
Copy Markdown
Contributor

please add a test to make sure we don't regress

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 15, 2026

I don't know what letter corresponds with invalid utf8.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

GNU testsuite comparison:

Skip an intermittent issue tests/tail/tail-n0f (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/cut/bounded-memory (passes in this run but fails in the 'main' branch)

Comment thread tests/by-util/test_yes.rs

#[test]
#[cfg(any(unix, target_os = "wasi"))]
fn test_non_utf8() {
Copy link
Copy Markdown
Contributor Author

@oech3 oech3 May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisDenton I cannot write test working on Windows

Copy link
Copy Markdown
Contributor

@ChrisDenton ChrisDenton May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows OsStr does not support non UTF-8. Well, it does support something called "WTF-8" but it's an extension of UTF-8 (so \xFF isn't valid). Also it's an internal encoding and not guaranteed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sylvestre Would you merge this PR as mere refactoring without adding tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it cause crush at production? Or is invalid args rejected by OS?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it cause crush at production? Or is invalid args rejected by OS?

It is UB, so it's undefined what will happen. The standard library needs to be able to convert an OsStr (which is [u8]) to a platform string (which for Windows is [u16]). So the OsStr needs to have a special encoding to make that possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does as_encoded_bytes have defined behavior at production at least for?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has defined cross-platform behaviour only for the UTF-8 subset. You can split on UTF-8. You can join UTF-8. Anything else is platform specific. On Unix it's fine to use bytes. On Windows the encoding isn't guaranteed except that it's a superset of UTF-8.

Production is where most issues will arise since it's optimized for performance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need yes > binary. But it is impossible to test since yes does not use file as input.

Comment thread src/uu/yes/src/yes.rs Outdated
@oech3 oech3 changed the title yes: use as_encoded_bytes on Windows yes: use as_encoded_bytes May 15, 2026
Comment thread src/uu/yes/src/yes.rs Outdated
@xtqqczze
Copy link
Copy Markdown
Contributor

I don't know what letter corresponds with invalid utf8.

I think the byte 0xff is never used in the UTF-8 encoding.

https://manpages.ubuntu.com/manpages/stonking/man7/utf-8.7.html

@oech3
Copy link
Copy Markdown
Contributor Author

oech3 commented May 15, 2026

But I can't put it as args.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

yes: support invalid unicode on Windows

4 participants