[SPARK-55621][PYTHON] Fix ambiguous and unnecessary unicode usage by gaogaotiantian · Pull Request #54410 · apache/spark

gaogaotiantian · 2026-02-20T22:41:09Z

What changes were proposed in this pull request?

Fixed all the unnecessary and ambiguous unicode character usage.
A set of ruff rules are also added to prevent future regressions.

Why are the changes needed?

We should avoid using non-ascii unicode character usage as much as possible. There are few rationales behind it

Sometimes it's just wrong. e.g. ‘index’ vs 'index'
Some editor (VSCode) will highlight it as a warning and some editor/terminal might not display it well
It's difficult to keep consistency because people don't know how to type that
For docstrings, it could actually be displayed somewhere while users are using it and unicode could cause problems

Does this PR introduce any user-facing change?

No.

How was this patch tested?

ruff check passed.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2026-02-22T22:33:41Z

python/pyspark/pandas/indexes/base.py

            Mapping correspondence.
        na_action : {None, 'ignore'}
-            If ‘ignore’, propagate NA values, without passing them to the mapping correspondence.
+            If 'ignore', propagate NA values, without passing them to the mapping correspondence.


I thought backticks are legitimate in Sphinx.

backticks (`) are legit syntax, ‘ is not a backtick. It's a unicode quote.

allisonwang-db

Thanks for fixing this! Good to know.

holdenk

Approved, although I'm uncertain if we need the comment changes and I think we should be open to dropping it in the future if we find having unicode in comments helpful for illustrating behaviour.

holdenk · 2026-02-23T19:52:44Z

python/pyspark/tests/upstream/pyarrow/test_pyarrow_type_coercion.py

        # ==== 3.2 Nullable Extension Types ====
        # (data, target_type, expected_values)
        nullable_cases = [
-            # Int types → float


in-line unicode comments don't seem as bad as string/docstring issues.

So this is actually not enforced by ruff. The added ruff checker only checks for "ambiguous unicode usage" like the quote I mentioned above. This fix is done by myself. It's actually added pretty recently and I believe it's because LLMs like to generate icons like this.

I don't think having such characters in the comments is horrible, and in some case it might actually be helpful. But unicode characters may have issues on some IDEs/machines/editors and it's not worth it to do → vs ->. I don't even know how to type → by myself :) .

That being said, this enforcement will not block any unicode usages in the future - people can still do that. This specific change is a side effect when I'm trying to clean up unicode character usages in this PR.

holdenk · 2026-02-23T19:52:48Z

pyproject.toml

+    # ambiguous unicode character
+    "RUF001",  # string
+    "RUF002",  # docstring
+    "RUF003",  # comment


in-line unicode comments don't seem as bad as string/docstring issues.

Fix ambiguous and unnecessary unicode usage

571c4ff

HyukjinKwon reviewed Feb 22, 2026

View reviewed changes

allisonwang-db approved these changes Feb 23, 2026

View reviewed changes

holdenk approved these changes Feb 23, 2026

View reviewed changes

HyukjinKwon approved these changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-55621][PYTHON] Fix ambiguous and unnecessary unicode usage#54410

[SPARK-55621][PYTHON] Fix ambiguous and unnecessary unicode usage#54410
gaogaotiantian wants to merge 1 commit intoapache:masterfrom
gaogaotiantian:fix-ascii

gaogaotiantian commented Feb 20, 2026 •

edited

Loading

Uh oh!

HyukjinKwon Feb 22, 2026

Uh oh!

gaogaotiantian Feb 23, 2026

Uh oh!

allisonwang-db left a comment

Uh oh!

holdenk left a comment

Uh oh!

holdenk Feb 23, 2026

Uh oh!

gaogaotiantian Feb 23, 2026

Uh oh!

holdenk Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

gaogaotiantian commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

holdenk Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gaogaotiantian commented Feb 20, 2026 •

edited

Loading