Skip to content

Conversation

@jorgecarleitao
Copy link
Member

I would like to propose that we outline and enforce guidelines on the arrow crate implementation with respect to the usage of unsafe.

The background of this proposal are PRs #8645 and #8829. In both cases, while addressing an unrelated issue, they hit undefined behavior (UB) due to an incorrect usage of unsafe in the code base. This UB was very time-consuming to identify and debug: combined, they accounted for more than 12hs of my time.

Safety against undefined behavior is the core premise of the Rust language. In many cases, the maintenance burden (time to find and fix bugs) does not justify the performance improvements and the decrease in motivation in handling them (they are just painful due to how difficult they are to debug). In particular, IMO those 12 hours would have been better spent in other parts of the code if unsafe would have not been used in the first place, which would have been likely translated in faster code or more features.

There are situations where unsafe is necessary, and the guidelines outline these cases. However, I also see many uses of unsafe that are not necessary nor properly documented.

The goal of these guidelines is to motivate contributors of the Rust implementation to be conscious about the maintenance cost of unsafe, and outline specific necessary conditions for any new unsafe to be introduced in the code base.

@andygrove andygrove changed the title [Rust] [Proposal] Add guidelines about usage of unsafe ARROW-10889: [Rust] [Proposal] Add guidelines about usage of unsafe Dec 12, 2020
@andygrove
Copy link
Member

This is funny timing. I just filed ARROW-10889 for this and then saw this PR!

@codecov-io
Copy link

codecov-io commented Dec 12, 2020

Codecov Report

Merging #8901 (ff35ce8) into master (5353c28) will increase coverage by 7.78%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8901      +/-   ##
==========================================
+ Coverage   75.48%   83.26%   +7.78%     
==========================================
  Files         181      195      +14     
  Lines       41649    48066    +6417     
==========================================
+ Hits        31439    40023    +8584     
+ Misses      10210     8043    -2167     
Impacted Files Coverage Δ
rust/parquet/src/data_type.rs 78.72% <0.00%> (-5.89%) ⬇️
rust/parquet/src/util/memory.rs 89.57% <0.00%> (-1.04%) ⬇️
rust/parquet/src/file/statistics.rs 93.80% <0.00%> (-0.60%) ⬇️
rust/arrow/src/datatypes.rs 76.45% <0.00%> (-0.41%) ⬇️
rust/parquet/src/encodings/rle.rs 92.68% <0.00%> (-0.35%) ⬇️
rust/parquet/src/schema/parser.rs 89.86% <0.00%> (-0.34%) ⬇️
rust/arrow/src/ffi.rs 70.28% <0.00%> (-0.34%) ⬇️
rust/parquet/src/record/api.rs 98.01% <0.00%> (-0.14%) ⬇️
rust/arrow/src/json/reader.rs 83.09% <0.00%> (-0.09%) ⬇️
rust/arrow/src/csv/writer.rs 78.82% <0.00%> (ø)
... and 98 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48fee66...ff35ce8. Read the comment docs.

@github-actions
Copy link

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks @jorgecarleitao !

@jorgecarleitao jorgecarleitao marked this pull request as ready for review December 12, 2020 19:48
@andygrove
Copy link
Member

It would be really cool to have a GitHub Action that detected unsafe in a PR and auto-posted a link to this documentation, as well as tagging the PR with an unsafe label. I do not know how feasible that is though.

@Dandandan
Copy link
Contributor

Very cool proposal. Will take some time soon to read it in detail. I think currently there is a lot of code violating what is proposed, so that's clearly needed 👍

I think it also makes sense to keep track of the amount of unsafe that is used, and try to reduce it as much as possible by introducing safe APIs that are as performant. An example might be: introducing a safe iterator to avoid bounds checking.

Also I think the unsafe code can /should be more tested, e.g. by having more debug_asserts etc. checking for invariants and checking for undefined behavior in the CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 10% is a bit arbitrary and depends on context? Maybe it should.bd demonstrated / be likely that it has an impact on real world usage of arrow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this wording would be stronger without a specific bound. I would rather say something like "usage of unsafe for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large. Performance benefits should be quantified with a bench".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the sentiment.

My concern is that I am not sure I would like to maintain a complex unsafe code for a 6% improvement at this particular phase of the project.

The rational for a concrete number is to impose a bound that we consider to not have the capacity to handle the maintenance cost for "small" improvements, and thus people should not try to focus on those types of improvements (again, at this particular phase of the project).

My concern with "performance benefits are sufficiently large" is that anything bigger than zero is always "sufficiently large" when compared to zero.

What if we write something like

usage of unsafe for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (>~10%)

so that we allow other values, but we offer a number for reference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the improvement be expressed by the "hotness" of the code? I can imagine that a 6% improvement in the Buffers is much more valuable than a 11% percentage improvement in a specific kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like

usage of unsafe for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (e.g. >~10%)

@ritchie46 I agree with the sentiments that some performance improvements are more important than others, which is why I was suggesting leaving room for interpretation in the writeup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be preferable to have some room to trade safety for performance and vice versa. A change that makes everything a few percent faster might be acceptable, while a 50% improvement on a micro benchmark might not in some cases. And in some cases you maybe want to reduce the amount of safety, but might want to accept a certain negative performance impact.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea -- thank you for writing it up @jorgecarleitao !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doc.rust-lang.org/reference/behavior-considered-undefined.html might be a reasonable citation, though it doesn't quantify the sources of unsafetly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest being more explicit here -- maybe "Omitting bounds checking for performance" or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistically, I think we could remove the 'Usage of unsafe for ...' sentences. The rationale for use of unsafe is already explained above so this repetition seems redundant to me.

I also think such wording may give the impression that unsafe is always allowed for these operations, when I think the intent is "unsafe can be used as a last resort"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this wording would be stronger without a specific bound. I would rather say something like "usage of unsafe for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large. Performance benefits should be quantified with a bench".

@alamb
Copy link
Contributor

alamb commented Dec 14, 2020

@jorgecarleitao I am not sure what has happened to this PR but it now seemingly includes many changes that are unrelated to the readme improvements:
Screen Shot 2020-12-14 at 1 24 21 PM

jorgecarleitao and others added 5 commits December 14, 2020 18:28
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@jorgecarleitao
Copy link
Member Author

Some mistake on my part. Fixed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. @sunchao @paddyhoran or @nevi-me would you like to express an opinion on this proposal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like

usage of unsafe for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (e.g. >~10%)

@ritchie46 I agree with the sentiments that some performance improvements are more important than others, which is why I was suggesting leaving room for interpretation in the writeup

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the changes are clear enough and provide a good guideline for using unsafe.

@andygrove
Copy link
Member

I saw a tweet yesterday from @timClicks showing an example of how the imageproc project documents their unsafe code and I really like it. I think we should do something similiar. Maybe we could add this to the guidelines @jorgecarleitao ?

            // JUSTIFICATION
            //  Benefit
            //      Using checked indexing here makes bench_integral_image_rgb take 1.05x as long
            //      (The results are noisy, but this seems to be reproducible. I've not checked the generated assembly.)
            //  Correctness
            //      x and y are within bounds by definition of in_width and in_height
            let input = unsafe { image.unsafe_get_pixel(x, y) };

@Dandandan
Copy link
Contributor

I saw a tweet yesterday from @timClicks showing an example of how the imageproc project documents their unsafe code and I really like it. I think we should do something similiar. Maybe we could add this to the guidelines @jorgecarleitao ?

            // JUSTIFICATION
            //  Benefit
            //      Using checked indexing here makes bench_integral_image_rgb take 1.05x as long
            //      (The results are noisy, but this seems to be reproducible. I've not checked the generated assembly.)
            //  Correctness
            //      x and y are within bounds by definition of in_width and in_height
            let input = unsafe { image.unsafe_get_pixel(x, y) };

Saw that too :) I think it's an interesting approach. In this PR #8900 I added a comment with "SAFETY", but that misses the benefit part.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jorgecarleitao
Copy link
Member Author

Thanks all for the suggestions. Really good stuff.

I have incorporated all the changes:

  • the change of the requirement for performance was re-written as suggested by @alamb and @Dandandan
  • added a code snipped with the idea from @andygrove
  • fixed typos identified by @paddyhoran

I have used the term soundness instead of correctness because the former is already defined on an official guide: https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library and is based on existing literature, while the latter seems to be used only by imageproc.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any additional comments

@alamb alamb closed this in 900f6fe Dec 16, 2020
@alamb
Copy link
Contributor

alamb commented Dec 16, 2020

I think this PR has enough approvals / 👍 so I merged it in. We can continue to iterate on it in the future but this is a great base to work from. Thank you @jorgecarleitao and everyone else who contributed. A true team effort 👍

@timClicks
Copy link

Thanks all for working hard to keep Arrow as robust as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants