Skip to content

Add tests for src/snmalloc/override/new.cc#839

Merged
mjp41 merged 31 commits intomicrosoft:mainfrom
paulhdk:test-new
Apr 29, 2026
Merged

Add tests for src/snmalloc/override/new.cc#839
mjp41 merged 31 commits intomicrosoft:mainfrom
paulhdk:test-new

Conversation

@paulhdk
Copy link
Copy Markdown
Contributor

@paulhdk paulhdk commented Mar 29, 2026

First-time contributor taking a stab at #85.

Questions:

  • The standard explicitly requests that new and delete don't introduce data races. Should we explicitly test that, e.g., with ThreadSanitizer?
  • I noticed in passing that rust.cc does not have any tests. Is there a reason why there aren't any or
    should that be tracked in a separate issue?

closes: #85

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Mar 29, 2026

@microsoft-github-policy-service agree

@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Mar 29, 2026

Hi @pauhdk, thanks for taking this.

The standard explicitly requests that new and delete don't introduce data races. Should we explicitly test that, e.g., with ThreadSanitizer?

We run TSan in CI. So any concurrent test will get checked with TSan.

I noticed in passing that rust.cc does not have any tests. Is there a reason why there aren't any or
should that be tracked in a separate issue?

This should also be done, but a separate PR.

We still support building with C++17, so that has caused some failures in CI. Also, there should be a clangformat pass. It will build a target with clangformat that formats the code correctly. If you can't get that to work, the CI failure should display a patch.

Copy link
Copy Markdown
Member

@mjp41 mjp41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have made a few suggestions that will hopefully get it through CI.

The TSan build is failing due to duplicate symbols. We might have to disable this test on TSan sadly.

Comment thread src/test/func/new/new.cc
Comment thread src/test/helpers.h Outdated
Comment thread src/test/func/new/new.cc Outdated
@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Mar 30, 2026

Thanks. I have made a few suggestions that will hopefully get it through CI.

Thank you! You beat me to it. Probably shouldn't have pushed before addressing all of your previous comments.

The TSan build is failing due to duplicate symbols. We might have to disable this test on TSan sadly.

So, no additional data race tests needed in that case?

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Mar 30, 2026

Just pushed another fix, @mjp41.

I believe the Ubuntu builds are failing because they run with SNMALLOC_USE_SELF_VENDORED_STL=ON.

Not sure about the Windows builds.

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 6, 2026

Did some more testing, and I'm able to reproduce the CI failures locally. I stil believe that SNMALLOC_USE_SELF_VENDORED_STL=ON is at fault and that the RTTI of the thrown exception (self-vendored STL) and the expected exception (system STL) don't match, which is why the catch (std::bad_alloc&) falls through when SNMALLOC_USE_SELF_VENDORED_STL=ON is set.

An easy fix to pass the tests would be a plain catch(...) block, but that wouldn't solve the root cause.

Do you consider this a bug, @mjp41? What do you suggest we do?

@SchrodingerZhu
Copy link
Copy Markdown
Collaborator

Self-vendored STL is for compiled with freestanding so I don't expect it will have full exception handling facilities.

@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 7, 2026

Self-vendored STL is for compiled with freestanding so I don't expect it will have full exception handling facilities.

So should we disable this test if the self-vendored setting is on?

@SchrodingerZhu
Copy link
Copy Markdown
Collaborator

I think so.

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 7, 2026

PTAL

Copy link
Copy Markdown
Member

@mjp41 mjp41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing this long standing issue

Comment thread src/test/func/new/new.cc Outdated
Comment thread src/test/func/new/new.cc Outdated
@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 20, 2026

Looks like the netbsd tests are failing. Maybe an integer overflow somewhere?
I'd be happy to spend some time debugging this. Does anyone (@mjp41, @SchrodingerZhu) have some pointers for me?

@SchrodingerZhu
Copy link
Copy Markdown
Collaborator

Could you add -v (either to ninja or CMAKE_CXX_FLAGS) to dump the compilation command?

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 21, 2026

I'm trying to run the netbsd CI job locally with act to get the compile command, but no luck so far.
Have you managed to run the GitHub CI locally? That would probably make debugging a lot easier

@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 24, 2026

I'm trying to run the netbsd CI job locally with act to get the compile command, but no luck so far. Have you managed to run the GitHub CI locally? That would probably make debugging a lot easier

Sorry, I haven't. @devnexen normally knows what is going on with different BSDs.

@devnexen
Copy link
Copy Markdown
Collaborator

I'm trying to run the netbsd CI job locally with act to get the compile command, but no luck so far. Have you managed to run the GitHub CI locally? That would probably make debugging a lot easier

Hi @paulhdk, I don't think it's an integer overflow. The request is (size_t)-1 (~16 EiB), so snmalloc rejects the size internally long before any arithmetic on it
could wrap.

Looking at the log again: we land on the EXPECT with caught_bad_alloc == false. If bad_alloc had been thrown and unmatched, the process would have terminated before
that line ever printed, so the throw just didn't happen on this lane.

Two plausible causes:

  1. snmalloc's huge-size path returned nullptr to the caller without going through Throw::failure — that would be a real bug worth fixing.
  2. The throw ran but unwinding silently bailed out. I'd lean toward this one given pkgsrc gcc10 plus -static-libstdc++. Same family of issue as the self-vendored STL
    lane earlier in this PR.

Quickest way to tell them apart: drop a write(2, "throw\n", 6) right before the throw std::bad_alloc() in override/new.cc and rerun the netbsd job. If "throw" shows
up in the log, it's a toolchain/unwinding problem; if it doesn't, the failure handler is being bypassed and the bug is on our side.

Hope it helps.

paulhdk added a commit to paulhdk/snmalloc that referenced this pull request Apr 27, 2026
@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 27, 2026

I'm trying to run the netbsd CI job locally with act to get the compile command, but no luck so far. Have you managed to run the GitHub CI locally? That would probably make debugging a lot easier

Hi @paulhdk, I don't think it's an integer overflow. The request is (size_t)-1 (~16 EiB), so snmalloc rejects the size internally long before any arithmetic on it could wrap.

Looking at the log again: we land on the EXPECT with caught_bad_alloc == false. If bad_alloc had been thrown and unmatched, the process would have terminated before that line ever printed, so the throw just didn't happen on this lane.

Two plausible causes:

1. snmalloc's huge-size path returned nullptr to the caller without going through Throw::failure — that would be a real bug worth fixing.

2. The throw ran but unwinding silently bailed out. I'd lean toward this one given pkgsrc gcc10 plus -static-libstdc++. Same family of issue as the self-vendored STL
   lane earlier in this PR.

Quickest way to tell them apart: drop a write(2, "throw\n", 6) right before the throw std::bad_alloc() in override/new.cc and rerun the netbsd job. If "throw" shows up in the log, it's a toolchain/unwinding problem; if it doesn't, the failure handler is being bypassed and the bug is on our side.

I just pushed a commit with the suggested print statement. @mjp41, could you re-run the CI please?

Hope it helps.

Thank you, @devnexen - very helpful!

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 27, 2026

I'm not seeing the "throw" message in the netbsd job.
But it appears to be some unrelated issue because I'm not seeing it in the macOS job either, even though it is showing up as expected when I run the test on my local machine.

FWICT, stderr is not explicitly redirected for the jobs running the tests. Can anyone think of a reason why stderr/the "throw" message" is not showing in the logs?

In the meantime, I'll continue trying to set up the CI on my machine with act.

@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 27, 2026

I'm not seeing the "throw" message in the netbsd job. But it appears to be some unrelated issue because I'm not seeing it in the macOS job either, even though it is showing up as expected when I run the test on my local machine.

FWICT, stderr is not explicitly redirected for the jobs running the tests. Can anyone think of a reason why stderr/the "throw" message" is not showing in the logs?

In the meantime, I'll continue trying to set up the CI on my machine with act.

So this is what I can see.

image

I wonder if we can add SNMALLOC_TRACING temporarily to the NetBSD build, and then we might see a bit more. I guess there are three possibilities

  • there is a nullptr return path we haven't correctly intercepted
  • the Throw class is not correctly being choosen for errors
  • We aren't correctly overriding new on netBSD.

I think the experimenting is suggesting it isn't 2.

Although, do we need a flush after write?

Comment thread src/snmalloc/override/new.cc
@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 27, 2026

I'm not seeing the "throw" message in the netbsd job. But it appears to be some unrelated issue because I'm not seeing it in the macOS job either, even though it is showing up as expected when I run the test on my local machine.

FWICT, stderr is not explicitly redirected for the jobs running the tests. Can anyone think of a reason why stderr/the "throw" message" is not showing in the logs?

In the meantime, I'll continue trying to set up the CI on my machine with act.

So this is what I can see.

image

I wonder if we can add SNMALLOC_TRACING temporarily to the NetBSD build, and then we might see a bit more.

I would have to add that to the netbsd job's cmake-flags in .github/workflows/main.yml, correct?

I guess there are three possibilities

  • there is a nullptr return path we haven't correctly intercepted
  • the Throw class is not correctly being choosen for errors
  • We aren't correctly overriding new on netBSD.

I think the experimenting is suggesting it isn't 2.

Although, do we need a flush after write?

Is there a particular reason we're using the builtin write() for our debug prints? If not, we could just use std::cout with a std::endl, which would give us a flush.

@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 27, 2026

I'm not seeing the "throw" message in the netbsd job. But it appears to be some unrelated issue because I'm not seeing it in the macOS job either, even though it is showing up as expected when I run the test on my local machine.

FWICT, stderr is not explicitly redirected for the jobs running the tests. Can anyone think of a reason why stderr/the "throw" message" is not showing in the logs?

In the meantime, I'll continue trying to set up the CI on my machine with act.

So this is what I can see.

image

I wonder if we can add SNMALLOC_TRACING temporarily to the NetBSD build, and then we might see a bit more.

I would have to add that to the netbsd job's cmake-flags in .github/workflows/main.yml, correct?

Yes. It should put quite a lot of information about what is happening.

I guess there are three possibilities

  • there is a nullptr return path we haven't correctly intercepted
  • the Throw class is not correctly being choosen for errors
  • We aren't correctly overriding new on netBSD.

I think the experimenting is suggesting it isn't 2.

Although, do we need a flush after write?

Is there a particular reason we're using the builtin write() for our debug prints? If not, we could just use std::cout with a std::endl, which would give us a flush.

I read the docs and write doesn't need flush if it is to stdout or stderr.

We can't use std::cout as it can allocate. Being an allocator is tricky as most libraries assume allocation works. So an allocator can't call them.

We have snmalloc::message which is what we normally use internally for logging. That adds a few bits on top of write for mild formatting without allocation. It is also cross platform.

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 27, 2026

I'm not seeing the "throw" message in the netbsd job. But it appears to be some unrelated issue because I'm not seeing it in the macOS job either, even though it is showing up as expected when I run the test on my local machine.
FWICT, stderr is not explicitly redirected for the jobs running the tests. Can anyone think of a reason why stderr/the "throw" message" is not showing in the logs?
In the meantime, I'll continue trying to set up the CI on my machine with act.

So this is what I can see.
image
I wonder if we can add SNMALLOC_TRACING temporarily to the NetBSD build, and then we might see a bit more.

I would have to add that to the netbsd job's cmake-flags in .github/workflows/main.yml, correct?

Yes. It should put quite a lot of information about what is happening.

Done. Could you give the CI another go, please?

I guess there are three possibilities

  • there is a nullptr return path we haven't correctly intercepted
  • the Throw class is not correctly being choosen for errors
  • We aren't correctly overriding new on netBSD.

I think the experimenting is suggesting it isn't 2.
Although, do we need a flush after write?

Is there a particular reason we're using the builtin write() for our debug prints? If not, we could just use std::cout with a std::endl, which would give us a flush.

I read the docs and write doesn't need flush if it is to stdout or stderr.

We can't use std::cout as it can allocate. Being an allocator is tricky as most libraries assume allocation works. So an allocator can't call them.

We have snmalloc::message which is what we normally use internally for logging. That adds a few bits on top of write for mild formatting without allocation. It is also cross platform.

Makes sense - thanks for explaining!

Comment thread src/snmalloc/override/new.cc
Comment thread src/snmalloc/override/new.cc
Comment thread src/snmalloc/override/new.cc
Comment thread src/snmalloc/override/new.cc
Comment thread src/snmalloc/override/new.cc
Comment thread src/snmalloc/override/new.cc Outdated
Comment thread src/snmalloc/override/new.cc Outdated
Comment thread src/snmalloc/override/new.cc Outdated
Comment thread src/snmalloc/override/new.cc Outdated
Comment thread src/snmalloc/override/new.cc Outdated
Comment thread src/snmalloc/override/new.cc Outdated
Comment thread src/snmalloc/override/new.cc Outdated
@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 28, 2026

I'm hoping that #841 will fix the NetBSD implementation. Claude felt that would ensure the override to occur.

paulhdk and others added 6 commits April 28, 2026 18:31
Co-authored-by: Matthew Parkinson <mjp41@users.noreply.github.com>
Co-authored-by: Matthew Parkinson <mjp41@users.noreply.github.com>
Co-authored-by: Matthew Parkinson <mjp41@users.noreply.github.com>
mjp41 added 2 commits April 29, 2026 09:13
On NetBSD with GCC 10, taking the address of operator new/delete as
function pointer template arguments can resolve to the libstdc++
versions rather than the snmalloc overrides defined in the same
translation unit. This happens because the function pointer is
resolved via PLT indirection on some ELF platforms.

Replace the function-pointer template parameters with lambda arguments
that make direct calls to operator new/delete. Direct calls are always
resolved to the definitions in the translation unit, avoiding the
platform-dependent symbol resolution issue.
@mjp41 mjp41 force-pushed the test-new branch 2 times, most recently from 3f7022d to dfde462 Compare April 29, 2026 08:59
mjp41 added 3 commits April 29, 2026 10:15
GCC recognises operator new/delete as C++ replaceable allocation
functions and can transform them into malloc/free or elide paired
new/delete calls at higher optimisation levels. This bypasses the
snmalloc override and causes the test to fail (e.g. operator new
returns NULL from malloc instead of throwing std::bad_alloc).

-fno-builtin does not prevent this because GCC handles C++ replaceable
allocation functions separately from C built-in functions. Compiling
the test at -O0 disables all such transformations.
GCC 10 on NetBSD 9.2 has a bug where it incorrectly optimises calls
to replaceable C++ allocation functions (operator new/delete) inside
templates and lambdas, replacing them with built-in malloc/free even
when a user-provided replacement is visible in the same translation
unit. This causes the new/delete override test to fail.

Upgrade to NetBSD 10.1 (the vmactions/netbsd-vm default) and GCC 14
from pkgsrc, which does not have this issue.
The GCC 10 compiler bug that required this workaround is addressed by
upgrading the NetBSD CI to GCC 14.
@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 29, 2026

@paulhdk sorry for taking over here, but with the microsoft organisation policy that new contributors cannot trigger CI to run, I think this would have taken a very long time.

Thank you for the work, and hopefully this will now pass CI.

If you could take a look it would be great to check I haven't broken something.

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 29, 2026

@paulhdk sorry for taking over here, but with the microsoft organisation policy that new contributors cannot trigger CI to run, I think this would have taken a very long time.

Makes sense. I'll just have to find another issue to work on that I can debug myself then 🙂 Maybe the Rust tests we discussed earlier in this thread?

Thank you for the work, and hopefully this will now pass CI.

Thank you, @mjp41 (+ @devnexen, + @SchrodingerZhu) for helping getting it across the finish line!

So, the fix ended up being a combination of #841, upgrading GCC in the netbsd CI, and using lambdas instead of function pointers?

If you could take a look it would be great to check I haven't broken something.

Is the hardcoded pragma to suppress warnings on GCC at the beginning of the tests (still) necessary?

Otherwise, LGTM!

@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 29, 2026

@paulhdk sorry for taking over here, but with the microsoft organisation policy that new contributors cannot trigger CI to run, I think this would have taken a very long time.

Makes sense. I'll just have to find another issue to work on that I can debug myself then 🙂 Maybe the Rust tests we discussed earlier in this thread?

That would be great, if you want to take that on. You got this working perfectly except for NetBSD, and I think that was mostly due to it being pinned to an ancient version.

Hopefully, once this PR lands you will get automatically running CI, but it might require more PRs.

Thank you for the work, and hopefully this will now pass CI.

Thank you, @mjp41 (+ @devnexen, + @SchrodingerZhu) for helping getting it across the finish line!

Thank you.

So, the fix ended up being a combination of #841, upgrading GCC in the netbsd CI, and using lambdas instead of function pointers?

If you could take a look it would be great to check I haven't broken something.

Is the hardcoded pragma to suppress warnings on GCC at the beginning of the tests (still) necessary?

I am not sure anymore. I think GCC on some platforms was getting confused and emitting that warning. But I did feel a bit like a dog chasing its tail on this.

I also am not sure the Lambda rewrite was required. We could go back to your address taking version as that didn't have the need for the warning?

Avoid having to hard-code a suppression of certain GCC warnings
@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 29, 2026

[...]

Is the hardcoded pragma to suppress warnings on GCC at the beginning of the tests (still) necessary?

I am not sure anymore. I think GCC on some platforms was getting confused and emitting that warning. But I did feel a bit like a dog chasing its tail on this.

I also am not sure the Lambda rewrite was required. We could go back to your address taking version as that didn't have the need for the warning?

I just reverted those changes so we can give CI another go. I think it would definitely improve readability, but let's see if CI passes first.

@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 29, 2026

@paulhdk for other issues. There is the Rust testing you mentioned, which would be awesome.

If you want something fairly self-contained but pretty low-level there is: #594. There is a proposed solution in there, but no one has tried it. That would get you familiar with a bit of snmalloc platform abstraction.

A really meaty issue would be to take on some of #740. The first bullet would be pretty tough, but both myself and @SchrodingerZhu would be interested and able to provide you information and guidance. I'd suggest making a sub-issue to discuss how it would be done, if you want to try that.

Copy link
Copy Markdown
Member

@mjp41 mjp41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks for your first contribution to snmalloc.

@mjp41 mjp41 merged commit 6d6c8f0 into microsoft:main Apr 29, 2026
192 checks passed
@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 29, 2026

@paulhdk for other issues. There is the Rust testing you mentioned, which would be awesome.

If you want something fairly self-contained but pretty low-level there is: #594. There is a proposed solution in there, but no one has tried it. That would get you familiar with a bit of snmalloc platform abstraction.

This one also sounds like a great excuse to invest some more time in figuring out how to run the CI locally with act and then document it within SNMalloc for others to use. I believe the current documentation for running the CI locally is outdated.

A really meaty issue would be to take on some of #740. The first bullet would be pretty tough, but both myself and @SchrodingerZhu would be interested and able to provide you information and guidance. I'd suggest making a sub-issue to discuss how it would be done, if you want to try that.

Awesome!

All of these sound super interesting! Let's see how far I can get 🙂

@paulhdk paulhdk deleted the test-new branch April 29, 2026 15:03
@mjp41
Copy link
Copy Markdown
Member

mjp41 commented Apr 30, 2026

@paulhdk for other issues. There is the Rust testing you mentioned, which would be awesome.
If you want something fairly self-contained but pretty low-level there is: #594. There is a proposed solution in there, but no one has tried it. That would get you familiar with a bit of snmalloc platform abstraction.

This one also sounds like a great excuse to invest some more time in figuring out how to run the CI locally with act and then document it within SNMalloc for others to use. I believe the current documentation for running the CI locally is outdated.

I didn't know we had documentation on that, so it is definitely going to be out of date ;-)

A really meaty issue would be to take on some of #740. The first bullet would be pretty tough, but both myself and @SchrodingerZhu would be interested and able to provide you information and guidance. I'd suggest making a sub-issue to discuss how it would be done, if you want to try that.

Awesome!

All of these sound super interesting! Let's see how far I can get 🙂

Awesome. Please raise issues if you start work, so we don't duplicate effort.

@paulhdk
Copy link
Copy Markdown
Contributor Author

paulhdk commented Apr 30, 2026

@paulhdk for other issues. There is the Rust testing you mentioned, which would be awesome.
If you want something fairly self-contained but pretty low-level there is: #594. There is a proposed solution in there, but no one has tried it. That would get you familiar with a bit of snmalloc platform abstraction.

This one also sounds like a great excuse to invest some more time in figuring out how to run the CI locally with act and then document it within SNMalloc for others to use. I believe the current documentation for running the CI locally is outdated.

I didn't know we had documentation on that, so it is definitely going to be out of date ;-)

A really meaty issue would be to take on some of #740. The first bullet would be pretty tough, but both myself and @SchrodingerZhu would be interested and able to provide you information and guidance. I'd suggest making a sub-issue to discuss how it would be done, if you want to try that.

Awesome!
All of these sound super interesting! Let's see how far I can get 🙂

Awesome. Please raise issues if you start work, so we don't duplicate effort.

Thank you, will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

new.cc does not have functional tests

4 participants