WIP: Rtapi cleanup v2 by hdiethelm · Pull Request #3919 · LinuxCNC/linuxcnc

hdiethelm · 2026-04-09T20:43:18Z

Continuation of #3908 reverted in #3918

Target: Move some classes out of the huge uspace_rtapi_app.cc

uspace_rtapi_main: Contains the main function and helpers
uspace_rtapi_app: Contains the RtapiApp class
uspace_posix: Contains the PosixApp class

Other fixes:

Don't start master just for exit: 371793c
Remove unused rtapi_task::ratio: 3097578
- Set but only read to force a fixed ratio which probably was not the intention
Slightly different lock to avoid needing reference to instance: c8af827
All real time implementations are now library's and handled the same way: 62e5ea6
No hard-coded library paths e1bbaf4

hdiethelm · 2026-04-09T20:47:56Z

Still open:

libuspace-xxx.so.0 -> liblinuxcnc-rtapi-xxx.so ?
- Longest name would be liblinuxcnc-rtap-xenomai-evl.so.0
Check if slightly different lock to avoid needing reference to instance: c8af827 is a good idea
Review

And

In the Submakefile it reads:
$(call TOOBJSDEPS, $(USPACE_POSIX_SRCS)): EXTRAFLAGS += -pthread -fPIC
Shouldn't that be (not adding to EXTRAFLAGS):
$(call TOOBJSDEPS, $(USPACE_POSIX_SRCS)): EXTRAFLAGS = -pthread -fPIC

Hmm, I don't understand makefiles in depth. I just copied what was already there a few lines below and edited it to match my lib. As much as I understand this, it just adds this flags to the global EXTRAFLAGS in Makefile for one compile command only. I did not see any duplicated flags.

NTULINUX · 2026-04-11T07:36:19Z

Works here on Gentoo, rip, system install, clang and gcc builds, whole shebang!

edit: Have not yet tested with RTAI.

hdiethelm · 2026-04-11T10:46:25Z

Thanks for testing!

I renamed the library's to:

liblinuxcnc-uspace-posix.so.0 liblinuxcnc-uspace-xenomai-evl.so.0
liblinuxcnc-uspace-rtai.so.0    liblinuxcnc-uspace-xenomai.so.0

A bit long names, but I think it is fine. But i'm open for other suggestions.

Additionally, I reduced the globals.

I tested all 5 configurations in a VM and they all work. There are two issues but I don't think they are due to this PR:

LXRT doesn't start on an isolated CPU. Workaround: RTAPI_CPU_NUMBER=1 linuxcnc
RTAI doesn't unload the modules in the correct order and fails to unload all

hdiethelm · 2026-04-11T10:54:11Z

@NTULINUX: Do you have a test setup you can share or is it all just manually setup? A series of docker files would be nice, so different OS can be tested if something starts. I messed my VM up slightly by using make install and just after the fact figured out that there is no make uninstall but I was able to restore it.
If there is nothing yet, I can create some, should not be to involved. I often use podman for similar things where I don't want to mess up my host. Podman is similar to docker but it doesn't need root.

NTULINUX · 2026-04-11T13:34:21Z

I have VM images up right now but they're in flux. I'm going to post new VMs with all the right fixes in a bit. Currently tracking this PR here:

#3925

My ebuilds at the moment are broken but we're in the middle of sorting it all out for good. Will share link to new VM soon with everything tied together, cleanly.

src/rtapi/uspace_rtapi_main.cc

BsAtHome · 2026-04-10T21:43:36Z

src/rtapi/uspace_rtapi_main.cc

+
+    default: // pretty bad
+        rtapi_print_msg(RTAPI_MSG_ERR, "rtapi_app: caught signal %d - dumping core\n", sig);
+        sleep(1); // let syslog drain


calling sleep() in a signal handler may be problematic because it can be implemented using SIGALRM. Signals in signal is a very bad concept.

Do you have a good solution for this? I don't really understand what the idea behind this code is.

The whole "exit" procedure done in a signal handler is faulty and wrong. You normally run your main handler in a select/poll loop where one of the input descriptors is the read-end of a pipe. You write (one byte) into this pipe from the signal handler and you know what to do outside the signal handler in the main handler loop. Then you can use a simple switch outside the signal handler to determine what to do.

The code "tries" to do a flush, but does it in the wrong place. Anything called in there is bound to be wrong or not guaranteed to work as you need it to. BTW, dumping core should not be preceded by printing a message because all you know is that the printing of a message cause the signal (SEGV for example) and everything is corrupt. Dumping code needs to be just that, dump core.

BsAtHome · 2026-04-10T21:50:42Z

src/rtapi/uspace_rtapi_main.cc

+        //If called in master mode with exit command, no need to start master
+        //and exit again
+        if (args.size() == 1 && args[0] == "exit") {
+            return 0;
+        }


If you exit here, then why perform the socket() and bind() calls?

This code is a bit funny: Bind is used to detect if a master is running. It returns 0 if no master is already running, so a new one is started.

Now I had the case when exiting, everything was initialized again due to master exits automatically as soon as there is no instance any more: Line 417.

Due to that, master was started again just to call exit and "Note: Using POSIX realtime" was shown a second time before exiting when I closed latency test.

But isn't that is just as race prone as any other method?

The concept "the first becomes master" is flawed when you try to exit.

Which kind or race do you have in mind? I think bind is probably atomic, I would have to research it.

The issue I fixed is just: "Don't start a master if none is running just to immediately close it due to the command was exit"

The only that came to my mind is if the there are multiple clients at the same time, one sends exit while the other sends a loadrt for example. In that case, it is random if the master exits and then starts again to execute loadrt or the other way around, loadrt and then exit.

But this "If there is no master, I will become master" feels a bit wrong in general. Better would be: "A master is started at start of linuxcnc and closed at the end" but this is to much for this PR i think and needs probably changes in other places to.

By the time you determine that there is no master, someone else can have become master. By the time you determine that there is a master, it may have exited before sending your command.

But you are right, this needs to be rethought and redone from the ground up. Lets leave it for now.

By the time you determine that there is no master, someone else can have become master.

As long as bind is atomic, this can not happen. If bind is not atomic, two applications could be server on the same socket which would be a linux bug.

By the time you determine that there is a master, it may have exited before sending your command.

That can happen only if the application removes everything and then starts to add new things due to the master exits on exit or if there are no instances any more. Probably no linuxcnc app should do that but you can do that manually in halcmd. I have to test this.

But you are right, this needs to be rethought and redone from the ground up. Lets leave it for now.

Ok, we do that later. I think it's not a bad issue, just hard to read code with possible issues.

BsAtHome · 2026-04-10T21:54:27Z

src/rtapi/uspace_rtapi_main.cc

+                break;
+            if (i == 0)
+                srand48(t0.tv_sec ^ t0.tv_usec);
+            usleep(lrand48() % 100000);


Can there be unforeseen interaction with SIGALRM here? I don't think you want to involve signals. Also, wouldn't there be a minimum wait wanted here? A random wait can return close to zero all the time and you only loop three times.

That one looks funny. Signal will break it, the for loop is only 0.3s (100000 != 1000000) and gettimeofday can fail if you change the system time. Improved.

I did not find a function for diff of timespec, so I added one above main.

BTW: That one could also be simplified by just using (now->tv_sec - start->tv_sec) < 3 but this would be something between 2 and 4s.

src/rtapi/uspace_rtapi_main.cc

BsAtHome · 2026-04-10T21:58:22Z

src/rtapi/uspace_rtapi_main.cc

+    pthread_cancel(queue_thread);
+    pthread_join(queue_thread, nullptr);
+    rtapi_msg_queue.consume_all([](const message_t &m) {
+        fputs(m.msg, m.level == RTAPI_MSG_ALL ? stdout : stderr);


Why fputs()?

It writes out all pending messages before exiting.
Same on line 81 where the normal message writer task function is. It's a bit risky when the 0 termination is missing but otherwise fine. Any sugestions?
m.level == RTAPI_MSG_ALL ? stdout : stderr sends to stdout if the message is from rtapi_print() -> RTAPI_MSG_ALL, otherwise stderr. Looks also fine.

As long as it isn't in a signal handler it should be fine. Just wondering.

However, now you mention rtapi_print (and friends), there are rtapi_print_msg in the signal handler. That may pose a real problem. The default handler prints using stdio and that is a nono in a signal handler.

This one is not in a signal and the standard handler for uspace_rtapi writes to a fifo. Only uspace_ulapi prints directly which is fine due to this has no signal handler.

This "just link a different function with the same name" is hard to track.

But motion.c uses rtapi_set_msg_handler() to change it. I have to track this. I think this is a real time thread.

Yes, motion.c is part of motmod and is part of the interface layer in realtime from commands from non-realtime.

I tracked the motion.c part a bit more. I don't fully understand why this is done that way, but it goes to a fifo, so seams to be fine.

I have to correct my self:
https://github.com/hdiethelm/linuxcnc-fork/blob/rtapi_cleanup_v2/src/rtapi/uspace_rtapi_main.cc#L927
rtapi_print_msg() uses vfprintf() in the main thread. So it should also not be used in the signal handler.

The "simple" alternative for writing static/fixed messages in a signal handler are something like:

static void sighandler(int sig) { ... static const char msg[] = "This is a message from a sighandler to stderr\n"; ::write(2, msg, sizeof(msg)-1); // minus one to prevent the \0 from printing ... }

src/rtapi/uspace_rtapi_main.cc

andypugh · 2026-04-12T18:46:49Z

Some discussion in the Sunday video meet-up has suggested that we should look at incorporating this into the rtapi cleanup:

#918

src/rtapi/uspace_rtapi_main.cc

hdiethelm · 2026-04-12T20:51:06Z

I commented and corrected the things I have changed.
Now do we really want do do all of that in this MR?

I always prefer to have refactoring (moving code around) and bug fixing/functional changes as separated as possible. That makes review and testing easier, you just check that the moved code arrived as it was and so avoids having merge conflicts due to the branch staying open for to long.

But I have to admit, I also did some changes that I just was not able to leave it as it was and which simplified moving code due to removed dependency's and of course created a bug doing so.

BsAtHome · 2026-04-12T20:56:50Z

See: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html for the callyou are allowed to make in signal handlers.

BsAtHome · 2026-04-12T21:04:15Z

I commented and corrected the things I have changed. Now do we really want do do all of that in this MR?

Well, that is a good question. You are doing "cleanup" and that would imply refactor and fixes, IMO.

I always prefer to have refactoring (moving code around) and bug fixing/functional changes as separated as possible. That makes review and testing easier, you just check that the moved code arrived as it was and so avoids having merge conflicts due to the branch staying open for to long.

That is a possibility too, but there are soooo (key got stuck) many problems with this code that it is a good question why not hit those ~~two birds with one stone~~ two lines with one keypress.

But I have to admit, I also did some changes that I just was not able to leave it as it was and which simplified moving code due to removed dependency's and of course created a bug doing so.

And I appreciate the changes! They are very necessary. Before we're done it needs to be tested,of course. But when the code gets better structured, then that should also become easier, I hope.

But, if you want to split it, then I think we need to have a very clear split where you move without actual changes and then refactor. My opinion is that moving code also classifies as refactoring and therefore it would be a missed opportunity if not fixed in one go.

BsAtHome · 2026-04-12T21:10:34Z

BTW, I've been wanting to vacuum this code for a long time but have not gotten arround to that point yet. You know, INI-file reader just done, HAL types/access revamp in the pipeline, tcl9 stuff that still needs fixing, build system cleanup, and so on.

(and also got CI's -Werror merged)

hdiethelm · 2026-04-12T21:31:59Z

I commented and corrected the things I have changed. Now do we really want do do all of that in this MR?

Well, that is a good question. You are doing "cleanup" and that would imply refactor and fixes, IMO.

Might be I did not specify my (initial) intentions well enough. I just found the rtapi_uspace hard to manage while implementing xenomai and wanted to split it up without changing the existing code if not needed. But that diverged anyway already a bit.

With the new structure, it should also be easier to implement rtapi_udp_sendto() or something similar needed for RTNet.

I always prefer to have refactoring (moving code around) and bug fixing/functional changes as separated as possible. That makes review and testing easier, you just check that the moved code arrived as it was and so avoids having merge conflicts due to the branch staying open for to long.

That is a possibility too, but there are soooo (key got stuck) many problems with this code that it is a good question why not hit those two birds with one stone two lines with one keypress.

But I have to admit, I also did some changes that I just was not able to leave it as it was and which simplified moving code due to removed dependency's and of course created a bug doing so.

And I appreciate the changes! They are very necessary. Before we're done it needs to be tested,of course. But when the code gets better structured, then that should also become easier, I hope.

But, if you want to split it, then I think we need to have a very clear split where you move without actual changes and then refactor. My opinion is that moving code also classifies as refactoring and therefore it would be a missed opportunity if not fixed in one go.

I have to look into it a bit more next week if I find time. But from my point of view it would make sense to split the more involved changes in a new PR or also two.

But there is also always some testing effort behind which I don't know how much you have to do from your side.

Also there is: #3925
Just randomly, I found also that DLSYM halpr_find_comp_by_name https://github.com/hdiethelm/linuxcnc-fork/blob/rtapi_cleanup_v2/src/rtapi/uspace_rtapi_main.cc#L112 does not match the signature (char* vs const char*),

I was now wondering if there is not a better solution for this. Often I use the following pattern to define the function and the matching type in the same header and then from then on only FUN_T* for function pointers. So you naturally change both. By some define magic, it should be possible do do both in one, but then it gets hard to read.

typedef int FUN_T(void *, void *);
int fun(void *, void *);

You know of a good way to test if both definitions match?

It would also be nicer to check at compile time if a .so has the needed main/exit functions.

This reverts commit a24f173.

From libuspace-* to liblinuxcnc-uspace-*

Changed size check to -2 due to 1 byte is needed at the beginning and 1 for \0 at the end

BsAtHome · 2026-04-14T20:12:28Z

I'll need to make a new look over it after all these changes to see if I missed stuff.

hdiethelm · 2026-04-14T22:08:38Z

I'll need to make a new look over it after all these changes to see if I missed stuff.

Yes, I changed again a lot based on your review...

I rewrote the socket serialize/deserialize anyway. After looking for to long at this read_strings, I could not let it be that way. It's still a bit handwavy but I also did not want to import a new library just for this.

Still open from your review:

The signal handling. I don't really see the intent behind. Due to it is in harden_rt() it could have some influence on real time. But otherwise, I would just delete it all together. I see no point in a graceful exit if a signal is received except for debug purpose. And for this there are better solutions.

I would like to postpone this to separate PR's:

Master/slave issue
Rootless
Might be better solution for function pointers?

Of course except you see something else/new.

hdiethelm · 2026-04-14T22:43:06Z

I quickly tested the signal handler by doing a segfault on purpose in the load command handler:

../bin/rtapi_app load test
Note: Using POSIX realtime
rtapi_app: caught signal 11 - dumping core

Of course there is no core dump what so ever, also not on master.

hdiethelm · 2026-04-14T23:25:38Z

So lacking better options, I went for: cd41d50
Do you think that's acceptable?
An option would be to add:

@@ -661,7 +662,10 @@ static void signal_handler(int sig, siginfo_t * /*si*/, void * /*uctx*/) {
         WRITE_STDERR_STR("rtapi_app: UNKNOWN - shutting down\n");
         break;
     }
-    
+    rtapi_msg_queue.consume_all([](const message_t &m) {
+        WRITE_STDERR_STR(m.msg);
+    });
+
     _exit(-1);
 }

so the last messages are shown. But I am not sure if this is signal signal-safety conform.

However, it works. I blocked the normal consumer thread and then did killall rtapi_app to get messages printed in the signal handler:

Note: Using POSIX realtime
rtapi_app: SIGTERM - shutting down
Unexpected realtime delay on task 0 with period 25000
This Message will only display once per session.
Run the Latency Test and resolve before continuing.

Core dump did not work any way

BsAtHome · 2026-04-15T06:48:43Z

I quickly tested the signal handler by doing a segfault on purpose in the load command handler:
rtapi_app: caught signal 11 - dumping core
Of course there is no core dump what so ever, also not on master.

If you run on a system with systemd (most likely) you may want to use coredumpctl to access the coredump (run coredumpctl debug to start gdb with the dump).

BsAtHome · 2026-04-15T07:09:57Z

WRITE_STDERR_STR("rtapi_app: UNKNOWN - shutting down\n");

That should work too. As long as the string is a constant and strlen is replaced by the compiler builtin, then the length expands as a constant too. Just a minor detail, you should put the expansion of a macro in parentheses.

However, it works. I blocked the normal consumer thread and then did killall rtapi_app to get messages printed in the signal handler:
[snip]

You are now catching more signals than the original. That is wrong. You actually want to coredump on "bad" signals (anything not with explicit meaning for the running program). The system should behave predictable and if that does not happen you want to be able to backtrack why it didn't behave. That is why it codedumps, so you can backtrack.

BsAtHome reviewed Apr 12, 2026

View reviewed changes

src/rtapi/uspace_rtapi_main.cc Outdated Show resolved Hide resolved

hdiethelm added 9 commits April 13, 2026 21:39

Reapply "Rtapi cleanup"

7df21d0

This reverts commit a24f173.

Cleanup: Library doesn't need a hardcoded path

bdbf604

Cleanup: Get rid of global task_array and add static where possible

352c782

Cleanup: Get rid of globals ruid/euid

98b45d6

Cleanup: Rename libs

4c1a0de

From libuspace-* to liblinuxcnc-uspace-*

Cleanup: Get rid of globals find_rt_cpu_number / set_namef

67d61ba

Cleanup: Review: const std::string &

1ed7649

Cleanup: Correct sizeof()

26ea00a

Cleanup: Correct _exit

f165b58

hdiethelm force-pushed the rtapi_cleanup_v2 branch from 44ab357 to f165b58 Compare April 13, 2026 19:44

hdiethelm added 3 commits April 13, 2026 22:24

Cleanup: Review: get_fifo_path nicer

9dfe1e2

Cleanup: Review: Fix timeout

75925a7

Cleanup: Improve get_fifo_path_to_addr

2692a61

Changed size check to -2 due to 1 byte is needed at the beginning and 1 for \0 at the end

hdiethelm added 2 commits April 14, 2026 23:52

Cleanup: Rewrite socket protocol

379b802

Cleanup: Run clang-format

6c1aa3d

Cleanup: Signal handler: Use only allowed functions

cd41d50

Core dump did not work any way

hdiethelm force-pushed the rtapi_cleanup_v2 branch from 9cf3880 to cd41d50 Compare April 14, 2026 23:31

Conversation

hdiethelm commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdiethelm commented Apr 9, 2026

Uh oh!

NTULINUX commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdiethelm commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdiethelm commented Apr 11, 2026

Uh oh!

NTULINUX commented Apr 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andypugh commented Apr 12, 2026

Uh oh!

Uh oh!

hdiethelm commented Apr 12, 2026

Uh oh!

BsAtHome commented Apr 12, 2026

Uh oh!

BsAtHome commented Apr 12, 2026

Uh oh!

BsAtHome commented Apr 12, 2026

Uh oh!

hdiethelm commented Apr 12, 2026

Uh oh!

BsAtHome commented Apr 14, 2026

Uh oh!

hdiethelm commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdiethelm commented Apr 9, 2026 •

edited

Loading

NTULINUX commented Apr 11, 2026 •

edited

Loading

hdiethelm commented Apr 11, 2026 •

edited

Loading

hdiethelm commented Apr 14, 2026 •

edited

Loading

hdiethelm commented Apr 14, 2026 •

edited

Loading