Refactor TCPSocket::setup Timeout and Error Handling#511
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #511 +/- ##
==========================================
+ Coverage 77.14% 77.20% +0.06%
==========================================
Files 116 116
Lines 6354 6415 +61
Branches 2764 2792 +28
==========================================
+ Hits 4902 4953 +51
- Misses 1101 1106 +5
- Partials 351 356 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 6618266. Configure here.
|
To be honest, I'm not sure whether this actually improves things a lot. Am I correct, that you want to solve UniversalRobots/Universal_Robots_ROS2_Driver#838? From my understanding, that issue isn't about the blocking connect itself. For example, you can start the primary pipeline example from the client library with a wrong IP / an IP address that doesn't host a primary interface and just close it using CTRL-C. Thus, I would expect that UniversalRobots/Universal_Robots_ROS2_Driver#838 is more a problem of signal forwarding inside the ROS node. This PR seems to complicate the socket connection quite a lot and I'd like to challenge the requirement for actually doing this. |
|
Hi @urfeex. Well... this PR doesn't fully resolve [#838] UniversalRobots/Universal_Robots_ROS2_Driver#838) on its own. Regarding the With this PR implemented, the output on the teminal when using a wrong IP for that example will be: ~/ws_rol/ur_c/Universal_Robots_Client_Library/b/examples tcp_socket_timeout > ./primary_pipeline_example 1.2.3.4
[1780382828.827426] ERROR /home/mirserv/ws_rolling/ur_client_library/Universal_Robots_Client_Library/src/comm/tcp_socket.cpp 270: Failed to connect to robot on IP 1.2.3.4:30001. Reason: Connection timed out. Retrying in 10 seconds.
[1780382839.328329] ERROR /home/mirserv/ws_rolling/ur_client_library/Universal_Robots_Client_Library/src/comm/tcp_socket.cpp 270: Failed to connect to robot on IP 1.2.3.4:30001. Reason: Connection timed out. Retrying in 10 seconds.
[1780382849.829394] ERROR /home/mirserv/ws_rolling/ur_client_library/Universal_Robots_Client_Library/src/comm/tcp_socket.cpp 270: Failed to connect to robot on IP 1.2.3.4:30001. Reason: Connection timed out. Retrying in 10 seconds.
[1780382860.330338] ERROR /home/mirserv/ws_rolling/ur_client_library/Universal_Robots_Client_Library/src/comm/tcp_socket.cpp 270: Failed to connect to robot on IP 1.2.3.4:30001. Reason: Connection timed out. Retrying in 10 seconds. Without the PR, it will be blocked without showing any output: ~/ws_rol/ur_c/Universal_Robots_Client_Library/b/examples master > ./primary_pipeline_example 1.2.3.4
^C
~/ws_rol/ur_c/Universal_Robots_Client_Library/b/examples master > In a ROS 2 node like Currently, when
Universal_Robots_Client_Library/src/comm/tcp_socket.cpp Lines 75 to 76 in 627cf20
Because of this, the thread is blocked. This makes the This PR introduces a timeout using To fix this at the node level, I actually tested moving Let me know if this makes sense or if you think I should take a different approach. |

Refactor TCPSocket::setup
Description
This PR refactors the
TCPSocket::setup()method to resolve the blocking behavior during network failures.The implementation now introduces an asynchronous approach to enforce a connection timeout and capture socket errors.
Step-by-Step Modifications
1. Signature Update (Timeout Parameter)
timeoutparameter (std::chrono::milliseconds) to thesetup()function. This makes it possible to define exactly how long to wait for a connection before giving up. Otherwise, the code blocks indefinitely, keeping the thread hanging even if the number of max tries is set to 1. The default value has been set to 500 ms.2. Error Tracking (
std::error_code)std::error_code socket_errorvariable.getLastSocketErrorCode()is called immediately after any failure to store the exact system reason (e.g.,ECONNREFUSED,EHOSTUNREACH) so it can be reported later.3. Non-Blocking Connection Logic
fcntl()andO_NONBLOCK.::connect()is called. If it returnsEINPROGRESS, the code now hands control over to::select().select()waits for the socket to become writable, strictly bounded by the newly providedtimeoutparameter.select()indicates readiness,::getsockopt(..., SO_ERROR, ...)is used to verify if the connection actually succeeded or if it failed in the background.4. Proper Resource Cleanup on Failure
socket_fd_is created but the connection attempt fails, it is now properly closed.::ur_close(socket_fd_)and reset toINVALID_SOCKETinside the failure branches within theforloop to properly release resources. Also added error capturing if::socket()creation fails directly.5. Improved Logging & Debugging
"Please check that the robot is booted and reachable...") with the exact system string translated bysocket_error.message()."Reason: Connection refused","Reason: Operation not permitted", or"Reason: Connection timed out", improving troubleshooting capabilities for end users.Testing Performed
select()and fails immediately.timeoutduration before logging a timeout error.Note
Medium Risk
Changes core TCP connection establishment logic (non-blocking connect +
select()/getsockopt()), which can affect connectivity behavior across platforms and edge cases like fd limits/timeouts.Overview
TCPSocket::setup()now accepts an additionaltimeout(default500ms) and uses a non-blocking connect withselect()to enforce connection timeouts instead of potentially blocking indefinitely.Connection attempts now capture and surface the underlying
std::error_codereason in retry logs, add extra failure-path cleanup (close/reset fd), and include a non-Windows guard forFD_SETSIZEoverflow before usingselect().Reviewed by Cursor Bugbot for commit 6b46158. Bugbot is set up for automated code reviews on this repo. Configure here.