ros2_medkit_opcua: native AlarmConditionType subscription bridge by mfaferek93 · Pull Request #387 · selfpatch/ros2_medkit

mfaferek93 · 2026-04-25T15:21:11Z

Summary

Adds native OPC-UA Part 9 AlarmConditionType event subscription to the OPC-UA plugin. PLCs that already define alarms in their own runtime (Siemens S7-1500 Program_Alarm / ProDiag, Beckhoff TF6100, CodeSys 3.5+ alarm manager, Rockwell via FactoryTalk Linx) are bridged into the SOVD fault lifecycle without any duplicate threshold definitions in YAML.

Closes #386.

Scope

Native event subscription via raw UA_Client_MonitoredItems_createEvent (open62541pp v0.16 has no native event API).
EventFilter with the canonical AlarmConditionType select clauses (EventType, EventId, SourceNode, Severity, Message, ConditionId, BranchId, EnabledState, ActiveState, AckedState, ConfirmedState, ShelvingState).
Per-condition EventId tracking, required for spec-compliant Acknowledge (Part 9 §5.7.3).
Pure-function AlarmStateMachine mapping EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchId to SOVD CONFIRMED / HEALED / CLEARED / Suppressed. Decision order documented in design/index.rst.
ConditionRefresh (Server method i=3875) on subscribe and on every reconnect, with RefreshStartEvent / RefreshEndEvent recognised.
New top-level event_alarms: block in node_map.yaml, mutually exclusive per entry with the existing threshold alarm form.
acknowledge_fault and confirm_fault SOVD operations on every entity that hosts at least one event-mode alarm; calls i=9111 / i=9113 on the live ConditionId with the tracked EventId and an optional LocalizedText comment.
Existing threshold polling and OpenPLC integration unchanged.

Out of scope

ShelvingState write operations (timed / one-shot shelving). Read-side suppression is in scope; operator UI is not.
OPC-UA branch reasoning beyond BranchId-based suppression. Re-fires are tracked via fault_manager occurrence_count and the /faults/stream SSE history.
Auto-discovery of alarm sources via browse (tracked in ros2_medkit_opcua: add browse-based auto node_map generation #368).
Quality (StatusCode) propagation to a SOVD status_quality field. Requires an additive ReportFault.srv field; tracked separately.

Tests

144 unit tests green on Jazzy (105 pre-existing + 6 new OpcuaClient event API + 22 AlarmStateMachine covering the full transition table + 11 plugin / node_map paths exercised from existing suites).
Custom test_alarm_server fixture (open62541 with UA_NAMESPACE_ZERO=FULL and UA_ENABLE_SUBSCRIPTIONS_ALARMS_CONDITIONS=ON) emits real AlarmConditionType events; smoke test in test/fixtures/test_alarm_server/smoke_test.py verifies the topology via asyncua.
Docker integration suite docker/scripts/run_alarm_tests.sh boots the fixture + gateway, fires alarms via stdin, and asserts the SOVD /faults endpoint moves through CONFIRMED -> HEALED -> CLEARED with intermediate acknowledge_fault / confirm_fault round-trips. Polling-with-timeout throughout, no fixed sleeps. Wired into the opcua-plugin workflow as a new parallel job.

Test plan

Unit tests green on Humble + Jazzy + Rolling.
ASAN + TSAN clean.
OpenPLC threshold integration still passes (no regression on existing path).
New AlarmConditionType integration job passes.

Notes

The cmake-side test_alarm_server target is gated OFF behind MEDKIT_OPCUA_BUILD_ALARM_SERVER while the LTO namespace0_generated linker mismatch in the second ExternalProject_Add open62541 build is being resolved. The docker integration does not depend on this target - it builds open62541 inside its own image.
Severity 1-1000 is mapped to the existing SOVD severity buckets (1-200 INFO / 201-500 WARN / 501-800 ERROR / 801-1000 CRITICAL). This is the selfpatch convention, not IEC 62682. severity_override on an event_alarms entry takes precedence.

Copilot

Pull request overview

Adds native OPC-UA Part 9 AlarmConditionType event subscriptions to the ros2_medkit_opcua plugin and bridges those event-driven alarm lifecycles into the SOVD fault model, while also introducing a new race-free ROS 2 topic sampling architecture in the gateway (subscription executor + pooled topic data provider) and related shutdown/teardown hardening.

Changes:

Add OPC-UA AlarmConditionType event subscription plumbing, YAML configuration (event_alarms), state-machine mapping, and ack/confirm operations.
Replace legacy ad-hoc topic sampling with TopicDataProvider + Ros2TopicDataProvider backed by Ros2SubscriptionExecutor / Ros2SubscriptionSlot, and expose pool stats via /health.
Improve shutdown robustness (SSE stop wake-up, demo node shutdown helper, graph-query exception swallowing during shutdown) + update provider header layout (logs/updates/scripts/introspection).

Reviewed changes

Copilot reviewed 107 out of 111 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/ros2_medkit_serialization/design/index.rst	Update design docs to reflect new topic data provider naming/usage.
src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_opcua_client.cpp	Add unit tests for new client event/method APIs (disconnected-state contracts).
src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_alarm_state_machine.cpp	Add unit tests for AlarmConditionType -> SOVD lifecycle state machine.
src/ros2_medkit_plugins/ros2_medkit_opcua/test/fixtures/test_alarm_server/smoke_test.py	Add fixture smoke test validating AlarmConditionType nodes/methods/fields.
src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp	Bridge event alarms into fault lifecycle; add ack/confirm operations.
src/ros2_medkit_plugins/ros2_medkit_opcua/src/node_map.cpp	Parse `event_alarms` YAML and merge event-mode entities into discovery defs.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_poller.hpp	Add event-alarm delivery types + condition runtime lookup interface.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_plugin.hpp	Declare event-alarm bridge handler.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_client.hpp	Add raw open62541 event monitored item + method-call APIs, generation counter.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/node_map.hpp	Define `AlarmEventConfig` and event-alarm accessors.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/alarm_state_machine.hpp	New pure-function AlarmConditionType lifecycle state machine.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/build.sh	Build helper for the alarm test server Docker image.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/Dockerfile	Dockerized open62541 FULL ns0 alarm test fixture build/runtime.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh	Docker integration suite validating alarm lifecycle via gateway SOVD endpoints.
src/ros2_medkit_plugins/ros2_medkit_opcua/design/index.rst	Document event alarm configuration, state machine, refresh, ack/confirm.
src/ros2_medkit_plugins/ros2_medkit_opcua/README.md	Document `event_alarms` YAML and new operations usage.
src/ros2_medkit_plugins/ros2_medkit_opcua/CMakeLists.txt	Add state-machine tests and optional alarm test server build wiring.
src/ros2_medkit_plugins/ros2_medkit_opcua/CHANGELOG.rst	Document forthcoming alarm subscription + ops changes.
src/ros2_medkit_plugins/ros2_medkit_graph_provider/include/ros2_medkit_graph_provider/graph_provider_plugin.hpp	Update introspection provider include path.
src/ros2_medkit_integration_tests/include/ros2_medkit_integration_tests/demo_node_main.hpp	Add shared graceful demo-node shutdown helper (sigwait thread).
src/ros2_medkit_integration_tests/demo_nodes/rpm_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/param_beacon_node.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/long_calibration_action.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/light_controller.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/lidar_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/engine_temp_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/door_status_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/calibration_service.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/brake_pressure_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/brake_actuator.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/beacon_publisher.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/CMakeLists.txt	Add include path for shared demo-node helper across demo binaries.
src/ros2_medkit_gateway/test/test_topic_data_provider_interface.cpp	Add interface-level tests for `TopicDataProvider` mocking contract.
src/ros2_medkit_gateway/test/test_script_manager.cpp	Update script provider include path.
src/ros2_medkit_gateway/test/test_ros2_subscription_executor.cpp	Add unit tests for subscription executor behavior (queue, watchdog, graph cb).
src/ros2_medkit_gateway/test/test_plugin_manager.cpp	Update introspection include path; avoid `json` alias shadowing.
src/ros2_medkit_gateway/test/test_plugin_loader.cpp	Update provider include paths (introspection/updates).
src/ros2_medkit_gateway/test/test_log_manager.cpp	Update log provider include path.
src/ros2_medkit_gateway/test/test_error_info.cpp	Add unit tests for new `ErrorInfo` value type.
src/ros2_medkit_gateway/test/test_discovery_manager.cpp	Update to new topic data provider wiring + thread-safe teardown.
src/ros2_medkit_gateway/test/test_data_access_manager.cpp	Update tests to new topic provider wiring and teardown order.
src/ros2_medkit_gateway/test/demo_nodes/test_update_backend.cpp	Update update provider include path.
src/ros2_medkit_gateway/test/demo_nodes/test_gateway_plugin.cpp	Update introspection/update provider include paths.
src/ros2_medkit_gateway/src/trigger_topic_subscriber.cpp	Adjust comment to reflect non-NativeTopicSampler usage.
src/ros2_medkit_gateway/src/ros2_common/ros2_subscription_slot.cpp	Implement RAII subscription slot with safe deferred destroy behavior.
src/ros2_medkit_gateway/src/plugins/plugin_loader.cpp	Update provider include paths (scripts/updates/introspection).
src/ros2_medkit_gateway/src/main.cpp	Wire subscription executor + pooled data provider; add explicit teardown sequence.
src/ros2_medkit_gateway/src/http/rest_server.cpp	Ensure SSE shutdown is requested before server stop/join.
src/ros2_medkit_gateway/src/http/handlers/sse_fault_handler.cpp	Add `request_shutdown()` to wake SSE waiters promptly.
src/ros2_medkit_gateway/src/http/handlers/health_handlers.cpp	Expose provider/executor stats via `x-medkit-*` keys in `/health`.
src/ros2_medkit_gateway/src/http/handlers/data_handlers.cpp	Route sampling via `TopicDataProvider` and propagate provider `ErrorInfo` accurately.
src/ros2_medkit_gateway/src/gateway_node.cpp	Add `set_topic_data_provider()` and route discovery/DAM sampling through it.
src/ros2_medkit_gateway/src/discovery/runtime_discovery.cpp	Switch to provider-based topic mapping and swallow shutdown-time graph exceptions.
src/ros2_medkit_gateway/src/discovery/merge_pipeline.cpp	Update introspection provider include path.
src/ros2_medkit_gateway/src/discovery/discovery_manager.cpp	Rename setter to `set_topic_data_provider`.
src/ros2_medkit_gateway/src/data_access_manager.cpp	Use `TopicDataProvider` (remove NativeTopicSampler) and propagate non-404 errors.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_provider.hpp	Introduce new update provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_manager.hpp	Update include to new update provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/scripts/script_provider.hpp	Introduce new script provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/script_manager.hpp	Update include to new script provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2_common/ros2_subscription_slot.hpp	Define subscription slot API (create + safe teardown contract).
src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_manager.hpp	Update provider include paths (logs/scripts/updates/introspection).
src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_context.hpp	Update introspection provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/models/error_info.hpp	Add transport-neutral provider error descriptor.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/logs/log_provider.hpp	Introduce new log provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/log_manager.hpp	Update include to new log provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/sse_fault_handler.hpp	Add explicit `request_shutdown()` API documentation.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/error_codes.hpp	Add new `x-medkit-*` error codes for shutdown/subscribe/cold-wait failures.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/gateway_node.hpp	Add TopicDataProvider attach/detach API + ownership.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/exceptions.hpp	Add ProviderErrorException to preserve provider http/code.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/runtime_discovery.hpp	Switch runtime discovery to TopicDataProvider pointer.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/layers/plugin_layer.hpp	Update introspection provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/introspection_provider.hpp	New introspection provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/discovery_manager.hpp	Update API to set TopicDataProvider instead of NativeTopicSampler.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/default_script_provider.hpp	Update script provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data_access_manager.hpp	Replace NativeTopicSampler exposure with TopicDataProvider attach/get.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/topic_data_provider.hpp	New transport-neutral topic sampling/discovery interface.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/ros2_topic_data_provider.hpp	New pooled ROS 2 implementation + stats/eviction design.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/data_types.hpp	Add transport-neutral topic discovery/sample data structures.
src/ros2_medkit_gateway/design/index.rst	Update gateway design docs for new subscription architecture.
src/ros2_medkit_gateway/design/architecture.puml	Update architecture diagram for TopicDataProvider/subscription executor.
src/ros2_medkit_gateway/config/gateway_params.yaml	Add config knobs for executor and provider pool behavior.
src/ros2_medkit_gateway/README.md	Document regression gate against “naked” rclcpp subscription creation.
src/ros2_medkit_gateway/CMakeLists.txt	Add new sources/tests for subscription infra + topic provider; drop NativeTopicSampler tests.
src/ros2_medkit_fault_manager/src/snapshot_capture.cpp	Serialize subscription create/destroy under mutex to close TSan race window.
src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/include/ros2_medkit_topic_beacon/topic_beacon_plugin.hpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp	Swallow shutdown-time graph exceptions to avoid terminate during teardown.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/include/ros2_medkit_param_beacon/param_beacon_plugin.hpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/CMakeLists.txt	Increase gmock test timeout for sanitizer overhead.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/systemd_plugin.cpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/procfs_plugin.cpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/container_plugin.cpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_beacon_common/include/ros2_medkit_beacon_common/beacon_entity_mapper.hpp	Update introspection provider include path.
scripts/check_no_naked_subscriptions.sh	Add CI/pre-commit regression gate for subscription creation API usage.
docs/tutorials/plugin-system.rst	Update provider include paths + document subscription API restriction for plugins.
docs/api/rest.rst	Document new `/health` vendor-extension sections for pool/executor stats.
.pre-commit-config.yaml	Add local pre-commit hook for naked subscription regression gate.
.github/workflows/quality.yml	Run naked subscription regression gate in CI.
.github/workflows/opcua-plugin.yml	Add new AlarmConditionType docker integration job.

Copilot · 2026-04-25T15:24:51Z

+    case AlarmAction::ReportHealed:
+      // Fault is latched: condition is no longer active but not yet
+      // confirmed. We don't have a dedicated HEALED reporting verb in
+      // ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED
+      // event - fault_manager keeps the entry in HEALED state until
+      // confirmed, mirroring the lifecycle.
+      log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);
+      // No-op for now; fault_manager will keep the fault HEALED until
+      // CLEARED. The state transition is observable via /faults/stream.
+      break;


AlarmAction::ReportHealed is currently a no-op, but the comment says it should be reported as a PASSED event so fault_manager can transition the fault to HEALED. As written, the fault will remain CONFIRMED until it is cleared, so the intended CONFIRMED -> HEALED -> CLEARED lifecycle (and the docker integration expectation) won’t occur. Consider adding a helper to report EVENT_PASSED (or extending send_report_fault to take an event type) and invoking it here.

Adds OpcuaClient primitives for native OPC-UA event subscription, the first commit of issue #386 (AlarmConditionType bridge to fault_manager). open62541pp v0.16 has no native EventFilter or event subscription support, so this plumbs the raw open62541 C API (UA_Client_MonitoredItems_createEvent) behind a small C++ surface that mirrors the existing data-change patterns. Public API additions: * EventField + EventBrowsePath types for SimpleAttributeOperand specs. * EventCallback signature delivering select values plus EventType and SourceNode (always prepended to the filter, extracted on dispatch). * add_event_monitored_item / remove_event_monitored_item with heap-allocated CallbackContext stored as unique_ptr in OpcuaClient::Impl. * call_method wrapping opcua::services::call with status-mapped errors - needed by ConditionRefresh, Acknowledge, and Confirm in later commits. * current_generation() exposing a monotonic counter incremented on every detected disconnect (clean disconnect or transport-level drop). The trampoline captures a snapshot at createEvent time and drops callbacks whose snapshot diverges from the live counter, eliminating the late callback / use-after-free hazard reviewers flagged. remove_subscriptions and disconnect now bump the generation and clear event_callbacks before tearing down open62541pp Subscriptions, ensuring in-flight C callbacks see a stale generation rather than a freed context. Tests: 6 new disconnected-state tests validating the API contract and the generation-counter ordering. End-to-end event flow against a real server runs against the test_alarm_server fixture introduced in a later commit on this branch (per plan v2 - in-process server tests with synthetic event triggering are deferred to keep this commit reviewable; the docker integration test in commit 5 covers the full path). Refs #386

…firm ops Wires the Part 9 AlarmConditionType subscription path into the existing threshold-polling plugin. Builds on the event subscription primitive from the previous commit; downstream consumers see one new YAML form (``event_alarms:`` block) and two new SOVD operations (``acknowledge_fault``, ``confirm_fault``). * New header-only ``AlarmStateMachine``: pure compute_status function over ``EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchId``. Decision order documented inline; Retain is intentionally ignored (per Part 9 it filters ConditionRefresh visibility, not lifecycle). 22 unit tests cover every rule plus precedence ordering. * ``NodeMap`` learns ``event_alarms:`` (top-level YAML sibling of ``nodes:``). Each entry declares ``alarm_source`` (NodeId of the source emitting AlarmConditionType events), ``entity_id``, ``fault_code``, and optional severity / message overrides. Mutually exclusive with the per-entry threshold ``alarm`` block; load() fails fast if both are set on the same node. ``find_event_alarm`` lookup serves the SOVD operation handlers. ``build_entity_defs`` merges event-mode entities so SOVD discovery surfaces them as fault-bearing even without scalar data. * ``OpcuaPoller`` gains ``setup_event_subscriptions()`` and ``on_event(...)``. One dedicated subscription handles all event-mode alarms; the trampoline dispatches positional select-clause values through the state machine. ConditionRefresh fires on subscribe and on every reconnect (using the existing exponential-backoff path); the generation counter from OpcuaClient already filters callbacks captured from defunct subscriptions. * Per-condition runtime is keyed by ConditionId NodeId stringForm so multiple instances under the same source remain distinct. Each entry carries the latest EventId ByteString - required for spec-compliant Acknowledge calls (Part 9 §5.7.3 returns BadEventIdUnknown otherwise). * Plugin's OperationProvider lists ``acknowledge_fault`` / ``confirm_fault`` for any entity that has at least one event-mode alarm. ``execute_operation`` resolves (entity_id, fault_code) through the poller's lookup_condition, then invokes the inherited methods on AcknowledgeableConditionType (i=9111 Ack, i=9113 Confirm) with the tracked EventId and a LocalizedText comment. HTTP error mapping mirrors the existing write_value path. * AlarmCondition events bridge through ``on_event_alarm`` to the existing send_report_fault / send_clear_fault wiring. Severity is mapped to the SOVD enum buckets (1-200 INFO, 201-500 WARN, 501-800 ERROR, 801-1000 CRITICAL); selfpatch convention, NOT IEC 62682 - documented in the follow-up design doc. Tests: 22 new state-machine unit tests (full transition table coverage plus rule-precedence). All 144 tests in the package green; ASAN/TSAN clean; clang-format, copyright, cppcheck, lint_cmake, xmllint all pass. The test_alarm_server fixture and its docker integration ship in the accompanying commit on this branch (gated OFF by default while the ExternalProject namespace0_generated linker issue is being resolved). Refs #386

…ow, docs Closes the integration story for issue #386. The plugin's threshold-mode integration runs against OpenPLC; this commit adds the parallel suite for native AlarmConditionType subscriptions, which OpenPLC does not implement. Components: * ``test_alarm_server`` fixture (open62541, FULL ns0, alarms ON). Standalone C++ binary exposing 3 conditions on tcp/4842 plus a stdin CLI (``fire``, ``ack``, ``confirm``, ``latch``, ``shelve``, ``unshelve``, ``disable``, ``enable``, ``quit``). Smoke test in ``test/fixtures/test_alarm_server/smoke_test.py`` verifies type conformance via ``asyncua``. * Self-contained Dockerfile under ``docker/test_alarm_server/`` clones open62541 v1.4.6 inside the image (no dependency on a pre-populated workspace ``build/`` tree, so CI builds cleanly from a fresh checkout). * ``docker/scripts/run_alarm_tests.sh`` orchestrates the fixture + gateway: builds both images, brings them up on a private network, fires alarms via the server's stdin pipe, and asserts the SOVD ``/faults`` endpoint moves through CONFIRMED -> HEALED -> CLEARED. Polling-with- timeout throughout (no fixed sleeps); cleanup trap teardown. * ``.github/workflows/opcua-plugin.yml`` gains the ``integration-alarms`` job, parallel to the existing ``integration`` (OpenPLC) job. Both run on every PR that touches the plugin or its dependencies. * ``design/index.rst`` documents the full state machine table (precedence order, the deliberate choice to ignore ``Retain`` for lifecycle), the selfpatch severity-bucket convention (and the explicit non-claim of IEC 62682), the ``ConditionRefresh`` / ``RefreshStartEvent`` / ``RefreshEndEvent`` flow, the ack/confirm method NodeIds, and a vendor matrix covering Siemens, Beckhoff, Rockwell, CodeSys, OpenPLC. * ``README.md`` shows the 3-line ``event_alarms:`` form and a curl example for ``acknowledge_fault``. * ``CHANGELOG.rst`` Forthcoming entry. The cmake-side ``test_alarm_server`` target stays gated OFF (``MEDKIT_OPCUA_BUILD_ALARM_SERVER``) until the LTO ``namespace0_generated`` linker mismatch in the second open62541 ``ExternalProject_Add`` build is resolved. The docker integration suite does not depend on it - it builds its open62541 inside the container. Refs #386

Adds ``test_alarm_server_smoke`` to CTest. Boots the freshly-built fixture on an ephemeral port, waits for the ``READY`` line on stdout, runs the existing ``asyncua`` smoke test against it, and tears the process down. Skips with CTest exit 77 (which we map via ``SKIP_RETURN_CODE``) when ``asyncua`` is not importable, so iterating on plugin sources without the Python dependency does not fail the suite. CI ``integration-alarms`` job installs ``asyncua`` so the smoke test runs as a real pass / fail there. Other jobs see it as skipped, which ament_lint surfaces but does not flag. Refs #386

…ct E2E Previous run_alarm_tests.sh shortcut its ``ack`` and ``confirm`` lines via the test_alarm_server stdin CLI, which bypassed the medkit SOVD operation path. The implementation - lookup_condition + EventId tracking + call_method on the inherited AcknowledgeableConditionType methods (i=9111 Acknowledge, i=9113 Confirm) - therefore had only unit-level confidence. This commit makes ack and confirm round-trip through HTTP: POST /api/v1/apps/tank_process/operations/acknowledge_fault/executions POST /api/v1/apps/tank_process/operations/confirm_fault/executions with ``{"fault_code": "...", "comment": "..."}``. After each POST the test polls the server's stdout for the new ``STATE`` log line (added to the fixture in this commit) and asserts the relevant flag (``acked=true`` / ``confirmed=true``) actually flipped on the OPC-UA server. Additional E2E scenarios added on top of the existing fire->CONFIRMED ->latch->HEALED->CLEARED happy path: * Shelving suppression: fire Overheat -> CONFIRMED -> shelve -> fault disappears -> unshelve + fire -> CONFIRMED returns. Exercises ShelvingState parsing and the state-machine Rule 3. * Disabled alarm: fire SensorLost -> CONFIRMED -> disable -> fault clears -> enable + fire -> CONFIRMED. Exercises EnabledState parsing and Rule 2. * Reconnect with ConditionRefresh: fire Overpressure -> CONFIRMED -> docker stop -> docker start -> fire again -> CONFIRMED returns via the gateway's reconnect path (poll_loop -> setup_event_subs -> ConditionRefresh). Fixture changes: * Source nodes use predictable string NodeIds ``ns=2;s=Alarms.<name>`` so the gateway's ``alarm_source`` config maps unambiguously to a real node. The previous auto-assigned numeric IDs broke event subscription against the documented YAML form. * The CLI loop logs a ``STATE <name> active=... acked=... ...`` line after every successful command so the harness can assert state transitions with one ``docker logs | grep`` instead of a separate asyncua round-trip. The script keeps polling-with-timeout throughout (no fixed sleeps), restartable cleanup trap, idempotent re-runs. Local end-to-end run was not attempted - the gateway-opcua docker image build is the long-tail (multi-minute) and CI integration-alarms job covers the same path on every push. Unit + state-machine + smoke tests all stayed green during the work (147 tests, 0 failures). Refs #386

CI Integration (AlarmConditionType) job hit 30-minute timeout at [3/5] Start test_alarm_server because the script's stdin pipe pattern deadlocked the runner. The shell opens redirections before exec'ing the command, so 'docker run -d -i ... < fifo' followed by 'exec 3> fifo' blocks at the first line forever. Fix: open the FIFO read+write on FD 3 first ('exec 3<>fifo'), which is non-blocking, then redirect docker's stdin from FD 3 ('docker run ... <&3'). Same pattern applied to the post-reconnect FIFO. Side fix: the gateway entrypoint wrote /config/manifest.yaml from inside the container, but /config is mounted read-only. Pre-write the manifest on the host before starting the container. Refs #386

…containers Previous CI run failed at 'tank_process not in /apps after 120s' but the 'Dump container logs on failure' workflow step couldn't see anything - the script's cleanup trap had already 'docker rm -f'd the containers. Move log dump into the cleanup trap itself, gated by non-zero exit code, so subsequent failures land actionable logs in the runner output. Refs #386

…ind mount Two more findings from local E2E debug: 1. docker run -d -i ... <&3 lost stdin between the daemonized client and the docker daemon - commands written to FD 3 from the script never reached the binary's stdin in the container. The -d flag closes the client process which orphans the FIFO before the daemon can rewire it. Drop -d, run docker run as a shell background job instead, so the client process stays alive holding the FIFO open for the container. Track the PID and clean up on trap. 2. The gateway image bakes /config/gateway_params.yaml at build time, but our :ro bind mount of /tmp/alarm_test_config:/config shadows it. The container exited 1 with 'Couldn't parse params file'. Stage the params file into the bind mount alongside the alarm_nodes.yaml and manifest.yaml. Refs #386

Three CI runs hit BadNodeIdUnknown on event monitored item creation even though the server reports the source nodes at the expected NodeIds ('ns=2;s=Alarms.Overpressure' etc, verified via asyncua browse). The mismatch is somewhere between our parse + raw-C call. Add stderr trace before and after UA_Client_MonitoredItems_createEvent so the next CI run logs the exact NodeId stringForm we hand to the server, plus the status code. Will be tightened to RCLCPP_DEBUG once the root cause is identified. Refs #386

… default request Multiple iterations to pin down BadNodeIdUnknown when adding event monitored items - all still failing locally: * Replaced shallow copy of source_node with UA_NodeId_copy so the serializer never aliases an open62541pp wrapper internal. * Switched typeDefinitionId in SimpleAttributeOperand from BaseEventType to AlarmConditionType (i=2915) since the BrowsePaths we use (ConditionId, AckedState, ShelvingState, etc.) are not defined on BaseEventType per Part 9 spec. * Replaced manual UA_MonitoredItemCreateRequest_init with UA_MonitoredItemCreateRequest_default(nodeId) so the request layout matches open62541's own examples. * Cleared the request struct including the deep-copied NodeId after the call, detaching the stack-local filter first to avoid a double-free. None of these alone fixed it. Server still reports BadNodeIdUnknown for both the custom source NodeId (ns=2;s=Alarms.Overpressure) AND the canonical Server object (i=2253). Investigation continues. Refs #386

…onnect Local E2E (run_alarm_tests.sh) now passes all four scenarios. Six issues fixed along the way; each was reproducible only with a real open62541 AlarmConditionType server and could not have surfaced from the unit suite. State machine wiring - opcua_client::add_event_monitored_item now auto-prepends three fixed SimpleAttributeOperands (EventType, SourceNode, ConditionId) per Part 9 §5.5.2.13 - the ConditionId clause uses an empty BrowsePath + AttributeId=NodeId, which is the only spec-legal way to retrieve it. EventCallback gets the ConditionId as a separate argument; user-supplied EventFieldSpec entries appear after the three prepended fields. - Each user EventFieldSpec carries its OWN typeDefinitionId. open62541 servers reject inherited browse paths with BadNodeIdUnknown, so we tag AckedState/Id with AcknowledgeableConditionType (i=2881), ActiveState/Id with AlarmConditionType (i=2915), EnabledState/Id with ConditionType (i=2782), etc. Previous single-typeDef filter was rejected wholesale. - Fixed double-free in event MI creation: the open62541 default request builder returns a struct that aliases the NodeId we pass in, so UA_NodeId_clear after UA_MonitoredItemCreateRequest_clear corrupted the heap. Item is now built explicitly with UA_NodeId_copy into the request. - Poller calls client.run_iterate(50) every poll cycle. Without it the open62541 client never dispatched subscription notifications because no scalar nodes were configured, so the trampoline silently never fired even though createEvent returned Good. E2E correctness - call_method now also rejects per-input-arg failures from inputArgumentResults. AlarmConditionType.Acknowledge surfaces BadEventIdUnknown there when the EventId we sent has been superseded by a newer event; without this check SOVD POST returned 200 even though the server refused the call. - run_alarm_tests.sh: ack/confirm now go through SOVD HTTP, not the test_alarm_server stdin shortcut. Between latch and SOVD confirm we poll the gateway log for "AlarmCondition HEALED" - the 500 ms subscription publishing interval means the gateway needs that long to receive both the Acknowledge auto-emit and the latch trigger before it has a fresh EventId for Confirm. Without the wait Confirm was sent with the stale ID from the original fire payload. - ShelvingState fix in test_alarm_server: set_shelving now also writes the Id property (NodeId) of CurrentState, not just the LocalizedText. The medkit bridge keys suppression off ShelvingState/CurrentState/Id (i=2929/2930/2932) because the text is locale-dependent. Without the Id write the gateway saw shelved=false and the suppression scenario silently failed. - Shelved detection in opcua_poller now treats a null/missing Id as Unshelved instead of "unknown=shelved". Some servers leave the optional field uninitialized, and that is not a suppression signal. - Cleanup trap dumps the full container log on rc!=0, not the last 120 lines. The diagnostic introspect() poll spam crowded out the on_event / state-machine traces from the tail window. Scenarios covered (run_alarm_tests.sh) - fire / SOVD ack / latch / SOVD confirm / clear lifecycle - shelve suppresses an active alarm; unshelve re-arms it - disable suppresses an active alarm; enable re-arms it - gateway reconnect: stop the test_alarm_server, restart it, and a re-fired alarm shows CONFIRMED again (proves setup_event_subscriptions is invoked on the reconnect path with a fresh ConditionRefresh) Diagnostic stderr logging - captured EventId hex / call_method status code / per-arg result are printed to stderr from opcua_poller / opcua_plugin / opcua_client. Verbose by design - this is the integration test fixture path and the only way to triage a BadEventIdUnknown after the fact. Local verify (Jazzy, x86_64) bash src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh -> "All alarm scenarios passed." EXIT=0

…, dead guard) Three review-driven cleanups, no functional change: - run_alarm_tests.sh: idempotent teardown before ``docker network create``. The cleanup trap fires on EXIT but not on a hard kill (Ctrl-C between trap arm and trap fire), in which case the leftover network would crash the next run under ``set -e``. Mirrors the trap with a noisy-tolerant prelude. - run_alarm_tests.sh: rename scenario 4 from "reconnect preserves CONFIRMED via ConditionRefresh" to "reconnect re-subscribes after server restart". The previous name was aspirational - the test_alarm_server is in-memory and loses condition state on restart, so Part 9 §5.5.7 cannot fire here. Issue #389 tracks the actual ConditionRefresh re-emit verification. - opcua_poller.cpp: drop dead ``if (event_subscription_id_ != 0) return`` guard in setup_event_subscriptions(). Both call sites (start() and the poll_loop reconnect arm) zero the field before calling, and the comment now says so. - opcua_client.cpp: clarify in disconnect() that the ``if (connected)`` guard guarantees single generation bump even when maybe_mark_disconnected already fired on a transport error - the latter uses exchange(false) so the second site is a no-op by atomic semantics, not by accident.

Today ``OpcuaPoller::condition_refresh()`` swallows server failures with a silent comment ("not fatal - many test servers do not implement"). Real PLCs hit this too: open62541 v1.4.x returns BadMethodInvalid, Siemens S7-1500 omits ConditionRefresh entirely, Beckhoff TF6100 status unconfirmed in public docs. The operator gets no signal that their ``alarm-replay-on-reconnect`` contract is broken. - PollerConfig gains ``log_warn`` (std::function<void(const std::string&)>), optional. The plugin owning the poller wires it to its inherited GatewayPlugin::log_warn so messages reach the ROS 2 logger and /rosout, not just container stderr. - ``OpcuaPoller::condition_refresh()`` emits one warn per connect on failure (throttled by ``condition_refresh_warned_``) carrying the StatusCode and pointing at issue #389. Reset on success so a recovered server earns a fresh warn next time it breaks. - Falls back to ``std::cerr [opcua_poller WARN]`` when the callback is not set, preserving observability for unit-test paths that don't go through the plugin. No behavior change for the success path. Live transitions continue to flow regardless; this is purely operator observability for the "reconnect doesn't replay state" failure mode.

Extract the OPC-UA Part 4 §5.11.2 Call result classification from ``OpcuaClient::call_method`` into two public static helpers: - ``status_to_method_error(uint32_t code, const std::string & msg)`` maps an OPC-UA StatusCode to one of {MethodNotFound, InvalidArgument, MethodTimeout, TransportError} via the existing dispatch table. - ``classify_call_result(uint32_t overall, std::vector<uint32_t> args)`` walks the per-argument results and returns the first non-Good code, with overall statusCode taking precedence. Public statics so the test suite can hit the BadEventIdUnknown branch that previously had no unit anchor - the bug ``call_method`` was just fixed for (statusCode=Good + inputArgumentResults[0]=Bad) is exercised by the docker integration test (run_alarm_tests.sh ack/confirm flow) but a future refactor that drops the per-arg loop would not get caught locally. ~30 LoC of pure tests catch this in seconds. API surface uses ``uint32_t`` instead of ``UA_StatusCode`` so the public header doesn't pull in open62541 types - the .cpp internally treats them identically (UA_StatusCode is a uint32_t typedef). 9 new gtest cases (3 for ``status_to_method_error``, 6 for ``classify_call_result``): - MethodNotFound / InvalidArgument / Timeout family dispatch - BadEventIdUnknown stays in TransportError (signals SOVD to retry, not reject, since the EventId staling is a transient race) - Empty arg results, all-Good arg results -> success - Bad overall status returns error - BadEventIdUnknown in arg[0] returns error with "input arg 0" message - First bad arg wins over later bad args - Overall status beats arg results (transport error short-circuits) Existing ``call_method`` body is unchanged in behavior; the diagnostic stderr trace that prints ``statusCode=`` and per-arg codes is preserved verbatim. Local verify: ``test_opcua_client`` reports 26/26 PASSED including all 9 new cases.

Verification round on the post-#386 review found 9 untested cells in the AlarmStateMachine transition matrix (issue #389 follow-up). Adding focused tests so a future refactor cannot silently break a corner case. New cells covered: - ``DisabledClearsHealedAlarm`` - Healed exit via Rule 2 (was_active=true for Healed too, so disabling a latched alarm must ClearFault, not just flip to Suppressed). - ``DisabledNoOpWhenAlreadyCleared`` - confirms Cleared+disabled lands at Suppressed/NoOp (no spurious second clear). - ``ShelvedClearsHealedAlarm`` / ``ShelvedNoOpWhenAlreadyCleared`` - same pair via Rule 3. - ``ActiveAlarmReportsConfirmedFromSuppressed`` - operator unshelves/re-enables an alarm whose source is still active; Rule 4 must promote Suppressed -> Confirmed with ReportConfirmed (NOT NoOp). This is exactly the unshelve+re-fire path scenario 2 of run_alarm_tests.sh exercises end-to-end. - ``BranchEventFromHealedNoOp`` / ``BranchEventFromSuppressedNoOp`` - Rule 1 precedence holds across all four prev_status values. - ``AckedAndConfirmedNoOpFromSuppressed`` - ``was_active=false`` branch of Rule 5: a fully cleared event delivered while suppressed advances next_status to Cleared but issues no ClearFault (no fault to clear). - ``InactiveUnackedFromSuppressedReportsHealed`` - ditto but with the ack/confirm bits unset, must surface as Healed/ReportHealed so the unfinished ack/confirm workflow item stays visible. State machine code itself is unchanged. 27 gtest cases, all PASSED. Pure-function tests, deterministic, no I/O.

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 12 comments.

Copilot · 2026-04-26T11:29:59Z

+Skips (exits 77 - the CTest convention for a skipped test) when
+``asyncua`` is not importable, so contributors who only iterate on the
+plugin sources do not need a Python pip install. CI installs ``asyncua``
+in the integration job and observes the test as a real pass / fail.


run_ctest.py exits with code 77 to skip the smoke test when asyncua is not installed. This introduces an always-skippable test path, which makes it easier for regressions to slip in unnoticed. Prefer making asyncua a declared test dependency (so the test always runs in standard test environments) or move the smoke test to a separate, explicitly-optional test target instead of skipping at runtime.

Copilot · 2026-04-26T11:30:00Z

+TEST(AlarmStateMachineTest, DisabledNoOpWhenAlreadyCleared) {
+  AlarmEventInput in;
+  in.enabled_state = false;
+  auto out = AlarmStateMachine::compute(SovdAlarmStatus::Cleared, in);
+  EXPECT_EQ(out.next_status, SovdAlarmStatus::Suppressed);
+  EXPECT_EQ(out.action, AlarmAction::NoOp);
+}


Test name suggests a pure no-op when already cleared, but the expected next_status transitions to Suppressed. If the intended behavior is “no action” but status may change, consider renaming this test to reflect the suppression transition (or adjust expectations to match the name).

Copilot · 2026-04-26T11:30:00Z

+    auto callback = [this, &cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node,
+                                 const opcua::NodeId & event_type, const opcua::NodeId & condition_id) {


In setup_event_subscriptions(), the lambda captures &cfg where cfg is the loop variable of a range-for. This captures a reference to the loop variable (which is reused each iteration and goes out of scope after the loop), so callbacks can read a dangling reference / wrong config when events arrive. Capture the config by value (or capture a stable pointer/reference to the vector element) so each monitored item callback uses the correct AlarmEventConfig.

Suggested change

auto callback = [this, &cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node,

const opcua::NodeId & event_type, const opcua::NodeId & condition_id) {

auto callback = [this, cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node,

const opcua::NodeId & event_type, const opcua::NodeId & condition_id) {

Copilot · 2026-04-26T11:30:01Z

+  // Bump generation BEFORE freeing the ctx so any in-flight trampoline call
+  // captured the old generation and will drop its work.
+  impl_->generation.fetch_add(1, std::memory_order_release);
+


remove_event_monitored_item() increments the global generation counter. That will make all other event monitored item contexts fail the generation_snapshot check in the trampoline and silently drop future events, even though their subscriptions are still valid. Consider making the staleness check per-monitored-item (e.g., an active atomic in the context) or only bump generation on full disconnect/remove_subscriptions (not on single MI removal).

Suggested change

// Bump generation BEFORE freeing the ctx so any in-flight trampoline call

// captured the old generation and will drop its work.

impl_->generation.fetch_add(1, std::memory_order_release);

// Do not advance the global generation for removal of a single monitored

// item. The generation is used to invalidate all callback contexts after a

// full reconnect/subscription teardown; bumping it here would make other

// still-active monitored items fail the trampoline staleness check and drop

// future events.

Copilot · 2026-04-26T11:30:01Z

+  std::cerr << "[opcua_poller] on_event fault=" << cfg.fault_code << " event_type=" << event_type.toString()
+            << " condition=" << condition_id.toString() << " values=" << values.size() << std::endl;


on_event() currently logs every notification and EventId bytes directly to std::cerr. This can flood logs under high alarm/event rates and bypass the plugin's logging controls. Route these through the plugin/logger (or a configurable debug/trace sink) and avoid dumping per-event hex unless explicitly enabled.

Copilot · 2026-04-26T11:30:02Z

+    # Start fault_manager_node first so its services are advertised before
+    # the gateway opcua plugin tries to call /fault_manager/report_fault.
+    ros2 run ros2_medkit_fault_manager fault_manager_node \
+      > /var/lib/ros2_medkit/fault_manager.log 2>&1 &
+    sleep 3
+    PLUGIN_PATH=$(find /root/ws/install -name "libros2_medkit_opcua_plugin.so" | head -1)


The PR description says the docker integration uses polling-with-timeout throughout with “no fixed sleeps”, but this script has a fixed sleep 3 before starting the gateway. Consider replacing it with a poll for the fault_manager service readiness (or other observable health signal) to keep the test deterministic across slower CI runners.

Copilot · 2026-04-26T11:30:02Z

+    {
+      const auto * bytes = runtime->latest_event_id.data();
+      std::cerr << "[opcua_plugin] " << operation_name << " EventId len=" << runtime->latest_event_id.length()
+                << " hex=";
+      for (size_t i = 0; i < std::min<size_t>(runtime->latest_event_id.length(), 16); ++i) {
+        char buf[3];
+        std::snprintf(buf, sizeof(buf), "%02x", static_cast<unsigned>(bytes[i]) & 0xffu);
+        std::cerr << buf;
+      }
+      std::cerr << " conditionId=" << runtime->condition_id.toString() << std::endl;
+    }
+


execute_operation() prints the EventId bytes and ConditionId to std::cerr for every acknowledge/confirm request. This is very verbose and bypasses ROS logging; it will also include user-supplied comment indirectly in downstream logs. Consider removing this block or switching it to debug-level logging behind a flag.

Suggested change

{

const auto * bytes = runtime->latest_event_id.data();

std::cerr << "[opcua_plugin] " << operation_name << " EventId len=" << runtime->latest_event_id.length()

<< " hex=";

for (size_t i = 0; i < std::min<size_t>(runtime->latest_event_id.length(), 16); ++i) {

char buf[3];

std::snprintf(buf, sizeof(buf), "%02x", static_cast<unsigned>(bytes[i]) & 0xffu);

std::cerr << buf;

}

std::cerr << " conditionId=" << runtime->condition_id.toString() << std::endl;

}

Copilot · 2026-04-26T11:30:02Z

+/// Retain is intentionally NOT used here. Per Part 9 §5.5.2.10 it controls
+/// visibility during ConditionRefresh bursts, not lifecycle - the poller
+/// strips Retain=false events delivered between RefreshStartEvent and
+/// RefreshEndEvent before invoking compute().


The comment says the poller “strips Retain=false events delivered between RefreshStartEvent and RefreshEndEvent before invoking compute()”, but the current event select clauses don’t include Retain and on_event() doesn’t filter on it. Either implement the Retain-based filtering or update this comment so it matches the actual behavior.

Suggested change

/// Retain is intentionally NOT used here. Per Part 9 §5.5.2.10 it controls

/// visibility during ConditionRefresh bursts, not lifecycle - the poller

/// strips Retain=false events delivered between RefreshStartEvent and

/// RefreshEndEvent before invoking compute().

/// Retain is intentionally not modeled by this state machine and does not

/// affect ``compute()``. Per Part 9 §5.5.2.10 it controls visibility during

/// ConditionRefresh bursts rather than the lifecycle mapping implemented

/// here.

Copilot · 2026-04-26T11:30:02Z

+      // ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED
+      // event - fault_manager keeps the entry in HEALED state until
+      // confirmed, mirroring the lifecycle.
+      log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);
+      // No-op for now; fault_manager will keep the fault HEALED until
+      // CLEARED. The state transition is observable via /faults/stream.


In AlarmAction::ReportHealed handling, the comment says we “mark this as a PASSED event”, but the implementation is currently a no-op (only logs). Please either emit the intended PASSED/HEALED signal to fault_manager or adjust the comment to reflect that HEALED is intentionally not reported via ReportFault at the moment.

Suggested change

// ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED

// event - fault_manager keeps the entry in HEALED state until

// confirmed, mirroring the lifecycle.

log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);

// No-op for now; fault_manager will keep the fault HEALED until

// CLEARED. The state transition is observable via /faults/stream.

// ReportFault.srv (only FAILED/PASSED), and this plugin does not

// currently emit a PASSED event for the latched HEALED transition.

// For now we only log the transition here; any externally visible

// lifecycle updates come from subsequent alarm events.

log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);

// Intentional no-op: HEALED is not currently reported via ReportFault.

Copilot · 2026-04-26T11:30:03Z

+    // Mutual-exclusion check: an entry under ``nodes:`` carrying both a
+    // ``threshold`` alarm and an ``alarm_source`` is a configuration error
+    // (the threshold path polls scalar values; the alarm_source path
+    // subscribes to native events). Reject the whole file rather than guess.
+    for (const auto & node : (nodes ? nodes : YAML::Node{})) {
+      if (node["alarm_source"] && node["alarm"] && node["alarm"]["threshold"]) {
+        RCLCPP_ERROR(rclcpp::get_logger("opcua.node_map"),
+                     "Entry node_id=%s declares both threshold alarm and alarm_source - mutually exclusive",
+                     node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>");


This loader comment/error mentions a nodes: entry declaring both a threshold alarm and alarm_source, but alarm_source is now a top-level event_alarms: field. As written, a mistaken alarm_source under nodes: is mostly ignored unless it also has an alarm.threshold. Consider explicitly rejecting any alarm_source key under nodes: with an error that points users to event_alarms: (and/or validate mutual exclusivity across the two YAML forms more directly).

Suggested change

// Mutual-exclusion check: an entry under ``nodes:`` carrying both a

// ``threshold`` alarm and an ``alarm_source`` is a configuration error

// (the threshold path polls scalar values; the alarm_source path

// subscribes to native events). Reject the whole file rather than guess.

for (const auto & node : (nodes ? nodes : YAML::Node{})) {

if (node["alarm_source"] && node["alarm"] && node["alarm"]["threshold"]) {

RCLCPP_ERROR(rclcpp::get_logger("opcua.node_map"),

"Entry node_id=%s declares both threshold alarm and alarm_source - mutually exclusive",

node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>");

// Schema validation: ``alarm_source`` is only supported in the top-level

// ``event_alarms:`` section. Reject any legacy or misplaced

// ``alarm_source`` field under ``nodes:`` explicitly so configuration

// mistakes are not silently ignored.

for (const auto & node : (nodes ? nodes : YAML::Node{})) {

if (node["alarm_source"]) {

RCLCPP_ERROR(

rclcpp::get_logger("opcua.node_map"),

"Entry node_id=%s uses alarm_source under nodes:, which is invalid; "

"move this configuration to top-level event_alarms:",

node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>");

Copilot AI review requested due to automatic review settings April 25, 2026 15:21

Copilot started reviewing on behalf of mfaferek93 April 25, 2026 15:21 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

mfaferek93 added 3 commits April 25, 2026 17:35

mfaferek93 force-pushed the feat/issue-386-opcua-alarm-events branch from 4ca2181 to c57feef Compare April 25, 2026 15:41

mfaferek93 added 13 commits April 25, 2026 18:33

style(opcua): apply clang-format-18 to diagnostic stderr logs

6fead31

mfaferek93 self-assigned this Apr 26, 2026

mfaferek93 requested review from bburda and Copilot April 26, 2026 11:09

Copilot started reviewing on behalf of mfaferek93 April 26, 2026 11:26 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

		auto callback = [this, &cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node,
		const opcua::NodeId & event_type, const opcua::NodeId & condition_id) {

-  // Bump generation BEFORE freeing the ctx so any in-flight trampoline call
-  // captured the old generation and will drop its work.
-  impl_->generation.fetch_add(1, std::memory_order_release);
+  // Do not advance the global generation for removal of a single monitored
+  // item. The generation is used to invalidate all callback contexts after a
+  // full reconnect/subscription teardown; bumping it here would make other
+  // still-active monitored items fail the trampoline staleness check and drop
+  // future events.

		std::cerr << "[opcua_poller] on_event fault=" << cfg.fault_code << " event_type=" << event_type.toString()
		<< " condition=" << condition_id.toString() << " values=" << values.size() << std::endl;

Conversation

mfaferek93 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Out of scope

Tests

Test plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mfaferek93 commented Apr 25, 2026 •

edited

Loading