ros2_medkit_opcua: native AlarmConditionType subscription bridge#387
ros2_medkit_opcua: native AlarmConditionType subscription bridge#387mfaferek93 wants to merge 16 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds native OPC-UA Part 9 AlarmConditionType event subscriptions to the ros2_medkit_opcua plugin and bridges those event-driven alarm lifecycles into the SOVD fault model, while also introducing a new race-free ROS 2 topic sampling architecture in the gateway (subscription executor + pooled topic data provider) and related shutdown/teardown hardening.
Changes:
- Add OPC-UA AlarmConditionType event subscription plumbing, YAML configuration (
event_alarms), state-machine mapping, and ack/confirm operations. - Replace legacy ad-hoc topic sampling with
TopicDataProvider+Ros2TopicDataProviderbacked byRos2SubscriptionExecutor/Ros2SubscriptionSlot, and expose pool stats via/health. - Improve shutdown robustness (SSE stop wake-up, demo node shutdown helper, graph-query exception swallowing during shutdown) + update provider header layout (logs/updates/scripts/introspection).
Reviewed changes
Copilot reviewed 107 out of 111 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/ros2_medkit_serialization/design/index.rst | Update design docs to reflect new topic data provider naming/usage. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_opcua_client.cpp | Add unit tests for new client event/method APIs (disconnected-state contracts). |
| src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_alarm_state_machine.cpp | Add unit tests for AlarmConditionType -> SOVD lifecycle state machine. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/test/fixtures/test_alarm_server/smoke_test.py | Add fixture smoke test validating AlarmConditionType nodes/methods/fields. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp | Bridge event alarms into fault lifecycle; add ack/confirm operations. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/src/node_map.cpp | Parse event_alarms YAML and merge event-mode entities into discovery defs. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_poller.hpp | Add event-alarm delivery types + condition runtime lookup interface. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_plugin.hpp | Declare event-alarm bridge handler. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_client.hpp | Add raw open62541 event monitored item + method-call APIs, generation counter. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/node_map.hpp | Define AlarmEventConfig and event-alarm accessors. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/alarm_state_machine.hpp | New pure-function AlarmConditionType lifecycle state machine. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/build.sh | Build helper for the alarm test server Docker image. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/Dockerfile | Dockerized open62541 FULL ns0 alarm test fixture build/runtime. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh | Docker integration suite validating alarm lifecycle via gateway SOVD endpoints. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/design/index.rst | Document event alarm configuration, state machine, refresh, ack/confirm. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/README.md | Document event_alarms YAML and new operations usage. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/CMakeLists.txt | Add state-machine tests and optional alarm test server build wiring. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/CHANGELOG.rst | Document forthcoming alarm subscription + ops changes. |
| src/ros2_medkit_plugins/ros2_medkit_graph_provider/include/ros2_medkit_graph_provider/graph_provider_plugin.hpp | Update introspection provider include path. |
| src/ros2_medkit_integration_tests/include/ros2_medkit_integration_tests/demo_node_main.hpp | Add shared graceful demo-node shutdown helper (sigwait thread). |
| src/ros2_medkit_integration_tests/demo_nodes/rpm_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/param_beacon_node.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/long_calibration_action.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/light_controller.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/lidar_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/engine_temp_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/door_status_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/calibration_service.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/brake_pressure_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/brake_actuator.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/beacon_publisher.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/CMakeLists.txt | Add include path for shared demo-node helper across demo binaries. |
| src/ros2_medkit_gateway/test/test_topic_data_provider_interface.cpp | Add interface-level tests for TopicDataProvider mocking contract. |
| src/ros2_medkit_gateway/test/test_script_manager.cpp | Update script provider include path. |
| src/ros2_medkit_gateway/test/test_ros2_subscription_executor.cpp | Add unit tests for subscription executor behavior (queue, watchdog, graph cb). |
| src/ros2_medkit_gateway/test/test_plugin_manager.cpp | Update introspection include path; avoid json alias shadowing. |
| src/ros2_medkit_gateway/test/test_plugin_loader.cpp | Update provider include paths (introspection/updates). |
| src/ros2_medkit_gateway/test/test_log_manager.cpp | Update log provider include path. |
| src/ros2_medkit_gateway/test/test_error_info.cpp | Add unit tests for new ErrorInfo value type. |
| src/ros2_medkit_gateway/test/test_discovery_manager.cpp | Update to new topic data provider wiring + thread-safe teardown. |
| src/ros2_medkit_gateway/test/test_data_access_manager.cpp | Update tests to new topic provider wiring and teardown order. |
| src/ros2_medkit_gateway/test/demo_nodes/test_update_backend.cpp | Update update provider include path. |
| src/ros2_medkit_gateway/test/demo_nodes/test_gateway_plugin.cpp | Update introspection/update provider include paths. |
| src/ros2_medkit_gateway/src/trigger_topic_subscriber.cpp | Adjust comment to reflect non-NativeTopicSampler usage. |
| src/ros2_medkit_gateway/src/ros2_common/ros2_subscription_slot.cpp | Implement RAII subscription slot with safe deferred destroy behavior. |
| src/ros2_medkit_gateway/src/plugins/plugin_loader.cpp | Update provider include paths (scripts/updates/introspection). |
| src/ros2_medkit_gateway/src/main.cpp | Wire subscription executor + pooled data provider; add explicit teardown sequence. |
| src/ros2_medkit_gateway/src/http/rest_server.cpp | Ensure SSE shutdown is requested before server stop/join. |
| src/ros2_medkit_gateway/src/http/handlers/sse_fault_handler.cpp | Add request_shutdown() to wake SSE waiters promptly. |
| src/ros2_medkit_gateway/src/http/handlers/health_handlers.cpp | Expose provider/executor stats via x-medkit-* keys in /health. |
| src/ros2_medkit_gateway/src/http/handlers/data_handlers.cpp | Route sampling via TopicDataProvider and propagate provider ErrorInfo accurately. |
| src/ros2_medkit_gateway/src/gateway_node.cpp | Add set_topic_data_provider() and route discovery/DAM sampling through it. |
| src/ros2_medkit_gateway/src/discovery/runtime_discovery.cpp | Switch to provider-based topic mapping and swallow shutdown-time graph exceptions. |
| src/ros2_medkit_gateway/src/discovery/merge_pipeline.cpp | Update introspection provider include path. |
| src/ros2_medkit_gateway/src/discovery/discovery_manager.cpp | Rename setter to set_topic_data_provider. |
| src/ros2_medkit_gateway/src/data_access_manager.cpp | Use TopicDataProvider (remove NativeTopicSampler) and propagate non-404 errors. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_provider.hpp | Introduce new update provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_manager.hpp | Update include to new update provider header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/scripts/script_provider.hpp | Introduce new script provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/script_manager.hpp | Update include to new script provider header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2_common/ros2_subscription_slot.hpp | Define subscription slot API (create + safe teardown contract). |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_manager.hpp | Update provider include paths (logs/scripts/updates/introspection). |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_context.hpp | Update introspection provider include path. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/models/error_info.hpp | Add transport-neutral provider error descriptor. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/logs/log_provider.hpp | Introduce new log provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/log_manager.hpp | Update include to new log provider header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/sse_fault_handler.hpp | Add explicit request_shutdown() API documentation. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/error_codes.hpp | Add new x-medkit-* error codes for shutdown/subscribe/cold-wait failures. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/gateway_node.hpp | Add TopicDataProvider attach/detach API + ownership. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/exceptions.hpp | Add ProviderErrorException to preserve provider http/code. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/runtime_discovery.hpp | Switch runtime discovery to TopicDataProvider pointer. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/layers/plugin_layer.hpp | Update introspection provider include path. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/introspection_provider.hpp | New introspection provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/discovery_manager.hpp | Update API to set TopicDataProvider instead of NativeTopicSampler. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/default_script_provider.hpp | Update script provider include path. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data_access_manager.hpp | Replace NativeTopicSampler exposure with TopicDataProvider attach/get. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/topic_data_provider.hpp | New transport-neutral topic sampling/discovery interface. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/ros2_topic_data_provider.hpp | New pooled ROS 2 implementation + stats/eviction design. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/data_types.hpp | Add transport-neutral topic discovery/sample data structures. |
| src/ros2_medkit_gateway/design/index.rst | Update gateway design docs for new subscription architecture. |
| src/ros2_medkit_gateway/design/architecture.puml | Update architecture diagram for TopicDataProvider/subscription executor. |
| src/ros2_medkit_gateway/config/gateway_params.yaml | Add config knobs for executor and provider pool behavior. |
| src/ros2_medkit_gateway/README.md | Document regression gate against “naked” rclcpp subscription creation. |
| src/ros2_medkit_gateway/CMakeLists.txt | Add new sources/tests for subscription infra + topic provider; drop NativeTopicSampler tests. |
| src/ros2_medkit_fault_manager/src/snapshot_capture.cpp | Serialize subscription create/destroy under mutex to close TSan race window. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/include/ros2_medkit_topic_beacon/topic_beacon_plugin.hpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp | Swallow shutdown-time graph exceptions to avoid terminate during teardown. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/include/ros2_medkit_param_beacon/param_beacon_plugin.hpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/CMakeLists.txt | Increase gmock test timeout for sanitizer overhead. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/systemd_plugin.cpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/procfs_plugin.cpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/container_plugin.cpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_beacon_common/include/ros2_medkit_beacon_common/beacon_entity_mapper.hpp | Update introspection provider include path. |
| scripts/check_no_naked_subscriptions.sh | Add CI/pre-commit regression gate for subscription creation API usage. |
| docs/tutorials/plugin-system.rst | Update provider include paths + document subscription API restriction for plugins. |
| docs/api/rest.rst | Document new /health vendor-extension sections for pool/executor stats. |
| .pre-commit-config.yaml | Add local pre-commit hook for naked subscription regression gate. |
| .github/workflows/quality.yml | Run naked subscription regression gate in CI. |
| .github/workflows/opcua-plugin.yml | Add new AlarmConditionType docker integration job. |
| case AlarmAction::ReportHealed: | ||
| // Fault is latched: condition is no longer active but not yet | ||
| // confirmed. We don't have a dedicated HEALED reporting verb in | ||
| // ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED | ||
| // event - fault_manager keeps the entry in HEALED state until | ||
| // confirmed, mirroring the lifecycle. | ||
| log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code); | ||
| // No-op for now; fault_manager will keep the fault HEALED until | ||
| // CLEARED. The state transition is observable via /faults/stream. | ||
| break; |
There was a problem hiding this comment.
AlarmAction::ReportHealed is currently a no-op, but the comment says it should be reported as a PASSED event so fault_manager can transition the fault to HEALED. As written, the fault will remain CONFIRMED until it is cleared, so the intended CONFIRMED -> HEALED -> CLEARED lifecycle (and the docker integration expectation) won’t occur. Consider adding a helper to report EVENT_PASSED (or extending send_report_fault to take an event type) and invoking it here.
Adds OpcuaClient primitives for native OPC-UA event subscription, the first commit of issue #386 (AlarmConditionType bridge to fault_manager). open62541pp v0.16 has no native EventFilter or event subscription support, so this plumbs the raw open62541 C API (UA_Client_MonitoredItems_createEvent) behind a small C++ surface that mirrors the existing data-change patterns. Public API additions: * EventField + EventBrowsePath types for SimpleAttributeOperand specs. * EventCallback signature delivering select values plus EventType and SourceNode (always prepended to the filter, extracted on dispatch). * add_event_monitored_item / remove_event_monitored_item with heap-allocated CallbackContext stored as unique_ptr in OpcuaClient::Impl. * call_method wrapping opcua::services::call with status-mapped errors - needed by ConditionRefresh, Acknowledge, and Confirm in later commits. * current_generation() exposing a monotonic counter incremented on every detected disconnect (clean disconnect or transport-level drop). The trampoline captures a snapshot at createEvent time and drops callbacks whose snapshot diverges from the live counter, eliminating the late callback / use-after-free hazard reviewers flagged. remove_subscriptions and disconnect now bump the generation and clear event_callbacks before tearing down open62541pp Subscriptions, ensuring in-flight C callbacks see a stale generation rather than a freed context. Tests: 6 new disconnected-state tests validating the API contract and the generation-counter ordering. End-to-end event flow against a real server runs against the test_alarm_server fixture introduced in a later commit on this branch (per plan v2 - in-process server tests with synthetic event triggering are deferred to keep this commit reviewable; the docker integration test in commit 5 covers the full path). Refs #386
…firm ops Wires the Part 9 AlarmConditionType subscription path into the existing threshold-polling plugin. Builds on the event subscription primitive from the previous commit; downstream consumers see one new YAML form (``event_alarms:`` block) and two new SOVD operations (``acknowledge_fault``, ``confirm_fault``). * New header-only ``AlarmStateMachine``: pure compute_status function over ``EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchId``. Decision order documented inline; Retain is intentionally ignored (per Part 9 it filters ConditionRefresh visibility, not lifecycle). 22 unit tests cover every rule plus precedence ordering. * ``NodeMap`` learns ``event_alarms:`` (top-level YAML sibling of ``nodes:``). Each entry declares ``alarm_source`` (NodeId of the source emitting AlarmConditionType events), ``entity_id``, ``fault_code``, and optional severity / message overrides. Mutually exclusive with the per-entry threshold ``alarm`` block; load() fails fast if both are set on the same node. ``find_event_alarm`` lookup serves the SOVD operation handlers. ``build_entity_defs`` merges event-mode entities so SOVD discovery surfaces them as fault-bearing even without scalar data. * ``OpcuaPoller`` gains ``setup_event_subscriptions()`` and ``on_event(...)``. One dedicated subscription handles all event-mode alarms; the trampoline dispatches positional select-clause values through the state machine. ConditionRefresh fires on subscribe and on every reconnect (using the existing exponential-backoff path); the generation counter from OpcuaClient already filters callbacks captured from defunct subscriptions. * Per-condition runtime is keyed by ConditionId NodeId stringForm so multiple instances under the same source remain distinct. Each entry carries the latest EventId ByteString - required for spec-compliant Acknowledge calls (Part 9 §5.7.3 returns BadEventIdUnknown otherwise). * Plugin's OperationProvider lists ``acknowledge_fault`` / ``confirm_fault`` for any entity that has at least one event-mode alarm. ``execute_operation`` resolves (entity_id, fault_code) through the poller's lookup_condition, then invokes the inherited methods on AcknowledgeableConditionType (i=9111 Ack, i=9113 Confirm) with the tracked EventId and a LocalizedText comment. HTTP error mapping mirrors the existing write_value path. * AlarmCondition events bridge through ``on_event_alarm`` to the existing send_report_fault / send_clear_fault wiring. Severity is mapped to the SOVD enum buckets (1-200 INFO, 201-500 WARN, 501-800 ERROR, 801-1000 CRITICAL); selfpatch convention, NOT IEC 62682 - documented in the follow-up design doc. Tests: 22 new state-machine unit tests (full transition table coverage plus rule-precedence). All 144 tests in the package green; ASAN/TSAN clean; clang-format, copyright, cppcheck, lint_cmake, xmllint all pass. The test_alarm_server fixture and its docker integration ship in the accompanying commit on this branch (gated OFF by default while the ExternalProject namespace0_generated linker issue is being resolved). Refs #386
…ow, docs Closes the integration story for issue #386. The plugin's threshold-mode integration runs against OpenPLC; this commit adds the parallel suite for native AlarmConditionType subscriptions, which OpenPLC does not implement. Components: * ``test_alarm_server`` fixture (open62541, FULL ns0, alarms ON). Standalone C++ binary exposing 3 conditions on tcp/4842 plus a stdin CLI (``fire``, ``ack``, ``confirm``, ``latch``, ``shelve``, ``unshelve``, ``disable``, ``enable``, ``quit``). Smoke test in ``test/fixtures/test_alarm_server/smoke_test.py`` verifies type conformance via ``asyncua``. * Self-contained Dockerfile under ``docker/test_alarm_server/`` clones open62541 v1.4.6 inside the image (no dependency on a pre-populated workspace ``build/`` tree, so CI builds cleanly from a fresh checkout). * ``docker/scripts/run_alarm_tests.sh`` orchestrates the fixture + gateway: builds both images, brings them up on a private network, fires alarms via the server's stdin pipe, and asserts the SOVD ``/faults`` endpoint moves through CONFIRMED -> HEALED -> CLEARED. Polling-with- timeout throughout (no fixed sleeps); cleanup trap teardown. * ``.github/workflows/opcua-plugin.yml`` gains the ``integration-alarms`` job, parallel to the existing ``integration`` (OpenPLC) job. Both run on every PR that touches the plugin or its dependencies. * ``design/index.rst`` documents the full state machine table (precedence order, the deliberate choice to ignore ``Retain`` for lifecycle), the selfpatch severity-bucket convention (and the explicit non-claim of IEC 62682), the ``ConditionRefresh`` / ``RefreshStartEvent`` / ``RefreshEndEvent`` flow, the ack/confirm method NodeIds, and a vendor matrix covering Siemens, Beckhoff, Rockwell, CodeSys, OpenPLC. * ``README.md`` shows the 3-line ``event_alarms:`` form and a curl example for ``acknowledge_fault``. * ``CHANGELOG.rst`` Forthcoming entry. The cmake-side ``test_alarm_server`` target stays gated OFF (``MEDKIT_OPCUA_BUILD_ALARM_SERVER``) until the LTO ``namespace0_generated`` linker mismatch in the second open62541 ``ExternalProject_Add`` build is resolved. The docker integration suite does not depend on it - it builds its open62541 inside the container. Refs #386
4ca2181 to
c57feef
Compare
Adds ``test_alarm_server_smoke`` to CTest. Boots the freshly-built fixture on an ephemeral port, waits for the ``READY`` line on stdout, runs the existing ``asyncua`` smoke test against it, and tears the process down. Skips with CTest exit 77 (which we map via ``SKIP_RETURN_CODE``) when ``asyncua`` is not importable, so iterating on plugin sources without the Python dependency does not fail the suite. CI ``integration-alarms`` job installs ``asyncua`` so the smoke test runs as a real pass / fail there. Other jobs see it as skipped, which ament_lint surfaces but does not flag. Refs #386
…ct E2E
Previous run_alarm_tests.sh shortcut its ``ack`` and ``confirm`` lines via
the test_alarm_server stdin CLI, which bypassed the medkit SOVD operation
path. The implementation - lookup_condition + EventId tracking +
call_method on the inherited AcknowledgeableConditionType methods (i=9111
Acknowledge, i=9113 Confirm) - therefore had only unit-level confidence.
This commit makes ack and confirm round-trip through HTTP:
POST /api/v1/apps/tank_process/operations/acknowledge_fault/executions
POST /api/v1/apps/tank_process/operations/confirm_fault/executions
with ``{"fault_code": "...", "comment": "..."}``. After each POST the
test polls the server's stdout for the new ``STATE`` log line (added to
the fixture in this commit) and asserts the relevant flag (``acked=true``
/ ``confirmed=true``) actually flipped on the OPC-UA server.
Additional E2E scenarios added on top of the existing fire->CONFIRMED
->latch->HEALED->CLEARED happy path:
* Shelving suppression: fire Overheat -> CONFIRMED -> shelve ->
fault disappears -> unshelve + fire -> CONFIRMED returns.
Exercises ShelvingState parsing and the state-machine Rule 3.
* Disabled alarm: fire SensorLost -> CONFIRMED -> disable ->
fault clears -> enable + fire -> CONFIRMED. Exercises
EnabledState parsing and Rule 2.
* Reconnect with ConditionRefresh: fire Overpressure -> CONFIRMED ->
docker stop -> docker start -> fire again -> CONFIRMED returns
via the gateway's reconnect path (poll_loop -> setup_event_subs
-> ConditionRefresh).
Fixture changes:
* Source nodes use predictable string NodeIds ``ns=2;s=Alarms.<name>``
so the gateway's ``alarm_source`` config maps unambiguously to a
real node. The previous auto-assigned numeric IDs broke event
subscription against the documented YAML form.
* The CLI loop logs a ``STATE <name> active=... acked=... ...`` line
after every successful command so the harness can assert state
transitions with one ``docker logs | grep`` instead of a separate
asyncua round-trip.
The script keeps polling-with-timeout throughout (no fixed sleeps),
restartable cleanup trap, idempotent re-runs.
Local end-to-end run was not attempted - the gateway-opcua docker image
build is the long-tail (multi-minute) and CI integration-alarms job
covers the same path on every push. Unit + state-machine + smoke tests
all stayed green during the work (147 tests, 0 failures).
Refs #386
CI Integration (AlarmConditionType) job hit 30-minute timeout at
[3/5] Start test_alarm_server because the script's stdin pipe pattern
deadlocked the runner. The shell opens redirections before exec'ing the
command, so 'docker run -d -i ... < fifo' followed by 'exec 3> fifo'
blocks at the first line forever.
Fix: open the FIFO read+write on FD 3 first ('exec 3<>fifo'), which
is non-blocking, then redirect docker's stdin from FD 3 ('docker run
... <&3'). Same pattern applied to the post-reconnect FIFO.
Side fix: the gateway entrypoint wrote /config/manifest.yaml from
inside the container, but /config is mounted read-only. Pre-write
the manifest on the host before starting the container.
Refs #386
…containers Previous CI run failed at 'tank_process not in /apps after 120s' but the 'Dump container logs on failure' workflow step couldn't see anything - the script's cleanup trap had already 'docker rm -f'd the containers. Move log dump into the cleanup trap itself, gated by non-zero exit code, so subsequent failures land actionable logs in the runner output. Refs #386
…ind mount Two more findings from local E2E debug: 1. docker run -d -i ... <&3 lost stdin between the daemonized client and the docker daemon - commands written to FD 3 from the script never reached the binary's stdin in the container. The -d flag closes the client process which orphans the FIFO before the daemon can rewire it. Drop -d, run docker run as a shell background job instead, so the client process stays alive holding the FIFO open for the container. Track the PID and clean up on trap. 2. The gateway image bakes /config/gateway_params.yaml at build time, but our :ro bind mount of /tmp/alarm_test_config:/config shadows it. The container exited 1 with 'Couldn't parse params file'. Stage the params file into the bind mount alongside the alarm_nodes.yaml and manifest.yaml. Refs #386
Three CI runs hit BadNodeIdUnknown on event monitored item creation
even though the server reports the source nodes at the expected NodeIds
('ns=2;s=Alarms.Overpressure' etc, verified via asyncua browse). The
mismatch is somewhere between our parse + raw-C call.
Add stderr trace before and after UA_Client_MonitoredItems_createEvent
so the next CI run logs the exact NodeId stringForm we hand to the
server, plus the status code. Will be tightened to RCLCPP_DEBUG once
the root cause is identified.
Refs #386
… default request Multiple iterations to pin down BadNodeIdUnknown when adding event monitored items - all still failing locally: * Replaced shallow copy of source_node with UA_NodeId_copy so the serializer never aliases an open62541pp wrapper internal. * Switched typeDefinitionId in SimpleAttributeOperand from BaseEventType to AlarmConditionType (i=2915) since the BrowsePaths we use (ConditionId, AckedState, ShelvingState, etc.) are not defined on BaseEventType per Part 9 spec. * Replaced manual UA_MonitoredItemCreateRequest_init with UA_MonitoredItemCreateRequest_default(nodeId) so the request layout matches open62541's own examples. * Cleared the request struct including the deep-copied NodeId after the call, detaching the stack-local filter first to avoid a double-free. None of these alone fixed it. Server still reports BadNodeIdUnknown for both the custom source NodeId (ns=2;s=Alarms.Overpressure) AND the canonical Server object (i=2253). Investigation continues. Refs #386
…onnect Local E2E (run_alarm_tests.sh) now passes all four scenarios. Six issues fixed along the way; each was reproducible only with a real open62541 AlarmConditionType server and could not have surfaced from the unit suite. State machine wiring - opcua_client::add_event_monitored_item now auto-prepends three fixed SimpleAttributeOperands (EventType, SourceNode, ConditionId) per Part 9 §5.5.2.13 - the ConditionId clause uses an empty BrowsePath + AttributeId=NodeId, which is the only spec-legal way to retrieve it. EventCallback gets the ConditionId as a separate argument; user-supplied EventFieldSpec entries appear after the three prepended fields. - Each user EventFieldSpec carries its OWN typeDefinitionId. open62541 servers reject inherited browse paths with BadNodeIdUnknown, so we tag AckedState/Id with AcknowledgeableConditionType (i=2881), ActiveState/Id with AlarmConditionType (i=2915), EnabledState/Id with ConditionType (i=2782), etc. Previous single-typeDef filter was rejected wholesale. - Fixed double-free in event MI creation: the open62541 default request builder returns a struct that aliases the NodeId we pass in, so UA_NodeId_clear after UA_MonitoredItemCreateRequest_clear corrupted the heap. Item is now built explicitly with UA_NodeId_copy into the request. - Poller calls client.run_iterate(50) every poll cycle. Without it the open62541 client never dispatched subscription notifications because no scalar nodes were configured, so the trampoline silently never fired even though createEvent returned Good. E2E correctness - call_method now also rejects per-input-arg failures from inputArgumentResults. AlarmConditionType.Acknowledge surfaces BadEventIdUnknown there when the EventId we sent has been superseded by a newer event; without this check SOVD POST returned 200 even though the server refused the call. - run_alarm_tests.sh: ack/confirm now go through SOVD HTTP, not the test_alarm_server stdin shortcut. Between latch and SOVD confirm we poll the gateway log for "AlarmCondition HEALED" - the 500 ms subscription publishing interval means the gateway needs that long to receive both the Acknowledge auto-emit and the latch trigger before it has a fresh EventId for Confirm. Without the wait Confirm was sent with the stale ID from the original fire payload. - ShelvingState fix in test_alarm_server: set_shelving now also writes the Id property (NodeId) of CurrentState, not just the LocalizedText. The medkit bridge keys suppression off ShelvingState/CurrentState/Id (i=2929/2930/2932) because the text is locale-dependent. Without the Id write the gateway saw shelved=false and the suppression scenario silently failed. - Shelved detection in opcua_poller now treats a null/missing Id as Unshelved instead of "unknown=shelved". Some servers leave the optional field uninitialized, and that is not a suppression signal. - Cleanup trap dumps the full container log on rc!=0, not the last 120 lines. The diagnostic introspect() poll spam crowded out the on_event / state-machine traces from the tail window. Scenarios covered (run_alarm_tests.sh) - fire / SOVD ack / latch / SOVD confirm / clear lifecycle - shelve suppresses an active alarm; unshelve re-arms it - disable suppresses an active alarm; enable re-arms it - gateway reconnect: stop the test_alarm_server, restart it, and a re-fired alarm shows CONFIRMED again (proves setup_event_subscriptions is invoked on the reconnect path with a fresh ConditionRefresh) Diagnostic stderr logging - captured EventId hex / call_method status code / per-arg result are printed to stderr from opcua_poller / opcua_plugin / opcua_client. Verbose by design - this is the integration test fixture path and the only way to triage a BadEventIdUnknown after the fact. Local verify (Jazzy, x86_64) bash src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh -> "All alarm scenarios passed." EXIT=0
…, dead guard) Three review-driven cleanups, no functional change: - run_alarm_tests.sh: idempotent teardown before ``docker network create``. The cleanup trap fires on EXIT but not on a hard kill (Ctrl-C between trap arm and trap fire), in which case the leftover network would crash the next run under ``set -e``. Mirrors the trap with a noisy-tolerant prelude. - run_alarm_tests.sh: rename scenario 4 from "reconnect preserves CONFIRMED via ConditionRefresh" to "reconnect re-subscribes after server restart". The previous name was aspirational - the test_alarm_server is in-memory and loses condition state on restart, so Part 9 §5.5.7 cannot fire here. Issue #389 tracks the actual ConditionRefresh re-emit verification. - opcua_poller.cpp: drop dead ``if (event_subscription_id_ != 0) return`` guard in setup_event_subscriptions(). Both call sites (start() and the poll_loop reconnect arm) zero the field before calling, and the comment now says so. - opcua_client.cpp: clarify in disconnect() that the ``if (connected)`` guard guarantees single generation bump even when maybe_mark_disconnected already fired on a transport error - the latter uses exchange(false) so the second site is a no-op by atomic semantics, not by accident.
Today ``OpcuaPoller::condition_refresh()`` swallows server failures with a
silent comment ("not fatal - many test servers do not implement"). Real
PLCs hit this too: open62541 v1.4.x returns BadMethodInvalid, Siemens
S7-1500 omits ConditionRefresh entirely, Beckhoff TF6100 status
unconfirmed in public docs. The operator gets no signal that their
``alarm-replay-on-reconnect`` contract is broken.
- PollerConfig gains ``log_warn`` (std::function<void(const std::string&)>),
optional. The plugin owning the poller wires it to its inherited
GatewayPlugin::log_warn so messages reach the ROS 2 logger and
/rosout, not just container stderr.
- ``OpcuaPoller::condition_refresh()`` emits one warn per connect on
failure (throttled by ``condition_refresh_warned_``) carrying the
StatusCode and pointing at issue #389. Reset on success so a recovered
server earns a fresh warn next time it breaks.
- Falls back to ``std::cerr [opcua_poller WARN]`` when the callback is
not set, preserving observability for unit-test paths that don't go
through the plugin.
No behavior change for the success path. Live transitions continue to
flow regardless; this is purely operator observability for the
"reconnect doesn't replay state" failure mode.
Extract the OPC-UA Part 4 §5.11.2 Call result classification from
``OpcuaClient::call_method`` into two public static helpers:
- ``status_to_method_error(uint32_t code, const std::string & msg)``
maps an OPC-UA StatusCode to one of {MethodNotFound, InvalidArgument,
MethodTimeout, TransportError} via the existing dispatch table.
- ``classify_call_result(uint32_t overall, std::vector<uint32_t> args)``
walks the per-argument results and returns the first non-Good code,
with overall statusCode taking precedence.
Public statics so the test suite can hit the BadEventIdUnknown branch
that previously had no unit anchor - the bug ``call_method`` was just
fixed for (statusCode=Good + inputArgumentResults[0]=Bad) is exercised
by the docker integration test (run_alarm_tests.sh ack/confirm flow)
but a future refactor that drops the per-arg loop would not get caught
locally. ~30 LoC of pure tests catch this in seconds.
API surface uses ``uint32_t`` instead of ``UA_StatusCode`` so the
public header doesn't pull in open62541 types - the .cpp internally
treats them identically (UA_StatusCode is a uint32_t typedef).
9 new gtest cases (3 for ``status_to_method_error``, 6 for
``classify_call_result``):
- MethodNotFound / InvalidArgument / Timeout family dispatch
- BadEventIdUnknown stays in TransportError (signals SOVD to retry,
not reject, since the EventId staling is a transient race)
- Empty arg results, all-Good arg results -> success
- Bad overall status returns error
- BadEventIdUnknown in arg[0] returns error with "input arg 0" message
- First bad arg wins over later bad args
- Overall status beats arg results (transport error short-circuits)
Existing ``call_method`` body is unchanged in behavior; the diagnostic
stderr trace that prints ``statusCode=`` and per-arg codes is preserved
verbatim.
Local verify: ``test_opcua_client`` reports 26/26 PASSED including all
9 new cases.
Verification round on the post-#386 review found 9 untested cells in the AlarmStateMachine transition matrix (issue #389 follow-up). Adding focused tests so a future refactor cannot silently break a corner case. New cells covered: - ``DisabledClearsHealedAlarm`` - Healed exit via Rule 2 (was_active=true for Healed too, so disabling a latched alarm must ClearFault, not just flip to Suppressed). - ``DisabledNoOpWhenAlreadyCleared`` - confirms Cleared+disabled lands at Suppressed/NoOp (no spurious second clear). - ``ShelvedClearsHealedAlarm`` / ``ShelvedNoOpWhenAlreadyCleared`` - same pair via Rule 3. - ``ActiveAlarmReportsConfirmedFromSuppressed`` - operator unshelves/re-enables an alarm whose source is still active; Rule 4 must promote Suppressed -> Confirmed with ReportConfirmed (NOT NoOp). This is exactly the unshelve+re-fire path scenario 2 of run_alarm_tests.sh exercises end-to-end. - ``BranchEventFromHealedNoOp`` / ``BranchEventFromSuppressedNoOp`` - Rule 1 precedence holds across all four prev_status values. - ``AckedAndConfirmedNoOpFromSuppressed`` - ``was_active=false`` branch of Rule 5: a fully cleared event delivered while suppressed advances next_status to Cleared but issues no ClearFault (no fault to clear). - ``InactiveUnackedFromSuppressedReportsHealed`` - ditto but with the ack/confirm bits unset, must surface as Healed/ReportHealed so the unfinished ack/confirm workflow item stays visible. State machine code itself is unchanged. 27 gtest cases, all PASSED. Pure-function tests, deterministic, no I/O.
| Skips (exits 77 - the CTest convention for a skipped test) when | ||
| ``asyncua`` is not importable, so contributors who only iterate on the | ||
| plugin sources do not need a Python pip install. CI installs ``asyncua`` | ||
| in the integration job and observes the test as a real pass / fail. |
There was a problem hiding this comment.
run_ctest.py exits with code 77 to skip the smoke test when asyncua is not installed. This introduces an always-skippable test path, which makes it easier for regressions to slip in unnoticed. Prefer making asyncua a declared test dependency (so the test always runs in standard test environments) or move the smoke test to a separate, explicitly-optional test target instead of skipping at runtime.
| TEST(AlarmStateMachineTest, DisabledNoOpWhenAlreadyCleared) { | ||
| AlarmEventInput in; | ||
| in.enabled_state = false; | ||
| auto out = AlarmStateMachine::compute(SovdAlarmStatus::Cleared, in); | ||
| EXPECT_EQ(out.next_status, SovdAlarmStatus::Suppressed); | ||
| EXPECT_EQ(out.action, AlarmAction::NoOp); | ||
| } |
There was a problem hiding this comment.
Test name suggests a pure no-op when already cleared, but the expected next_status transitions to Suppressed. If the intended behavior is “no action” but status may change, consider renaming this test to reflect the suppression transition (or adjust expectations to match the name).
| auto callback = [this, &cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node, | ||
| const opcua::NodeId & event_type, const opcua::NodeId & condition_id) { |
There was a problem hiding this comment.
In setup_event_subscriptions(), the lambda captures &cfg where cfg is the loop variable of a range-for. This captures a reference to the loop variable (which is reused each iteration and goes out of scope after the loop), so callbacks can read a dangling reference / wrong config when events arrive. Capture the config by value (or capture a stable pointer/reference to the vector element) so each monitored item callback uses the correct AlarmEventConfig.
| auto callback = [this, &cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node, | |
| const opcua::NodeId & event_type, const opcua::NodeId & condition_id) { | |
| auto callback = [this, cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node, | |
| const opcua::NodeId & event_type, const opcua::NodeId & condition_id) { |
| // Bump generation BEFORE freeing the ctx so any in-flight trampoline call | ||
| // captured the old generation and will drop its work. | ||
| impl_->generation.fetch_add(1, std::memory_order_release); | ||
|
|
There was a problem hiding this comment.
remove_event_monitored_item() increments the global generation counter. That will make all other event monitored item contexts fail the generation_snapshot check in the trampoline and silently drop future events, even though their subscriptions are still valid. Consider making the staleness check per-monitored-item (e.g., an active atomic in the context) or only bump generation on full disconnect/remove_subscriptions (not on single MI removal).
| // Bump generation BEFORE freeing the ctx so any in-flight trampoline call | |
| // captured the old generation and will drop its work. | |
| impl_->generation.fetch_add(1, std::memory_order_release); | |
| // Do not advance the global generation for removal of a single monitored | |
| // item. The generation is used to invalidate all callback contexts after a | |
| // full reconnect/subscription teardown; bumping it here would make other | |
| // still-active monitored items fail the trampoline staleness check and drop | |
| // future events. |
| std::cerr << "[opcua_poller] on_event fault=" << cfg.fault_code << " event_type=" << event_type.toString() | ||
| << " condition=" << condition_id.toString() << " values=" << values.size() << std::endl; |
There was a problem hiding this comment.
on_event() currently logs every notification and EventId bytes directly to std::cerr. This can flood logs under high alarm/event rates and bypass the plugin's logging controls. Route these through the plugin/logger (or a configurable debug/trace sink) and avoid dumping per-event hex unless explicitly enabled.
| # Start fault_manager_node first so its services are advertised before | ||
| # the gateway opcua plugin tries to call /fault_manager/report_fault. | ||
| ros2 run ros2_medkit_fault_manager fault_manager_node \ | ||
| > /var/lib/ros2_medkit/fault_manager.log 2>&1 & | ||
| sleep 3 | ||
| PLUGIN_PATH=$(find /root/ws/install -name "libros2_medkit_opcua_plugin.so" | head -1) |
There was a problem hiding this comment.
The PR description says the docker integration uses polling-with-timeout throughout with “no fixed sleeps”, but this script has a fixed sleep 3 before starting the gateway. Consider replacing it with a poll for the fault_manager service readiness (or other observable health signal) to keep the test deterministic across slower CI runners.
| { | ||
| const auto * bytes = runtime->latest_event_id.data(); | ||
| std::cerr << "[opcua_plugin] " << operation_name << " EventId len=" << runtime->latest_event_id.length() | ||
| << " hex="; | ||
| for (size_t i = 0; i < std::min<size_t>(runtime->latest_event_id.length(), 16); ++i) { | ||
| char buf[3]; | ||
| std::snprintf(buf, sizeof(buf), "%02x", static_cast<unsigned>(bytes[i]) & 0xffu); | ||
| std::cerr << buf; | ||
| } | ||
| std::cerr << " conditionId=" << runtime->condition_id.toString() << std::endl; | ||
| } | ||
|
|
There was a problem hiding this comment.
execute_operation() prints the EventId bytes and ConditionId to std::cerr for every acknowledge/confirm request. This is very verbose and bypasses ROS logging; it will also include user-supplied comment indirectly in downstream logs. Consider removing this block or switching it to debug-level logging behind a flag.
| { | |
| const auto * bytes = runtime->latest_event_id.data(); | |
| std::cerr << "[opcua_plugin] " << operation_name << " EventId len=" << runtime->latest_event_id.length() | |
| << " hex="; | |
| for (size_t i = 0; i < std::min<size_t>(runtime->latest_event_id.length(), 16); ++i) { | |
| char buf[3]; | |
| std::snprintf(buf, sizeof(buf), "%02x", static_cast<unsigned>(bytes[i]) & 0xffu); | |
| std::cerr << buf; | |
| } | |
| std::cerr << " conditionId=" << runtime->condition_id.toString() << std::endl; | |
| } |
| /// Retain is intentionally NOT used here. Per Part 9 §5.5.2.10 it controls | ||
| /// visibility during ConditionRefresh bursts, not lifecycle - the poller | ||
| /// strips Retain=false events delivered between RefreshStartEvent and | ||
| /// RefreshEndEvent before invoking compute(). |
There was a problem hiding this comment.
The comment says the poller “strips Retain=false events delivered between RefreshStartEvent and RefreshEndEvent before invoking compute()”, but the current event select clauses don’t include Retain and on_event() doesn’t filter on it. Either implement the Retain-based filtering or update this comment so it matches the actual behavior.
| /// Retain is intentionally NOT used here. Per Part 9 §5.5.2.10 it controls | |
| /// visibility during ConditionRefresh bursts, not lifecycle - the poller | |
| /// strips Retain=false events delivered between RefreshStartEvent and | |
| /// RefreshEndEvent before invoking compute(). | |
| /// Retain is intentionally not modeled by this state machine and does not | |
| /// affect ``compute()``. Per Part 9 §5.5.2.10 it controls visibility during | |
| /// ConditionRefresh bursts rather than the lifecycle mapping implemented | |
| /// here. |
| // ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED | ||
| // event - fault_manager keeps the entry in HEALED state until | ||
| // confirmed, mirroring the lifecycle. | ||
| log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code); | ||
| // No-op for now; fault_manager will keep the fault HEALED until | ||
| // CLEARED. The state transition is observable via /faults/stream. |
There was a problem hiding this comment.
In AlarmAction::ReportHealed handling, the comment says we “mark this as a PASSED event”, but the implementation is currently a no-op (only logs). Please either emit the intended PASSED/HEALED signal to fault_manager or adjust the comment to reflect that HEALED is intentionally not reported via ReportFault at the moment.
| // ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED | |
| // event - fault_manager keeps the entry in HEALED state until | |
| // confirmed, mirroring the lifecycle. | |
| log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code); | |
| // No-op for now; fault_manager will keep the fault HEALED until | |
| // CLEARED. The state transition is observable via /faults/stream. | |
| // ReportFault.srv (only FAILED/PASSED), and this plugin does not | |
| // currently emit a PASSED event for the latched HEALED transition. | |
| // For now we only log the transition here; any externally visible | |
| // lifecycle updates come from subsequent alarm events. | |
| log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code); | |
| // Intentional no-op: HEALED is not currently reported via ReportFault. |
| // Mutual-exclusion check: an entry under ``nodes:`` carrying both a | ||
| // ``threshold`` alarm and an ``alarm_source`` is a configuration error | ||
| // (the threshold path polls scalar values; the alarm_source path | ||
| // subscribes to native events). Reject the whole file rather than guess. | ||
| for (const auto & node : (nodes ? nodes : YAML::Node{})) { | ||
| if (node["alarm_source"] && node["alarm"] && node["alarm"]["threshold"]) { | ||
| RCLCPP_ERROR(rclcpp::get_logger("opcua.node_map"), | ||
| "Entry node_id=%s declares both threshold alarm and alarm_source - mutually exclusive", | ||
| node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>"); |
There was a problem hiding this comment.
This loader comment/error mentions a nodes: entry declaring both a threshold alarm and alarm_source, but alarm_source is now a top-level event_alarms: field. As written, a mistaken alarm_source under nodes: is mostly ignored unless it also has an alarm.threshold. Consider explicitly rejecting any alarm_source key under nodes: with an error that points users to event_alarms: (and/or validate mutual exclusivity across the two YAML forms more directly).
| // Mutual-exclusion check: an entry under ``nodes:`` carrying both a | |
| // ``threshold`` alarm and an ``alarm_source`` is a configuration error | |
| // (the threshold path polls scalar values; the alarm_source path | |
| // subscribes to native events). Reject the whole file rather than guess. | |
| for (const auto & node : (nodes ? nodes : YAML::Node{})) { | |
| if (node["alarm_source"] && node["alarm"] && node["alarm"]["threshold"]) { | |
| RCLCPP_ERROR(rclcpp::get_logger("opcua.node_map"), | |
| "Entry node_id=%s declares both threshold alarm and alarm_source - mutually exclusive", | |
| node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>"); | |
| // Schema validation: ``alarm_source`` is only supported in the top-level | |
| // ``event_alarms:`` section. Reject any legacy or misplaced | |
| // ``alarm_source`` field under ``nodes:`` explicitly so configuration | |
| // mistakes are not silently ignored. | |
| for (const auto & node : (nodes ? nodes : YAML::Node{})) { | |
| if (node["alarm_source"]) { | |
| RCLCPP_ERROR( | |
| rclcpp::get_logger("opcua.node_map"), | |
| "Entry node_id=%s uses alarm_source under nodes:, which is invalid; " | |
| "move this configuration to top-level event_alarms:", | |
| node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>"); |
Summary
Adds native OPC-UA Part 9
AlarmConditionTypeevent subscription to the OPC-UA plugin. PLCs that already define alarms in their own runtime (Siemens S7-1500Program_Alarm/ ProDiag, Beckhoff TF6100, CodeSys 3.5+ alarm manager, Rockwell via FactoryTalk Linx) are bridged into the SOVD fault lifecycle without any duplicate threshold definitions in YAML.Closes #386.
Scope
UA_Client_MonitoredItems_createEvent(open62541pp v0.16 has no native event API).EventFilterwith the canonical AlarmConditionType select clauses (EventType,EventId,SourceNode,Severity,Message,ConditionId,BranchId,EnabledState,ActiveState,AckedState,ConfirmedState,ShelvingState).EventIdtracking, required for spec-compliantAcknowledge(Part 9 §5.7.3).AlarmStateMachinemappingEnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchIdto SOVDCONFIRMED / HEALED / CLEARED / Suppressed. Decision order documented indesign/index.rst.ConditionRefresh(Server methodi=3875) on subscribe and on every reconnect, withRefreshStartEvent/RefreshEndEventrecognised.event_alarms:block innode_map.yaml, mutually exclusive per entry with the existing thresholdalarmform.acknowledge_faultandconfirm_faultSOVD operations on every entity that hosts at least one event-mode alarm; callsi=9111/i=9113on the liveConditionIdwith the trackedEventIdand an optionalLocalizedTextcomment.Out of scope
ShelvingStatewrite operations (timed / one-shot shelving). Read-side suppression is in scope; operator UI is not.BranchId-based suppression. Re-fires are tracked viafault_manageroccurrence_countand the/faults/streamSSE history.Quality(StatusCode) propagation to a SOVDstatus_qualityfield. Requires an additiveReportFault.srvfield; tracked separately.Tests
OpcuaClientevent API + 22AlarmStateMachinecovering the full transition table + 11 plugin / node_map paths exercised from existing suites).test_alarm_serverfixture (open62541 withUA_NAMESPACE_ZERO=FULLandUA_ENABLE_SUBSCRIPTIONS_ALARMS_CONDITIONS=ON) emits realAlarmConditionTypeevents; smoke test intest/fixtures/test_alarm_server/smoke_test.pyverifies the topology viaasyncua.docker/scripts/run_alarm_tests.shboots the fixture + gateway, fires alarms via stdin, and asserts the SOVD/faultsendpoint moves through CONFIRMED -> HEALED -> CLEARED with intermediateacknowledge_fault/confirm_faultround-trips. Polling-with-timeout throughout, no fixed sleeps. Wired into theopcua-pluginworkflow as a new parallel job.Test plan
Notes
test_alarm_servertarget is gated OFF behindMEDKIT_OPCUA_BUILD_ALARM_SERVERwhile the LTOnamespace0_generatedlinker mismatch in the secondExternalProject_Addopen62541 build is being resolved. The docker integration does not depend on this target - it builds open62541 inside its own image.severity_overrideon anevent_alarmsentry takes precedence.