Skip to content

ros2_medkit_opcua: native AlarmConditionType subscription bridge#387

Open
mfaferek93 wants to merge 16 commits intomainfrom
feat/issue-386-opcua-alarm-events
Open

ros2_medkit_opcua: native AlarmConditionType subscription bridge#387
mfaferek93 wants to merge 16 commits intomainfrom
feat/issue-386-opcua-alarm-events

Conversation

@mfaferek93
Copy link
Copy Markdown
Collaborator

@mfaferek93 mfaferek93 commented Apr 25, 2026

Summary

Adds native OPC-UA Part 9 AlarmConditionType event subscription to the OPC-UA plugin. PLCs that already define alarms in their own runtime (Siemens S7-1500 Program_Alarm / ProDiag, Beckhoff TF6100, CodeSys 3.5+ alarm manager, Rockwell via FactoryTalk Linx) are bridged into the SOVD fault lifecycle without any duplicate threshold definitions in YAML.

Closes #386.

Scope

  • Native event subscription via raw UA_Client_MonitoredItems_createEvent (open62541pp v0.16 has no native event API).
  • EventFilter with the canonical AlarmConditionType select clauses (EventType, EventId, SourceNode, Severity, Message, ConditionId, BranchId, EnabledState, ActiveState, AckedState, ConfirmedState, ShelvingState).
  • Per-condition EventId tracking, required for spec-compliant Acknowledge (Part 9 §5.7.3).
  • Pure-function AlarmStateMachine mapping EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchId to SOVD CONFIRMED / HEALED / CLEARED / Suppressed. Decision order documented in design/index.rst.
  • ConditionRefresh (Server method i=3875) on subscribe and on every reconnect, with RefreshStartEvent / RefreshEndEvent recognised.
  • New top-level event_alarms: block in node_map.yaml, mutually exclusive per entry with the existing threshold alarm form.
  • acknowledge_fault and confirm_fault SOVD operations on every entity that hosts at least one event-mode alarm; calls i=9111 / i=9113 on the live ConditionId with the tracked EventId and an optional LocalizedText comment.
  • Existing threshold polling and OpenPLC integration unchanged.

Out of scope

  • ShelvingState write operations (timed / one-shot shelving). Read-side suppression is in scope; operator UI is not.
  • OPC-UA branch reasoning beyond BranchId-based suppression. Re-fires are tracked via fault_manager occurrence_count and the /faults/stream SSE history.
  • Auto-discovery of alarm sources via browse (tracked in ros2_medkit_opcua: add browse-based auto node_map generation #368).
  • Quality (StatusCode) propagation to a SOVD status_quality field. Requires an additive ReportFault.srv field; tracked separately.

Tests

  • 144 unit tests green on Jazzy (105 pre-existing + 6 new OpcuaClient event API + 22 AlarmStateMachine covering the full transition table + 11 plugin / node_map paths exercised from existing suites).
  • Custom test_alarm_server fixture (open62541 with UA_NAMESPACE_ZERO=FULL and UA_ENABLE_SUBSCRIPTIONS_ALARMS_CONDITIONS=ON) emits real AlarmConditionType events; smoke test in test/fixtures/test_alarm_server/smoke_test.py verifies the topology via asyncua.
  • Docker integration suite docker/scripts/run_alarm_tests.sh boots the fixture + gateway, fires alarms via stdin, and asserts the SOVD /faults endpoint moves through CONFIRMED -> HEALED -> CLEARED with intermediate acknowledge_fault / confirm_fault round-trips. Polling-with-timeout throughout, no fixed sleeps. Wired into the opcua-plugin workflow as a new parallel job.

Test plan

  • Unit tests green on Humble + Jazzy + Rolling.
  • ASAN + TSAN clean.
  • OpenPLC threshold integration still passes (no regression on existing path).
  • New AlarmConditionType integration job passes.

Notes

  • The cmake-side test_alarm_server target is gated OFF behind MEDKIT_OPCUA_BUILD_ALARM_SERVER while the LTO namespace0_generated linker mismatch in the second ExternalProject_Add open62541 build is being resolved. The docker integration does not depend on this target - it builds open62541 inside its own image.
  • Severity 1-1000 is mapped to the existing SOVD severity buckets (1-200 INFO / 201-500 WARN / 501-800 ERROR / 801-1000 CRITICAL). This is the selfpatch convention, not IEC 62682. severity_override on an event_alarms entry takes precedence.
image

Copilot AI review requested due to automatic review settings April 25, 2026 15:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native OPC-UA Part 9 AlarmConditionType event subscriptions to the ros2_medkit_opcua plugin and bridges those event-driven alarm lifecycles into the SOVD fault model, while also introducing a new race-free ROS 2 topic sampling architecture in the gateway (subscription executor + pooled topic data provider) and related shutdown/teardown hardening.

Changes:

  • Add OPC-UA AlarmConditionType event subscription plumbing, YAML configuration (event_alarms), state-machine mapping, and ack/confirm operations.
  • Replace legacy ad-hoc topic sampling with TopicDataProvider + Ros2TopicDataProvider backed by Ros2SubscriptionExecutor / Ros2SubscriptionSlot, and expose pool stats via /health.
  • Improve shutdown robustness (SSE stop wake-up, demo node shutdown helper, graph-query exception swallowing during shutdown) + update provider header layout (logs/updates/scripts/introspection).

Reviewed changes

Copilot reviewed 107 out of 111 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/ros2_medkit_serialization/design/index.rst Update design docs to reflect new topic data provider naming/usage.
src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_opcua_client.cpp Add unit tests for new client event/method APIs (disconnected-state contracts).
src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_alarm_state_machine.cpp Add unit tests for AlarmConditionType -> SOVD lifecycle state machine.
src/ros2_medkit_plugins/ros2_medkit_opcua/test/fixtures/test_alarm_server/smoke_test.py Add fixture smoke test validating AlarmConditionType nodes/methods/fields.
src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp Bridge event alarms into fault lifecycle; add ack/confirm operations.
src/ros2_medkit_plugins/ros2_medkit_opcua/src/node_map.cpp Parse event_alarms YAML and merge event-mode entities into discovery defs.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_poller.hpp Add event-alarm delivery types + condition runtime lookup interface.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_plugin.hpp Declare event-alarm bridge handler.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_client.hpp Add raw open62541 event monitored item + method-call APIs, generation counter.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/node_map.hpp Define AlarmEventConfig and event-alarm accessors.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/alarm_state_machine.hpp New pure-function AlarmConditionType lifecycle state machine.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/build.sh Build helper for the alarm test server Docker image.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/Dockerfile Dockerized open62541 FULL ns0 alarm test fixture build/runtime.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh Docker integration suite validating alarm lifecycle via gateway SOVD endpoints.
src/ros2_medkit_plugins/ros2_medkit_opcua/design/index.rst Document event alarm configuration, state machine, refresh, ack/confirm.
src/ros2_medkit_plugins/ros2_medkit_opcua/README.md Document event_alarms YAML and new operations usage.
src/ros2_medkit_plugins/ros2_medkit_opcua/CMakeLists.txt Add state-machine tests and optional alarm test server build wiring.
src/ros2_medkit_plugins/ros2_medkit_opcua/CHANGELOG.rst Document forthcoming alarm subscription + ops changes.
src/ros2_medkit_plugins/ros2_medkit_graph_provider/include/ros2_medkit_graph_provider/graph_provider_plugin.hpp Update introspection provider include path.
src/ros2_medkit_integration_tests/include/ros2_medkit_integration_tests/demo_node_main.hpp Add shared graceful demo-node shutdown helper (sigwait thread).
src/ros2_medkit_integration_tests/demo_nodes/rpm_sensor.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/param_beacon_node.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/long_calibration_action.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/light_controller.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/lidar_sensor.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/engine_temp_sensor.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/door_status_sensor.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/calibration_service.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/brake_pressure_sensor.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/brake_actuator.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/beacon_publisher.cpp Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/CMakeLists.txt Add include path for shared demo-node helper across demo binaries.
src/ros2_medkit_gateway/test/test_topic_data_provider_interface.cpp Add interface-level tests for TopicDataProvider mocking contract.
src/ros2_medkit_gateway/test/test_script_manager.cpp Update script provider include path.
src/ros2_medkit_gateway/test/test_ros2_subscription_executor.cpp Add unit tests for subscription executor behavior (queue, watchdog, graph cb).
src/ros2_medkit_gateway/test/test_plugin_manager.cpp Update introspection include path; avoid json alias shadowing.
src/ros2_medkit_gateway/test/test_plugin_loader.cpp Update provider include paths (introspection/updates).
src/ros2_medkit_gateway/test/test_log_manager.cpp Update log provider include path.
src/ros2_medkit_gateway/test/test_error_info.cpp Add unit tests for new ErrorInfo value type.
src/ros2_medkit_gateway/test/test_discovery_manager.cpp Update to new topic data provider wiring + thread-safe teardown.
src/ros2_medkit_gateway/test/test_data_access_manager.cpp Update tests to new topic provider wiring and teardown order.
src/ros2_medkit_gateway/test/demo_nodes/test_update_backend.cpp Update update provider include path.
src/ros2_medkit_gateway/test/demo_nodes/test_gateway_plugin.cpp Update introspection/update provider include paths.
src/ros2_medkit_gateway/src/trigger_topic_subscriber.cpp Adjust comment to reflect non-NativeTopicSampler usage.
src/ros2_medkit_gateway/src/ros2_common/ros2_subscription_slot.cpp Implement RAII subscription slot with safe deferred destroy behavior.
src/ros2_medkit_gateway/src/plugins/plugin_loader.cpp Update provider include paths (scripts/updates/introspection).
src/ros2_medkit_gateway/src/main.cpp Wire subscription executor + pooled data provider; add explicit teardown sequence.
src/ros2_medkit_gateway/src/http/rest_server.cpp Ensure SSE shutdown is requested before server stop/join.
src/ros2_medkit_gateway/src/http/handlers/sse_fault_handler.cpp Add request_shutdown() to wake SSE waiters promptly.
src/ros2_medkit_gateway/src/http/handlers/health_handlers.cpp Expose provider/executor stats via x-medkit-* keys in /health.
src/ros2_medkit_gateway/src/http/handlers/data_handlers.cpp Route sampling via TopicDataProvider and propagate provider ErrorInfo accurately.
src/ros2_medkit_gateway/src/gateway_node.cpp Add set_topic_data_provider() and route discovery/DAM sampling through it.
src/ros2_medkit_gateway/src/discovery/runtime_discovery.cpp Switch to provider-based topic mapping and swallow shutdown-time graph exceptions.
src/ros2_medkit_gateway/src/discovery/merge_pipeline.cpp Update introspection provider include path.
src/ros2_medkit_gateway/src/discovery/discovery_manager.cpp Rename setter to set_topic_data_provider.
src/ros2_medkit_gateway/src/data_access_manager.cpp Use TopicDataProvider (remove NativeTopicSampler) and propagate non-404 errors.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_provider.hpp Introduce new update provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_manager.hpp Update include to new update provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/scripts/script_provider.hpp Introduce new script provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/script_manager.hpp Update include to new script provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2_common/ros2_subscription_slot.hpp Define subscription slot API (create + safe teardown contract).
src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_manager.hpp Update provider include paths (logs/scripts/updates/introspection).
src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_context.hpp Update introspection provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/models/error_info.hpp Add transport-neutral provider error descriptor.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/logs/log_provider.hpp Introduce new log provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/log_manager.hpp Update include to new log provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/sse_fault_handler.hpp Add explicit request_shutdown() API documentation.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/error_codes.hpp Add new x-medkit-* error codes for shutdown/subscribe/cold-wait failures.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/gateway_node.hpp Add TopicDataProvider attach/detach API + ownership.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/exceptions.hpp Add ProviderErrorException to preserve provider http/code.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/runtime_discovery.hpp Switch runtime discovery to TopicDataProvider pointer.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/layers/plugin_layer.hpp Update introspection provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/introspection_provider.hpp New introspection provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/discovery_manager.hpp Update API to set TopicDataProvider instead of NativeTopicSampler.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/default_script_provider.hpp Update script provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data_access_manager.hpp Replace NativeTopicSampler exposure with TopicDataProvider attach/get.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/topic_data_provider.hpp New transport-neutral topic sampling/discovery interface.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/ros2_topic_data_provider.hpp New pooled ROS 2 implementation + stats/eviction design.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/data_types.hpp Add transport-neutral topic discovery/sample data structures.
src/ros2_medkit_gateway/design/index.rst Update gateway design docs for new subscription architecture.
src/ros2_medkit_gateway/design/architecture.puml Update architecture diagram for TopicDataProvider/subscription executor.
src/ros2_medkit_gateway/config/gateway_params.yaml Add config knobs for executor and provider pool behavior.
src/ros2_medkit_gateway/README.md Document regression gate against “naked” rclcpp subscription creation.
src/ros2_medkit_gateway/CMakeLists.txt Add new sources/tests for subscription infra + topic provider; drop NativeTopicSampler tests.
src/ros2_medkit_fault_manager/src/snapshot_capture.cpp Serialize subscription create/destroy under mutex to close TSan race window.
src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/include/ros2_medkit_topic_beacon/topic_beacon_plugin.hpp Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp Swallow shutdown-time graph exceptions to avoid terminate during teardown.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/include/ros2_medkit_param_beacon/param_beacon_plugin.hpp Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/CMakeLists.txt Increase gmock test timeout for sanitizer overhead.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/systemd_plugin.cpp Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/procfs_plugin.cpp Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/container_plugin.cpp Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_beacon_common/include/ros2_medkit_beacon_common/beacon_entity_mapper.hpp Update introspection provider include path.
scripts/check_no_naked_subscriptions.sh Add CI/pre-commit regression gate for subscription creation API usage.
docs/tutorials/plugin-system.rst Update provider include paths + document subscription API restriction for plugins.
docs/api/rest.rst Document new /health vendor-extension sections for pool/executor stats.
.pre-commit-config.yaml Add local pre-commit hook for naked subscription regression gate.
.github/workflows/quality.yml Run naked subscription regression gate in CI.
.github/workflows/opcua-plugin.yml Add new AlarmConditionType docker integration job.

Comment on lines +528 to +537
case AlarmAction::ReportHealed:
// Fault is latched: condition is no longer active but not yet
// confirmed. We don't have a dedicated HEALED reporting verb in
// ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED
// event - fault_manager keeps the entry in HEALED state until
// confirmed, mirroring the lifecycle.
log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);
// No-op for now; fault_manager will keep the fault HEALED until
// CLEARED. The state transition is observable via /faults/stream.
break;
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlarmAction::ReportHealed is currently a no-op, but the comment says it should be reported as a PASSED event so fault_manager can transition the fault to HEALED. As written, the fault will remain CONFIRMED until it is cleared, so the intended CONFIRMED -> HEALED -> CLEARED lifecycle (and the docker integration expectation) won’t occur. Consider adding a helper to report EVENT_PASSED (or extending send_report_fault to take an event type) and invoking it here.

Copilot uses AI. Check for mistakes.
Adds OpcuaClient primitives for native OPC-UA event subscription, the first
commit of issue #386 (AlarmConditionType bridge to fault_manager).
open62541pp v0.16 has no native EventFilter or event subscription support, so
this plumbs the raw open62541 C API (UA_Client_MonitoredItems_createEvent)
behind a small C++ surface that mirrors the existing data-change patterns.

Public API additions:

  * EventField + EventBrowsePath types for SimpleAttributeOperand specs.
  * EventCallback signature delivering select values plus EventType and
    SourceNode (always prepended to the filter, extracted on dispatch).
  * add_event_monitored_item / remove_event_monitored_item with
    heap-allocated CallbackContext stored as unique_ptr in OpcuaClient::Impl.
  * call_method wrapping opcua::services::call with status-mapped errors -
    needed by ConditionRefresh, Acknowledge, and Confirm in later commits.
  * current_generation() exposing a monotonic counter incremented on every
    detected disconnect (clean disconnect or transport-level drop). The
    trampoline captures a snapshot at createEvent time and drops callbacks
    whose snapshot diverges from the live counter, eliminating the late
    callback / use-after-free hazard reviewers flagged.

remove_subscriptions and disconnect now bump the generation and clear
event_callbacks before tearing down open62541pp Subscriptions, ensuring
in-flight C callbacks see a stale generation rather than a freed context.

Tests: 6 new disconnected-state tests validating the API contract and the
generation-counter ordering. End-to-end event flow against a real server
runs against the test_alarm_server fixture introduced in a later commit on
this branch (per plan v2 - in-process server tests with synthetic event
triggering are deferred to keep this commit reviewable; the docker
integration test in commit 5 covers the full path).

Refs #386
…firm ops

Wires the Part 9 AlarmConditionType subscription path into the existing
threshold-polling plugin. Builds on the event subscription primitive from
the previous commit; downstream consumers see one new YAML form
(``event_alarms:`` block) and two new SOVD operations
(``acknowledge_fault``, ``confirm_fault``).

* New header-only ``AlarmStateMachine``: pure compute_status function over
  ``EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState
  x BranchId``. Decision order documented inline; Retain is intentionally
  ignored (per Part 9 it filters ConditionRefresh visibility, not
  lifecycle). 22 unit tests cover every rule plus precedence ordering.

* ``NodeMap`` learns ``event_alarms:`` (top-level YAML sibling of
  ``nodes:``). Each entry declares ``alarm_source`` (NodeId of the source
  emitting AlarmConditionType events), ``entity_id``, ``fault_code``, and
  optional severity / message overrides. Mutually exclusive with the
  per-entry threshold ``alarm`` block; load() fails fast if both are set
  on the same node. ``find_event_alarm`` lookup serves the SOVD operation
  handlers. ``build_entity_defs`` merges event-mode entities so SOVD
  discovery surfaces them as fault-bearing even without scalar data.

* ``OpcuaPoller`` gains ``setup_event_subscriptions()`` and
  ``on_event(...)``. One dedicated subscription handles all event-mode
  alarms; the trampoline dispatches positional select-clause values
  through the state machine. ConditionRefresh fires on subscribe and on
  every reconnect (using the existing exponential-backoff path); the
  generation counter from OpcuaClient already filters callbacks captured
  from defunct subscriptions.

* Per-condition runtime is keyed by ConditionId NodeId stringForm so
  multiple instances under the same source remain distinct. Each entry
  carries the latest EventId ByteString - required for spec-compliant
  Acknowledge calls (Part 9 §5.7.3 returns BadEventIdUnknown otherwise).

* Plugin's OperationProvider lists ``acknowledge_fault`` /
  ``confirm_fault`` for any entity that has at least one event-mode
  alarm. ``execute_operation`` resolves (entity_id, fault_code) through
  the poller's lookup_condition, then invokes the inherited methods on
  AcknowledgeableConditionType (i=9111 Ack, i=9113 Confirm) with the
  tracked EventId and a LocalizedText comment. HTTP error mapping mirrors
  the existing write_value path.

* AlarmCondition events bridge through ``on_event_alarm`` to the existing
  send_report_fault / send_clear_fault wiring. Severity is mapped to the
  SOVD enum buckets (1-200 INFO, 201-500 WARN, 501-800 ERROR, 801-1000
  CRITICAL); selfpatch convention, NOT IEC 62682 - documented in the
  follow-up design doc.

Tests: 22 new state-machine unit tests (full transition table coverage
plus rule-precedence). All 144 tests in the package green; ASAN/TSAN
clean; clang-format, copyright, cppcheck, lint_cmake, xmllint all pass.

The test_alarm_server fixture and its docker integration ship in the
accompanying commit on this branch (gated OFF by default while the
ExternalProject namespace0_generated linker issue is being resolved).

Refs #386
…ow, docs

Closes the integration story for issue #386. The plugin's threshold-mode
integration runs against OpenPLC; this commit adds the parallel suite for
native AlarmConditionType subscriptions, which OpenPLC does not implement.

Components:

* ``test_alarm_server`` fixture (open62541, FULL ns0, alarms ON). Standalone
  C++ binary exposing 3 conditions on tcp/4842 plus a stdin CLI
  (``fire``, ``ack``, ``confirm``, ``latch``, ``shelve``, ``unshelve``,
  ``disable``, ``enable``, ``quit``). Smoke test in
  ``test/fixtures/test_alarm_server/smoke_test.py`` verifies type
  conformance via ``asyncua``.

* Self-contained Dockerfile under ``docker/test_alarm_server/`` clones
  open62541 v1.4.6 inside the image (no dependency on a pre-populated
  workspace ``build/`` tree, so CI builds cleanly from a fresh checkout).

* ``docker/scripts/run_alarm_tests.sh`` orchestrates the fixture +
  gateway: builds both images, brings them up on a private network, fires
  alarms via the server's stdin pipe, and asserts the SOVD ``/faults``
  endpoint moves through CONFIRMED -> HEALED -> CLEARED. Polling-with-
  timeout throughout (no fixed sleeps); cleanup trap teardown.

* ``.github/workflows/opcua-plugin.yml`` gains the ``integration-alarms``
  job, parallel to the existing ``integration`` (OpenPLC) job. Both run on
  every PR that touches the plugin or its dependencies.

* ``design/index.rst`` documents the full state machine table (precedence
  order, the deliberate choice to ignore ``Retain`` for lifecycle), the
  selfpatch severity-bucket convention (and the explicit non-claim of IEC
  62682), the ``ConditionRefresh`` / ``RefreshStartEvent`` /
  ``RefreshEndEvent`` flow, the ack/confirm method NodeIds, and a vendor
  matrix covering Siemens, Beckhoff, Rockwell, CodeSys, OpenPLC.

* ``README.md`` shows the 3-line ``event_alarms:`` form and a curl example
  for ``acknowledge_fault``.

* ``CHANGELOG.rst`` Forthcoming entry.

The cmake-side ``test_alarm_server`` target stays gated OFF
(``MEDKIT_OPCUA_BUILD_ALARM_SERVER``) until the LTO ``namespace0_generated``
linker mismatch in the second open62541 ``ExternalProject_Add`` build is
resolved. The docker integration suite does not depend on it - it builds
its open62541 inside the container.

Refs #386
@mfaferek93 mfaferek93 force-pushed the feat/issue-386-opcua-alarm-events branch from 4ca2181 to c57feef Compare April 25, 2026 15:41
Adds ``test_alarm_server_smoke`` to CTest. Boots the freshly-built
fixture on an ephemeral port, waits for the ``READY`` line on stdout,
runs the existing ``asyncua`` smoke test against it, and tears the
process down. Skips with CTest exit 77 (which we map via
``SKIP_RETURN_CODE``) when ``asyncua`` is not importable, so iterating
on plugin sources without the Python dependency does not fail the
suite.

CI ``integration-alarms`` job installs ``asyncua`` so the smoke test
runs as a real pass / fail there. Other jobs see it as skipped, which
ament_lint surfaces but does not flag.

Refs #386
…ct E2E

Previous run_alarm_tests.sh shortcut its ``ack`` and ``confirm`` lines via
the test_alarm_server stdin CLI, which bypassed the medkit SOVD operation
path. The implementation - lookup_condition + EventId tracking +
call_method on the inherited AcknowledgeableConditionType methods (i=9111
Acknowledge, i=9113 Confirm) - therefore had only unit-level confidence.

This commit makes ack and confirm round-trip through HTTP:

  POST /api/v1/apps/tank_process/operations/acknowledge_fault/executions
  POST /api/v1/apps/tank_process/operations/confirm_fault/executions

with ``{"fault_code": "...", "comment": "..."}``. After each POST the
test polls the server's stdout for the new ``STATE`` log line (added to
the fixture in this commit) and asserts the relevant flag (``acked=true``
/ ``confirmed=true``) actually flipped on the OPC-UA server.

Additional E2E scenarios added on top of the existing fire->CONFIRMED
->latch->HEALED->CLEARED happy path:

  * Shelving suppression: fire Overheat -> CONFIRMED -> shelve ->
    fault disappears -> unshelve + fire -> CONFIRMED returns.
    Exercises ShelvingState parsing and the state-machine Rule 3.
  * Disabled alarm: fire SensorLost -> CONFIRMED -> disable ->
    fault clears -> enable + fire -> CONFIRMED. Exercises
    EnabledState parsing and Rule 2.
  * Reconnect with ConditionRefresh: fire Overpressure -> CONFIRMED ->
    docker stop -> docker start -> fire again -> CONFIRMED returns
    via the gateway's reconnect path (poll_loop -> setup_event_subs
    -> ConditionRefresh).

Fixture changes:

  * Source nodes use predictable string NodeIds ``ns=2;s=Alarms.<name>``
    so the gateway's ``alarm_source`` config maps unambiguously to a
    real node. The previous auto-assigned numeric IDs broke event
    subscription against the documented YAML form.
  * The CLI loop logs a ``STATE <name> active=... acked=... ...`` line
    after every successful command so the harness can assert state
    transitions with one ``docker logs | grep`` instead of a separate
    asyncua round-trip.

The script keeps polling-with-timeout throughout (no fixed sleeps),
restartable cleanup trap, idempotent re-runs.

Local end-to-end run was not attempted - the gateway-opcua docker image
build is the long-tail (multi-minute) and CI integration-alarms job
covers the same path on every push. Unit + state-machine + smoke tests
all stayed green during the work (147 tests, 0 failures).

Refs #386
CI Integration (AlarmConditionType) job hit 30-minute timeout at
[3/5] Start test_alarm_server because the script's stdin pipe pattern
deadlocked the runner. The shell opens redirections before exec'ing the
command, so 'docker run -d -i ... < fifo' followed by 'exec 3> fifo'
blocks at the first line forever.

Fix: open the FIFO read+write on FD 3 first ('exec 3<>fifo'), which
is non-blocking, then redirect docker's stdin from FD 3 ('docker run
... <&3'). Same pattern applied to the post-reconnect FIFO.

Side fix: the gateway entrypoint wrote /config/manifest.yaml from
inside the container, but /config is mounted read-only. Pre-write
the manifest on the host before starting the container.

Refs #386
…containers

Previous CI run failed at 'tank_process not in /apps after 120s' but the
'Dump container logs on failure' workflow step couldn't see anything -
the script's cleanup trap had already 'docker rm -f'd the containers.

Move log dump into the cleanup trap itself, gated by non-zero exit code,
so subsequent failures land actionable logs in the runner output.

Refs #386
…ind mount

Two more findings from local E2E debug:

1. docker run -d -i ... <&3 lost stdin between the daemonized client
   and the docker daemon - commands written to FD 3 from the script never
   reached the binary's stdin in the container. The -d flag closes the
   client process which orphans the FIFO before the daemon can rewire it.
   Drop -d, run docker run as a shell background job instead, so the
   client process stays alive holding the FIFO open for the container.
   Track the PID and clean up on trap.

2. The gateway image bakes /config/gateway_params.yaml at build time,
   but our :ro bind mount of /tmp/alarm_test_config:/config shadows
   it. The container exited 1 with 'Couldn't parse params file'. Stage
   the params file into the bind mount alongside the alarm_nodes.yaml
   and manifest.yaml.

Refs #386
Three CI runs hit BadNodeIdUnknown on event monitored item creation
even though the server reports the source nodes at the expected NodeIds
('ns=2;s=Alarms.Overpressure' etc, verified via asyncua browse). The
mismatch is somewhere between our parse + raw-C call.

Add stderr trace before and after UA_Client_MonitoredItems_createEvent
so the next CI run logs the exact NodeId stringForm we hand to the
server, plus the status code. Will be tightened to RCLCPP_DEBUG once
the root cause is identified.

Refs #386
… default request

Multiple iterations to pin down BadNodeIdUnknown when adding event
monitored items - all still failing locally:

* Replaced shallow copy of source_node with UA_NodeId_copy so the
  serializer never aliases an open62541pp wrapper internal.
* Switched typeDefinitionId in SimpleAttributeOperand from
  BaseEventType to AlarmConditionType (i=2915) since the BrowsePaths
  we use (ConditionId, AckedState, ShelvingState, etc.) are not
  defined on BaseEventType per Part 9 spec.
* Replaced manual UA_MonitoredItemCreateRequest_init with
  UA_MonitoredItemCreateRequest_default(nodeId) so the request
  layout matches open62541's own examples.
* Cleared the request struct including the deep-copied NodeId after
  the call, detaching the stack-local filter first to avoid a
  double-free.

None of these alone fixed it. Server still reports BadNodeIdUnknown
for both the custom source NodeId (ns=2;s=Alarms.Overpressure) AND
the canonical Server object (i=2253). Investigation continues.

Refs #386
…onnect

Local E2E (run_alarm_tests.sh) now passes all four scenarios. Six issues
fixed along the way; each was reproducible only with a real open62541
AlarmConditionType server and could not have surfaced from the unit suite.

State machine wiring
- opcua_client::add_event_monitored_item now auto-prepends three fixed
  SimpleAttributeOperands (EventType, SourceNode, ConditionId) per
  Part 9 §5.5.2.13 - the ConditionId clause uses an empty BrowsePath +
  AttributeId=NodeId, which is the only spec-legal way to retrieve it.
  EventCallback gets the ConditionId as a separate argument; user-supplied
  EventFieldSpec entries appear after the three prepended fields.
- Each user EventFieldSpec carries its OWN typeDefinitionId. open62541
  servers reject inherited browse paths with BadNodeIdUnknown, so we tag
  AckedState/Id with AcknowledgeableConditionType (i=2881), ActiveState/Id
  with AlarmConditionType (i=2915), EnabledState/Id with ConditionType
  (i=2782), etc. Previous single-typeDef filter was rejected wholesale.
- Fixed double-free in event MI creation: the open62541 default request
  builder returns a struct that aliases the NodeId we pass in, so
  UA_NodeId_clear after UA_MonitoredItemCreateRequest_clear corrupted the
  heap. Item is now built explicitly with UA_NodeId_copy into the request.
- Poller calls client.run_iterate(50) every poll cycle. Without it the
  open62541 client never dispatched subscription notifications because no
  scalar nodes were configured, so the trampoline silently never fired
  even though createEvent returned Good.

E2E correctness
- call_method now also rejects per-input-arg failures from
  inputArgumentResults. AlarmConditionType.Acknowledge surfaces
  BadEventIdUnknown there when the EventId we sent has been superseded by
  a newer event; without this check SOVD POST returned 200 even though
  the server refused the call.
- run_alarm_tests.sh: ack/confirm now go through SOVD HTTP, not the
  test_alarm_server stdin shortcut. Between latch and SOVD confirm we
  poll the gateway log for "AlarmCondition HEALED" - the 500 ms
  subscription publishing interval means the gateway needs that long to
  receive both the Acknowledge auto-emit and the latch trigger before it
  has a fresh EventId for Confirm. Without the wait Confirm was sent
  with the stale ID from the original fire payload.
- ShelvingState fix in test_alarm_server: set_shelving now also writes
  the Id property (NodeId) of CurrentState, not just the LocalizedText.
  The medkit bridge keys suppression off ShelvingState/CurrentState/Id
  (i=2929/2930/2932) because the text is locale-dependent. Without the
  Id write the gateway saw shelved=false and the suppression scenario
  silently failed.
- Shelved detection in opcua_poller now treats a null/missing Id as
  Unshelved instead of "unknown=shelved". Some servers leave the optional
  field uninitialized, and that is not a suppression signal.
- Cleanup trap dumps the full container log on rc!=0, not the last 120
  lines. The diagnostic introspect() poll spam crowded out the
  on_event / state-machine traces from the tail window.

Scenarios covered (run_alarm_tests.sh)
- fire / SOVD ack / latch / SOVD confirm / clear lifecycle
- shelve suppresses an active alarm; unshelve re-arms it
- disable suppresses an active alarm; enable re-arms it
- gateway reconnect: stop the test_alarm_server, restart it, and a
  re-fired alarm shows CONFIRMED again (proves setup_event_subscriptions
  is invoked on the reconnect path with a fresh ConditionRefresh)

Diagnostic stderr logging
- captured EventId hex / call_method status code / per-arg result are
  printed to stderr from opcua_poller / opcua_plugin / opcua_client.
  Verbose by design - this is the integration test fixture path and the
  only way to triage a BadEventIdUnknown after the fact.

Local verify (Jazzy, x86_64)
  bash src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh
  -> "All alarm scenarios passed." EXIT=0
…, dead guard)

Three review-driven cleanups, no functional change:

- run_alarm_tests.sh: idempotent teardown before ``docker network create``.
  The cleanup trap fires on EXIT but not on a hard kill (Ctrl-C between trap
  arm and trap fire), in which case the leftover network would crash the
  next run under ``set -e``. Mirrors the trap with a noisy-tolerant prelude.

- run_alarm_tests.sh: rename scenario 4 from "reconnect preserves CONFIRMED
  via ConditionRefresh" to "reconnect re-subscribes after server restart".
  The previous name was aspirational - the test_alarm_server is in-memory
  and loses condition state on restart, so Part 9 §5.5.7 cannot fire here.
  Issue #389 tracks the actual ConditionRefresh re-emit verification.

- opcua_poller.cpp: drop dead ``if (event_subscription_id_ != 0) return``
  guard in setup_event_subscriptions(). Both call sites (start() and the
  poll_loop reconnect arm) zero the field before calling, and the comment
  now says so.

- opcua_client.cpp: clarify in disconnect() that the ``if (connected)`` guard
  guarantees single generation bump even when maybe_mark_disconnected
  already fired on a transport error - the latter uses exchange(false) so
  the second site is a no-op by atomic semantics, not by accident.
Today ``OpcuaPoller::condition_refresh()`` swallows server failures with a
silent comment ("not fatal - many test servers do not implement"). Real
PLCs hit this too: open62541 v1.4.x returns BadMethodInvalid, Siemens
S7-1500 omits ConditionRefresh entirely, Beckhoff TF6100 status
unconfirmed in public docs. The operator gets no signal that their
``alarm-replay-on-reconnect`` contract is broken.

- PollerConfig gains ``log_warn`` (std::function<void(const std::string&)>),
  optional. The plugin owning the poller wires it to its inherited
  GatewayPlugin::log_warn so messages reach the ROS 2 logger and
  /rosout, not just container stderr.
- ``OpcuaPoller::condition_refresh()`` emits one warn per connect on
  failure (throttled by ``condition_refresh_warned_``) carrying the
  StatusCode and pointing at issue #389. Reset on success so a recovered
  server earns a fresh warn next time it breaks.
- Falls back to ``std::cerr [opcua_poller WARN]`` when the callback is
  not set, preserving observability for unit-test paths that don't go
  through the plugin.

No behavior change for the success path. Live transitions continue to
flow regardless; this is purely operator observability for the
"reconnect doesn't replay state" failure mode.
Extract the OPC-UA Part 4 §5.11.2 Call result classification from
``OpcuaClient::call_method`` into two public static helpers:

- ``status_to_method_error(uint32_t code, const std::string & msg)``
  maps an OPC-UA StatusCode to one of {MethodNotFound, InvalidArgument,
  MethodTimeout, TransportError} via the existing dispatch table.
- ``classify_call_result(uint32_t overall, std::vector<uint32_t> args)``
  walks the per-argument results and returns the first non-Good code,
  with overall statusCode taking precedence.

Public statics so the test suite can hit the BadEventIdUnknown branch
that previously had no unit anchor - the bug ``call_method`` was just
fixed for (statusCode=Good + inputArgumentResults[0]=Bad) is exercised
by the docker integration test (run_alarm_tests.sh ack/confirm flow)
but a future refactor that drops the per-arg loop would not get caught
locally. ~30 LoC of pure tests catch this in seconds.

API surface uses ``uint32_t`` instead of ``UA_StatusCode`` so the
public header doesn't pull in open62541 types - the .cpp internally
treats them identically (UA_StatusCode is a uint32_t typedef).

9 new gtest cases (3 for ``status_to_method_error``, 6 for
``classify_call_result``):
- MethodNotFound / InvalidArgument / Timeout family dispatch
- BadEventIdUnknown stays in TransportError (signals SOVD to retry,
  not reject, since the EventId staling is a transient race)
- Empty arg results, all-Good arg results -> success
- Bad overall status returns error
- BadEventIdUnknown in arg[0] returns error with "input arg 0" message
- First bad arg wins over later bad args
- Overall status beats arg results (transport error short-circuits)

Existing ``call_method`` body is unchanged in behavior; the diagnostic
stderr trace that prints ``statusCode=`` and per-arg codes is preserved
verbatim.

Local verify: ``test_opcua_client`` reports 26/26 PASSED including all
9 new cases.
Verification round on the post-#386 review found 9 untested cells in
the AlarmStateMachine transition matrix (issue #389 follow-up). Adding
focused tests so a future refactor cannot silently break a corner case.

New cells covered:
- ``DisabledClearsHealedAlarm`` - Healed exit via Rule 2 (was_active=true
  for Healed too, so disabling a latched alarm must ClearFault, not just
  flip to Suppressed).
- ``DisabledNoOpWhenAlreadyCleared`` - confirms Cleared+disabled lands at
  Suppressed/NoOp (no spurious second clear).
- ``ShelvedClearsHealedAlarm`` / ``ShelvedNoOpWhenAlreadyCleared`` - same
  pair via Rule 3.
- ``ActiveAlarmReportsConfirmedFromSuppressed`` - operator
  unshelves/re-enables an alarm whose source is still active; Rule 4
  must promote Suppressed -> Confirmed with ReportConfirmed (NOT NoOp).
  This is exactly the unshelve+re-fire path scenario 2 of
  run_alarm_tests.sh exercises end-to-end.
- ``BranchEventFromHealedNoOp`` / ``BranchEventFromSuppressedNoOp`` -
  Rule 1 precedence holds across all four prev_status values.
- ``AckedAndConfirmedNoOpFromSuppressed`` - ``was_active=false`` branch
  of Rule 5: a fully cleared event delivered while suppressed advances
  next_status to Cleared but issues no ClearFault (no fault to clear).
- ``InactiveUnackedFromSuppressedReportsHealed`` - ditto but with the
  ack/confirm bits unset, must surface as Healed/ReportHealed so the
  unfinished ack/confirm workflow item stays visible.

State machine code itself is unchanged. 27 gtest cases, all PASSED.
Pure-function tests, deterministic, no I/O.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 12 comments.

Comment on lines +23 to +26
Skips (exits 77 - the CTest convention for a skipped test) when
``asyncua`` is not importable, so contributors who only iterate on the
plugin sources do not need a Python pip install. CI installs ``asyncua``
in the integration job and observes the test as a real pass / fail.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_ctest.py exits with code 77 to skip the smoke test when asyncua is not installed. This introduces an always-skippable test path, which makes it easier for regressions to slip in unnoticed. Prefer making asyncua a declared test dependency (so the test always runs in standard test environments) or move the smoke test to a separate, explicitly-optional test target instead of skipping at runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +226 to +232
TEST(AlarmStateMachineTest, DisabledNoOpWhenAlreadyCleared) {
AlarmEventInput in;
in.enabled_state = false;
auto out = AlarmStateMachine::compute(SovdAlarmStatus::Cleared, in);
EXPECT_EQ(out.next_status, SovdAlarmStatus::Suppressed);
EXPECT_EQ(out.action, AlarmAction::NoOp);
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name suggests a pure no-op when already cleared, but the expected next_status transitions to Suppressed. If the intended behavior is “no action” but status may change, consider renaming this test to reflect the suppression transition (or adjust expectations to match the name).

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +236
auto callback = [this, &cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node,
const opcua::NodeId & event_type, const opcua::NodeId & condition_id) {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In setup_event_subscriptions(), the lambda captures &cfg where cfg is the loop variable of a range-for. This captures a reference to the loop variable (which is reused each iteration and goes out of scope after the loop), so callbacks can read a dangling reference / wrong config when events arrive. Capture the config by value (or capture a stable pointer/reference to the vector element) so each monitored item callback uses the correct AlarmEventConfig.

Suggested change
auto callback = [this, &cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node,
const opcua::NodeId & event_type, const opcua::NodeId & condition_id) {
auto callback = [this, cfg](const std::vector<opcua::Variant> & values, const opcua::NodeId & source_node,
const opcua::NodeId & event_type, const opcua::NodeId & condition_id) {

Copilot uses AI. Check for mistakes.
Comment on lines +750 to +753
// Bump generation BEFORE freeing the ctx so any in-flight trampoline call
// captured the old generation and will drop its work.
impl_->generation.fetch_add(1, std::memory_order_release);

Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_event_monitored_item() increments the global generation counter. That will make all other event monitored item contexts fail the generation_snapshot check in the trampoline and silently drop future events, even though their subscriptions are still valid. Consider making the staleness check per-monitored-item (e.g., an active atomic in the context) or only bump generation on full disconnect/remove_subscriptions (not on single MI removal).

Suggested change
// Bump generation BEFORE freeing the ctx so any in-flight trampoline call
// captured the old generation and will drop its work.
impl_->generation.fetch_add(1, std::memory_order_release);
// Do not advance the global generation for removal of a single monitored
// item. The generation is used to invalidate all callback contexts after a
// full reconnect/subscription teardown; bumping it here would make other
// still-active monitored items fail the trampoline staleness check and drop
// future events.

Copilot uses AI. Check for mistakes.
Comment on lines +295 to +296
std::cerr << "[opcua_poller] on_event fault=" << cfg.fault_code << " event_type=" << event_type.toString()
<< " condition=" << condition_id.toString() << " values=" << values.size() << std::endl;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_event() currently logs every notification and EventId bytes directly to std::cerr. This can flood logs under high alarm/event rates and bypass the plugin's logging controls. Route these through the plugin/logger (or a configurable debug/trace sink) and avoid dumping per-event hex unless explicitly enabled.

Copilot uses AI. Check for mistakes.
Comment on lines +266 to +271
# Start fault_manager_node first so its services are advertised before
# the gateway opcua plugin tries to call /fault_manager/report_fault.
ros2 run ros2_medkit_fault_manager fault_manager_node \
> /var/lib/ros2_medkit/fault_manager.log 2>&1 &
sleep 3
PLUGIN_PATH=$(find /root/ws/install -name "libros2_medkit_opcua_plugin.so" | head -1)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the docker integration uses polling-with-timeout throughout with “no fixed sleeps”, but this script has a fixed sleep 3 before starting the gateway. Consider replacing it with a poll for the fault_manager service readiness (or other observable health signal) to keep the test deterministic across slower CI runners.

Copilot uses AI. Check for mistakes.
Comment on lines +901 to +912
{
const auto * bytes = runtime->latest_event_id.data();
std::cerr << "[opcua_plugin] " << operation_name << " EventId len=" << runtime->latest_event_id.length()
<< " hex=";
for (size_t i = 0; i < std::min<size_t>(runtime->latest_event_id.length(), 16); ++i) {
char buf[3];
std::snprintf(buf, sizeof(buf), "%02x", static_cast<unsigned>(bytes[i]) & 0xffu);
std::cerr << buf;
}
std::cerr << " conditionId=" << runtime->condition_id.toString() << std::endl;
}

Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execute_operation() prints the EventId bytes and ConditionId to std::cerr for every acknowledge/confirm request. This is very verbose and bypasses ROS logging; it will also include user-supplied comment indirectly in downstream logs. Consider removing this block or switching it to debug-level logging behind a flag.

Suggested change
{
const auto * bytes = runtime->latest_event_id.data();
std::cerr << "[opcua_plugin] " << operation_name << " EventId len=" << runtime->latest_event_id.length()
<< " hex=";
for (size_t i = 0; i < std::min<size_t>(runtime->latest_event_id.length(), 16); ++i) {
char buf[3];
std::snprintf(buf, sizeof(buf), "%02x", static_cast<unsigned>(bytes[i]) & 0xffu);
std::cerr << buf;
}
std::cerr << " conditionId=" << runtime->condition_id.toString() << std::endl;
}

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +74
/// Retain is intentionally NOT used here. Per Part 9 §5.5.2.10 it controls
/// visibility during ConditionRefresh bursts, not lifecycle - the poller
/// strips Retain=false events delivered between RefreshStartEvent and
/// RefreshEndEvent before invoking compute().
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the poller “strips Retain=false events delivered between RefreshStartEvent and RefreshEndEvent before invoking compute()”, but the current event select clauses don’t include Retain and on_event() doesn’t filter on it. Either implement the Retain-based filtering or update this comment so it matches the actual behavior.

Suggested change
/// Retain is intentionally NOT used here. Per Part 9 §5.5.2.10 it controls
/// visibility during ConditionRefresh bursts, not lifecycle - the poller
/// strips Retain=false events delivered between RefreshStartEvent and
/// RefreshEndEvent before invoking compute().
/// Retain is intentionally not modeled by this state machine and does not
/// affect ``compute()``. Per Part 9 §5.5.2.10 it controls visibility during
/// ConditionRefresh bursts rather than the lifecycle mapping implemented
/// here.

Copilot uses AI. Check for mistakes.
Comment on lines +539 to +544
// ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED
// event - fault_manager keeps the entry in HEALED state until
// confirmed, mirroring the lifecycle.
log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);
// No-op for now; fault_manager will keep the fault HEALED until
// CLEARED. The state transition is observable via /faults/stream.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AlarmAction::ReportHealed handling, the comment says we “mark this as a PASSED event”, but the implementation is currently a no-op (only logs). Please either emit the intended PASSED/HEALED signal to fault_manager or adjust the comment to reflect that HEALED is intentionally not reported via ReportFault at the moment.

Suggested change
// ReportFault.srv (only FAILED/PASSED), so we mark this as a PASSED
// event - fault_manager keeps the entry in HEALED state until
// confirmed, mirroring the lifecycle.
log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);
// No-op for now; fault_manager will keep the fault HEALED until
// CLEARED. The state transition is observable via /faults/stream.
// ReportFault.srv (only FAILED/PASSED), and this plugin does not
// currently emit a PASSED event for the latched HEALED transition.
// For now we only log the transition here; any externally visible
// lifecycle updates come from subsequent alarm events.
log_info("AlarmCondition HEALED (latched, awaiting ack/confirm): " + delivery.fault_code);
// Intentional no-op: HEALED is not currently reported via ReportFault.

Copilot uses AI. Check for mistakes.
Comment on lines +337 to +345
// Mutual-exclusion check: an entry under ``nodes:`` carrying both a
// ``threshold`` alarm and an ``alarm_source`` is a configuration error
// (the threshold path polls scalar values; the alarm_source path
// subscribes to native events). Reject the whole file rather than guess.
for (const auto & node : (nodes ? nodes : YAML::Node{})) {
if (node["alarm_source"] && node["alarm"] && node["alarm"]["threshold"]) {
RCLCPP_ERROR(rclcpp::get_logger("opcua.node_map"),
"Entry node_id=%s declares both threshold alarm and alarm_source - mutually exclusive",
node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>");
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loader comment/error mentions a nodes: entry declaring both a threshold alarm and alarm_source, but alarm_source is now a top-level event_alarms: field. As written, a mistaken alarm_source under nodes: is mostly ignored unless it also has an alarm.threshold. Consider explicitly rejecting any alarm_source key under nodes: with an error that points users to event_alarms: (and/or validate mutual exclusivity across the two YAML forms more directly).

Suggested change
// Mutual-exclusion check: an entry under ``nodes:`` carrying both a
// ``threshold`` alarm and an ``alarm_source`` is a configuration error
// (the threshold path polls scalar values; the alarm_source path
// subscribes to native events). Reject the whole file rather than guess.
for (const auto & node : (nodes ? nodes : YAML::Node{})) {
if (node["alarm_source"] && node["alarm"] && node["alarm"]["threshold"]) {
RCLCPP_ERROR(rclcpp::get_logger("opcua.node_map"),
"Entry node_id=%s declares both threshold alarm and alarm_source - mutually exclusive",
node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>");
// Schema validation: ``alarm_source`` is only supported in the top-level
// ``event_alarms:`` section. Reject any legacy or misplaced
// ``alarm_source`` field under ``nodes:`` explicitly so configuration
// mistakes are not silently ignored.
for (const auto & node : (nodes ? nodes : YAML::Node{})) {
if (node["alarm_source"]) {
RCLCPP_ERROR(
rclcpp::get_logger("opcua.node_map"),
"Entry node_id=%s uses alarm_source under nodes:, which is invalid; "
"move this configuration to top-level event_alarms:",
node["node_id"] ? node["node_id"].as<std::string>().c_str() : "<unknown>");

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ros2_medkit_opcua: subscribe to native OPC-UA AlarmConditions and bridge to fault_manager

2 participants