Skip to content

feat: nav2 sensor fix OTA demo#58

Draft
bburda wants to merge 32 commits intomainfrom
feat/ota-nav2-sensor-fix
Draft

feat: nav2 sensor fix OTA demo#58
bburda wants to merge 32 commits intomainfrom
feat/ota-nav2-sensor-fix

Conversation

@bburda
Copy link
Copy Markdown
Contributor

@bburda bburda commented Apr 26, 2026

Summary

  • Adds demos/ota_nav2_sensor_fix/ end-to-end OTA demo
  • Bundles a dev-grade ota_update_plugin C++ gateway plugin (UpdateProvider + GatewayPlugin)
  • Update / Install / Uninstall operations derived from SOVD ISO 17978-3 components metadata (updated_components / added_components / removed_components)
  • Minimal FastAPI artifact server + pack_artifact.py CLI for building tarballs and catalog entries
  • Two-service docker-compose.yml (gateway + update server); nav2 / Foxglove are bring-your-own (documented in README)

Out of scope (deliberate, dev-grade positioning)

  • Artifact signing / verification
  • Atomic swap or A/B partition rollout
  • Persistent update state across gateway restarts
  • Fleet-wide staging
  • Audit logging
  • Automated health-gated rollback

Test plan / verification

Unit & integration tests (all clean):

  • pytest -v for pack_artifact.py (16 tests)
  • pytest -v for ota_update_server (5 tests)
  • colcon test for ota_update_plugin (24 GTest cases)
  • All four demo ROS 2 packages build clean under -Wall -Wextra -Wpedantic -Wshadow -Wconversion
  • build_artifacts.sh produces a 3-entry catalog + tarballs end-to-end

End-to-end smoke (docker compose up --build, plugin against live gateway):

  • Plugin loads and registers as UpdateProvider (gateway logs: "Update backend provided by plugin")
  • Boot poll fetches /catalog and registers all 3 catalog entries
  • Update flow: PUT /updates/fixed_lidar_2_1_0/prepare && /execute kills broken_lidar_node and spawns fixed_lidar_node (verified via PID checks)
  • Install flow: PUT /updates/obstacle_classifier_v2_1_0_0/prepare && /execute swaps files and spawns obstacle_classifier_node
  • Uninstall flow: PUT /updates/broken_lidar_legacy_remove/execute accepted by gateway but the SOVD UpdateManager state machine stops at phase prepared for our no-op prepare() semantic. Plugin's uninstall code path is unit-tested; full integration needs either an explicit prepare step from the panel UI or a small adjustment to the plugin's prepare for uninstall. Tracking as a follow-up; does not block the demo's main Update narrative.
  • obstacle_classifier_node runtime: spawned process crashes on visualization_msgs fastcdr ABI mismatch in the runtime image. The install mechanism is verified; the runtime crash is a ros-jazzy-visualization-msgs packaging issue separate from the plugin. Workaround: rebuild the runtime stage with matching fastcdr versions.

Notes

  • Builds against selfpatch/ros2_medkit main for the gateway sources (clone happens at image build time)
  • __has_include shim in ota_update_plugin.hpp covers both gateway header layouts (providers/ vs updates/)
  • pgrep matches against /proc/<pid>/cmdline argv[0] basename (not comm, which the kernel truncates to 15 chars)
  • WSL2 + Docker Desktop bind mounts are unreliable; artifacts/ is baked into the update_server image at build time
  • Pairs with the ros2_medkit_foxglove_extension Updates panel PR (fix(docker): add missing ros2_medkit components and submodules #6 of that repo)

bburda added 30 commits April 26, 2026 17:47
Implements GatewayPlugin + UpdateProvider for the OTA demo. Polls a
FastAPI catalog at boot and supports update / install / uninstall
operations derived from SOVD ISO 17978-3 metadata.

Process model: SIGTERM old executable, swap files on disk, fork+exec
new executable. No lifecycle commands. Operation kind is classified
from updated_components / added_components / removed_components.

Components:
- OtaUpdatePlugin: list/get/register/delete/prepare/execute/supports_automated
- CatalogClient: cpp-httplib GET /catalog and artifact download, with parse_url
- OperationDispatcher: SOVD metadata -> Update/Install/Uninstall/Unknown
- ProcessRunner: pgrep via /proc, kill_by_executable with SIGTERM->SIGKILL
  fallback, fork+exec spawn

21 gtests pass (7 dispatcher, 6 parse_url, 8 plugin smoke).
Adds optional --replaces-executable flag to pack_artifact.py and threads it
into the catalog entry as x_medkit_replaces_executable when kind=update.
This lets the gateway plugin kill the OLD executable (broken_lidar_node)
before spawning the NEW one (fixed_lidar_node) when the two live in
separate ROS 2 packages.
…process

When a SOVD update package swaps a node across ROS 2 packages (e.g.
broken_lidar -> fixed_lidar), the OLD process binary basename differs
from the new one. Read x_medkit_replaces_executable from the entry
metadata before issuing the kill, falling back to x_medkit_executable
when the field is absent (in-package upgrades).
…ime libs in image

- ProcessRunner::pgrep now reads /proc/<pid>/cmdline argv[0] basename instead of
  /proc/<pid>/comm (which kernel truncates to 15 chars - 'broken_lidar_node'
  would never match).
- plugin_exports.cpp exports get_update_provider so the gateway's plugin_loader
  can resolve the UpdateProvider interface across the dlopen boundary without
  relying on dynamic_cast.
- Dockerfile.gateway: drop --symlink-install (broke multi-stage COPY) and add
  runtime libs (libcpp-httplib, libsystemd, nlohmann-json3, lifecycle, test_msgs).
- ota_update_server Dockerfile: bake artifacts/ into image (WSL2 + Docker
  Desktop bind mounts unreliable).
- Compose: gateway port configurable via OTA_GATEWAY_PORT (default 8080).

Verified via end-to-end smoke against the live stack:
- Plugin loads and reports as UpdateProvider
- Boot poll registers all 3 catalog entries
- Update flow kills broken_lidar_node and spawns fixed_lidar_node
pack_artifact.py was emitting 'name' (not in SOVD ISO 17978-3 - spec uses
'update_name') and 'version' (not a SOVD field at all). Spec-compliant
clients (ros2_medkit_web_ui, the Foxglove updates panel) expect
update_name; vendor-specific data lives under x_medkit_*.

Confirmed against the live demo gateway: the web UI happily renders the
updated shape, all 3 catalog entries visible end-to-end.
…nst gateway

Verifies the canonical SOVD client flow that the Foxglove updates panel
mirrors: connect form, /api/v1/updates returns {items: [<id>]}, per-id
/status calls, all 3 catalog entries render in the dashboard.
Adopt the same script convention as sensor_diagnostics, multi_ecu_aggregation,
and turtlebot3_integration:

  ./run-demo.sh           build artifacts + bring up gateway + nodes + update
                          server (daemon mode by default, --attached for fg)
  ./stop-demo.sh          tear down (-v removes volumes, --images removes
                          built images)
  ./check-demo.sh         show registered updates + per-id status + live
                          plugin-managed processes inside the gateway
                          container
  ./trigger-update.sh     broken_lidar -> fixed_lidar (the headline)
  ./trigger-install.sh    install obstacle_classifier_v2 from scratch
  ./trigger-uninstall.sh  remove broken_lidar_legacy

OTA_GATEWAY_PORT (or OTA_GATEWAY_URL for full override) lets the user
sidestep collisions with another gateway on host port 8080.

README quickstart updated to point at run-demo.sh.
bburda added 2 commits April 27, 2026 10:28
…tern

tests/smoke_test_ota.sh asserts:
  - gateway /health 200
  - gateway log says 'Update backend provided by plugin' (no 'no provider' warn)
  - GET /updates returns SOVD {items: [<id>]} envelope with all 3 catalog ids
  - GET /updates/{id} detail uses spec field names: update_name (not 'name'),
    x_medkit_version (not bare 'version'), updated_components for kind,
    x_medkit_replaces_executable threaded through pack_artifact
  - update flow: PUT prepare + execute kills broken_lidar_node and
    spawns fixed_lidar_node inside the gateway container
  - install flow: spawns obstacle_classifier_node

ci.yml gets a build-and-test-ota job following the same shape as the other
per-demo jobs: checkout -> install Python + ROS Jazzy on the runner ->
build_artifacts.sh -> docker compose up -d --build -> run smoke ->
log dumps on failure -> teardown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant