Skip to content

Implement Choria Transport (Phases 1 and 2)#206

Open
nmburgan wants to merge 7 commits intomainfrom
choria
Open

Implement Choria Transport (Phases 1 and 2)#206
nmburgan wants to merge 7 commits intomainfrom
choria

Conversation

@nmburgan
Copy link
Copy Markdown
Member

@nmburgan nmburgan commented Mar 27, 2026

Adds a Choria transport to OpenBolt, enabling task execution, command running, and script execution on nodes via Choria's NATS pub/sub messaging as an alternative to SSH and WinRM. This implements phases 1 and 2 of the transport plan (docs/choria-transport-plan.md).

Rather than opening direct connections to each node, OpenBolt sends MCollective RPC requests through a NATS broker. Nodes running the Choria server execute the requests via Ruby MCollective agents and return results over the same messaging bus. This scales well to large fleets and works through NAT/firewalls since nodes only need outbound connectivity to the broker.

Documentation (docs/)

  • choria-transport.md: User guide covering configuration, usage, and examples
  • choria-transport-dev.md: Developer guide for architecture, data flow, and code patterns
  • choria-transport-plan.md: Project plan with phased roadmap and progress tracking
  • choria-transport-testing.md: Test environment setup with OpenVox/Choria infrastructure configuration and manual verification steps

Phase 1: bolt_tasks agent

Phase 1 delivers task execution via the bolt_tasks agent, which downloads
task files from an OpenVox/Puppet Server and executes them on target nodes.

  • run_task via bolt_tasks agent with async execution and polling
  • run_command, run_script return clear per-target errors when the shell
    agent is not available (rather than crashing)
  • upload, download return "not yet supported" errors
  • Connectivity checking via rpcutil.ping
  • Agent detection with per-target caching
  • Client configuration with NATS, TLS, and collective overrides
  • Config class with validation for all transport options
  • Transport and config registration in OpenBolt's executor and config systems

Phase 2: shell agent

Phase 2 adds command and script execution via the shell agent, plus an
alternative task execution path that uploads task files directly instead of
downloading from an OpenVox/Puppet Server.

  • run_command with async execution, timeout, and process kill on timeout
  • run_script with remote tmpdir creation, script upload via base64, and
    cleanup
  • run_task via shell agent with support for all input methods (environment,
    stdin, both)
  • Deterministic agent selection via task-agent config and --choria-task-agent
    CLI flag (no automatic fallback between agents)
  • Batched shell polling via shell.list + shell.statuses for scalability
  • Platform-aware command builders for POSIX and Windows (PowerShell)
  • Interpreter support via the interpreters config option

Shared infrastructure

  • Client management (client.rb): MCollective RPC client setup with auto-detected or explicit Choria config, NATS/TLS overrides, collective routing, and thread-safe one-time initialization
  • Agent discovery (agent_discovery.rb): Per-target agent detection and version checking with caching, OS family detection for platform-specific command building
  • Command builders (command_builders.rb): POSIX and Windows command construction for tasks, scripts, file uploads, directory management, and environment variable injection
  • Helpers (helpers.rb): Shared polling with configurable retries and timeout, result building, input validation (path traversal, env key injection, null bytes)
  • Config (config/transport/choria.rb): Transport configuration with validation for all options including SSL overrides, timeout settings, agent selection, tmpdir, and interpreters

Transport config options

See docs for a list of all config options. I tried to expose all of the relevant knobs, but if you can think of others that should get added, let me know.

Key design decisions

  • Use the Base class rather than Simple: The Simple class assumes the model used by SSH and WinRM transports, where you have one thread per target that handles the connection, execution, and cleanup and directly for one target. That doesn’t work with Choria’s architecture and would be far too inefficient. This uses the Base class and implements our own batching. It means more code, but is necessary to use the goodness and scalability of Choria.
  • Deterministic agent routing: No fallback from bolt_tasks to shell. If the configured agent isn't available, the target gets a clear error. Mixed fleets (some nodes with shell, some without) produce per-target success/failure results, not crashes. I considered trying to do an automatic fallback from bolt_tasks to shell if the task isn't available on the server, but this added a fair bit of extra complexity, and it's probably better to give the user more control over exactly how a task is run anyway.
  • Partitioned functionality and graceful failure handling: Aligning with Choria’s philosophy, functionality is narrowly scoped by the agents installed on target nodes. If a node doesn’t have the agent needed for an action, the action fails for that target in a graceful way.
  • Batched polling and fetching of results: shell.list for O(1) status checks per poll round, shell.statuses for batched output retrieval. Avoids per-target polling overhead. The shell.statuses action is a new action for the shell agent, which is why version 1.2.0 is required. Otherwise, fetching results from nodes at scale would have been very slow and cumbersome.
  • Collective-based batching, not concurrency-based: Unlike SSH/WinRM, targets are grouped by Choria collective (typically one group), and each batch uses MCollective's native multi-node RPC fanout. OpenBolt's --concurrency flag doesn't apply and all targets in a collective are addressed in a single RPC call.
  • Shell DDL bundled with OpenBolt: The shell agent DDL is shipped in lib/mcollective/agent/shell.ddl and preloaded during client setup, so users don't need to install it separately on the controller. The bolt_tasks DDL comes from the choria-mcorpc-support gem which is already an OpenBolt dependency.
  • Code readability and maintainability: I tried to find the right balance of keeping the code relatively easy to follow and encapsulating logic where it makes sense to, without too many functions you have to pass through for a particular action. Keep nesting of logic to a sane level, don’t mutate objects passed by reference to remote function calls, don’t require keeping too much state in your head to understand what the code is doing.

Future phases (not in this PR)

  • Phase 3: Getting this integrated into the foreman_openbolt/smart_proxy_openbolt Foreman plugins.
  • Phase 4: Implementing upload/download with a new file-transfer Choria agent.
  • Phase 5: Full plan support, including apply blocks

Implements phases 1 and 2 of the Choria transport, enabling OpenBolt to
run tasks, commands, and scripts on nodes via Choria's NATS pub/sub
messaging as an alternative to SSH and WinRM.

Phase 1 (bolt_tasks agent): Downloads task files to targets from an
OpenVox/Puppet Server and executes them using the bolt_tasks Choria agent.

Phase 2 (shell agent): Executes commands, scripts, and tasks through the
Choria shell agent. This allows running tasks not available on an
OpenVox/Puppet server.

Everything is implemented as asynchronously as possible, aligning with
Choria's model, and is built to run at scale across many thousands of
nodes at once.

See docs in a later commit for details on the phases of this project as
well as user-facing and developer documentation.
Attempts to minimize stubbing (although we still need a fair bit) and
use the choria-mcorpc-support gem as much as possible.
- choria-transport.md: User guide covering configuration, usage, and examples
- choria-transport-dev.md: Developer guide for architecture, data flow, and patterns
- choria-transport-plan.md: Project plan with phased roadmap and progress tracking
- choria-transport-testing.md: Test environment setup for manual verification
Add CLI flags for all Choria transport options so they can be passed on
the command line. CLI flags use a choria- prefix for clarity (e.g., --choria-config-file, --choria-ssl-ca) while
internal option keys remain unprefixed so inventory files stay clean
(e.g., choria: { config-file: /path }).

Rename choria-agent to task-agent since it only applies to task
execution. The CLI flag becomes --choria-task-agent.

New CLI flags:
  --choria-task-agent, --choria-config-file, --choria-ssl-ca,
  --choria-ssl-cert, --choria-ssl-key, --choria-collective,
  --choria-puppet-environment, --choria-rpc-timeout,
  --choria-task-timeout, --choria-command-timeout,
  --nats-servers, --nats-connection-timeout

The nats-* flags are not prefixed since they are already clearly
Choria-specific. Shared options (cleanup, tmpdir, host, interpreters)
are unchanged.
BoltOptionParser::OPTIONS[:choria] needs CLI switch names (e.g.,
choria-config-file) not internal keys (config-file) so that
remove_excluded_opts correctly includes them in --help output.
Also fix task-agent -> choria-task-agent in the task run flags list.
The 11 new Choria flags added to ACTION_OPTS increase the parameter
count for bolt apply, bolt command, and bolt file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant