Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 256 additions & 0 deletions .github/workflows/rolling-update.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
name: Rolling update

# Manually-triggered production rollout. Joins the Tailnet, SSHes over
# MagicDNS into each node, and invokes scripts/rolling-update.sh.
# See docs/design/2026_04_24_proposed_deploy_via_tailscale.md.

on:
workflow_dispatch:
inputs:
ref:
description: Git ref (tag or sha) to deploy. Also used as the image tag unless image_tag is set.
required: true
type: string
image_tag:
description: Override the image tag (default = ref). Used for rollbacks.
required: false
type: string
default: ""
nodes:
description: Comma-separated raft IDs to roll (e.g. "n1,n2"). Empty = all nodes in NODES_RAFT_MAP.
required: false
type: string
default: ""
dry_run:
description: Render the plan and run a reachability check only; do NOT touch containers.
required: true
type: boolean
default: true

permissions:
contents: read
id-token: write # required by tailscale/github-action OIDC flow
Comment on lines +30 to +32
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add packages:read for GHCR manifest check

This workflow narrows GITHUB_TOKEN to contents and id-token, which implicitly removes package scope, but the Verify image exists on ghcr.io step authenticates to GHCR and inspects a manifest with that token. In environments where the image is private (or package auth is otherwise required), docker login/docker manifest inspect will fail with authorization errors before rollout begins, so deploys are blocked even when the image exists.

Useful? React with 👍 / 👎.

packages: read # required by `docker manifest inspect` on ghcr.io private images

concurrency:
group: rolling-update
cancel-in-progress: false

jobs:
deploy:
runs-on: ubuntu-latest
# Approval gate — see GitHub environment settings for required reviewers.
# Dry-runs also use this environment so the secret wiring is identical;
# the environment's approval rule should be configured to auto-approve
# dry-runs if that distinction is desired (GitHub UI: "Deployment
# protection rules").
environment: production
timeout-minutes: 60

steps:
# The deploy script (scripts/rolling-update.sh) is executed from the
# checkout below, after the tailnet join and SSH key load. If `ref`
# were unvalidated, anyone with workflow_dispatch permission could
# point it at a fork commit containing a modified script that
# harvests the SSH key / Tailscale OAuth secret. Validate that
# `ref` resolves to (a) the repository's default branch, or (b) a
# tag on the repo, before we hand it any secret. Branches other
# than the default are rejected so review-gated default is the only
# entry point besides immutable tags.
- name: Validate ref is default branch or a tag
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REF: ${{ inputs.ref }}
run: |
set -euo pipefail
default_branch=$(gh api "repos/${{ github.repository }}" --jq '.default_branch')
default_sha=$(gh api "repos/${{ github.repository }}/commits/$default_branch" --jq '.sha')
if [[ "$REF" == "$default_branch" || "$REF" == "$default_sha" ]]; then
echo "ref is the default branch ($default_branch / $default_sha)"
exit 0
fi
if gh api "repos/${{ github.repository }}/git/refs/tags/$REF" >/dev/null 2>&1; then
echo "ref is a tag"
exit 0
fi
# Also accept a sha that is reachable from the default branch's HEAD
# so historical default-branch commits remain deployable for rollback.
if git -c "http.https://github.com/.extraheader=" ls-remote "https://github.com/${{ github.repository }}.git" | grep -q "^$REF"; then
echo "::error::ref '$REF' is not the default branch or a tag. Branches other than '$default_branch' are disallowed to prevent arbitrary-code execution with production secrets."
exit 1
fi
echo "ref '$REF' treated as a sha; checkout will fail if it is not reachable."

- name: Checkout
uses: actions/checkout@v6
with:
ref: ${{ inputs.ref }}
Comment on lines +85 to +87
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict deploy script checkout to trusted refs

The workflow checks out ${{ inputs.ref }} and later executes ./scripts/rolling-update.sh after loading production credentials (SSH key, Tailscale OAuth secret) and joining the tailnet, so ref is effectively arbitrary code execution, not just a deployment selector. If someone can dispatch runs and supply a branch/tag they control, a modified script in that ref can exfiltrate secrets or perform unintended actions once the environment is approved. Keep execution on a trusted protected revision (or strictly validate allowed refs) and use the input only for image selection.

Useful? React with 👍 / 👎.

persist-credentials: false

- name: Install jq
run: sudo apt-get install -y --no-install-recommends jq

- name: Verify image exists on ghcr.io
env:
IMAGE_BASE: ${{ vars.IMAGE_BASE }}
IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }}
GHCR_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
if [[ -z "$IMAGE_BASE" ]]; then
echo "::error::IMAGE_BASE repository variable is not set"
exit 1
fi
echo "Checking $IMAGE_BASE:$IMAGE_TAG"
echo "$GHCR_TOKEN" | docker login ghcr.io -u "${{ github.actor }}" --password-stdin >/dev/null
if ! docker manifest inspect "$IMAGE_BASE:$IMAGE_TAG" >/dev/null; then
echo "::error::image $IMAGE_BASE:$IMAGE_TAG not found on ghcr.io"
exit 1
fi

- name: Join Tailnet (ephemeral)
uses: tailscale/github-action@v3
with:
oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }}
oauth-secret: ${{ secrets.TS_OAUTH_SECRET }}
tags: tag:ci-deploy

- name: Configure SSH
env:
SSH_KEY: ${{ secrets.DEPLOY_SSH_PRIVATE_KEY }}
KNOWN_HOSTS: ${{ secrets.DEPLOY_KNOWN_HOSTS }}
run: |
set -euo pipefail
mkdir -p ~/.ssh
chmod 700 ~/.ssh
printf '%s\n' "$SSH_KEY" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
printf '%s\n' "$KNOWN_HOSTS" > ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts
# Sanity: no stray CRLF in the key, no empty file.
test -s ~/.ssh/id_ed25519 || { echo "::error::DEPLOY_SSH_PRIVATE_KEY is empty"; exit 1; }
ssh-keygen -lf ~/.ssh/id_ed25519 >/dev/null

- name: Render NODES and SSH_TARGETS
id: render
env:
NODES_RAFT_MAP: ${{ vars.NODES_RAFT_MAP }}
SSH_TARGETS_MAP: ${{ vars.SSH_TARGETS_MAP }}
NODES_FILTER: ${{ inputs.nodes }}
run: |
set -euo pipefail
if [[ -z "$NODES_RAFT_MAP" || -z "$SSH_TARGETS_MAP" ]]; then
echo "::error::NODES_RAFT_MAP or SSH_TARGETS_MAP is not set in the production environment variables"
exit 1
fi
if [[ -n "$NODES_FILTER" ]]; then
# Filter NODES_RAFT_MAP and SSH_TARGETS_MAP to the requested subset.
# Reject any filter ID that does not appear in the map: silently
# dropping unknown IDs would let a typo like "n1,n9" proceed as
# a one-node rollout of n1 alone, which is a staged-deploy
# footgun.
IFS=',' read -r -a wanted <<< "$NODES_FILTER"
IFS=',' read -r -a entries <<< "$NODES_RAFT_MAP"
declare -a known_ids=()
for e in "${entries[@]}"; do
known_ids+=("${e%%=*}")
done
unknown=""
for w in "${wanted[@]}"; do
found=0
for k in "${known_ids[@]}"; do
if [[ "$k" == "$w" ]]; then found=1; break; fi
done
if [[ $found -eq 0 ]]; then unknown+="${unknown:+, }$w"; fi
done
if [[ -n "$unknown" ]]; then
echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: ${known_ids[*]}"
exit 1
fi
filter_csv() {
local all="$1"
local filter="$2"
local out=""
IFS=',' read -r -a list_entries <<< "$all"
IFS=',' read -r -a list_wanted <<< "$filter"
for e in "${list_entries[@]}"; do
key="${e%%=*}"
for w in "${list_wanted[@]}"; do
if [[ "$key" == "$w" ]]; then
out+="${e},"
break
fi
done
done
echo "${out%,}"
}
NODES_RAFT_MAP="$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")"
SSH_TARGETS_MAP="$(filter_csv "$SSH_TARGETS_MAP" "$NODES_FILTER")"
if [[ -z "$NODES_RAFT_MAP" ]]; then
echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP"
exit 1
Comment on lines +189 to +191
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject partially invalid node filters

The filter logic silently drops unknown raft IDs and only errors when no IDs match, so an input like nodes: n1,n9 proceeds as a rollout of n1 only. That can leave operators believing multiple nodes were upgraded when one was skipped, which is risky for staged deploys/rollbacks and can leave the cluster in an unintended mixed-version state. Treat inputs.nodes as an exact set and fail if any requested ID is missing from the configured map(s).

Useful? React with 👍 / 👎.

Comment on lines +189 to +191
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail when requested node IDs are only partially matched

The nodes filter currently errors only when zero IDs match, so a typo like n1,n9 (or whitespace variant n1, n2) silently drops unmatched IDs and proceeds with a partial rollout. That means operators can believe they updated a specific subset while one or more intended nodes were skipped, which is risky during security or incident-driven deploys. Please validate that every requested ID is present in NODES_RAFT_MAP and fail if any are missing.

Useful? React with 👍 / 👎.

fi
fi
{
echo "NODES=$NODES_RAFT_MAP"
echo "SSH_TARGETS=$SSH_TARGETS_MAP"
} >> "$GITHUB_OUTPUT"
echo "::group::Deploy plan"
echo "NODES=$NODES_RAFT_MAP"
echo "SSH_TARGETS=$SSH_TARGETS_MAP"
echo "::endgroup::"

- name: Tailscale reachability check
env:
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
run: |
set -euo pipefail
IFS=',' read -r -a entries <<< "$SSH_TARGETS"
failed=0
for e in "${entries[@]}"; do
Comment on lines +208 to +210
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check reachability for all rollout nodes, not only SSH map entries

The reachability step iterates only over SSH_TARGETS, but rolling-update.sh resolves missing SSH mappings by falling back to each node host from NODES (ssh_target_by_id). If SSH_TARGETS_MAP is incomplete, dry-run can report success without probing some actual rollout targets, and the job can then fail mid-roll when it reaches an unvalidated node. Preflight should derive targets from NODES + SSH_TARGETS using the same fallback semantics or enforce one-to-one mapping coverage first.

Useful? React with 👍 / 👎.

host="${e##*=}"
host="${host%%:*}"
# strip user@ if present
host="${host##*@}"
if tailscale ping --c 2 --timeout 3s "$host" >/dev/null 2>&1; then
echo " ok $host"
else
echo "::error::$host not reachable over tailnet"
failed=1
fi
done
if [[ "$failed" -ne 0 ]]; then
exit 1
fi

- name: Dry-run summary
if: ${{ inputs.dry_run }}
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
IMAGE_BASE: ${{ vars.IMAGE_BASE }}
IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }}
SSH_USER: ${{ vars.SSH_USER }}
run: |
set -euo pipefail
cat <<EOF
==== DRY RUN — no containers were touched ====
image: ${IMAGE_BASE}:${IMAGE_TAG}
SSH user: ${SSH_USER}
NODES: ${NODES}
SSH_TARGETS: ${SSH_TARGETS}
ref: ${{ inputs.ref }}
Re-run with dry_run=false to apply.
EOF

- name: Roll cluster
if: ${{ !inputs.dry_run }}
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
SSH_USER: ${{ vars.SSH_USER }}
IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }}
SSH_STRICT_HOST_KEY_CHECKING: "yes"
run: |
set -euo pipefail
./scripts/rolling-update.sh
Loading
Loading