Skip to content

chore: sync main → develop after misrouted docs PRs#108

Open
saadqbal wants to merge 17 commits intodevelopfrom
main
Open

chore: sync main → develop after misrouted docs PRs#108
saadqbal wants to merge 17 commits intodevelopfrom
main

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented May 6, 2026

Summary

Backflow PRs that landed on `main` instead of `develop`:

These should have been opened against `develop`. Bringing main into develop now to prevent merge friction on the next Prod: PR. (#106 was retargeted before merge so it lands on develop directly.)

Closes #107

🤖 Generated with Claude Code


Note

Low Risk
Low risk: changes are limited to GitHub Actions workflow wiring and documentation/commands, with no runtime code or data-path modifications.

Overview
Adds two GitHub Actions workflow callers (for fr-gate on PRs to staging/main/master and for /fr-pass issue comments) that delegate to workflows in tracebloc/.github and inherit repo secrets.

Updates README.md and docs/INSTALL.md to clarify Helm-based installation (including helm repo add/helm install), prominently document the standalone quick installer commands, and link to in-repo deployment/security/migration docs with a NetworkPolicy requirement callout.

Reviewed by Cursor Bugbot for commit be344fb. Bugbot is set up for automated code reviews on this repo. Configure here.

saadqbal and others added 14 commits April 24, 2026 21:18
* Add NetworkPolicy locking down training-pod egress

Training pods run untrusted ML code uploaded by external data scientists.
This policy selects on the tracebloc.io/workload=training label (injected
by jobs-manager in the companion client-runtime PR) and:

  - Denies all ingress (nothing should connect TO a training pod).
  - Allows DNS to the cluster DNS service.
  - Allows external TCP/443 only; blocks all pod-to-pod, ClusterIP, and
    in-cluster pod traffic via ipBlock with cluster-CIDR exclusions.

Training pods can still reach tracebloc backend, Azure Service Bus, and
App Insights (external HTTPS). They can no longer reach mysql-client,
the K8s API server, the jobs-manager pod IP, or other training pods.

Per-platform defaults:
  AKS:  enabled=true  (requires Azure NPM or Calico at cluster create)
  EKS:  enabled=false (AWS VPC CNI does not enforce NetworkPolicy; safer
                       to explicitly disable than silently have no effect)
  BM:   enabled=true  (requires Calico / Cilium / kube-router)
  OC:   enabled=true  (OVN-Kubernetes enforces by default; custom DNS
                       selector and OpenShift pod/service CIDRs)

The dnsSelector default is empty with a template-side fallback to
{k8s-app: kube-dns} to avoid Helm's map-merge semantics surprising
customers who override it (OpenShift's selector would otherwise be
unioned with the default rather than replacing it).

- templates/network-policy-training.yaml: new policy (gated on
  networkPolicy.training.enabled)
- values.yaml + values.schema.json: new networkPolicy.training block
- ci/{aks,eks,bm,oc}-values.yaml: per-platform overrides with notes
- tests/network_policy_test.yaml: 8 helm-unittest cases covering
  rendering, ingress denial, DNS allow, external HTTPS allow, cluster
  CIDR blocking, and the OpenShift selector override

No effect until the companion client-runtime PR lands, which adds the
tracebloc.io/workload=training label to spawned training pods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add optional Namespace resource with Pod Security Admission labels (#43)

* Add optional Namespace resource with Pod Security Admission labels

Layers Kubernetes Pod Security Admission on top of the per-pod
securityContext work for defense-in-depth. Off by default -- enabling
requires a greenfield install, since the chart does not currently own
the release namespace on existing deployments.

When namespace.create is true, the chart templates a Namespace with:

    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted
    helm.sh/resource-policy: keep

Warn + audit surface any pod-spec violation as a kubectl warning and
an audit-log event, without rejecting the pod. This gives us a
tripwire for future regressions in our own pod specs (jobs-manager,
mysql, resource-monitor, training pods) and for any third-party pods
in the same namespace.

Enforce mode is deliberately left UNSET. Two of our own workloads
would be rejected under enforce: restricted:

  - mysql init containers run as UID 0 (needed to chown the PVC
    before the main container -- UID 999 -- starts)
  - resource-monitor DaemonSet mounts hostPath /proc and /sys

Enabling enforce before those are refactored (or moved to a separate
namespace) would break the chart. Customers who want full enforcement
can set namespace.podSecurity.enforce = restricted after auditing
their own deployment; the current defaults keep them safe.

helm.sh/resource-policy: keep prevents helm uninstall from deleting
the Namespace, which would otherwise take the PVC-backed training
data and MySQL state with it.

- templates/namespace.yaml: new, gated on namespace.create (default false)
- values.yaml: new namespace block with long comments
- values.schema.json: schema entries for namespace.create + podSecurity
- tests/namespace_test.yaml: 8 helm-unittest cases (toggle off, toggle
  on, keep annotation, labels, version strings, enforce omitted when
  empty, enforce present when set, baseline override, namespace name
  respects release)
- docs/INSTALL.md: section explaining the greenfield vs existing-ns
  paths with copy-pasteable kubectl label commands

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix kubeVersion constraint to accept cloud pre-release suffixes

Helm's semver parser excludes pre-release versions from >= ranges by
default, so ">=1.24.0" rejected EKS ("1.34.4-eks-f69f56f"), GKE
("-gke-*"), and AKS release-tagged versions. Changing to ">=1.24.0-0"
explicitly opts the constraint into matching pre-releases, which is
how managed-Kubernetes providers encode their vendor suffix.

Surfaced while dry-run-installing PR #43 against a dev EKS cluster.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>

* Add consolidated SECURITY.md covering the training-pod sandbox (#44)

Brings together the threat model, defense layers, per-platform
caveats, operator responsibilities, residual risks, and verification
steps into one reviewable artifact. Covers the complete hardening
posture as shipped across the chart + jobs-manager + new-arch
training images.

Sections:

  1.  Threat model: trusted platform, untrusted external-data-
      scientist submissions. Explicit in-scope / out-of-scope.
  2.  Seven design goals (G1-G7) for the training-pod sandbox,
      each mapped to current status on new-arch vs. legacy.
  3.  Architecture overview.
  4.  Defense layers -- credential isolation, network egress,
      K8s API access, container runtime hardening, storage
      isolation, cross-tenant forgeability, admission tripwire.
  5.  Per-platform caveats -- NetworkPolicy CNI matrix (AKS/EKS/
      bare-metal/OpenShift), PSA version requirements, OpenShift
      DNS selector override, runAsUser + arbitrary UIDs, bare-
      metal hostPath note.
  6.  What operators must do themselves -- rotate secrets, verify
      CNI enforces, label existing namespaces, monitor audit,
      upgrade ordering, refactor path for enforce: restricted.
  7.  Verification -- copy-pasteable kubectl snippets for each
      defense layer.
  8.  Residual risks with explicit ownership -- global SB conn
      strings (backend), HTTPS egress (platform endgame), token
      TTL (backend), legacy arch (migration team), PSA enforce
      (chart refactor), CNI silent no-op (operator), kernel
      escape (out of scope), resource DoS (out of scope).
  9.  Compromise response playbook.
  10. Where each defense is implemented (code-path map for
      reviewers).
  11. Document history.

Also:

- README.md: add Security subsection under Deployment Guide
  linking to docs/SECURITY.md.
- docs/INSTALL.md: prerequisite note about CNI enforcement.

No code changes; documentation only.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add docs/MIGRATIONS.md and CLAUDE.md for Helm chart migration safety (#47)

Document the helm.sh/resource-policy=keep gotcha: Helm reads the
annotation from the stored release manifest, not live resources, so
kubectl annotate alone does not protect PVCs from helm uninstall.
Includes the 2026-04-22 tracebloc-templates migration as a case study
and three mitigation options (helm upgrade, strip ownership, or rely
on PV Retain + recreate).

* docs(client): add pre-Helm resource-monitor cleanup step to MIGRATION.md (#49)

Early-era edges were installed with a hand-rolled `resource-monitor`
DaemonSet via raw `kubectl apply` before the per-platform charts existed.
The unified chart's `tracebloc-resource-monitor` DaemonSet replaces it,
but the legacy DS is unmanaged and keeps running after migration, mounting
hostPath /proc + /sys and blocking PSA `enforce=restricted` on the namespace.

Adds a step-6 section documenting the kubectl cleanup (DS + SA + ClusterRole
+ ClusterRoleBinding, all named `resource-monitor`) with a safety check to
confirm the ClusterRole/Binding aren't shared before deletion.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(mysql): drop root init-containers, add PSA-restricted securityContext (#48)

* feat(mysql): drop root init-containers, add PSA-restricted securityContext

Unblocks pod-security.kubernetes.io/enforce: restricted on the release
namespace. Previously the mysql-client pod had two init-containers
running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999
before mysqld started. PSA restricted rejects runAsUser: 0 on any
container, so these init-containers were the last blocker to promoting
the namespace from warn/audit to enforce.

The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch`
at the pod level, which kubelet uses to chgrp mounted volumes on first
mount. Once that is in place the init-container chowns are redundant:

- On existing PVCs (already owned 999:999 from the prior init-container
  chown) OnRootMismatch sees the correct root ownership and skips the
  recursive chgrp — mount is instant, no behavior change.
- On fresh PVCs kubelet applies fsGroup before the main container starts.
- On emptyDir (the logs volume) kubelet applies fsGroup at volume
  creation.

Also adds a container-level securityContext with all six fields PSA
restricted requires:
- runAsNonRoot: true
- runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's
  default user, and the entrypoint skips its root-to-mysql gosu re-exec
  when already running as 999)
- allowPrivilegeEscalation: false
- capabilities: drop all
- seccompProfile: RuntimeDefault

Scope: client chart only (now the universal chart covering eks/aks/bm/oc).

Caveats for customers:
- Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType
  (EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some
  object-backed drivers do not; chart docs should flag this in a
  follow-up.

Deferred to separate PR:
- readOnlyRootFilesystem on the mysql container (needs emptyDir mounts
  for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk).

* fix(mysql): restore chown init-container for hostPath (bare-metal)

kubelet does not apply fsGroup ownership to hostPath volumes
(kubernetes/kubernetes#138411), so bare-metal installs need a
privileged bootstrap to chown /var/lib/mysql to 999:999 on first
start. Gated on .Values.hostPath.enabled so CSI-backed deployments
(EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Move tracebloc-resource-monitor to dedicated privileged namespace (#50)

* Move tracebloc-resource-monitor to dedicated privileged namespace

Pod Security Admission's `restricted` profile bans hostPath volumes
outright, and the resource-monitor DaemonSet needs hostPath /proc and
/sys to read node-level metrics. Previously, setting
`pod-security.kubernetes.io/enforce: restricted` on the release
namespace (tracebloc-templates) would reject the DaemonSet outright,
and `warn=restricted` + `audit=restricted` already spam violations.

This isolates the DaemonSet in a new dedicated namespace
(tracebloc-node-agents, configurable via `nodeAgents.namespace.name`)
that carries `pod-security.kubernetes.io/{enforce,warn,audit}:
privileged` labels. The release namespace is no longer constrained by
the node-agent and can run `enforce: restricted` once the mysql init
refactor lands.

Changes:
- templates/node-agents-namespace.yaml: new, gated on
  nodeAgents.namespace.create (default true) and resourceMonitor
- templates/resource-monitor-daemonset.yaml: deploy into node-agents ns
- templates/resource-monitor-rbac.yaml: SA + (Cluster)RoleBinding in
  node-agents ns
- templates/resource-monitor-scc.yaml: SCC users + CRB subject updated
  (OpenShift path)
- values.yaml + values.schema.json: new `nodeAgents.namespace` block
- templates/namespace.yaml + docs/INSTALL.md: drop resource-monitor
  from the enforce-blocker list; document the new node-agents ns
- tests/node_agents_namespace_test.yaml: 12 new unittest cases

Upgrade impact: existing installs will see the DaemonSet / SA /
(Cluster)RoleBinding deleted from the release namespace and recreated
in the node-agents namespace during `helm upgrade`. Brief (~seconds)
gap in node metrics during rollout; no persistent data involved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Mirror secrets into node-agents ns; keep namespace RBAC in release ns

Two follow-ups from review of the namespace-split change:

1. Secrets are namespace-scoped — a pod in `tracebloc-node-agents`
   cannot `secretKeyRef` a Secret that only exists in the release
   namespace. The resource-monitor DaemonSet was referencing CLIENT_ID /
   CLIENT_PASSWORD from `tracebloc.secretName` and the registry pull
   secret, both of which template only into `.Release.Namespace`, so
   pods would have failed to start with CreateContainerConfigError.

   templates/secrets.yaml and templates/docker-registry-secret.yaml now
   template a second copy into `nodeAgents.namespace.name` when:
     resourceMonitor != false  AND  node-agents ns != release ns

   The mirror is skipped when the two namespaces collide (e.g. operator
   points nodeAgents.namespace.name back at the release namespace) so
   Helm does not try to create two resources with the same name.

2. When clusterScope: false, the Role must live in the RELEASE
   namespace because that is where the monitored workloads run — a
   namespace-scoped Role only grants access to its own namespace.
   Previously this PR put the Role in `tracebloc-node-agents`, which
   would have silently broken the resource-monitor for anyone not
   using ClusterRole. Role + RoleBinding are now back in
   `.Release.Namespace`; the RoleBinding subject still points at the
   ServiceAccount in the node-agents namespace (cross-namespace
   subjects in RoleBindings are valid).

Tests updated accordingly; 5 new cases cover mirror-on, mirror-off
(resourceMonitor=false), mirror-off (namespaces collide), dockercfg
mirror, and the corrected Role/RoleBinding placement.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(resource-monitor): pin NAMESPACE env to release ns; guard node-agents ns==release ns

Two review fixes from the PSA hardening change:

1. NAMESPACE env var was using Downward API fieldPath: metadata.namespace,
   which now resolves to the node-agents namespace (where the DaemonSet
   pods live) instead of the release namespace (where the monitored
   workloads live). Replace with the literal Release.Namespace so the
   monitor continues to watch the right namespace regardless of where
   its own pods run.

2. node-agents-namespace.yaml would stamp privileged PSA labels onto the
   release namespace if an operator set nodeAgents.namespace.name to the
   release namespace (and with namespace.create=true it would render two
   Namespace docs with the same name — a render-time collision). Add an
   equality guard so the template is a no-op in that configuration.

Adds one test covering the NAMESPACE env fix; tests: 74/74 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(mysql): set readOnlyRootFilesystem on mysql-client (#52)

Completes container runtime hardening (G4) for mysql-client. Adds three
emptyDir mounts for the paths mysqld writes to at runtime that are NOT
already on PVC or log volumes:

- /var/run/mysqld       pid file + unix socket
- /tmp                  temp tables, sort buffers, LOAD DATA staging
- /var/lib/mysql-files  default secure_file_priv dir (touched at start)

Verified via helm upgrade on EKS (tb-client-dev-templates /
tracebloc-templates): pod Ready, readOnlyRootFilesystem=true, `touch /etc/x`
rejected as Read-only, mysqld.sock + mysqld.pid present under /var/run/mysqld,
existing DB data intact in /var/lib/mysql.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(psa): enforce=restricted by default on CSI; bare-metal overrides (#51)

- values.yaml: namespace.podSecurity.enforce flipped to "restricted".
- ci/bm-values.yaml: overrides enforce to "" because kubelet does not
  apply fsGroup to hostPath volumes (kubernetes/kubernetes#138411),
  forcing the chart to render a privileged init-mysql-data chown
  container that PSA restricted would reject. warn+audit remain on.
- namespace.yaml docstring + SECURITY.md (§4.7, §6.3, §6.6, §8.5)
  updated to document the CSI-default / bare-metal-override split.

Verified with helm template --set namespace.create=true against both
eks-values.yaml (enforce rendered) and bm-values.yaml (enforce absent).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(installer): slim k3d and add dev overrides for local testing (#54)

The tracebloc client is outbound-only: jobs-manager and pods-monitor
dial out to the platform, and the only in-cluster Service is mysql-client
(ClusterIP). The bundled k3s ingress/LB stack and metrics-server are
unused overhead, and the chart ships its own StorageClass.

Drop the loadbalancer port mappings (HTTP_PORT/HTTPS_PORT) plus their
validation/help/log references, and pass --k3s-arg "--disable=..." for
traefik, servicelb, metrics-server, and local-storage to k3d cluster
create. Applied symmetrically in scripts/install-k8s.ps1.

Also add two env vars for local-chart testing in install-client-helm.sh:

  TRACEBLOC_CHART_PATH    install from a local chart path instead of the
                          published tracebloc/client Helm repo (skips
                          helm repo add/update)
  TRACEBLOC_VALUES_FILE   use the caller-supplied values file as-is and
                          skip the clientId/password prompts + values.yaml
                          generation

With both set, the installer can exercise the full flow end-to-end
against unreleased chart changes before publishing.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(client): harden image pinning and credentials (v1.0.4) (#53)

Address the High-severity findings from the client chart security review:

- Add digest support to tracebloc.image helper and images.* values for
  jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is
  set, the image is rendered as repo@sha256:... and imagePullPolicy drops
  to IfNotPresent (immutable pin, auditable rollout).
- Replace the hard-coded mysql-client:latest with a configurable tag that
  defaults to "prod". The schema rejects "latest" outright; operators
  wanting absolute pinning should set images.mysqlClient.digest.
- Harden the bare-metal mysql init-container: still runs as root (kubelet
  does not apply fsGroup to hostPath volumes, k8s#138411), but now with
  drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false,
  readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
- Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults.
  The defaults are now empty strings; the schema and template both reject
  empty values and <...> placeholder patterns so deployments fail fast
  instead of silently encoding a placeholder into the Secret.

Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(client): require metrics-server for resource-monitor (v1.0.5) (#55)

The tracebloc-resource-monitor DaemonSet queries the metrics.k8s.io API
for node CPU/memory. Without metrics-server registered, the DaemonSet
crash-loops with 404s against /apis/metrics.k8s.io/v1beta1 — silently,
every few seconds. Found during a bare-metal smoke test on a k3d cluster
where metrics-server had been explicitly disabled.

- scripts/lib/cluster.sh: drop --disable=metrics-server from the k3d
  create args. k3s bundles metrics-server; the earlier comment claiming
  the chart "ships its own" was wrong — the DaemonSet is a consumer of
  metrics-server, not a replacement.
- client/templates/resource-monitor-daemonset.yaml: add a pre-install
  `lookup` that fails the release up front when resourceMonitor is true
  but v1beta1.metrics.k8s.io is not registered. Guarded by a kube-system
  probe so offline `helm template` still renders.
- client/values.yaml: document the dependency inline on resourceMonitor,
  with per-platform install notes (k3d/AKS bundled; EKS/OC/bare-metal
  need manual install).
- docs/SECURITY.md: call out the dependency and the escape hatch
  (resourceMonitor: false) in the architecture section.
- Chart.yaml: 1.0.4 -> 1.0.5.

Verified on a fresh k3d cluster (no --disable=metrics-server): metrics
API comes up in ~30s, smoke install succeeds, resource-monitor reaches
Running with zero ERROR/404 lines. Pre-flight fail path also verified
against a metrics-less cluster.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(mysql): drop chmod from hostPath init (v1.0.6) (#56)

The init-container runs as UID 0 with capabilities drop:[ALL] add:[CHOWN].
After 'chown 999:999' transfers ownership, the subsequent 'chmod 755' runs
as a non-owner without CAP_FOWNER and returns EPERM on re-install where
the hostPath dir already exists from a prior run. Reversing the order
does not help (chmod first still fails once the dir is 999-owned from
any previous successful run).

kubelet creates hostPath dirs at 0755 via DirectoryOrCreate, so the chmod
was a no-op on fresh installs and broken on re-installs. Drop it.

Verified on k3d/AWS VM:
- fresh install: kubelet-created root:root dir -> chown succeeds -> 999:999
- re-install: pre-existing 999:999 dir with data -> chown no-op -> data intact

* Chore/merge main into develop (#58)

* Update README.md

* Add narrow CODEOWNERS for security-sensitive paths

* Remove metrics-server disable argument from k3d cluster creation in install-k8s.ps1 to ensure proper functionality of the resource-monitor DaemonSet, which relies on the metrics API. This change aligns with previous updates that emphasized the necessity of metrics-server for monitoring capabilities.

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Merge pull request #60 from tracebloc/fix/resource-monitor-digest-pinning

fix(client): pin resource-monitor by digest (v1.0.7)

* chore: add auto-add to engineer kanban workflow (#45)

* Add auto-add to engineer kanban workflow

* fix(ci): pin actions/add-to-project to v1.0.2

@v1 is not a valid tag — action publishes full semver only. Pin to v1.0.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) (#61)

* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8)

When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the
template's range loop produced no items, so `except:` rendered as null.
Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`,
silently granting training pods unrestricted port-443 egress to MySQL, the
K8s API, jobs-manager, and every other in-cluster destination the policy
is meant to block.

Gate the misconfiguration at two levels:
- `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at
  helm install/upgrade validation)
- `network-policy-training.yaml`: add a `{{ fail }}` guard as
  defense-in-depth for schema-bypass paths (helm template --validate=false)
- `tests/network_policy_test.yaml`: add a unit test asserting the failure

Credit: bug bot finding.

* fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade

Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.

- Read the digest via nested `default (dict)` so a missing `images` map
  AND a missing `resourceMonitor` entry both fall through to "" safely.
  `dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
  `images: null` and asserts the DaemonSet still renders with the tag
  fallback.

Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.

* fix(client): scope clusterCidrs minItems guard to enabled=true only

Bug bot flagged that the unconditional minItems:1 constraint on
networkPolicy.training.clusterCidrs rejects `enabled: false` +
`clusterCidrs: []` — a legitimate minimal config for operators on
non-enforcing CNIs who disable the policy entirely.

Move the constraint behind a JSON Schema draft 7 if/then at the
`training` object level: minItems:1 applies only when enabled=true.
The template-side fail guard was already correctly scoped inside the
`.Values.networkPolicy.training.enabled` check, so no template change
is needed — this aligns the schema with the template.

Add a unittest covering `enabled: false` + `clusterCidrs: []`
(schema must pass, no policy rendered).

---------

Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Merge pull request #62 from tracebloc/fix/release-workflow-lint
Enhance CI workflows and fix MySQL resource management issues
* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight

* chore: add default CODEOWNERS for auto-reviewer assignment (#73)

* ci: add kanban closure-routing caller workflow (#75)

* fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72)

Two client releases on the same cluster could not both deploy the
resource-monitor DaemonSet because several resources templated into the
shared tracebloc-node-agents namespace used the literal name
`tracebloc-resource-monitor` rather than a release-scoped name. The
second `helm install` failed with:

  Error: ServiceAccount "tracebloc-resource-monitor" in namespace
  "tracebloc-node-agents" exists and cannot be imported into the current
  release: invalid ownership metadata; ... must equal "hasan-prod":
  current value is "stg".

Surfaced during the 2026-04-27 hasan-prod migration on
tracebloc-templates-prod; worked around at the time by setting
resourceMonitor: false on the second release, which means prod customers
currently lose their per-CLIENT_ID metric stream until this lands.

What changed:

- New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`,
  centralised in _helpers.tpl alongside the existing per-release name
  helpers (secretName, serviceAccountName, etc.).
- DaemonSet metadata.name, spec.selector.matchLabels.app, pod label
  app=, and spec.template.spec.serviceAccountName all now go through
  the helper. The selector + pod label have to move together because
  DaemonSet selectors are namespace-scoped: two DaemonSets in
  tracebloc-node-agents both selecting `app: tracebloc-resource-monitor`
  would each grab the other's pods, which is worse than the surface bug.
- ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through
  the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding
  metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`)
  and stay as-is to avoid an unnecessary ClusterRole rename for upgrading
  installs. Only the *subject* names in (Cluster)RoleBinding change to
  point at the new SA.
- Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents:
  the secret names were already release-scoped via
  tracebloc.secretName / tracebloc.registrySecretName so they did not
  collide. Their `app` label was the literal value, which is harmless on
  uniquely-named resources but inconsistent — updated for consistency.
- Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton
  resources is a behaviour change for existing installs (DaemonSet name,
  ServiceAccount name, and selector label all change), so a minor bump
  signals that operators should review.

Tests: 93 -> 98. New cases cover:
- DaemonSet name + selector + serviceAccountName all release-scoped
- ServiceAccount name release-scoped
- ClusterRoleBinding subject points at the release-scoped SA
- A second `helm template` with a different release name produces
  non-colliding names

Verified end-to-end via `helm template stg ./client` and
`helm template hasan-prod ./client` on the same chart: ServiceAccount,
DaemonSet, and ClusterRoleBinding subject names all diverge per release.

Upgrade path from 1.1.0:

The DaemonSet and ServiceAccount rename triggers a Helm three-way merge
that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs
the new release-scoped one. ~30-60s gap on each node where resource
metrics are not collected. DaemonSet selector is immutable, so the
delete-then-create path is what we want — helm upgrade handles this
automatically because the names diverge in the stored manifest. No
manual orphan cleanup needed.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): allow training pods to reach mysql-client (v1.2.1) (#76)

The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and
external TCP/443. Training pods load their dataset from the in-namespace
mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table),
so under any CNI that actually enforces NetworkPolicy the connect failed
with errno 111 and the Job CrashLoopBackOff'd before the first batch:

  Database connection failed: 2003 (HY000): Can't connect to MySQL server
  on 'mysql-client:3306' (111)
  RuntimeError: Database connection is not available for load_dataframe_from_sql_table

Surfaced on a fresh client install (k3d / k3s, which enforces policy via
the built-in kube-router) where jobs-manager could reach mysql but every
training Job spawned with tracebloc.io/workload=training could not.

Add a third egress rule scoped to podSelector {app: mysql-client} on
TCP/3306. Same-namespace by default (no namespaceSelector), so it stays
tight to the chart's own mysql pod and does not open the namespace
generally. The egress[1] /32 ipBlock comment is updated to note that
MySQL is now explicitly re-permitted by egress[2].

Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod
with the training label was refused; post-fix it connects.

* docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74)

* docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x

Captures the operational tooling validated during the 2026-04-27 stg and
hasan-prod migrations and generalises it for the remaining tenants
(bmw, cisco, charite) and any future tenant on the legacy chart family.

What's here:

- README.md walks the workflow + recommended ordering for the pending
  set + skip rationale for chart toggles (resourceMonitor: false,
  priorityClass.create: false, etc).
- generate.sh consumes a tenant-config.env (gitignored) and emits, per
  tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml.
  Refuses to expand placeholder __FOO__ rows so an operator running
  generate.sh against the unmodified template fails fast.
- migrate-tenant.sh is the parameterised runbook. `phase1` is
  non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand
  recovery point, dry-run render). `phase2` is one-shot per tenant
  (helm uninstall, claimRef clear, SC re-create, PVC pre-create with
  release-scoped Helm ownership stamp, helm install, verify mysql data
  + keep annotation in stored manifest).
- tenant-config.example.env is the template; populated copy is the
  secret-bearing artifact and must stay local.

No real secrets in any committed file:

- DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__)
- per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders
- MYSQL_ROOT_PW placeholder (it's image-baked; required from env at
  runtime, no committed default)
- .gitignore now excludes docs/migration-tools/tenant-config.env
  (only the .example variant is tracked)

Operational notes:

- Every kubectl/helm call passes --context explicitly. The 2026-04-27
  prod run hit a context-drift bug mid-migration; the explicit form
  is a hard requirement.
- values.yaml ships with resourceMonitor: false. Flip true after the
  release-scoped resource-monitor names land in client-1.2.0 (separate
  PR). Until then the shared SA in tracebloc-node-agents collides with
  the stg release.
- Phase 1 is idempotent and re-runnable. Phase 2 is destructive and
  one-shot per tenant. Operators should pause and eyeball Phase 1
  outputs before running Phase 2 — that's deliberately not automated.

Once all four pending tenants are on client-1.x, this directory is
historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade`
because the new chart already templates `helm.sh/resource-policy: keep`
on PVCs, so the migration protocol isn't needed for routine upgrades.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(migration-tools): address bugbot review feedback on PR #74

Three issues flagged by Cursor Bugbot on the migration scripts:

* migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp
  verification (HIGH). Linux operators would abort Phase 1 mid-transfer.
  Add portable `_md5` and `_size` helpers that pick md5sum on Linux,
  fall back to md5(1) on macOS, and use `wc -c` instead of stat for size.

* generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD
  + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD
  (MEDIUM). Literal `__FOO__` placeholders silently rendered into
  values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install
  time. Iterate over every per-row field, plus a one-shot global check
  for DOCKER_PASSWORD before the loop. Error messages now name the
  offending field.

* Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done`
  (MEDIUM). After the destructive helm uninstall, a non-converging
  install (image-pull error, mysql kill-loop recurrence, missing PVC
  binding) hung the script forever instead of surfacing the failure.
  Add a wall-clock deadline — default 600s, override via READY_TIMEOUT —
  and exit 1 with the last-seen pod state on timeout.

* fix(migration-tools): address bugbot follow-up on PR #74

Two more issues raised on the previous fix commit:

* Readiness wait loop aborted on empty pod list (HIGH). With `set -euo
  pipefail`, the routine post-install window where no pods are visible
  yet caused `grep -c .` to exit 1, killing the script on the very first
  iteration before the wall-clock deadline could ever fire — defeating
  the bounded-wait intent. Guard the empty case explicitly. `wc -l`
  alone is also wrong because `echo ""` prints a newline.

* MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD,
  CLIENT_*, and PV_* now have (LOW). An operator who copied the example
  without editing this row passed the non-empty gate, then the literal
  __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up
  partway through with an opaque "Access denied" inside kubectl exec.
  Add the same `*__*__*` case guard right after the non-empty check.

* fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74)

The pre-source assignment

    EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}"

was a no-op: `source "$CONFIG"` runs immediately after and the example
config (and any real tenant-config.env derived from it) unconditionally
sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every
time. Operators thinking they were targeting a non-default EFS would
silently start AWS Backup on-demand jobs against the hard-coded prod
filesystem.

Move the override knob to AFTER source where env genuinely wins, drop
the hard-coded fallback, and require EFS_FS to be set somewhere (config
or override) before continuing.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): release-scope SCC SA refs (v1.2.2) (#78)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80)

* fix(client): release-scope SCC SA refs (v1.2.2)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3)

Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues:

1. (Medium) NOTES.txt:9 still hardcoded the literal
   `tracebloc-resource-monitor` for the resource-monitor DaemonSet
   display, while the actual DaemonSet name has been
   `<release>-resource-monitor` since v1.2.0. Operators see one name
   in the post-install banner and a different name when they
   `kubectl get ds`. Now routes through the same
   tracebloc.resourceMonitorName helper as the rest of the chart.

2. (Low) docs/migration-tools/generate.sh hardcoded
   `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0`
   on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3,
   and operators running generate.sh today get PVC labels stuck at
   1.1.0 even though the install ahead is 1.2.3. Helm adoption itself
   is unaffected (it keys on meta.helm.sh/release-name, not the chart
   label), but the labels lie until a subsequent upgrade reconciles
   them, and `kubectl get pvc -L helm.sh/chart` is misleading during
   migration debugging. Fixed by reading name + version from
   client/Chart.yaml at generate time.

Plus a few stale prose references caught while auditing the same path
(no functional impact, but the doc was directing operators at "client
fix in 1.2.0" as if it were still pending):

- generate.sh inline comment on `resourceMonitor: false` rephrased
  from "until client-1.2.0 is published" to "until you have verified
  the chart you're installing is 1.2.0+"
- migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to
  "mysql spec sanity (v1.1.0+ shape: ...)"
- README.md skip table cell on `resourceMonitor: false` rewritten to
  reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it
  to true without colliding with the stg release

Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers:
- Release-scoped resource-monitor name appears in NOTES.txt
- A different release renders a different name (proves the helper
  isn't accidentally hardcoded)
- Negative regex guards against the literal `tracebloc-resource-monitor`
  reappearing followed by a non-suffix character (i.e. the bare
  pre-1.2.3 form, while still letting the SCC line `tracebloc-
  resource-monitor-<release>` further down the file pass)
- `resourceMonitor: false` removes the line entirely

End-to-end smoke of generate.sh confirms PVCs ship with the live chart
version (`helm.sh/chart: client-1.2.3` after this commit, verified
against /tmp/tracebloc-migration-<demo>/pvcs.yaml).

Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains
the SCC SA-ref rename. Once #78 lands the diff against develop will
reduce to just this commit.

Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix +
tooling correctness).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* docs(claude): require @saadqbal as PR assignee (#79)

Convention captured after a session-end ask. Every PR Claude opens for
this repo must be assigned to saadqbal — orphaned PRs without an
assignee fall through the review queue.

Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running
unauthenticated). No exceptions.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
chore(client): bump chart to 1.2.3 for release
…loses #70) (#83) (#84)

The chart unification (4 per-platform charts -> unified client/ chart)
shipped in v1.1.0; the unified chart has now been at v1.2.x in production
across stg + hasan-prod for several releases. Time to retire the legacy
artifacts.

Removed:
- aks/, bm/, eks/, oc/ chart directories — 75 files, ~330KB. Each had a
  DEPRECATED.md pointing at the unified chart for ~6 months.
- 7 stale .tgz tarballs at repo root (aks-1.0.3, aks-1.0.4, bm-1.0.3,
  bm-1.0.4, eks-1.0.3, eks-1.0.4, oc-1.0.4). The release workflow
  publishes via gh-pages; these checked-in builds were dead weight.
- Root index.yaml — stale snapshot listing only 1.0.3/1.0.4 of the
  legacy charts. The live index served at tracebloc.github.io/client
  is on the gh-pages branch and is the source of truth.
- mysql.yaml at repo root — orphaned PVC manifest with hardcoded volume
  UUID and namespace. Audited: zero references anywhere in the repo.

Other:
- Added *.tgz to .gitignore so chart packages don't sneak back in.
- Updated client/MIGRATION.md Rollback section. The old "the legacy
  charts remain in aks/, bm/, eks/, oc/ and can be used at any time"
  was about to become a lie. Replaced with instructions to recover the
  directory from git history if anyone genuinely needs the old chart.

Verification:
- helm lint --strict ./client -f client/ci/eks-values.yaml — clean
  (same invocation the release workflow runs on every tag)
- helm unittest client — 105/105 still passing
- helm package ./client -d /tmp — produces a valid client-1.2.3.tgz

Net diff: 86 files changed, 17 insertions(+), 3447 deletions(-).

Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Prod: Implement self-upgrade CronJob for Helm chart automation
* chore: revert default CODEOWNERS — keep narrow security rules only (#92)

* Merge pull request #91 from tracebloc/chore/bump-chart-1.3.1

chore(client): bump chart 1.3.0 -> 1.3.1 (auto-upgrade verification)

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
The Deploy section opened with `docker pull tracebloc/client:latest`,
but this repo ships a Helm chart — the actual install is `helm install`.
External walkthrough URLs (`/local-linux`, `/local-macos`, `/aws`,
`/deployment-overview`) didn't match any path in the tracebloc/docs
tree, so they 404. The in-repo documentation (`docs/INSTALL.md`,
`docs/MIGRATIONS.md`, `docs/migration-tools/README.md`,
`client/MIGRATION.md`) was never linked from the README despite being
the operational source of truth.

Surgical change — the rest of the README stays as-is:
- Replace `docker pull` with `helm repo add` + `helm install` (matches
  docs/INSTALL.md)
- Call out chart version (v1.3.1) and platform support (AKS / EKS /
  bare-metal / OpenShift) up front
- Table linking every in-repo operational doc
- Fix external URLs to match actual tracebloc/docs paths
  (local-deployment-guide-linux, local-deployment-guide-macos,
  eks-client-deployment-guide, azure-deployment-guide)
- Pull NetworkPolicy/CNI prerequisite into a callout

Closes #101

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs: fix README Deploy section (Helm not docker), surface in-repo docs
The standalone installer (bash <(curl -fsSL tracebloc.io/i.sh) /
irm tracebloc.io/i.ps1 | iex) is the one-command path for evaluation,
local dev, and first-time installs — it provisions a cluster, detects
GPU drivers, and deploys the client. Today it isn't documented anywhere
reachable from this repo, so readers see the multi-step helm install
flow as the only option.

README:
- New "Quick install" subsection at the top of Deploy with macOS/Linux
  and Windows commands, brief description of what it does, and a
  pointer to the local helper scripts under scripts/
- Existing helm flow relabeled as "Helm install (production)" — now
  positioned as the option for existing production clusters

docs/INSTALL.md:
- Top-of-doc callout pointing at the standalone installer for
  non-production users
- Production-focused content untouched

Closes #103

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous wording ("Best for evaluation, local dev, and first-time
installs" / "Just trying it out? For local dev or a quick evaluation")
implied the standalone installer produces a lesser/demo client. It
doesn't — it produces the same full client, just on a cluster the
script provisions for you.

Reframes the differentiator around cluster ownership instead of install
quality:
- README: "Use this when you don't already have a cluster — the result
  is a full client install, not a demo." Helm subsection retitled
  from "Helm install (production)" to just "Helm install" with
  "For existing Kubernetes clusters".
- INSTALL.md: callout opens with "Don't have a Kubernetes cluster
  yet?" and emphasizes "a full tracebloc client".

Refs #103
curl and PowerShell's irm both default to HTTP when no scheme is
specified, so `curl -fsSL tracebloc.io/i.sh` and `irm tracebloc.io/i.ps1`
issue plaintext requests. The downloaded body is piped straight into
bash / iex, so a network-level attacker between the user and tracebloc.io
could MITM the response and inject arbitrary code.

Add explicit `https://` to every installer URL in README.md and
docs/INSTALL.md so the request is encrypted from the first byte.

Refs #103
docs: surface standalone installer in README and INSTALL.md
…main

ci: bootstrap FR-flow callers on main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants