Context
The current scheduler (forge-core/scheduler/scheduler.go) uses a
single 30s ticker goroutine and persists state to
<WorkDir>/.forge/memory/SCHEDULES.md. Three operational problems
when running as a container:
- No persistence by default — LLM-set schedules (via the
schedule_set builtin tool) and run history disappear on pod
restart unless a PVC is mounted at <WorkDir>/.forge.
- Not horizontally safe — two replicas both ticking on the same
SCHEDULES.md fire every schedule twice and race on file rename.
- Invisible to standard K8s tooling — operators can't
kubectl get cronjobs to inspect what's scheduled.
K8s already solves all three (cluster-singleton CronJob controller,
durable in etcd, native kubectl integration). The right
architecture is a hybrid backend that picks the cluster when running
in-cluster and falls back to the file store otherwise.
Proposal — Two parts
Part 1 — Hybrid scheduler backend (runtime side)
A new ScheduleBackend interface with two implementations chosen by
environment detection at startup:
// forge-core/scheduler/backend.go (new)
type ScheduleBackend interface {
Sync(ctx context.Context, entries []Schedule) error // declarative; startup + hot-reload
Add(ctx context.Context, s Schedule) error // dynamic; from schedule_set tool
Delete(ctx context.Context, id string) error
List(ctx context.Context) ([]Schedule, error)
}
type FileBackend struct { ... } // wraps the existing MemoryScheduleStore + ticker
type KubernetesBackend struct { ... } // uses k8s.io/client-go to CRUD CronJobs
Detection:
func InCluster() bool {
_, err := os.Stat("/var/run/secrets/kubernetes.io/serviceaccount/token")
return err == nil
}
forge.yaml escape hatch:
scheduler:
backend: auto # auto | file | kubernetes
kubernetes:
namespace: "" # defaults to the pod's own namespace
service_url: "" # the agent's in-cluster Service URL (required in k8s mode)
allow_dynamic: false # whether schedule_set (LLM-driven) can create CronJobs at runtime
Part 2 — forge package generates CronJob manifests
Today forge package emits a Deployment + Service + ConfigMap for
the agent. It should also emit one CronJob per entry in
forge.yaml schedules[]. Operators then kubectl apply -k ./k8s
once and get the agent + every declarative schedule materialized as
real CronJobs — no runtime CRUD calls needed for static schedules.
The runtime KubernetesBackend's Sync() reconciles in case someone
edits forge.yaml between deployments, but the steady-state expectation
is "declarative schedules are baked into the deploy manifest, dynamic
ones go through the API."
This also covers the case where the operator wants K8s-native
scheduling without granting the agent pod RBAC to create CronJobs —
set scheduler.kubernetes.allow_dynamic: false (the default), let
forge package generate the CronJobs, agent only needs RBAC to
list/get for the schedule_list tool.
Authentication — reuse the existing loopback token
Forge already mints an internal bearer token for channel plugins to
call back into the A2A endpoint:
Runner.ResolveAuth() generates r.authToken at startup
(runner.go:201,215).
- Stored via
auth.StoreToken(WorkDir, token) to
<WorkDir>/.forge/runtime.token with 0600 permissions
(forge-core/auth/token.go:47) so internal callers can read it.
- A
static_token auth provider is prepended to the chain keyed on
this token, identity {UserID: "forge-internal", Source: "internal"} (runner.go:2425-2436).
- Channel plugins consume it via
Runner.AuthToken().
CronJobs reuse the exact same token. No new auth surface to
design. CronJobs send the token as Authorization: Bearer — the
existing loopback static_token provider validates it, the
auth_verify event lands with Source: "internal" identical to a
channel callback.
Token provisioning — manifest is a template, NOT a credential
forge package MUST NOT embed the token value in any generated
file. The build pipeline runs in CI / developer workstations and
its output ends up in git repos, container registries, and operator
laptops — none of which are appropriate places for a long-lived
bearer token. Base64-encoded data inside a Secret manifest is
plaintext as far as version control is concerned.
Instead, forge package emits a Secret template with empty
data plus runtime-readable instructions for the operator:
# k8s/internal-token-secret.yaml — generated by `forge package`
apiVersion: v1
kind: Secret
metadata:
name: my-agent-internal-token
namespace: default
labels:
forge.agent.id: my-agent
type: Opaque
# data:
# token: <BASE64-OF-RUNTIME-TOKEN>
#
# This Secret is intentionally generated WITHOUT a `data` field.
# The token is a security credential and must not be checked into
# version control. Populate it once per deployment via one of:
#
# 1. kubectl create secret generic my-agent-internal-token \
# --from-literal=token="$(cat .forge/runtime.token)" \
# -n default --dry-run=client -o yaml | kubectl apply -f -
#
# 2. Use your secret-manager operator (ExternalSecrets / Sealed
# Secrets / SOPS / Vault Agent Injector) and point its
# external-secret manifest at this Secret name.
#
# 3. For first-deploy bootstrap from a clean checkout where no
# .forge/runtime.token exists yet, run `forge auth show-token`
# against the deployed pod (after the Deployment has minted a
# token on its volume) or pre-mint one with
# `forge auth mint-token`.
The Deployment manifest references the Secret by name as today;
applying the Deployment before the Secret exists fails the pod
readiness check with a clear MountVolume.SetUp failed for volume "internal-token": secret "my-agent-internal-token" not found —
operators get a loud "you forgot the token" signal rather than a
silent fallback.
A new forge auth subcommand (small follow-on, in scope here):
| Command |
Behavior |
forge auth show-token |
Print the token from <WorkDir>/.forge/runtime.token. Exit 1 + clear error if absent. |
forge auth mint-token |
Generate a fresh token, store it via auth.StoreToken, print it to stdout. Useful for first-time deploys. |
forge auth secret-yaml |
Print a ready-to-apply Secret YAML with the token loaded from local store. Pipe straight to kubectl apply -f -. Default behavior matches the option-1 example above but in one command. |
These belong in the same PR as the K8s backend because they're the
operator-facing primitives that close the loop on the
"manifest-without-credential" decision.
Generated CronJob shape
For each forge.yaml schedules[] entry:
apiVersion: batch/v1
kind: CronJob
metadata:
name: forge-aibuilderdemo-daily-summary
namespace: default
labels:
forge.agent.id: aibuilderdemo
forge.schedule.id: daily-summary
forge.schedule.source: yaml # "yaml" or "llm"
spec:
schedule: "0 9 * * *"
concurrencyPolicy: Forbid # K8s-native overlap prevention
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: trigger
image: curlimages/curl:8.10.1
env:
- name: FORGE_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: my-agent-internal-token
key: token
args:
- -sX
- POST
- http://my-agent.default.svc:8383/
- -H
- "Authorization: Bearer $(FORGE_AUTH_TOKEN)"
- -H
- "X-Forge-Schedule-Id: daily-summary"
- -H
- "Content-Type: application/json"
- --data
- '{"jsonrpc":"2.0","id":"1","method":"tasks/send","params":{"id":"sched-daily-summary-$(date +%s)","message":{"role":"user","parts":[{"type":"text","text":"<schedule description from forge.yaml>"}]}}}'
concurrencyPolicy: Forbid is K8s's native equivalent of the current
scheduler's schedule_skip on overlap — same semantic, free.
Audit-event linkage
The agent recognizes a scheduled fire by the X-Forge-Schedule-Id
request header. Middleware reads it at the A2A boundary and stashes
it in ctx alongside the existing workflow / tenancy context. The
runner emits schedule_fire itself before dispatching, then the
normal session_start → llm_call → invocation_complete chain runs,
capped with schedule_complete. Same audit shape as today; the
cluster is just the remote ticker.
RBAC
In the KubernetesBackend at runtime the agent's ServiceAccount needs:
- apiGroups: ["batch"]
resources: ["cronjobs"]
verbs:
- get # always
- list # always (powers schedule_list tool)
- create # only when allow_dynamic: true
- patch # only when allow_dynamic: true
- delete # only when allow_dynamic: true OR a yaml schedule was removed
forge package emits a Role + RoleBinding scoped to the agent's
own namespace with the minimum verbs based on
scheduler.kubernetes.allow_dynamic. Default false → get,
list only.
Granting create/delete is a meaningful privilege escalation —
essentially "let the LLM schedule arbitrary HTTP calls back to me"
when allow_dynamic: true. Document loudly.
On restart — the user-described behavior
KubernetesBackend's Sync() is idempotent. On restart:
- List all CronJobs in the namespace with label
forge.agent.id=<self>.
- For each forge.yaml entry: if CronJob exists with matching spec →
leave. If exists with stale spec → patch. If absent → create.
- For each existing CronJob NOT in forge.yaml AND labeled
forge.schedule.source: yaml → delete (handles renamed/removed
schedules).
- CronJobs labeled
forge.schedule.source: llm are left alone (the
LLM owns those; user code shouldn't reap them on restart).
Steady state: the cluster is the source of truth. schedule_list
returns the live CronJob set. No SCHEDULES.md to keep in sync.
Local fallback unchanged
Outside the cluster — forge run on a laptop, CI, a non-k8s VM —
detection returns false, backend resolves to FileBackend, today's
30s-ticker + SCHEDULES.md behavior is byte-identical to current
main. No regression risk for the dev path.
Implementation footprint
~300-400 lines total:
| File |
Change |
forge-core/scheduler/backend.go (new) |
ScheduleBackend interface |
forge-core/scheduler/file_backend.go (new) |
Wraps existing MemoryScheduleStore + ticker behind the interface |
forge-core/scheduler/k8s_backend.go (new) |
client-go based; uses BatchV1().CronJobs(ns) for CRUD |
forge-core/scheduler/k8s_manifest.go (new) |
Pure-Go CronJob YAML generation (no client-go dep — usable from forge package without API access) |
forge-core/types/config.go |
scheduler.backend + scheduler.kubernetes block |
forge-cli/runtime/runner.go |
Pick backend at startup; thread service URL + auth token into KubernetesBackend |
forge-cli/build/k8s_stage.go (or wherever forge package emits manifests) |
Emit one CronJob per schedules[] entry + the credential-less Secret template + the optional Role/RoleBinding |
forge-cli/cmd/auth.go (new) |
forge auth show-token / mint-token / secret-yaml subcommands |
forge-cli/server/middleware |
Read X-Forge-Schedule-Id header, stash in ctx |
forge-cli/runtime/runner.go schedule dispatch |
If X-Forge-Schedule-Id is set on the inbound, emit schedule_fire / schedule_complete around the dispatch |
docs/deployment/scheduler-kubernetes.md (new) |
RBAC table, manifest examples, token-provisioning runbook, security model, comparison with file backend |
| Tests |
Unit tests against a fake kubernetes.Interface; manifest-generation golden tests asserting Secret has NO data field; e2e against kind cluster (optional) |
Out of scope
- Schedule history retrieval in k8s mode — could read K8s Job
status, but easier and more uniform: keep history from the audit
stream (schedule_complete events already carry status +
duration). schedule_history tool reads from a small in-memory
ring buffer fed by the audit emitter, regardless of backend.
- Cross-namespace deployments — first cut assumes CronJob and
agent live in the same namespace.
- Multi-cluster — one cluster per agent.
- Refresh of agent service URL on Service IP changes — once the
CronJob is created, it points at the Service's DNS name (stable);
Pod IP changes are irrelevant.
- Real-time interactive replacement —
schedule_set from an
in-flight chat re-creates a CronJob; the next K8s scheduler tick
picks it up (kube-controller-manager defaults to a 100ms loop, so
the lag is negligible).
- Auto-rotating the internal token — initial implementation
treats the token as a long-lived credential. Operators rotate by
re-deploying with a fresh token in the Secret + the agent pod
picking it up on restart. Auto-rotation with a transitional grace
window is a separate follow-on.
Verification
forge run on a laptop — confirm FileBackend, SCHEDULES.md
still written, no client-go in the binary's import graph for
in-process detection. Zero behavior change.
forge package on the same agent — confirm k8s/ directory now
contains:
cronjob-<sched-id>.yaml files matching the manifest shape
internal-token-secret.yaml with no data field (golden
test asserts this — committing a token-bearing manifest must
be impossible)
role-scheduler.yaml + rolebinding-scheduler.yaml
forge auth secret-yaml | kubectl apply -f - populates the
Secret out-of-band.
kubectl apply -k ./k8s and confirm CronJobs appear; kubectl get cronjobs shows them; on schedule time, a Job pod fires and curls
the agent's Service URL.
- Tail the audit socket and confirm
schedule_fire →
session_start → llm_call → invocation_complete →
schedule_complete lands, with the inbound auth_verify showing
Source: "internal".
- With
scheduler.kubernetes.allow_dynamic: true, call
schedule_set mid-conversation and confirm a new CronJob is
created in the namespace.
- Restart the agent pod and confirm CronJobs persist;
schedule_list
returns the same set; no SCHEDULES.md is written to disk.
- Apply only the Deployment without the Secret; confirm the pod
stays NotReady with a clear secret "..." not found event,
rather than starting with no scheduling.
Related
- See the conversation that led to this issue: hybrid backend reusing
the loopback static_token (runner.go:2425-2436) that channel
plugins already use.
forge-core/scheduler/scheduler.go — current ticker implementation
forge-cli/runtime/scheduler_store.go — current file-backed store
forge-core/auth/token.go — StoreToken / LoadToken against
<WorkDir>/.forge/runtime.token
Context
The current scheduler (
forge-core/scheduler/scheduler.go) uses asingle 30s ticker goroutine and persists state to
<WorkDir>/.forge/memory/SCHEDULES.md. Three operational problemswhen running as a container:
schedule_setbuiltin tool) and run history disappear on podrestart unless a PVC is mounted at
<WorkDir>/.forge.SCHEDULES.md fire every schedule twice and race on file rename.
kubectl get cronjobsto inspect what's scheduled.K8s already solves all three (cluster-singleton CronJob controller,
durable in etcd, native
kubectlintegration). The rightarchitecture is a hybrid backend that picks the cluster when running
in-cluster and falls back to the file store otherwise.
Proposal — Two parts
Part 1 — Hybrid scheduler backend (runtime side)
A new
ScheduleBackendinterface with two implementations chosen byenvironment detection at startup:
Detection:
forge.yaml escape hatch:
Part 2 —
forge packagegenerates CronJob manifestsToday
forge packageemits a Deployment + Service + ConfigMap forthe agent. It should also emit one CronJob per entry in
forge.yamlschedules[]. Operators thenkubectl apply -k ./k8sonce and get the agent + every declarative schedule materialized as
real CronJobs — no runtime CRUD calls needed for static schedules.
The runtime KubernetesBackend's
Sync()reconciles in case someoneedits forge.yaml between deployments, but the steady-state expectation
is "declarative schedules are baked into the deploy manifest, dynamic
ones go through the API."
This also covers the case where the operator wants K8s-native
scheduling without granting the agent pod RBAC to create CronJobs —
set
scheduler.kubernetes.allow_dynamic: false(the default), letforge packagegenerate the CronJobs, agent only needs RBAC tolist/get for the
schedule_listtool.Authentication — reuse the existing loopback token
Forge already mints an internal bearer token for channel plugins to
call back into the A2A endpoint:
Runner.ResolveAuth()generatesr.authTokenat startup(
runner.go:201,215).auth.StoreToken(WorkDir, token)to<WorkDir>/.forge/runtime.tokenwith0600permissions(
forge-core/auth/token.go:47) so internal callers can read it.static_tokenauth provider is prepended to the chain keyed onthis token, identity
{UserID: "forge-internal", Source: "internal"}(runner.go:2425-2436).Runner.AuthToken().CronJobs reuse the exact same token. No new auth surface to
design. CronJobs send the token as
Authorization: Bearer— theexisting loopback static_token provider validates it, the
auth_verifyevent lands withSource: "internal"identical to achannel callback.
Token provisioning — manifest is a template, NOT a credential
forge packageMUST NOT embed the token value in any generatedfile. The build pipeline runs in CI / developer workstations and
its output ends up in git repos, container registries, and operator
laptops — none of which are appropriate places for a long-lived
bearer token. Base64-encoded data inside a Secret manifest is
plaintext as far as version control is concerned.
Instead,
forge packageemits a Secret template with emptydata plus runtime-readable instructions for the operator:
The Deployment manifest references the Secret by name as today;
applying the Deployment before the Secret exists fails the pod
readiness check with a clear
MountVolume.SetUp failed for volume "internal-token": secret "my-agent-internal-token" not found—operators get a loud "you forgot the token" signal rather than a
silent fallback.
A new
forge authsubcommand (small follow-on, in scope here):forge auth show-token<WorkDir>/.forge/runtime.token. Exit 1 + clear error if absent.forge auth mint-tokenauth.StoreToken, print it to stdout. Useful for first-time deploys.forge auth secret-yamlkubectl apply -f -. Default behavior matches the option-1 example above but in one command.These belong in the same PR as the K8s backend because they're the
operator-facing primitives that close the loop on the
"manifest-without-credential" decision.
Generated CronJob shape
For each
forge.yamlschedules[]entry:concurrencyPolicy: Forbidis K8s's native equivalent of the currentscheduler's
schedule_skipon overlap — same semantic, free.Audit-event linkage
The agent recognizes a scheduled fire by the
X-Forge-Schedule-Idrequest header. Middleware reads it at the A2A boundary and stashes
it in ctx alongside the existing workflow / tenancy context. The
runner emits
schedule_fireitself before dispatching, then thenormal
session_start → llm_call → invocation_completechain runs,capped with
schedule_complete. Same audit shape as today; thecluster is just the remote ticker.
RBAC
In the KubernetesBackend at runtime the agent's ServiceAccount needs:
forge packageemits a Role + RoleBinding scoped to the agent'sown namespace with the minimum verbs based on
scheduler.kubernetes.allow_dynamic. Defaultfalse→get,listonly.Granting create/delete is a meaningful privilege escalation —
essentially "let the LLM schedule arbitrary HTTP calls back to me"
when
allow_dynamic: true. Document loudly.On restart — the user-described behavior
KubernetesBackend's
Sync()is idempotent. On restart:forge.agent.id=<self>.leave. If exists with stale spec → patch. If absent → create.
forge.schedule.source: yaml→ delete (handles renamed/removedschedules).
forge.schedule.source: llmare left alone (theLLM owns those; user code shouldn't reap them on restart).
Steady state: the cluster is the source of truth.
schedule_listreturns the live CronJob set. No SCHEDULES.md to keep in sync.
Local fallback unchanged
Outside the cluster —
forge runon a laptop, CI, a non-k8s VM —detection returns false, backend resolves to FileBackend, today's
30s-ticker + SCHEDULES.md behavior is byte-identical to current
main. No regression risk for the dev path.
Implementation footprint
~300-400 lines total:
forge-core/scheduler/backend.go(new)ScheduleBackendinterfaceforge-core/scheduler/file_backend.go(new)forge-core/scheduler/k8s_backend.go(new)client-gobased; uses BatchV1().CronJobs(ns) for CRUDforge-core/scheduler/k8s_manifest.go(new)forge-core/types/config.goscheduler.backend+scheduler.kubernetesblockforge-cli/runtime/runner.goforge-cli/build/k8s_stage.go(or wherever forge package emits manifests)forge-cli/cmd/auth.go(new)forge auth show-token/mint-token/secret-yamlsubcommandsforge-cli/server/middlewareX-Forge-Schedule-Idheader, stash in ctxforge-cli/runtime/runner.goschedule dispatchX-Forge-Schedule-Idis set on the inbound, emitschedule_fire/schedule_completearound the dispatchdocs/deployment/scheduler-kubernetes.md(new)kubernetes.Interface; manifest-generation golden tests asserting Secret has NOdatafield; e2e against kind cluster (optional)Out of scope
status, but easier and more uniform: keep history from the audit
stream (
schedule_completeevents already carry status +duration).
schedule_historytool reads from a small in-memoryring buffer fed by the audit emitter, regardless of backend.
agent live in the same namespace.
CronJob is created, it points at the Service's DNS name (stable);
Pod IP changes are irrelevant.
schedule_setfrom anin-flight chat re-creates a CronJob; the next K8s scheduler tick
picks it up (kube-controller-manager defaults to a 100ms loop, so
the lag is negligible).
treats the token as a long-lived credential. Operators rotate by
re-deploying with a fresh token in the Secret + the agent pod
picking it up on restart. Auto-rotation with a transitional grace
window is a separate follow-on.
Verification
forge runon a laptop — confirm FileBackend, SCHEDULES.mdstill written, no client-go in the binary's import graph for
in-process detection. Zero behavior change.
forge packageon the same agent — confirmk8s/directory nowcontains:
cronjob-<sched-id>.yamlfiles matching the manifest shapeinternal-token-secret.yamlwith nodatafield (goldentest asserts this — committing a token-bearing manifest must
be impossible)
role-scheduler.yaml+rolebinding-scheduler.yamlforge auth secret-yaml | kubectl apply -f -populates theSecret out-of-band.
kubectl apply -k ./k8sand confirm CronJobs appear;kubectl get cronjobsshows them; on schedule time, a Job pod fires and curlsthe agent's Service URL.
schedule_fire→session_start→llm_call→invocation_complete→schedule_completelands, with the inboundauth_verifyshowingSource: "internal".scheduler.kubernetes.allow_dynamic: true, callschedule_setmid-conversation and confirm a new CronJob iscreated in the namespace.
schedule_listreturns the same set; no SCHEDULES.md is written to disk.
stays NotReady with a clear
secret "..." not foundevent,rather than starting with no scheduling.
Related
the loopback static_token (
runner.go:2425-2436) that channelplugins already use.
forge-core/scheduler/scheduler.go— current ticker implementationforge-cli/runtime/scheduler_store.go— current file-backed storeforge-core/auth/token.go—StoreToken/LoadTokenagainst<WorkDir>/.forge/runtime.token