diff --git a/docs/assets/threat-model-diagram.svg b/docs/assets/threat-model-diagram.svg
new file mode 100644
index 000000000..9f2682a02
--- /dev/null
+++ b/docs/assets/threat-model-diagram.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/docs/threat-model.md b/docs/threat-model.md
new file mode 100644
index 000000000..1d59bd094
--- /dev/null
+++ b/docs/threat-model.md
@@ -0,0 +1,137 @@
+# Substrate Threat Model
+
+[Michael Taufen](mailto:mtaufen@google.com), [Vikas Kumar](mailto:skvikas@google.com), [Oleg Mitrofanov](mailto:gooleg@google.com)
+
+Last updated: Jun 25, 2026
+
+TODO:
+- [ ] file GitHub issues for each threat
+- [ ] extract review skills/regression tests based on the threat model
+
+# Overview
+
+Substrate is an early, fast moving product. It is full of debate and subject to change, including major architectural changes. It also has little to no security hardening at this time. Therefore, this threat model is focused on how to achieve security with respect to the general problem Substrate is trying to solve: Fast orchestration of a stateful AI agents. The threat model uses Substrate's current implementation and roadmap as a reference point, but focuses on threats that emerge from this overall class of system, rather than threats specific to the current implementation.
+
+# Intended Outcome
+
+* The suggestions from this threat model should be added to Substrate's official roadmap after review and agreement with the community.
+* Security review SKILLs for AI-assisted security review should be extracted from this threat model and used for continuous review on the upstream [agent-substrate/substrate](https://github.com/agent-substrate/substrate) repository.
+
+# Goals
+
+* Identify threats relevant to any system trying to achieve the same goal, with roughly the same basic building blocks as Substrate (Kubernetes, containers, sandboxes, snapshots, and lightweight "actors" dynamically scheduled to reusable "worker" Pods).
+* Identify general "mitigating invariants" which, if implemented, eliminate or significantly decrease the severity of the threat.
+* Suggest *possible* implementation approaches, which may change if Substrate changes.
+* Prioritize the threats, so that resources can be spent of reducing the biggest risks first.
+
+# Non-Goals
+
+* This threat model does not attempt to predict Substrate's future architectural decisions.
+* This threat model does not demand any particular approach to mitigation.
+* While shown in diagrams for reference, this threat model does not focus on AI frameworks that may be used on top of Substrate.
+
+# Architecture of Substrate
+
+
+
+# Key Components
+
+* **ate-api-server:** Actor creation and scheduling and credential issuance.
+* **atelet:** Per-node daemon, performs snapshot/resume.
+* **ateom:** Per-worker Pod sidecar, running inside the worker Pod. Ateom sets up "interior" sandboxes in the worker Pod and manages sandbox lifecycle, including image pulls. It currently uses gvisor but Substrate will support multiple microvm solutions.
+* **Worker:** Preprovisioned Pods that actors get scheduled to.
+* **Actor:** The core compute primitive, gets scheduled to/from worker via Run for cold start and Resume for snapshot resume.
+* **Actor IP:** Actor networking is based on Pod networking. Each actor gets the IP of the worker it's currently scheduled to. The ateom has the opportunity to set up additional rules when it sets up interior sandboxes.
+* **Actor DNS:** Each Actor gets a DNS name like `.actors.resources.substrate.ate.dev`. Substrate runs a custom CoreDNS instance that returns atenet-router's IP address for any A record query matching the actor DNS name pattern. Substrate also includes a built-in controller that both keeps Substrate's CoreDNS configuration up to date with the router's Service IP and updates kube-dns with a stub domain for `actors.resources.substrate.ate.dev` that points to the IP of Substrate's CoreDNS Service. The latter enables traditional Kubernetes Pods to resolve Substrate Actor DNS names.
+* **atenet-router:** Substrate runs an Envoy proxy to handle ingress to actors. When a client sends a request to an actor's DNS name, it is resolved to atenet-router's Envoy sidecar. Envoy then forwards headers to atenet-router (ext\_proc filter), which extracts the actor ID, automatically resumes the actor if it is in a suspended state, queries the Substrate API for the current IP of the actor, and tells Envoy to rewrite the host header to the actor IP before forwarding the request to the actor. atenet-router includes a local xDS server that configures the Envoy sidecar with this behavior.
+* **Object Storage:** Used to store actor snapshots.
+* **Filesystem support:** Container local filesystem is saved in snapshots, future integrations likely to include networked storage.
+* **Substrate Database:** Currently Valkey (Redis-compatible API). The choice of backend database/interface is under active debate.
+* **Kubernetes:** The underlying infrastructure that Substrate runs on is expected to be Kubernetes.
+
+# Threats and Mitigations
+
+**Table Schema:**
+
+* **Priority:** Critical, High, Medium, Low
+* **Threats:** Expected risks in a naïve implementation of Substrate.
+* **Mitigating Invariants:** High level properties which, if true, mitigate the threat.
+* **Suggested Concrete Mitigations:** Specific options for implementing the mitigating invariants, based on current understanding of Substrate.
+* **Notes:** Additional relevant information.
+
+## Attacks from External Networks
+
+| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| | Critical | External attacker can access actors over the internet | Access to actors over the external internet is blocked. | Recommend use of infrastructure firewall to limit external ingress/egress in documentation (cloud specific). Use Kubernetes NetworkPolicy to limit external ingress/egress by default. Use additional network policy features depending on CNI. | |
+| | Critical | External attacker can access nodes over the internet | Access to nodes over the external internet is blocked. | Recommend use of infrastructure firewall to limit external ingress/egress in documentation (cloud specific). Use Kubernetes NetworkPolicy to limit external ingress/egress by default. Use additional network policy features depending on CNI. | |
+| | Critical | External attacker can access Substrate API or backend database over the internet | Access to Substrate API over the external internet is blocked. | Recommend use of infrastructure firewall to limit external ingress/egress in documentation (cloud specific). Use Kubernetes NetworkPolicy to limit external ingress/egress by default. Use additional network policy features depending on CNI. | |
+
+## Attacks from Internal Network, API Clients
+
+| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| | Critical | Access to the internal network allows arbitrary actions to be performed on ate-apiserver, atelet, substrate backend database, etc. | All system components must have basic mutual authentication and authorization, and communicate over TLS. All clients (including end users and actors) must be authenticated and authorized. Unauthenticated traffic must be rejected. | mTLS or other secure channel (e.g. UDS) between networked system components (ate-apiserver, atelet, ateom, etc) each atelet has a unique identity cryptographically tied to the node identity ate-router should check client permissions before resuming actors or forwarding traffic to actors. The only component authorized to connect directly to the backend database should be ate-apiserver. | |
+| | High | Privilege escalation via access to sensitive labels. | If Substrate offers its own resource labeling mechanism, it must also offer a way to authorize label updates on a per-label basis. | Substrate authorization system requires explicit authorization to update metadata, separate from updating resource body. Substrate authorization system supports per-label authorization rules. | Plenty of attacks in K8s were possible because labels had semantic meaning, but the permission model could implicitly granted access to modify labels, even if it was inappropriate. For example, /status subresource allows label updates. Substrate shouldn't repeat this mistake. |
+| | High | Attacker gains control of Substrate API server, router, or other ingress/egress proxy. | Isolate the control plane from the data plane, and isolate data plane ingress from sandboxes. | Don't co-locate ate-apiserver or other control-plane components on the same machines as the untrusted sandboxes. Consider running any gateway/router that enables direct interaction with sandboxes on a separate VM from the sandboxes, or using a zero-trust architecture where traffic is encrypted and authenticated end-to-end. | |
+| | High | Attacker who can create ActorTemplates specifies malicious runtime. | Ensure available runtime can only be configured by administrators. | Consider a mechanism like RuntimeClass to decouple configuration of available runtimes from consumption of available runtimes. | |
+| | High | Attacker who can create ActorTemplates can read or write any storage buckets atelet has access to. | Ensure that bucket access is least-privilege. | Use credentials derived from actor identity to read snapshots. Configure permissions to prevent atelet or nodes from having access to sensitive buckets. | For example: Attacker creates an ActorTemplate with the runsc URL or golden snapshot URL pointing to an arbitrary bucket in the same project/resource scope as the cluster. If atelet has project-wide access to buckets, this could cause the state to be pulled into the worker pod or malicious actor. Similarly, an attacker could set the snapshots URL to point to an internal infrastructure bucket, causing data to be written to that bucket. |
+| | Medium | DoS attack against API, router, or available cluster resources. | Minimize exposure and ensure APIs and proxies implement appropriately scoped quotas and rate limiting. | Don't co-locate ate-apiserver or other control-plane components on the same machines as the untrusted sandboxes. Don't expose ate-apiserver directly to the internet. Consider an identity-aware WAF (likely up to providers) if external access is required. Use network policy to block direct interaction between untrusted sandboxes and the ate-apiserver. Implement quotas and rate limiting in the API and in proxies. This includes quotas for the number of actors a user can allocate. Use a zero-trust architecture that prevents identity spoofing or intentional misrouting to get around limits. | Notably, the router checks ate-apiserver for the actor IP on each request. A flood of traffic to actors could potentially result in high read load on ate-apiserver. Caching could be considered, but cache invalidation during actor rescheduling would be important to avoid misrouting traffic. |
+| | Medium | Internal network traffic is intercepted or spoofed | Encrypt all traffic by default | Use mTLS between all system components, and between the router and actors. | It may be desirable to rely on cloud providers to transparently encrypt traffic between VMs on their internal network. Worth discussing what makes the most sense. |
+
+## Misconfiguration Risks
+
+| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| | High | Improper handling of Secrets | Ensure there is an official, secure, recommended way to pass secret data, like API access tokens, to actors. | Support env and filesystem plumbing for Kubernetes Secrets, to provide an official path that avoids secret material being plumbed via nonspecific fields that are difficult to audit. Ensure secrets are encrypted in transit and ideally stored in memory. If exposed via the filesystem, do so via in-memory tmpfs. | If we don't support this, users will inevitably put secrets in plaintext. |
+| | Medium | Complexity configuring permissions for frameworks on top of Substrate may lead to unintentional privilege escalation. | It must be clear to users what the downstream effects of auth config in substrate are. | If it's not intuitive, it must be documented in a user guide. | AI framework has to set up permissions to access ATE, and to access K8s, and potentially for actors (based on ATE identity and K8s identity) to access the framework. We need to make this easy. Think about past K8s issues like escalate/bind risk. Substrate resource model is spread across ate-apiserver and K8s, increasing complexity and chance for error. |
+| | Medium | Flat namespace of actors encourages broad permission grants or complex graph-oriented policy. | Support a grouping mechanism that can be used in policy controls. | Add namespaces to Substrate, similar to Kubernetes. | |
+| | Medium | DNS misconfiguration | Access to DNS configuration should be limited to authoritative controllers. Routing should use stable configurations and query the API for the current IP before routing each request. | Don't co-locate controllers with access to sensitive system state on the same nodes as actors. Limit permissions to update DNS configuration. Actively query Substrate API to ensure IPs are as up-to-date as possible. Potentially use mTLS based on actor DNS name between ate-router and actors. | As noted above, a flood of requests could create high read load on ate-apiserver. Caching could be considered, but cache invalidation during actor rescheduling would be important to avoid misrouting traffic. Establishing a backend mTLS tunnel between ate-router and each actor based on a serving cert signed for the actor's DNS name could be another approach to avoiding misrouting. |
+
+## Attacks from Actors
+
+| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| | Critical | Malicious actor gains access to the underlying node or other actors via container breakout (Linux local privilege escalation) | Actors are always sandboxed using a hardened sandbox solution like gvisor or microvm. Traditional containers are not a secure sandbox. | Use Gvisor for strong container isolation. Run sentry in a user namespace and pivot\_root to limit broader file system access if the gvisor boundary is broken. Use Seccomp for sentry/gofer as well. Limit capabilities granted to directfs/sentry Any sidecar containers running alongside actors or warmpool pods should be as low privileged as possible. Sandbox lifecycle must be controlled from outside the sandbox | |
+| | Critical | Malicious actor gains access to the underlying node via node-local endpoints exposed on the network | Network policies must prevent the actor from accessing system services on the node (e.g. instance metadata, host network namespace, host network interface). | Enable network namespacing, and do not provide host network access to any workload. Any host services should have targeted ingress policies so that only the correct clients even have network access to those services | |
+| | Critical | Malicious actor gains access to other actors via network | Network policies must deny ingress and egress by default, and selectively allow access to specific actors only when necessary. Network policies must be synchronized with actor lifecycle. | Implement default-deny network policy that blocks any cross-communication between actors (this could be implemented by ateom instead of Kubernetes, to speed up policy updates when actor groups do need to communicate). | |
+| | Critical | Malicious actor gains access to the underlying node or other actors via filesystem | Filesystem access is limited to the actor's local fs and remote filesystems the actor is directly, explicitly authorized to access. | Ensure that actor access to the filesystem is protected by, for example, mapping each actor to a unique Linux uid and using filesystem permissions, or by using user namespaces to isolate actor filesystem access on the same host. Ensure that actor management APIs and filesystem setup protects against traversal via symlinks, mount trickery, etc. | Possible vectors: Arbitrary file read vuln in fs implementation. Symlink traversal vulns. Vuln in OCI image unpacking. Path traversal attacks in mount configuration. Shared directories only gated via filesystem permissions \+ workers running as root or privileged. |
+| | Critical | Malicious actor gains access to the underlying Worker Pod or node via local Substrate services (atelet, ateom, etc). | Actor access to system services must be denied or explicitly scoped to the actor. Local services must run with the fewest privileges possible to limit the ability for an actor to escalate privilege. | Limit actor access to system services. Ensure system services are aware of actor identity, and authenticate and authorize actors when access is required. | |
+| | Critical | Malicious actor gains access to the underlying node or other actors via Substrate APIs (remote ate-apiserver) | Either actors are completely unable to access Substrate APIs, or actor access is authenticated and authorized such that the actor cannot escalate beyond its intended scope. Especially prevent self-modification, which was a common escalation path in K8s. | Do not allow actors or workers to self-modify, e.g. by: reading or writing their own snapshots self-labeling or other self-manipulations of KRM or ATE resources Logic to modify resource definitions including actors, workers, etc lives in the control plane, rather than the data plane, as much as possible. | |
+| | Critical | Attacker escalates to host via excessive Worker Pod permissions | Ensure Worker pods use the minimum necessary permissions to set manage sandbox lifecycle | Deprivilege the worker Pods. Run them as non-root or in a user namespace, without privileged, and with only required caps/devices. | Possible example, accidentally putting things in the same cgroup: [https://github.com/agent-substrate/substrate/issues/288](https://github.com/agent-substrate/substrate/issues/288) |
+| | Critical | Malicious actor gains access to the underlying node or other actors via Kubernetes APIs | Either actors are completely unable to access the Kubernetes APIs, or actor access is authenticated and authorized such that the actor cannot escalate beyond its intended scope. **Strong preference on blocking actor access to Kubernetes.** | Block network access to the kubernetes API Ensure the default kubernetes service account token in each pod has 0 privileges | |
+| | Critical | Malicious actor workload gains access to snapshots of other actors and steals data from them. | Worker Pods and actors must not have direct access to snapshots. | If actor identity is used for snapshot access, require the credential issued for snapshots to include additional claims identifying it as atelet, or require it to be used over a channel secured with atelet's mTLS certificate. | |
+| | Critical | Malicious actor workload overrides snapshots of other actors with malicious snapshots. | Worker Pods and actors must not have direct access to snapshots. | If actor identity is used for snapshot access, require the credential issued for snapshots to include additional claims identifying it as atelet, or require it to be used over a channel secured with atelet's mTLS certificate. | |
+| | Critical | Corrupt snapshot is restored | Cryptographically verify each snapshot before restoration | Snapshots must be checked against a trusted digest or be signed and checked against a trusted key before restore is allowed. The digest or key must be delivered via a trusted channel. | |
+| | Critical | Stale policies propagated out-of-band with actor scheduling result in incorrect access boundaries (including network policies, IAM permissions, etc.) | All network and authorization policies must be fully synced before an actor starts running. | Actor lifecycle APIs could support programming authz and network policy as part of create/resume. | |
+| | Critical | Worker reuse enables malicious actor to escalate to other actors by persisting a threat across reuse, reading leftover state from a previous actor, or taking advantage of stale policy configuration. | All actor-specific worker state, including process state, filesystem, env vars, mounted config, network policy, and security policy, must be completely reset between actors that subsequently run on the same worker. | Ensure local policy updates are tightly synchronized with actor lifecycle to avoid race conditions. Test that the suspend/resume lifecycle for every supported sandbox technology properly cleans up state by exercising it and enumerating system state. Use "honeypots" on each side of the boundary to detect state leaks. Test policy behavior on both sides of suspend/resume to ensure policies are appropriately updated in-sync with actor lifecycle. | |
+| | Critical | Malicious actor tricks Substrate identity broker into returning identity credentials for a different actor. | Add defense-in-depth to ensure that credentials cannot be mis-issued, and that mis-issued credentials are unusable in practice. | Ensure that credential issuance checks actor-to-worker scheduling assignment. Ensure actor identities include claims tying them to the worker and node the actor is scheduled to. Ensure claims are validated against scheduling assignments during authz policy evaluation; when actors are rescheduled old credentials should be immediately invalidated. | |
+| | High | Agent leaks credentials exposed in sandbox, because LLMs are unreliable. Due to prompt injection or just agent silliness. | Credentials are not exposed in sandboxes by default. | Opt-in to credentials, none by default. Credential injecting proxy. Delegation to "drivers" rather than relying on sandboxed clients (like CSI). Encrypting credentials exposed in sandboxes, requiring them to be unwrapped by a network proxy to be usable. Consider supporting pluggable per-node or per-worker security sidecars that can interpose network, implement additional policy, add monitoring, etc. | May provide an option to expose in sandbox if there are other mitigating factors. For example, the process is NOT an AI agent, or the process limits agent access to env/filesystem. |
+| | High | Malicious actor moves laterally to other nodes via suspend/resume, especially if self-suspend is allowed. | Suspend/resume is guaranteed to provide some acceptable locality with respect to lateral movement. TBD. | Maybe some kind of session pinning to specific groups of nodes, to contain the movement? Or statistical bias to resume on the same node? | Particularly useful in combination with a sandbox breakout vuln, or network or filesystem access vulns. |
+| | High | DoS/resource exhaustion via malicious container image | Enforce limits when pulling and extracting container images | Ensure limits on the uncompressed layers are enforced when untarring container images. Ensure a time limit is enforced when unpacking container images. | For example, image can contain a zip bomb. |
+| | Medium | Compromised process inside actor uses actor identity to write its own snapshot, enabling local privilege escalation (still in actor sandbox). | Issue separate credentials for accessing snapshots. | Even if actor identity is used to make snapshot access permissions granular, require an extra claim, audience, or dual-identity authorization to prove that it's atelet accessing the snapshot on behalf of the actor, rather than the actor itself. | Medium since it's still a local exploit within the application layer. Example: nonroot process with access to the actor credential rewrites a snapshot so that it can `su` to root after the next restore. |
+| | Medium | Attacker creates a large number of malicious actors to rapidly spread across the cluster or exhaust resources (via direct create or fork-bomb equivalent). | Limit resource consumption and number of child actors for each actor. | Direct ability to create actors should be considered a privileged permission. Use rate quotas and throttling to limit the speed of execution of this attack. Account for number of child actors and resource consumption by children so quotas can be enforced. Implement a namespace-like concept that quotas can be attached to. | |
+| | Medium | Attacker discovers Kubernetes-internal network topology from inside sandboxed actor | Don't allow actors to explore the cluster network | Don't expose cluster internal DNS to actors. Don't mount worker Pod resolv.conf in actors, instead provision a new file for the actor. | |
+
+## Attacks from Nodes
+
+| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| | High | A compromised node accesses all snapshots for the cluster. | Node access to snapshots must be scoped to actors actively scheduled to the node. | Issue special per-actor JWT, only for atelet, not in sandbox, that has claims available in conditional IAM. Use CredentialAccessBoundary (GCP only) Perform data wipeout on previous actor “tenants” of the node on actor suspension. Snapshots should be write-once per version, to prevent malicious actors from overwriting “golden” snapshots and compromising other actors that use that snapshot. | Detection rules to determine whether a given actor is running on the node and correlate it to a node requesting a snapshot for the actor could be valuable defense in depth. |
+| | High | Compromised node accesses filesystem storage for actors on other nodes. | Node access to network filesystems must be scoped to actors actively scheduled to the node. | TBD \- depends on network filesystem implementation and what it supports. | |
+| | High | Compromised node can escalate to other nodes or control plane via Substrate or Kubernetes. | Node access to Substrate or Kubernetes APIs must be scoped to operations relevant to actors actively scheduled to the node, and to the node itself. | Disable network communications for actors by default Tightly scoped Ingress/Egress network policies for every pod Default deny Actor to Actor communication Any allowed Actor to Actor communication should be over TLS and have authentication/authorization Limit total number of child actors under a given parent (and limit depth of actor create calls) Limit network access between data plane components, and ensure control plane components have minimal ingress from actors/workloads. | |
+
+## Insider Attacks
+
+| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| | Medium | Insider access to snapshots | Substrate's storage models for snapshots must enable users to grant granular access control when needed for system administration, rather than only supporting broad access across all actors. | Customer managed encryption keys for snapshots, similar to K8s kmsplugin. Storage layout that maps well to per-actor identity. | |
+| | Medium | Insider access to disks, e.g. for substrate backend DB | Support envelope encryption of sensitive data, or avoid storing sensitive data. | Use FDE by default (most cloud providers already do this, nonspecific to Substrate). If secrets are ever stored in Substrate's DB, support envelope encryption using HSMs, similar to kms providers in K8s. | |
+
+## Detection and Response
+
+| GitHub Issue | Priority | Threat | Mitigating Invariants | Suggested Concrete Mitigations | Notes |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| | High | Malicious actions are not recorded for forensics. | Enable audit logging for all Substrate components that serve APIs, including node-local components. | Enable audit logging for snapshot/GCS buckets Set up audit logging for ateapi requests | |
+| | Medium | Malicious activity proceeds undetected. | Enable integration with threat detections systems. | Support for pluggable threat detection integrations, making telemetry from Substrate API and node components, as well as sandboxes, available for continuous analysis. Consider supporting pluggable per-node or per-worker security sidecars that can gather telemetry, etc. | |
+| | Medium | Detected malicious actor cannot be contained. | Quarantine or suspension option. | Support dynamic policy updates for quick quarantine. Automatically trigger snapshot/suspend on malicious activity, but taint the snapshot so that it can't automatically be resumed in production. | |
diff --git a/hack/util/verify-boilerplate.py b/hack/util/verify-boilerplate.py
index 173f3a6eb..5ce55f1a5 100755
--- a/hack/util/verify-boilerplate.py
+++ b/hack/util/verify-boilerplate.py
@@ -104,7 +104,7 @@ def main():
filename = os.path.basename(filepath)
# Skip non-source-code files
- if ext in ['.md', '.txt', '.png', '.jpg', '.jpeg', '.gif', '.mp4', '.json', '.pdf', '.ico', '.woff', '.woff2', '.ttf', '.otf']:
+ if ext in ['.md', '.txt', '.png', '.jpg', '.jpeg', '.gif', '.mp4', '.json', '.pdf', '.ico', '.woff', '.woff2', '.ttf', '.otf', '.svg']:
continue
if filename in ['LICENSE', 'NOTICE', 'CODEOWNERS', '.gitignore', 'go.mod', 'go.sum']:
continue