Skip to content

Commit b1c4d19

Browse files
JAORMXclaude
andcommitted
Add proposal for Kubernetes source type in registry server
This proposal introduces a native `kubernetes` source type for the ToolHive Registry Server that directly watches MCP resources and builds registry entries from annotations. This approach supercedes the ConfigMap-based approach in #2591, avoiding ConfigMap size limits, backup complexity, and two-component coordination overhead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 6a9a2c1 commit b1c4d19

File tree

1 file changed

+260
-0
lines changed

1 file changed

+260
-0
lines changed
Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# Proposal: Kubernetes Source Type for Registry Server
2+
3+
## Status
4+
5+
**Proposed** - Supercedes [toolhive#2591](https://github.com/stacklok/toolhive/pull/2591)
6+
7+
## Summary
8+
9+
Add a native `kubernetes` source type to the ToolHive Registry Server that directly watches Kubernetes resources (MCPServer, MCPRemoteProxy, VirtualMCPServer) and builds registry entries from annotated resources. This eliminates the need for an intermediate ConfigMap-based approach.
10+
11+
## Motivation
12+
13+
### Problem Statement
14+
15+
We want to automatically populate the MCP registry with servers deployed in Kubernetes. The previous proposal ([toolhive#2591](https://github.com/stacklok/toolhive/pull/2591)) suggested having the ToolHive operator:
16+
17+
1. Watch annotated MCP resources and HTTPRoutes
18+
2. Aggregate discovered servers into per-namespace ConfigMaps
19+
3. Have the registry server read those ConfigMaps
20+
21+
This approach has several drawbacks:
22+
23+
- **ConfigMap size limits**: Kubernetes ConfigMaps are limited to 1MB, constraining scalability
24+
- **Backup complexity**: ConfigMaps as intermediate artifacts complicate backup/restore workflows
25+
- **Two-hop latency**: Changes must propagate through operator → ConfigMap → registry server
26+
- **Split logic**: Registry population logic is split across two components
27+
28+
### Proposed Solution
29+
30+
Add a `kubernetes` source type to the registry server that directly queries Kubernetes resources using the same sync patterns as existing sources (git, api, file). The registry server already uses `controller-runtime` and has a clean provider abstraction that fits this model naturally.
31+
32+
## Design
33+
34+
### Architecture
35+
36+
```
37+
┌─────────────────────────────────────────────────────────────────┐
38+
│ Registry API Server │
39+
│ ┌───────────────────────────────────────────────────────────┐ │
40+
│ │ KubernetesRegistryHandler │ │
41+
│ │ │ │
42+
│ │ 1. List MCPServer, MCPRemoteProxy, VirtualMCPServer │ │
43+
│ │ 2. Filter by namespace/labels + require annotations │ │
44+
│ │ 3. Build UpstreamRegistry entries from annotations │ │
45+
│ │ 4. Return FetchResult (same as git/api/file handlers) │ │
46+
│ │ │ │
47+
│ └───────────────────────────────────────────────────────────┘ │
48+
│ │ │
49+
│ ▼ │
50+
│ ┌───────────────────────────────────────────────────────────┐ │
51+
│ │ SyncManager + StorageManager (existing infrastructure) │ │
52+
│ └───────────────────────────────────────────────────────────┘ │
53+
└─────────────────────────────────────────────────────────────────┘
54+
```
55+
56+
### Configuration
57+
58+
```yaml
59+
registryName: "toolhive-cluster"
60+
61+
registries:
62+
- name: "cluster-mcp-servers"
63+
format: upstream
64+
kubernetes:
65+
# Namespace filtering (empty = all namespaces)
66+
namespaces: []
67+
68+
# Optional label selector (standard k8s selector syntax)
69+
labelSelector: ""
70+
71+
syncPolicy:
72+
interval: "30s"
73+
74+
auth:
75+
mode: oauth
76+
# ... standard auth config
77+
```
78+
79+
### Annotations
80+
81+
MCP resources use annotations under the `toolhive.stacklok.dev` prefix to control registry export. A resource is only included in the registry if it has the required annotations.
82+
83+
| Annotation | Required | Description |
84+
|------------|----------|-------------|
85+
| `toolhive.stacklok.dev/registry-export` | Yes | Must be `"true"` to include in registry |
86+
| `toolhive.stacklok.dev/registry-url` | Yes | The external endpoint URL for this server |
87+
| `toolhive.stacklok.dev/registry-description` | No | Override the description in registry |
88+
| `toolhive.stacklok.dev/registry-tier` | No | Server tier classification |
89+
90+
Resources without `registry-export: "true"` are ignored. Resources with `registry-export: "true"` but missing `registry-url` are logged as warnings and skipped.
91+
92+
### Example MCPServer
93+
94+
```yaml
95+
apiVersion: toolhive.stacklok.dev/v1alpha1
96+
kind: MCPServer
97+
metadata:
98+
name: my-mcp-server
99+
namespace: production
100+
annotations:
101+
toolhive.stacklok.dev/registry-export: "true"
102+
toolhive.stacklok.dev/registry-url: "https://mcp.example.com/servers/my-mcp-server"
103+
toolhive.stacklok.dev/registry-description: "Production MCP server for code analysis"
104+
spec:
105+
# ... MCP server spec
106+
```
107+
108+
### Handler Implementation
109+
110+
The `KubernetesRegistryHandler` implements the existing `RegistryHandler` interface:
111+
112+
```go
113+
type RegistryHandler interface {
114+
FetchRegistry(ctx context.Context, regCfg *config.RegistryConfig) (*FetchResult, error)
115+
Validate(regCfg *config.RegistryConfig) error
116+
CurrentHash(ctx context.Context, regCfg *config.RegistryConfig) (string, error)
117+
}
118+
```
119+
120+
Implementation:
121+
122+
1. **FetchRegistry**: Lists MCP resources, filters to those with `registry-export: "true"`, builds `UpstreamRegistry` entries from annotations
123+
2. **CurrentHash**: Same list/filter, computes hash for change detection
124+
3. **Validate**: Validates kubernetes config (label selector syntax, etc.)
125+
126+
### Sync Behavior
127+
128+
Uses the existing `SyncManager` infrastructure:
129+
130+
1. Every `syncPolicy.interval`, check if sync is needed via `CurrentHash()`
131+
2. If hash changed, call `FetchRegistry()` to get full data
132+
3. Store result via `StorageManager` (file or database)
133+
134+
This is identical to how git, api, and file sources work today.
135+
136+
### RBAC Requirements
137+
138+
The registry server's ServiceAccount needs read access to MCP resources:
139+
140+
```yaml
141+
apiVersion: rbac.authorization.k8s.io/v1
142+
kind: ClusterRole
143+
metadata:
144+
name: toolhive-registry-reader
145+
rules:
146+
- apiGroups: ["toolhive.stacklok.dev"]
147+
resources: ["mcpservers", "mcpremoteproxies", "virtualmcpservers"]
148+
verbs: ["get", "list", "watch"]
149+
---
150+
apiVersion: rbac.authorization.k8s.io/v1
151+
kind: ClusterRoleBinding
152+
metadata:
153+
name: toolhive-registry-reader
154+
subjects:
155+
- kind: ServiceAccount
156+
name: toolhive-registry-api
157+
namespace: toolhive-system
158+
roleRef:
159+
kind: ClusterRole
160+
name: toolhive-registry-reader
161+
apiGroup: rbac.authorization.k8s.io
162+
```
163+
164+
For namespace-scoped deployments, use Role/RoleBinding instead.
165+
166+
## Alternatives Considered
167+
168+
### ConfigMap-based approach (toolhive#2591)
169+
170+
The original proposal had the operator write to ConfigMaps, with the registry server reading them.
171+
172+
**Rejected due to the following concerns:**
173+
174+
#### ConfigMap size limits
175+
176+
Kubernetes ConfigMaps are hard-limited to 1MB. While individual MCP server entries are small, a cluster with many servers across many namespaces could approach this limit. The per-namespace ConfigMap approach in the original proposal mitigates this somewhat, but introduces its own complexity (multiple ConfigMaps to aggregate) and still imposes a ceiling on servers-per-namespace.
177+
178+
#### Backup and restore complexity
179+
180+
ConfigMaps as intermediate artifacts create operational challenges:
181+
182+
- **Backup ambiguity**: Should ConfigMaps be backed up? They're derived data, but if the operator isn't running during restore, the registry is empty until it regenerates them.
183+
- **Restore ordering**: On cluster restore, the operator must run and regenerate ConfigMaps before the registry server has data. This creates implicit dependencies in disaster recovery procedures.
184+
- **Drift detection**: If a ConfigMap is manually modified or corrupted, there's no single source of truth - the operator will eventually overwrite it, but the intermediate state is inconsistent.
185+
186+
With a direct Kubernetes source, the MCP resources themselves are the source of truth. Standard etcd/Velero backups capture everything needed; on restore, the registry server simply queries the restored resources.
187+
188+
#### Two-component coordination
189+
190+
Splitting registry population across operator and registry server introduces:
191+
192+
- **Deployment coupling**: Both components must be healthy for the registry to be populated
193+
- **Version skew**: Operator and registry server must agree on ConfigMap schema/format
194+
- **Debugging complexity**: "Why isn't my server in the registry?" requires checking both operator logs and ConfigMap contents
195+
- **Race conditions**: Operator writes ConfigMap, registry server reads it - timing windows where data is stale or partially written
196+
197+
#### Additional latency
198+
199+
Changes propagate through two hops:
200+
1. MCP resource changes → Operator detects → Operator writes ConfigMap
201+
2. ConfigMap changes → Registry server detects → Registry updates
202+
203+
Each hop adds its own reconciliation interval. With a direct source, there's a single sync interval from resource to registry.
204+
205+
## Future Work: Watch-based Updates
206+
207+
The initial implementation uses interval-based polling via `syncPolicy.interval`, consistent with other source types. However, Kubernetes resources can change frequently, and polling introduces latency between a resource change and registry update.
208+
209+
A future enhancement would add watch-based (informer) support for the kubernetes source:
210+
211+
### Proposed Approach
212+
213+
1. **Shared Informer Factory**: Use `controller-runtime`'s cache/informer infrastructure to watch MCP resources
214+
2. **Event-driven sync**: On resource add/update/delete events, trigger a registry rebuild
215+
3. **Debouncing**: Batch rapid changes (e.g., during deployments) with a short debounce window (e.g., 500ms-2s) to avoid excessive rebuilds
216+
4. **Hybrid mode**: Keep `syncPolicy.interval` as a fallback/consistency check, but primarily react to watch events
217+
218+
### Configuration Extension
219+
220+
```yaml
221+
kubernetes:
222+
namespaces: []
223+
labelSelector: ""
224+
225+
# Future: watch-based sync
226+
watch:
227+
enabled: true
228+
debounceInterval: "1s" # batch changes within this window
229+
```
230+
231+
### Benefits
232+
233+
- **Near real-time updates**: Registry reflects changes within seconds instead of waiting for next poll interval
234+
- **Reduced API load**: No need for frequent polling; only react to actual changes
235+
- **Consistency**: Informers maintain a local cache, reducing API server load
236+
237+
### Considerations
238+
239+
- **Complexity**: Informer lifecycle management, reconnection handling, cache synchronization
240+
- **Memory**: Informer cache consumes memory proportional to watched resources
241+
- **Startup**: Initial cache sync before serving requests
242+
243+
This can be implemented as a backward-compatible enhancement - existing poll-based configs continue to work, watch mode is opt-in.
244+
245+
## Implementation Plan
246+
247+
1. Add `KubernetesConfig` to `internal/config/config.go`
248+
2. Add config validation for kubernetes source type
249+
3. Implement `KubernetesRegistryHandler` in `internal/sources/kubernetes.go`
250+
4. Register handler in `internal/sources/factory.go`
251+
5. Add unit tests with mock K8s client
252+
6. Add integration test with envtest
253+
7. Update documentation and examples
254+
8. Add Helm chart RBAC templates
255+
256+
## Open Questions
257+
258+
1. **Feature flag**: Should this be behind a feature flag initially?
259+
2. **CRD availability**: How should the handler behave if ToolHive CRDs aren't installed in the cluster?
260+
3. **Cross-cluster**: Should we support watching resources in remote clusters (via kubeconfig)?

0 commit comments

Comments
 (0)