You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document is a comprehensive guide for an AI agent tasked with developing and maintaining the Sysdig MCP Server. It covers everything from project setup and architecture to daily workflows and troubleshooting.
3
+
This document is a comprehensive guide for an AI agent tasked with developing and maintaining the Sysdig MCP Server.
4
4
5
5
## 1. Project Overview
6
6
@@ -44,6 +44,7 @@ For a full list of optional variables (e.g., for transport configuration), see t
44
44
### 3.1. Repository Layout
45
45
46
46
```
47
+
.github/workflows - CI Workflows
47
48
cmd/server/ - CLI entry point, tool registration
48
49
internal/
49
50
config/ - Environment variable loading and validation
@@ -55,6 +56,7 @@ internal/
55
56
docs/ - Documentation assets
56
57
justfile - Canonical development tasks (format, lint, test, generate, bump)
57
58
flake.nix - Defines the Nix development environment and its dependencies
59
+
package.nix - Defines how the package is going to be built with Nix
58
60
```
59
61
60
62
### 3.2. Key Components & Flow
@@ -75,10 +77,10 @@ flake.nix - Defines the Nix development environment and its depen
75
77
- HTTP middleware extracts `Authorization` and `X-Sysdig-Host` headers for remote transports (line 108-138)
76
78
77
79
4.**Sysdig Client (`internal/infra/sysdig/`):**
78
-
-`client.gen.go`: Generated OpenAPI client (**DO NOT EDIT**, regenerated via oapi-codegen)
80
+
-`client.gen.go`: Generated OpenAPI client (**DO NOT EDIT**, manually regenerated via oapi-codegen, not with `go generate`)
79
81
-`client.go`: Authentication strategies with fallback support
80
82
- Context-based auth: `WrapContextWithToken()` and `WrapContextWithHost()` for remote transports
81
-
- Fixed auth: `WithFixedHostAndToken()` for stdio mode
83
+
- Fixed auth: `WithFixedHostAndToken()` for stdio mode and remote transports
82
84
- Custom extensions in `client_extension.go` and `client_*.go` files
83
85
84
86
5.**Tools (`internal/infra/mcp/tools/`):**
@@ -87,26 +89,12 @@ flake.nix - Defines the Nix development environment and its depen
87
89
- Use `WithRequiredPermissions()` from `utils.go` to declare Sysdig API permissions
88
90
- Permission filtering happens automatically in handler
89
91
90
-
### 3.3. Authentication Flow
91
-
92
-
1.**stdio transport**: Fixed host/token from env vars (`SYSDIG_MCP_API_HOST`, `SYSDIG_MCP_API_TOKEN`)
93
-
2.**Remote transports**: Extract from HTTP headers (`Authorization: Bearer <token>`, `X-Sysdig-Host`)
94
-
3. Fallback chain: Try context auth first, then fall back to env var auth
95
-
4. Each request includes Bearer token in Authorization header to Sysdig APIs
96
-
97
-
### 3.4. Tool Permission System
98
-
99
-
- Each tool declares its required Sysdig API permissions using `WithRequiredPermissions("permission1", "permission2")`.
100
-
- Before exposing tools to the LLM, the handler calls the Sysdig `GetMyPermissions` API.
101
-
- The agent will only see tools for which the provided API token has **all** required permissions.
102
-
- Common permissions: `policy-events.read`, `sage.exec`, `risks.read`, `promql.exec`
103
-
104
92
## 4. Day-to-Day Workflow
105
93
106
-
1.**Enter the Dev Shell:** Always work inside the Nix shell (`nix develop` or `direnv allow`) to ensure all tools are available. You can assume the developer is already in a Nix shell.
94
+
1.**Enter the Dev Shell:** Always work inside the Nix shell (`nix develop` or `direnv allow`). You can assume the developer already did that.
107
95
2.**Make Focused Changes:** Implement a new tool, fix a bug, or improve documentation.
108
96
3.**Run Quality Gates:** Use `just` to run formatters, linters, and tests.
109
-
4.**Commit:** Follow the Conventional Commits specification. Keep the commit messages short, just title, no description. Pre-commit hooks will run quality gates automatically.
97
+
4.**Commit:** Follow the Conventional Commits specification.
110
98
111
99
### 4.1. Testing & Quality Gates
112
100
@@ -121,174 +109,18 @@ just check # A convenient alias for fmt + lint + test.
121
109
122
110
### 4.2. Pre-commit Hooks
123
111
124
-
This repository uses **pre-commit** to automate quality checks before each commit. The hooks are configured in `.pre-commit-config.yaml` to run `just fmt`, `just lint`, and `just test`.
125
-
126
-
This means that every time you run `git commit`, your changes are automatically formatted, linted, and tested. If any of these checks fail, the commit is aborted, allowing you to fix the issues.
127
-
128
-
If the hooks do not run automatically, you may need to install them first:
129
-
```bash
130
-
# Install the git hooks defined in the configuration
131
-
pre-commit install
132
-
133
-
# After installation, you can run all checks on all files
134
-
pre-commit run -a
135
-
```
112
+
This repository uses **pre-commit** to automate quality checks before each commit.
113
+
The hooks are configured in `.pre-commit-config.yaml` to run `just fmt`, `just lint`, and `just test`.
114
+
If any of the hooks fail, the commit will not be created.
136
115
137
116
### 4.3 Updating all dependencies
138
117
139
-
You need to keep the project dependencies fresh from time to time. The way to do so is automated with `just bump`. Keep in mind that for that command to work, you need to have `nix` installed and in the $PATH.
140
-
141
-
## 5. MCP Tools & Permissions
142
-
143
-
The handler filters tools dynamically based on the Sysdig user's permissions. Each tool declares mandatory permissions via `WithRequiredPermissions`.
|`list_runtime_events`|`tool_list_runtime_events.go`| Query runtime events with filters, cursor, scope. |`policy-events.read`| “Show high severity runtime events from last 2h.” |
148
-
|`get_event_info`|`tool_get_event_info.go`| Pull full payload for a single policy event. |`policy-events.read`| “Fetch event `abc123` details.” |
149
-
|`get_event_process_tree`|`tool_get_event_process_tree.go`| Retrieve the process tree for an event when available. |`policy-events.read`| “Show the process tree behind event `abc123`.” |
150
-
|`run_sysql`|`tool_run_sysql.go`| Execute caller-supplied Sysdig SysQL queries safely. |`sage.exec`, `risks.read`| “Run the following SysQL…”. |
151
-
|`generate_sysql`|`tool_generate_sysql.go`| Convert natural language to SysQL via Sysdig Sage. |`sage.exec` (does not work with Service Accounts) | “Create a SysQL to list S3 buckets.” |
|`kubernetes_list_nodes`|`tool_kubernetes_list_nodes.go`| Lists Kubernetes node information. |`promql.exec`| "List all Kubernetes nodes in the cluster 'production-gke'" |
154
-
|`kubernetes_list_workloads`|`tool_kubernetes_list_workloads.go`| Lists Kubernetes workload information. |`promql.exec`| "List all desired workloads in the cluster 'production-gke' and namespace 'default'" |
155
-
|`kubernetes_list_pod_containers`|`tool_kubernetes_list_pod_containers.go`| Retrieves information from a particular pod and container. |`promql.exec`| "Show me info for pod 'my-pod' in cluster 'production-gke'" |
156
-
|`kubernetes_list_cronjobs`|`tool_kubernetes_list_cronjobs.go`| Retrieves information from the cronjobs in the cluster. |`promql.exec`| "List all cronjobs in cluster 'prod' and namespace 'default'" |
157
-
|`troubleshoot_kubernetes_list_top_unavailable_pods`|`tool_troubleshoot_kubernetes_list_top_unavailable_pods.go`| Shows the top N pods with the highest number of unavailable or unready replicas. |`promql.exec`| "Show the top 20 unavailable pods in cluster 'production'" |
158
-
|`troubleshoot_kubernetes_list_top_restarted_pods`|`tool_troubleshoot_kubernetes_list_top_restarted_pods.go`| Lists the pods with the highest number of container restarts. |`promql.exec`| "Show the top 10 pods with the most container restarts in cluster 'production'" |
159
-
|`troubleshoot_kubernetes_list_top_400_500_http_errors_in_pods`|`tool_troubleshoot_kubernetes_list_top_400_500_http_errors_in_pods.go`| Lists the pods with the highest rate of HTTP 4xx and 5xx errors over a specified time interval. |`promql.exec`| "Show the top 20 pods with the most HTTP errors in cluster 'production'" |
160
-
|`troubleshoot_kubernetes_list_top_network_errors_in_pods`|`tool_troubleshoot_kubernetes_list_top_network_errors_in_pods.go`| Shows the top network errors by pod over a given interval. |`promql.exec`| "Show the top 10 pods with the most network errors in cluster 'production'" |
161
-
|`troubleshoot_kubernetes_list_count_pods_per_cluster`|`tool_troubleshoot_kubernetes_list_count_pods_per_cluster.go`| List the count of running Kubernetes Pods grouped by cluster and namespace. |`promql.exec`| "List the count of running Kubernetes Pods in cluster 'production'" |
162
-
|`troubleshoot_kubernetes_list_underutilized_pods_by_cpu_quota`|`tool_troubleshoot_kubernetes_list_underutilized_pods_by_cpu_quota.go`| List Kubernetes pods with CPU usage below 25% of the quota limit. |`promql.exec`| "Show the top 10 underutilized pods by CPU quota in cluster 'production'" |
163
-
|`troubleshoot_kubernetes_list_underutilized_pods_by_memory_quota`|`tool_troubleshoot_kubernetes_list_underutilized_pods_by_memory_quota.go`| List Kubernetes pods with memory usage below 25% of the limit. |`promql.exec`| "Show the top 10 underutilized pods by memory quota in cluster 'production'" |
164
-
|`troubleshoot_kubernetes_list_top_cpu_consumed_by_workload`|`tool_troubleshoot_kubernetes_list_top_cpu_consumed_by_workload.go`| Identifies the Kubernetes workloads (all containers) consuming the most CPU (in cores). |`promql.exec`| "Show the top 10 workloads consuming the most CPU in cluster 'production'" |
165
-
|`troubleshoot_kubernetes_list_top_cpu_consumed_by_container`|`tool_troubleshoot_kubernetes_list_top_cpu_consumed_by_container.go`| Identifies the Kubernetes containers consuming the most CPU (in cores). |`promql.exec`| "Show the top 10 containers consuming the most CPU in cluster 'production'" |
166
-
|`troubleshoot_kubernetes_list_top_memory_consumed_by_workload`|`tool_troubleshoot_kubernetes_list_top_memory_consumed_by_workload.go`| Lists memory-intensive workloads (all containers). |`promql.exec`| "Show the top 10 workloads consuming the most memory in cluster 'production'" |
167
-
|`troubleshoot_kubernetes_list_top_memory_consumed_by_container`|`tool_troubleshoot_kubernetes_list_top_memory_consumed_by_container.go`| Lists memory-intensive containers. |`promql.exec`| "Show the top 10 containers consuming the most memory in cluster 'production'" |
168
-
169
-
## 6. Adding a New Tool
170
-
171
-
1.**Create Files:** Add `tool_<name>.go` and `tool_<name>_test.go` in `internal/infra/mcp/tools/`.
172
-
173
-
2.**Implement the Tool:**
174
-
* Define a struct that holds the Sysdig client.
175
-
* Implement the `handle` method, which contains the tool's core logic.
176
-
* Implement the `RegisterInServer` method to define the tool's MCP schema, including its name, description, parameters, and required permissions. Use helpers from `utils.go`.
177
-
178
-
3.**Write Tests:** Use Ginkgo/Gomega to write BDD-style tests. Mock the Sysdig client to cover:
179
-
- Parameter validation
180
-
- Permission metadata
181
-
- Sysdig API client interactions (mocked)
182
-
- Error handling
183
-
184
-
4.**Register the Tool:** Add the new tool to `setupHandler()` in `cmd/server/main.go` (line 88-114).
185
-
186
-
5.**Document:** Add the new tool to the README.md and the table in section 5 (MCP Tools & Permissions).
- Each tool requires comprehensive test coverage for:
221
-
- Parameter validation
222
-
- Permission metadata
223
-
- Sysdig API client interactions (mocked using go-mock)
224
-
- Error handling
225
-
- Integration tests marked with `_integration_test.go` suffix
226
-
- No focused specs (`FDescribe`, `FIt`) should be committed
227
-
228
-
## 7. Conventional Commits
229
-
230
-
All commit messages must follow the [Conventional Commits](https://www.conventionalcommits.org/) specification. This is essential for automated versioning and changelog generation.
-`internal/infra/sysdig/client.gen.go` is auto-generated from OpenAPI spec via oapi-codegen.
238
-
- Run `go generate ./...` (or `just generate`) to regenerate after spec changes.
239
-
- Generated code includes all Sysdig Secure API types and client methods.
240
-
-**DO NOT** manually edit `client.gen.go`. Extend functionality in separate files (e.g., `client_extension.go`).
241
-
242
-
## 9. Important Constraints
243
-
244
-
1.**Generated Code**: Never manually edit `client.gen.go`. Extend functionality in separate files like `client_extension.go`.
245
-
246
-
2.**Service Account Limitation**: The `generate_sysql` tool does NOT work with Service Account tokens (returns 500). Use regular user API tokens for this tool.
247
-
248
-
3.**Permission Filtering**: Tools are hidden if the API token lacks required permissions. Check user's Sysdig role if a tool is unexpectedly missing.
249
-
250
-
4.**stdio Mode Requirements**: When using stdio transport, `SYSDIG_MCP_API_HOST` and `SYSDIG_MCP_API_TOKEN` MUST be set. Remote transports can receive these via HTTP headers instead.
251
-
252
-
## 10. Troubleshooting
253
-
254
-
**Problem**: Tool not appearing in MCP client
255
-
-**Solution**: Check API token permissions match tool's `WithRequiredPermissions()`. Use Sysdig UI: **Settings > Users & Teams > Roles**. The token must have **all** permissions listed.
256
-
257
-
**Problem**: "unable to authenticate with any method"
258
-
-**Solution**: For `stdio`, verify `SYSDIG_MCP_API_HOST` and `SYSDIG_MCP_API_TOKEN` env vars are set correctly. For remote transports, check `Authorization: Bearer <token>` header format.
259
-
260
-
**Problem**: Tests failing with "command not found"
261
-
-**Solution**: Enter Nix shell with `nix develop` or `direnv allow`. All dev tools are provided by the flake.
262
-
263
-
**Problem**: `generate_sysql` returning 500 error
264
-
-**Solution**: This tool requires a regular user API token, not a Service Account token. Switch to a user-based token.
265
-
266
-
**Problem**: Pre-commit hooks not running
267
-
-**Solution**: Run `pre-commit install` to install git hooks, then `pre-commit run -a` to test all files.
268
-
269
-
## 11. Releasing
270
-
271
-
The workflow in .github/workflows/publish.yaml will create a new release automatically when the version of the crate changes in package.nix in the default git branch.
272
-
So, if you attempt to release a new version, you need to update this version. You should try releasing a new version when you do any meaningful change that the user can benefit from.
273
-
The guidelines to follow would be:
274
-
275
-
* New feature is implemented -> Release new version.
276
-
* Bug fixes -> Release new version.
277
-
* CI/Refactorings/Internal changes -> No need to release new version.
278
-
* Documentation changes -> No need to release new version.
279
-
280
-
The current version of the project is not stable yet, so you need to follow the [Semver spec](https://semver.org/spec/v2.0.0.html), with the following guidelines:
281
-
282
-
* Unless specified, do not attempt to stabilize the version. That is, do not try to update the version to >=1.0.0. Versions for now should be <1.0.0.
283
-
* For minor changes, update only the Y in 0.X.Y. For example: 0.5.2 -> 0.5.3
284
-
* For major/feature changes, update the X in 0.X.Y and set the Y to 0. For example: 0.5.2 -> 0.6.0
285
-
* Before choosing if the changes are minor or major, check all the commits since the last tag.
286
-
287
-
After the commit is merged into the default branch the workflow will cross-compile the project and create a GitHub release of that version.
288
-
Check the workflow file in case of doubt.
118
+
Automated with `just bump`. Requires `nix` installed.
289
119
290
-
## 12. Reference Links
120
+
## 5. Guides & Reference
291
121
292
-
-`README.md` – Comprehensive product docs, quickstart, and client configuration samples.
293
-
-`CLAUDE.md` – Complementary guide with additional examples and command reference.
294
-
-[Model Context Protocol](https://modelcontextprotocol.io/) – Protocol reference for tool/transport behavior.
122
+
***Tools & New Tool Creation:** See `internal/infra/mcp/tools/README.md`
123
+
***Releasing:** See `docs/RELEASING.md`
124
+
***Troubleshooting:** See `docs/TROUBLESHOOTING.md`
The workflow in .github/workflows/publish.yaml will create a new release automatically when the version of the crate changes in package.nix in the default git branch.
4
+
So, if you attempt to release a new version, you need to update this version. You should try releasing a new version when you do any meaningful change that the user can benefit from.
5
+
The guidelines to follow would be:
6
+
7
+
* New feature is implemented -> Release new version.
8
+
* Bug fixes -> Release new version.
9
+
* CI/Refactorings/Internal changes -> No need to release new version.
10
+
* Documentation changes -> No need to release new version.
11
+
12
+
The current version of the project is not stable yet, so you need to follow the [Semver spec](https://semver.org/spec/v2.0.0.html), with the following guidelines:
13
+
14
+
* Unless specified, do not attempt to stabilize the version. That is, do not try to update the version to >=1.0.0. Versions for now should be <1.0.0.
15
+
* For minor changes, update only the Y in 0.X.Y. For example: 0.5.2 -> 0.5.3
16
+
* For major/feature changes, update the X in 0.X.Y and set the Y to 0. For example: 0.5.2 -> 0.6.0
17
+
* Before choosing if the changes are minor or major, check all the commits since the last tag.
18
+
19
+
After the commit is merged into the default branch the workflow will cross-compile the project and create a GitHub release of that version.
-**Solution**: Check API token permissions match tool's `WithRequiredPermissions()`. The token must have **all** permissions listed.
5
+
6
+
**Problem**: "unable to authenticate with any method"
7
+
-**Solution**: For `stdio`, verify `SYSDIG_MCP_API_HOST` and `SYSDIG_MCP_API_TOKEN` env vars are set correctly. For remote transports, check `Authorization: Bearer <token>` header format.
8
+
9
+
**Problem**: Tests failing with "command not found"
10
+
-**Solution**: Enter Nix shell with `nix develop` or `direnv allow`. All dev tools are provided by the flake.
11
+
12
+
**Problem**: `generate_sysql` returning 500 error
13
+
-**Solution**: This tool requires a regular user API token, not a Service Account token. Switch to a user-based token.
14
+
15
+
**Problem**: Pre-commit hooks not running
16
+
-**Solution**: Run `pre-commit install` to install git hooks, then `pre-commit run -a` to test all files.
0 commit comments