Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
202 changes: 17 additions & 185 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Sysdig MCP Server – Agent Developer Handbook

This document is a comprehensive guide for an AI agent tasked with developing and maintaining the Sysdig MCP Server. It covers everything from project setup and architecture to daily workflows and troubleshooting.
This document is a comprehensive guide for an AI agent tasked with developing and maintaining the Sysdig MCP Server.

## 1. Project Overview

Expand Down Expand Up @@ -44,6 +44,7 @@ For a full list of optional variables (e.g., for transport configuration), see t
### 3.1. Repository Layout

```
.github/workflows - CI Workflows
cmd/server/ - CLI entry point, tool registration
internal/
config/ - Environment variable loading and validation
Expand All @@ -55,6 +56,7 @@ internal/
docs/ - Documentation assets
justfile - Canonical development tasks (format, lint, test, generate, bump)
flake.nix - Defines the Nix development environment and its dependencies
package.nix - Defines how the package is going to be built with Nix
```

### 3.2. Key Components & Flow
Expand All @@ -75,10 +77,10 @@ flake.nix - Defines the Nix development environment and its depen
- HTTP middleware extracts `Authorization` and `X-Sysdig-Host` headers for remote transports (line 108-138)

4. **Sysdig Client (`internal/infra/sysdig/`):**
- `client.gen.go`: Generated OpenAPI client (**DO NOT EDIT**, regenerated via oapi-codegen)
- `client.gen.go`: Generated OpenAPI client (**DO NOT EDIT**, manually regenerated via oapi-codegen, not with `go generate`)
- `client.go`: Authentication strategies with fallback support
- Context-based auth: `WrapContextWithToken()` and `WrapContextWithHost()` for remote transports
- Fixed auth: `WithFixedHostAndToken()` for stdio mode
- Fixed auth: `WithFixedHostAndToken()` for stdio mode and remote transports
- Custom extensions in `client_extension.go` and `client_*.go` files

5. **Tools (`internal/infra/mcp/tools/`):**
Expand All @@ -87,26 +89,12 @@ flake.nix - Defines the Nix development environment and its depen
- Use `WithRequiredPermissions()` from `utils.go` to declare Sysdig API permissions
- Permission filtering happens automatically in handler

### 3.3. Authentication Flow

1. **stdio transport**: Fixed host/token from env vars (`SYSDIG_MCP_API_HOST`, `SYSDIG_MCP_API_TOKEN`)
2. **Remote transports**: Extract from HTTP headers (`Authorization: Bearer <token>`, `X-Sysdig-Host`)
3. Fallback chain: Try context auth first, then fall back to env var auth
4. Each request includes Bearer token in Authorization header to Sysdig APIs

### 3.4. Tool Permission System

- Each tool declares its required Sysdig API permissions using `WithRequiredPermissions("permission1", "permission2")`.
- Before exposing tools to the LLM, the handler calls the Sysdig `GetMyPermissions` API.
- The agent will only see tools for which the provided API token has **all** required permissions.
- Common permissions: `policy-events.read`, `sage.exec`, `risks.read`, `promql.exec`

## 4. Day-to-Day Workflow

1. **Enter the Dev Shell:** Always work inside the Nix shell (`nix develop` or `direnv allow`) to ensure all tools are available. You can assume the developer is already in a Nix shell.
1. **Enter the Dev Shell:** Always work inside the Nix shell (`nix develop` or `direnv allow`). You can assume the developer already did that.
2. **Make Focused Changes:** Implement a new tool, fix a bug, or improve documentation.
3. **Run Quality Gates:** Use `just` to run formatters, linters, and tests.
4. **Commit:** Follow the Conventional Commits specification. Keep the commit messages short, just title, no description. Pre-commit hooks will run quality gates automatically.
4. **Commit:** Follow the Conventional Commits specification.

### 4.1. Testing & Quality Gates

Expand All @@ -121,174 +109,18 @@ just check # A convenient alias for fmt + lint + test.

### 4.2. Pre-commit Hooks

This repository uses **pre-commit** to automate quality checks before each commit. The hooks are configured in `.pre-commit-config.yaml` to run `just fmt`, `just lint`, and `just test`.

This means that every time you run `git commit`, your changes are automatically formatted, linted, and tested. If any of these checks fail, the commit is aborted, allowing you to fix the issues.

If the hooks do not run automatically, you may need to install them first:
```bash
# Install the git hooks defined in the configuration
pre-commit install

# After installation, you can run all checks on all files
pre-commit run -a
```
This repository uses **pre-commit** to automate quality checks before each commit.
The hooks are configured in `.pre-commit-config.yaml` to run `just fmt`, `just lint`, and `just test`.
If any of the hooks fail, the commit will not be created.

### 4.3 Updating all dependencies

You need to keep the project dependencies fresh from time to time. The way to do so is automated with `just bump`. Keep in mind that for that command to work, you need to have `nix` installed and in the $PATH.

## 5. MCP Tools & Permissions

The handler filters tools dynamically based on the Sysdig user's permissions. Each tool declares mandatory permissions via `WithRequiredPermissions`.

| Tool | File | Capability | Required Permissions | Useful Prompts |
| --- | --- | --- | --- | --- |
| `list_runtime_events` | `tool_list_runtime_events.go` | Query runtime events with filters, cursor, scope. | `policy-events.read` | “Show high severity runtime events from last 2h.” |
| `get_event_info` | `tool_get_event_info.go` | Pull full payload for a single policy event. | `policy-events.read` | “Fetch event `abc123` details.” |
| `get_event_process_tree` | `tool_get_event_process_tree.go` | Retrieve the process tree for an event when available. | `policy-events.read` | “Show the process tree behind event `abc123`.” |
| `run_sysql` | `tool_run_sysql.go` | Execute caller-supplied Sysdig SysQL queries safely. | `sage.exec`, `risks.read` | “Run the following SysQL…”. |
| `generate_sysql` | `tool_generate_sysql.go` | Convert natural language to SysQL via Sysdig Sage. | `sage.exec` (does not work with Service Accounts) | “Create a SysQL to list S3 buckets.” |
| `kubernetes_list_clusters` | `tool_kubernetes_list_clusters.go` | Lists Kubernetes cluster information. | `promql.exec` | "List all Kubernetes clusters" |
| `kubernetes_list_nodes` | `tool_kubernetes_list_nodes.go` | Lists Kubernetes node information. | `promql.exec` | "List all Kubernetes nodes in the cluster 'production-gke'" |
| `kubernetes_list_workloads` | `tool_kubernetes_list_workloads.go` | Lists Kubernetes workload information. | `promql.exec` | "List all desired workloads in the cluster 'production-gke' and namespace 'default'" |
| `kubernetes_list_pod_containers` | `tool_kubernetes_list_pod_containers.go` | Retrieves information from a particular pod and container. | `promql.exec` | "Show me info for pod 'my-pod' in cluster 'production-gke'" |
| `kubernetes_list_cronjobs` | `tool_kubernetes_list_cronjobs.go` | Retrieves information from the cronjobs in the cluster. | `promql.exec` | "List all cronjobs in cluster 'prod' and namespace 'default'" |
| `troubleshoot_kubernetes_list_top_unavailable_pods` | `tool_troubleshoot_kubernetes_list_top_unavailable_pods.go` | Shows the top N pods with the highest number of unavailable or unready replicas. | `promql.exec` | "Show the top 20 unavailable pods in cluster 'production'" |
| `troubleshoot_kubernetes_list_top_restarted_pods` | `tool_troubleshoot_kubernetes_list_top_restarted_pods.go` | Lists the pods with the highest number of container restarts. | `promql.exec` | "Show the top 10 pods with the most container restarts in cluster 'production'" |
| `troubleshoot_kubernetes_list_top_400_500_http_errors_in_pods` | `tool_troubleshoot_kubernetes_list_top_400_500_http_errors_in_pods.go` | Lists the pods with the highest rate of HTTP 4xx and 5xx errors over a specified time interval. | `promql.exec` | "Show the top 20 pods with the most HTTP errors in cluster 'production'" |
| `troubleshoot_kubernetes_list_top_network_errors_in_pods` | `tool_troubleshoot_kubernetes_list_top_network_errors_in_pods.go` | Shows the top network errors by pod over a given interval. | `promql.exec` | "Show the top 10 pods with the most network errors in cluster 'production'" |
| `troubleshoot_kubernetes_list_count_pods_per_cluster` | `tool_troubleshoot_kubernetes_list_count_pods_per_cluster.go` | List the count of running Kubernetes Pods grouped by cluster and namespace. | `promql.exec` | "List the count of running Kubernetes Pods in cluster 'production'" |
| `troubleshoot_kubernetes_list_underutilized_pods_by_cpu_quota` | `tool_troubleshoot_kubernetes_list_underutilized_pods_by_cpu_quota.go` | List Kubernetes pods with CPU usage below 25% of the quota limit. | `promql.exec` | "Show the top 10 underutilized pods by CPU quota in cluster 'production'" |
| `troubleshoot_kubernetes_list_underutilized_pods_by_memory_quota` | `tool_troubleshoot_kubernetes_list_underutilized_pods_by_memory_quota.go` | List Kubernetes pods with memory usage below 25% of the limit. | `promql.exec` | "Show the top 10 underutilized pods by memory quota in cluster 'production'" |
| `troubleshoot_kubernetes_list_top_cpu_consumed_by_workload` | `tool_troubleshoot_kubernetes_list_top_cpu_consumed_by_workload.go` | Identifies the Kubernetes workloads (all containers) consuming the most CPU (in cores). | `promql.exec` | "Show the top 10 workloads consuming the most CPU in cluster 'production'" |
| `troubleshoot_kubernetes_list_top_cpu_consumed_by_container` | `tool_troubleshoot_kubernetes_list_top_cpu_consumed_by_container.go` | Identifies the Kubernetes containers consuming the most CPU (in cores). | `promql.exec` | "Show the top 10 containers consuming the most CPU in cluster 'production'" |
| `troubleshoot_kubernetes_list_top_memory_consumed_by_workload` | `tool_troubleshoot_kubernetes_list_top_memory_consumed_by_workload.go` | Lists memory-intensive workloads (all containers). | `promql.exec` | "Show the top 10 workloads consuming the most memory in cluster 'production'" |
| `troubleshoot_kubernetes_list_top_memory_consumed_by_container` | `tool_troubleshoot_kubernetes_list_top_memory_consumed_by_container.go` | Lists memory-intensive containers. | `promql.exec` | "Show the top 10 containers consuming the most memory in cluster 'production'" |

## 6. Adding a New Tool

1. **Create Files:** Add `tool_<name>.go` and `tool_<name>_test.go` in `internal/infra/mcp/tools/`.

2. **Implement the Tool:**
* Define a struct that holds the Sysdig client.
* Implement the `handle` method, which contains the tool's core logic.
* Implement the `RegisterInServer` method to define the tool's MCP schema, including its name, description, parameters, and required permissions. Use helpers from `utils.go`.

3. **Write Tests:** Use Ginkgo/Gomega to write BDD-style tests. Mock the Sysdig client to cover:
- Parameter validation
- Permission metadata
- Sysdig API client interactions (mocked)
- Error handling

4. **Register the Tool:** Add the new tool to `setupHandler()` in `cmd/server/main.go` (line 88-114).

5. **Document:** Add the new tool to the README.md and the table in section 5 (MCP Tools & Permissions).

### 6.1. Example Tool Structure

```go
type ToolMyFeature struct {
sysdigClient sysdig.ExtendedClientWithResponsesInterface
}

func (h *ToolMyFeature) handle(ctx context.Context, request mcp.CallToolRequest) (*mcp.CallToolResult, error) {
param := request.GetString("param_name", "")
response, err := h.sysdigClient.SomeAPICall(ctx, param)
// Handle response...
return mcp.NewToolResultJSON(response.JSON200)
}

func (h *ToolMyFeature) RegisterInServer(s *server.MCPServer) {
tool := mcp.NewTool("my_feature",
mcp.WithDescription("What this tool does"),
mcp.WithString("param_name",
mcp.Required(),
mcp.Description("Parameter description"),
),
mcp.WithReadOnlyHintAnnotation(true),
mcp.WithDestructiveHintAnnotation(false),
WithRequiredPermissions("permission.name"),
)
s.AddTool(tool, h.handle)
}
```

### 6.2. Testing Philosophy

- Use BDD-style tests with Ginkgo/Gomega
- Each tool requires comprehensive test coverage for:
- Parameter validation
- Permission metadata
- Sysdig API client interactions (mocked using go-mock)
- Error handling
- Integration tests marked with `_integration_test.go` suffix
- No focused specs (`FDescribe`, `FIt`) should be committed

## 7. Conventional Commits

All commit messages must follow the [Conventional Commits](https://www.conventionalcommits.org/) specification. This is essential for automated versioning and changelog generation.

- **Types**: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`, `build`, `ci`.
- **Format**: `<type>(<optional scope>): <imperative description>`

## 8. Code Generation

- `internal/infra/sysdig/client.gen.go` is auto-generated from OpenAPI spec via oapi-codegen.
- Run `go generate ./...` (or `just generate`) to regenerate after spec changes.
- Generated code includes all Sysdig Secure API types and client methods.
- **DO NOT** manually edit `client.gen.go`. Extend functionality in separate files (e.g., `client_extension.go`).

## 9. Important Constraints

1. **Generated Code**: Never manually edit `client.gen.go`. Extend functionality in separate files like `client_extension.go`.

2. **Service Account Limitation**: The `generate_sysql` tool does NOT work with Service Account tokens (returns 500). Use regular user API tokens for this tool.

3. **Permission Filtering**: Tools are hidden if the API token lacks required permissions. Check user's Sysdig role if a tool is unexpectedly missing.

4. **stdio Mode Requirements**: When using stdio transport, `SYSDIG_MCP_API_HOST` and `SYSDIG_MCP_API_TOKEN` MUST be set. Remote transports can receive these via HTTP headers instead.

## 10. Troubleshooting

**Problem**: Tool not appearing in MCP client
- **Solution**: Check API token permissions match tool's `WithRequiredPermissions()`. Use Sysdig UI: **Settings > Users & Teams > Roles**. The token must have **all** permissions listed.

**Problem**: "unable to authenticate with any method"
- **Solution**: For `stdio`, verify `SYSDIG_MCP_API_HOST` and `SYSDIG_MCP_API_TOKEN` env vars are set correctly. For remote transports, check `Authorization: Bearer <token>` header format.

**Problem**: Tests failing with "command not found"
- **Solution**: Enter Nix shell with `nix develop` or `direnv allow`. All dev tools are provided by the flake.

**Problem**: `generate_sysql` returning 500 error
- **Solution**: This tool requires a regular user API token, not a Service Account token. Switch to a user-based token.

**Problem**: Pre-commit hooks not running
- **Solution**: Run `pre-commit install` to install git hooks, then `pre-commit run -a` to test all files.

## 11. Releasing

The workflow in .github/workflows/publish.yaml will create a new release automatically when the version of the crate changes in package.nix in the default git branch.
So, if you attempt to release a new version, you need to update this version. You should try releasing a new version when you do any meaningful change that the user can benefit from.
The guidelines to follow would be:

* New feature is implemented -> Release new version.
* Bug fixes -> Release new version.
* CI/Refactorings/Internal changes -> No need to release new version.
* Documentation changes -> No need to release new version.

The current version of the project is not stable yet, so you need to follow the [Semver spec](https://semver.org/spec/v2.0.0.html), with the following guidelines:

* Unless specified, do not attempt to stabilize the version. That is, do not try to update the version to >=1.0.0. Versions for now should be <1.0.0.
* For minor changes, update only the Y in 0.X.Y. For example: 0.5.2 -> 0.5.3
* For major/feature changes, update the X in 0.X.Y and set the Y to 0. For example: 0.5.2 -> 0.6.0
* Before choosing if the changes are minor or major, check all the commits since the last tag.

After the commit is merged into the default branch the workflow will cross-compile the project and create a GitHub release of that version.
Check the workflow file in case of doubt.
Automated with `just bump`. Requires `nix` installed.

## 12. Reference Links
## 5. Guides & Reference

- `README.md` – Comprehensive product docs, quickstart, and client configuration samples.
- `CLAUDE.md` – Complementary guide with additional examples and command reference.
- [Model Context Protocol](https://modelcontextprotocol.io/) – Protocol reference for tool/transport behavior.
* **Tools & New Tool Creation:** See `internal/infra/mcp/tools/README.md`
* **Releasing:** See `docs/RELEASING.md`
* **Troubleshooting:** See `docs/TROUBLESHOOTING.md`
* **Conventional Commits:** [Specification](https://www.conventionalcommits.org/)
* **Protocol:** [Model Context Protocol](https://modelcontextprotocol.io/)
20 changes: 20 additions & 0 deletions docs/RELEASING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Releasing

The workflow in .github/workflows/publish.yaml will create a new release automatically when the version of the crate changes in package.nix in the default git branch.
So, if you attempt to release a new version, you need to update this version. You should try releasing a new version when you do any meaningful change that the user can benefit from.
The guidelines to follow would be:

* New feature is implemented -> Release new version.
* Bug fixes -> Release new version.
* CI/Refactorings/Internal changes -> No need to release new version.
* Documentation changes -> No need to release new version.

The current version of the project is not stable yet, so you need to follow the [Semver spec](https://semver.org/spec/v2.0.0.html), with the following guidelines:

* Unless specified, do not attempt to stabilize the version. That is, do not try to update the version to >=1.0.0. Versions for now should be <1.0.0.
* For minor changes, update only the Y in 0.X.Y. For example: 0.5.2 -> 0.5.3
* For major/feature changes, update the X in 0.X.Y and set the Y to 0. For example: 0.5.2 -> 0.6.0
* Before choosing if the changes are minor or major, check all the commits since the last tag.

After the commit is merged into the default branch the workflow will cross-compile the project and create a GitHub release of that version.
Check the workflow file in case of doubt.
16 changes: 16 additions & 0 deletions docs/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Troubleshooting

**Problem**: Tool not appearing in MCP client
- **Solution**: Check API token permissions match tool's `WithRequiredPermissions()`. The token must have **all** permissions listed.

**Problem**: "unable to authenticate with any method"
- **Solution**: For `stdio`, verify `SYSDIG_MCP_API_HOST` and `SYSDIG_MCP_API_TOKEN` env vars are set correctly. For remote transports, check `Authorization: Bearer <token>` header format.

**Problem**: Tests failing with "command not found"
- **Solution**: Enter Nix shell with `nix develop` or `direnv allow`. All dev tools are provided by the flake.

**Problem**: `generate_sysql` returning 500 error
- **Solution**: This tool requires a regular user API token, not a Service Account token. Switch to a user-based token.

**Problem**: Pre-commit hooks not running
- **Solution**: Run `pre-commit install` to install git hooks, then `pre-commit run -a` to test all files.
Loading