Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
b21a264
Update team fork naming convention in README
KrithiAS10 Apr 15, 2026
3bfbc00
added agents.md
jtuluve Apr 15, 2026
7d1db21
Added frontend
KrithiAS10 Apr 15, 2026
629b8e2
Update dashboard/package.json
jtuluve Apr 15, 2026
788c52a
gemini reviews fixed
jtuluve Apr 15, 2026
825727e
initial observation layer
jtuluve Apr 15, 2026
d3f487c
Add Kubernetes deployment configurations and update README
jtuluve Apr 15, 2026
dbd9c32
Enhance backend configuration and observability features
jtuluve Apr 15, 2026
0c82b8e
dummy public
KrithiAS10 Apr 15, 2026
2ede624
fixes
KrithiAS10 Apr 15, 2026
c21e02b
Merge branch 'feat/obs-layer' of https://github.com/KrithiAS10/hackto…
KrithiAS10 Apr 15, 2026
432456a
initial observation layer
jtuluve Apr 15, 2026
5c092f7
Add Kubernetes deployment configurations and update README
jtuluve Apr 15, 2026
d42a4f9
dummy public
KrithiAS10 Apr 15, 2026
348e718
fixes
KrithiAS10 Apr 15, 2026
3ec069c
Enhance backend configuration and observability features
jtuluve Apr 15, 2026
07ce5be
cpu, memory usage fix
KrithiAS10 Apr 15, 2026
022dad7
Merge branch 'main' into feat/obs-layer
KrithiAS10 Apr 15, 2026
15cfcde
minor fixes
KrithiAS10 Apr 15, 2026
8f57d57
Merge pull request #3 from KrithiAS10/feat/obs-layer
KrithiAS10 Apr 15, 2026
3fec8fb
Add Redis integration and enhance backend features
jtuluve Apr 15, 2026
69b9634
Merge branch 'main' into feat/det
jtuluve Apr 15, 2026
471b06b
minor fixes
jtuluve Apr 15, 2026
99c6c6e
Merge pull request #4 from KrithiAS10/feat/det
jtuluve Apr 15, 2026
c959c26
untested tools and agent testing
jtuluve Apr 16, 2026
b553989
Created detection layer and some other stuff
KrithiAS10 Apr 16, 2026
c252f35
single agent setup worked!
jtuluve Apr 16, 2026
702d6b5
temp commit
KrithiAS10 Apr 16, 2026
32b5d05
Major changes
jtuluve Apr 16, 2026
3ed7498
some more changes
jtuluve Apr 16, 2026
2d2a800
Implement orchestrator agent and enhance service configurations
KrithiAS10 Apr 16, 2026
b182e08
minor fix
KrithiAS10 Apr 16, 2026
72ca662
something
jtuluve Apr 16, 2026
145b7b2
Merge branch 'feat/single-agent' of https://github.com/KrithiAS10/hac…
jtuluve Apr 16, 2026
f68a3e7
feat: add backend Dockerfile, Kustomize deployment manifests, and loc…
KrithiAS10 Apr 16, 2026
9065ce7
model id fix, chat boilerplate removed
jtuluve Apr 16, 2026
ec76ead
Merge branch 'feat/single-agent' of https://github.com/KrithiAS10/hac…
jtuluve Apr 16, 2026
a7e8045
Completed v1 with multi agent Setup!! #5
jtuluve Apr 16, 2026
dbc9fda
readme fixed
KrithiAS10 Apr 16, 2026
8fd8e46
minor fixes
KrithiAS10 Apr 16, 2026
ccffb42
Enhance agent cost management and incident reporting features. Introd…
jtuluve Apr 17, 2026
1acd312
Merge branch 'feat/single-agent' of https://github.com/KrithiAS10/hac…
jtuluve Apr 17, 2026
1f89681
major change
jtuluve Apr 17, 2026
b7c7cc6
Merge pull request #6 from KrithiAS10/feat/single-agent
jtuluve Apr 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions -
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJTXI4QnEvNlBmaTh3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TmpBME1UWXdOVFE1TXpsYUZ3MHpOakEwTVRNd05UVTBNemxhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUUNjSFMybWdnUjZIcHJuekduenVtUUVNWEJVcFMvQm43c1JWZ3dHMzZzUExTelpBRTBPRlBDdFVYNE8KakMzbGZ3bzltTHU2UW00b1k2OEtEbHRxT3pPZnVXWmJQQU4veHJRcGpXYVZLVnp6YjExbkJtYnpuT2xZTVJ1ZQovOUM4cmtwcDFIWHA3aTRRanQrcHlZMU9xQXl5bDEreDk3RDY0di84REtHZWo0amdNYmN4aU9jMFU0c3BvYWhXCnkxZEMwYiswN1k4dncxcjA2ZmMycEZWSGpkeEk3eVRXN2JZeDZxNFJ6UllNZlg2TUtkVlVzVjdSOWtwUkRkQTUKSWR3OElINU1TYVpPRzdBTHVrVG4rcjEwa2diTUZMbzNuOHRwKytDMWtUT3lFL3hxMTdRQmQxN0JZb0VwMWM4ego4S1hVTzlxenU3UEt2RjlPRGRDcUxmVW5lVHlyQWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTclNWUERFMmQ3aDlERjBVRkRzY1UxcmVXMnZ6QVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQVBCaGxDR3JwQwpMQzYvQmRkSzlFU3VObE1wNTFXY2xlTWUrY0k4Q01ZR3A0bUJHNU5GZVVJa2taMEtYTzc5Q1Y5T3V5cEZCWjd5ClRmMzV1RWdGb1RMYXJrRGR3SVdhNkJhTk9mWUpEaXNVYkRCcytBT0RjaW5TbW9RQkMzeHpXalg1UFhxNnRtMUUKbmw2T3V3ejN5R3BrOVdyWUQ2cnJCNlZkUjlHQmhqV3lRYk5HV3hqSzJyUWpQYnZPWWlJVzh5NEpMSHR3NWRBbAp0ajFuTkpnZ042MEx1SGNDckdvSFY2cm55RnlGMHl5ZjBzTHEzVlZUS0VHeTdvZm93bmxvRFpHV3VTRjNIZ2xVCkF1all5NlJRRDN0cG5wTlNxY3hNT1pUN3pvV0o1VDBZZEJIUDNLd2Q4d2hBMlhNUmlpZy9VcFJkREdsN1cvbTkKZ3B2amw1WEFEL2R2Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
server: https://127.0.0.1:51723
name: kind-lerna
contexts:
- context:
cluster: kind-lerna
user: kind-lerna
name: kind-lerna
current-context: kind-lerna
kind: Config
users:
- name: kind-lerna
user:
client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURLVENDQWhHZ0F3SUJBZ0lJUUljb056bkdlT2d3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TmpBME1UWXdOVFE1TXpsYUZ3MHlOekEwTVRZd05UVTBNemxhTUR3eApIekFkQmdOVkJBb1RGbXQxWW1WaFpHMDZZMngxYzNSbGNpMWhaRzFwYm5NeEdUQVhCZ05WQkFNVEVHdDFZbVZ5CmJtVjBaWE10WVdSdGFXNHdnZ0VpTUEwR0NTcUdTSWIzRFFFQkFRVUFBNElCRHdBd2dnRUtBb0lCQVFDOS83WlYKTWkvYnNXYmY4VmZES3ZsOHlQWlI3dE1ybUNwbUNoQUlpN3FlUUFyNC9IRk5uZHNUZ1lMWTQ4aGRvMGJ4akdiZApaS1BPYi9hM1VHWHNFanZKcHA4OGtaelorR1FjemhHcVdwbVBxdmR5WnFNN1BzNVVUcEVzMlVCZnVVMm1jRlNjCjZ3Zk1uUmRCUHBkNS9uSnZIS0pPVXgyUWdvbVc0bzJRb0tacnFiS3ovY25EL2l6UlFLTkRzMENsL2VSNGV0VGsKVXQxNloycnVNMHFtekpWeW5ucG85UFh1UnJZanVTRlVOWDE3S3NDYW92bm5YeEJrNkhxYmZsQVgzSlYwSG9DQgptTVVGOVhYRFR2T2hrc2VIVUo2NnZFVDJvcEl4ckRMU1NTVnIxc2txU1hPdC9jZGRTQkJJbWF2eUo0ZlpzZmRoCkFMdGtQZlIvK3JxWlVrYURBZ01CQUFHalZqQlVNQTRHQTFVZER3RUIvd1FFQXdJRm9EQVRCZ05WSFNVRUREQUsKQmdnckJnRUZCUWNEQWpBTUJnTlZIUk1CQWY4RUFqQUFNQjhHQTFVZEl3UVlNQmFBRkt0SlU4TVRaM3VIME1YUgpRVU94eFRXdDViYS9NQTBHQ1NxR1NJYjNEUUVCQ3dVQUE0SUJBUUNHRnY2R0ZOKzNOSXFBcGltb25nSVh4SG9HCkowNFJMalFtZ3lERTU2dlVHWnR0YStpN3lDUDk1Um40cFBzem5zQzMwUzdCc1h6TW1DZGhhZGR0cXhXaXdGMVUKVGRoVUJURmdtOHVPVmZCUXZ5TWZhL0l6TDlITkgwNGgzUEZ1K0NZSkc4bitHaWNFM3dwSlFEQ3ZGUFJQUitMWgpqcE9ualFMMWRvNWEwV2htajdONlRqcmE2cGx5eE9XSmRFemQzb21RZHVsQjk3ak1BWGViMmF2bWd3Smc2NkNFCllrdEZENDloOWRzRyt0cCtCR3BrZUcyUndXbW1sNkUyMzREblpTWjQvWlQ1WkRTN3VYTVU4SjQ3YmZWdWx2QVQKOGplcVJCS2dQTm1MTkxDalViYUlsSUNpV0N5bTZSaEJPVE9pQk03THFLQVUvTDFyTFRGRkt1cnpybjM5Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS0NBUUVBdmYrMlZUSXYyN0ZtMy9GWHd5cjVmTWoyVWU3VEs1Z3FaZ29RQ0l1Nm5rQUsrUHh4ClRaM2JFNEdDMk9QSVhhTkc4WXhtM1dTanptLzJ0MUJsN0JJN3lhYWZQSkdjMmZoa0hNNFJxbHFaajZyM2NtYWoKT3o3T1ZFNlJMTmxBWDdsTnBuQlVuT3NIekowWFFUNlhlZjV5Ynh5aVRsTWRrSUtKbHVLTmtLQ21hNm15cy8zSgp3LzRzMFVDalE3TkFwZjNrZUhyVTVGTGRlbWRxN2pOS3BzeVZjcDU2YVBUMTdrYTJJN2toVkRWOWV5ckFtcUw1CjUxOFFaT2g2bTM1UUY5eVZkQjZBZ1pqRkJmVjF3MDd6b1pMSGgxQ2V1cnhFOXFLU01hd3kwa2tsYTliSktrbHoKcmYzSFhVZ1FTSm1yOGllSDJiSDNZUUM3WkQzMGYvcTZtVkpHZ3dJREFRQUJBb0lCQUFIMzcwU3NzM0E4UTB1WQpyWWNaSCtLYUZtczg1VFV6YTJVSlA2ZEhBMVQyWnVhemZ0MEdBS29RRW5INjBpMmVMbkw4T0dpY3pWR3JPVXdtCjZoZHJEUEdHNTJseVBNVEpYUWdyWG1WOGNORGJQWnNTMHlnZSszWkdKaHpuMTFIbWtwWmgzWTZPcE5NSzRaM00KYnpkVldvd3FLTWhVOWg1MEs4YkRiQ0lPZUFydmY1ZGhZZ1hsOGU2QVFXajNvTDlLczljRHdFbDBBN044K0FwQQpBOXRaZEg3MEhpQzQwWmlaMHJZd0RDbmJhUFF0dVpGMmw4Mjk0MmpIR245aWovWmlabElxMVBhbkdzK0tCcDNaCllrUGYwUythaEhtOGlhY2FuMC91VitvMGdXM1pRamx3a0xETDY3VDFlTWU0bjJ0Ly9PTXlrRzJZb1FhWmk1V1EKTG9YNGJia0NnWUVBNElIVEw4NURFWDU0ekhwQXMwTGVHd0RkWndvSUxWUlRMUjRKU0czdEpWaHR2OTRmRVdnbgpDVHg2VmU2bXJQY0srNjRuUlZsaEM0c0xrL2NmU283Vk9oUmNpMmNkNUUvS2paRDIreWU1ZWlRa0ZtTUFmZkswCm93QkJXVmZoVUxxVVE5eE9oTEs4ZEc5T2o5WFhYUFIwYndISmc0UVR2MXZJZWU2bEtBdkdDejhDZ1lFQTJLYXQKUnVpTUN2RndCcFhIM2huUHp0b3NOVGpQdUVENDVZRzQzR0haMTBESzVyYjN0SHZKT0FWL2dxcTIyYmhHVWp5NgprVjE0MzBaQXRYSFR4TFVCRHdCUUMrdFRmaEMzR2dMV202Rm55NGc3OWZZVU8wWXVvQytwZm1pRzRvSS8ycUJBClFjcGpjWU9XRU1hdkJ1N1hoYnZXZnh6L21ldVc5dkh3MzhraXg3MENnWUVBeWlYYmhGWVNxYlBaRFRTZkFVb2EKTnZKR2FMcnR0Zk1SbWJSTDQzMm5aRk1GTHhmUG5ack1XMUtyVEtqQVIwbUNDREE5aUFIOGthbzNXSm5SQVE4dgpDMGErTlg4NXVSUG5iQ1MxWGx2Y2RCQUt0bVdhVWMyeHZIdEVYQy8yM3Z2QStJRnI2YXdPYUVDNDJtWlBycEVtCkxiWE1QckUwSHIrRCtkWlp1Mzh1YVgwQ2dZRUEwSFBteXhBYkZyaGhVbVN4RHZrRTRvRTNBZXBzcWxzUllEbmwKZFY1TTdIaHlBWFRRZHY2WGgraDZYRzRIU3dxcjFwcUo1QzNzaTkrYmlUbEJTY1hpZzkySUp6L0FjTTZDYm11RwpzKzJqNGNodDhPVlphQUxKLytSOEQ1MWhFdlhCbklpTjZ2OWhtU25Eck5hT04zeDlNRGFnVm1PL1p3aXZrMkVNCm96VnkybjBDZ1lBSnlldzFFSGt6djNwdWNEaWE5Yms5ejdDdm1uV0d4WDl3K3BUSDhYcTJ2Y2c3TzArTnhVTmYKVXZjd0ZKM2NZSmtuZHNveW5EL1EyeEY3aUJzMDFNZnJCeHFNZnhiY2g1MVE3bGVQRHIvdXVXWWlyNGozNTJjQwpZR3lrZ2xLOU00TExwOWRBaG0vZ0hNMFM3UXZ6KzczcWxiQ3c5bGZuWjhXYjh6a3ZCcUlvcHc9PQotLS0tLUVORCBSU0EgUFJJVkFURSBLRVktLS0tLQo=
27 changes: 27 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Never bake local secrets into images (agents/backend builds use repo root as context).
.env
**/.env

# Git and local editor state
.git
.git/**
.cursor
.cursor/**
.vscode
.vscode/**

# Python caches and local virtual envs
**/__pycache__/
**/*.pyc
**/.pytest_cache/
agents-layer/venv/
backend/venv/
detection-service/venv/

# Frontend build artifacts (not needed for Python image builds from repo root)
dashboard/node_modules/
dashboard/.next/

# Local terminal/transcript artifacts
agent-transcripts/
mcps/
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.env
venv
__pycache__
node_modules
25 changes: 25 additions & 0 deletions .vscode/tasks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"version": "2.0.0",
"tasks": [
{
"label": "Deploy Lerna to kind-testapp",
"type": "shell",
"command": "powershell",
"args": [
"-NoProfile",
"-File",
"./scripts/deploy.ps1"
]
},
{
"label": "Check Lerna rollout status",
"type": "shell",
"command": "powershell",
"args": [
"-NoProfile",
"-Command",
"kubectl get pods -n lerna -o wide; kubectl get svc -n lerna; kubectl get ingress -n lerna"
]
}
]
}
79 changes: 79 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Project Context: Clueless (Project Lerna)

## Overview
Project Lerna is an autonomous SRE system for Kubernetes clusters. It extends basic Kubernetes self-healing by using a multi-agent workflow that can detect incidents, diagnose root causes, plan remediation, execute fixes in a safe sandbox, and validate outcomes.

The goal is to reduce manual incident triage across logs, metrics, and traces while keeping a human operator in control through approval workflows.

## Problem Statement
- Modern Kubernetes microservice systems can fail in cascading ways.
- Native Kubernetes recovery is reactive and limited (restart/reschedule).
- Root cause analysis across observability tooling is manual and slow.
- Need: an intelligent, trace-aware system that can diagnose and restore stability safely.

## Solution Summary
- Multi-agent incident response pipeline from detection to validation.
- Sandbox-first execution model to test fixes away from production.
- Operator dashboard for configuration, monitoring, approvals, and overrides.
- Memory-driven incident matching via semantic retrieval of past incidents.

## Core Capabilities
- Risk-free sandboxing of remediation actions.
- Trace-driven diagnosis (OpenTelemetry-centric correlation).
- Real-time operator visibility and manual approval options.
- Least-privilege agent access to resources.
- Incident memory lookup for faster repeat resolution.

## High-Level Architecture
Lerna is organized as layers:

1) **Observation layer**
- Collects logs, traces, and metrics.
- Uses tools like OpenTelemetry, Loki, Prometheus, and Kubernetes events.

2) **Detection layer**
- Identifies meaningful incidents from telemetry and cluster events.
- Queries logs/metrics (e.g., PromQL/LogQL) to classify failures.

3) **Agents layer**
- Runs specialized agents coordinated by an orchestrator.
- Performs diagnosis, planning, execution, and validation workflows.

4) **Execution safety layer**
- Uses isolated `kind` environments as sandboxes.
- Allows testing fixes without risking production workloads.

5) **Operator interface**
- Dashboard for live cluster/agent status and decision control.
- Supports approve/deny, prompt steering, and optional autonomy.

## Agent Roles (Defined in Slides)
- **Filter Agent**: validates whether an event is a real service-impacting incident.
- **Orchestrator Agent**: routes tasks and coordinates agent workflow.
- **Incident Matcher Agent**: queries Qdrant for similar historical incidents/fixes.
- **Diagnosis Agent**: analyzes logs/metrics/cluster state for root cause.
- **Planning Agent**: proposes one or more remediation plans.
- **Executor Agent**: applies candidate fixes (sandbox-first).
- **Validation Agent**: checks whether remediation succeeded.

## Tech Stack (From Proposal)
- **Observability**: Prometheus, Grafana Loki, OpenTelemetry, Jaeger, Kubernetes API events.
- **Agent orchestration**: LangGraph.
- **LLM reasoning**: GPT-5.4 mini (proposal choice for cost/performance).
- **Cluster control interface**: MCP for standardized `kubectl` access.
- **Sandbox infrastructure**: `kind`.
- **Backend**: FastAPI, MongoDB (agent config), Qdrant (incident history), Redis (live status).
- **Frontend**: React / Next.js.
- **K8s clients**: Python/Node SDKs.

## Planned Implementation Phases
1. **Observability + Detection**: deploy test microservices in local Kubernetes (`kind`), wire telemetry and anomaly detection.
2. **Agents layer**: implement dynamically configurable specialized agents via LangGraph; enforce scoped permissions.
3. **Testing + Validation**: validate detection/remediation against failures such as pod crashes and misconfigurations.
4. **Dashboard**: build operator UX for reasoning visibility, incident history, chat controls, and fix approvals.

## Operating Principles
- Safety first: test remediation in sandbox before production changes.
- Human-in-the-loop by default: operators can review and approve actions.
- Trace correlation as primary debugging backbone.
- Role-based specialization: each agent has a narrow, explicit responsibility.
181 changes: 181 additions & 0 deletions DEPLOY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Deploy Lerna on `kind-testapp`

This runbook assumes:

- `kubectl` is pointed at the `kind-testapp` cluster
- Docker Desktop is running
- `kind` is installed

## 1. Verify cluster context

```powershell
kubectl config current-context
kind get clusters
```

Expected:

- context: `kind-testapp`
- cluster: `testapp`

> If Docker is not running, `kind get clusters` will fail with a Docker pipe error. Start Docker Desktop / Docker daemon before continuing.

## 2. Build local images

Run these from the repo root:

```powershell
docker build -f backend/Dockerfile -t lerna-backend:latest .
docker build -f agents-layer/Dockerfile -t lerna-agents:latest .
docker build -f detection-service/Dockerfile -t lerna-detection:latest .
docker build -f dashboard/Dockerfile -t lerna-dashboard:latest dashboard
```

## 3. Load images into the `kind` cluster

```powershell
kind load docker-image lerna-backend:latest lerna-agents:latest lerna-detection:latest lerna-dashboard:latest --name testapp
```

## 4. Deploy observability layer

```powershell
kubectl apply -f observation-layer/k8s/namespace.yaml
kubectl apply -f observation-layer/k8s/loki-configmap.yaml
kubectl apply -f observation-layer/k8s/loki-deployment.yaml
kubectl apply -f observation-layer/k8s/jaeger-deployment.yaml
kubectl apply -f observation-layer/k8s/prometheus-configmap.yaml
kubectl apply -f observation-layer/k8s/prometheus-deployment.yaml
kubectl apply -f observation-layer/k8s/otel-collector-configmap.yaml
kubectl apply -f observation-layer/k8s/otel-collector-rbac.yaml
kubectl apply -f observation-layer/k8s/otel-collector-deployment.yaml
```

If the collector image tag in the manifest fails, pin it to the known working version:

```powershell
kubectl set image deployment/otel-collector -n observability otel-collector=otel/opentelemetry-collector-contrib:0.113.0
```

## 5. Deploy app namespaces and services

```powershell
kubectl apply -f k8s/namespace-lerna.yaml
kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f backend/k8s/backend-rbac.yaml
kubectl apply -f backend/k8s/backend-deployment.yaml
kubectl apply -f agents-layer/k8s/agents-deployment.yaml
kubectl apply -f detection-service/k8s/detection-deployment.yaml
kubectl apply -f dashboard/k8s/dashboard-deployment.yaml
kubectl apply -f k8s/lerna-ingress.yaml
```

## 6. Optional: deploy the demo failure microservices

```powershell
kubectl apply -f k8s/detection-demo-errors.yaml
```

These pods are intentionally unhealthy and are meant to exercise detection.

## 6b. Route TestApp telemetry to the observation collector

If TestApp services are running in `default`, patch them so traces/metrics/logs export to the observation-layer OpenTelemetry Collector:

```powershell
.\scripts\patch-testapp-observability.ps1
```

Linux/macOS:

```bash
chmod +x scripts/patch-testapp-observability.sh
./scripts/patch-testapp-observability.sh
```

Optional overrides:

- `TESTAPP_NAMESPACE` (default: `default`)
- `OTEL_COLLECTOR_ENDPOINT` (default: `http://otel-collector.observability.svc.cluster.local:4318`)
- `OTEL_COLLECTOR_PROTOCOL` (default: `http/protobuf`)

## 7. Check rollout status

```powershell
kubectl rollout status deployment/loki -n observability --timeout=120s
kubectl rollout status deployment/prometheus -n observability --timeout=120s
kubectl rollout status deployment/jaeger -n observability --timeout=120s
kubectl rollout status deployment/otel-collector -n observability --timeout=120s

kubectl rollout status deployment/redis -n lerna --timeout=120s
kubectl rollout status deployment/lerna-backend -n lerna --timeout=120s
kubectl rollout status deployment/lerna-agents -n lerna --timeout=120s
kubectl rollout status deployment/lerna-detection -n lerna --timeout=120s
kubectl rollout status deployment/lerna-dashboard -n lerna --timeout=120s
```

If `kubectl rollout status` fails or crashes, use the safer fallback:

```powershell
kubectl get pods -n observability -o wide
kubectl get pods -n lerna -o wide
kubectl get svc -n lerna
kubectl get ingress -n lerna
```

## 8. Inspect running workloads

```powershell
kubectl get pods -n observability -o wide
kubectl get pods -n lerna -o wide
kubectl get svc -n lerna
kubectl get ingress -n lerna
```

## 9. Restart after rebuilding images

If you rebuild images later, reload them into `kind` and restart the deployments:

```powershell
kind load docker-image lerna-backend:latest lerna-agents:latest lerna-detection:latest lerna-dashboard:latest --name testapp

kubectl rollout restart deployment/lerna-backend -n lerna
kubectl rollout restart deployment/lerna-agents -n lerna
kubectl rollout restart deployment/lerna-detection -n lerna
kubectl rollout restart deployment/lerna-dashboard -n lerna
```

## 10. Useful cleanup commands

Delete only the demo failure workloads:

```powershell
kubectl delete -f k8s/detection-demo-errors.yaml
```

Delete the Lerna app stack:

```powershell
kubectl delete -f k8s/lerna-ingress.yaml
kubectl delete -f dashboard/k8s/dashboard-deployment.yaml
kubectl delete -f detection-service/k8s/detection-deployment.yaml
kubectl delete -f agents-layer/k8s/agents-deployment.yaml
kubectl delete -f backend/k8s/backend-deployment.yaml
kubectl delete -f backend/k8s/backend-rbac.yaml
kubectl delete -f k8s/redis-deployment.yaml
kubectl delete -f k8s/namespace-lerna.yaml
```

Delete the observability stack:

```powershell
kubectl delete -f observation-layer/k8s/otel-collector-deployment.yaml
kubectl delete -f observation-layer/k8s/otel-collector-rbac.yaml
kubectl delete -f observation-layer/k8s/otel-collector-configmap.yaml
kubectl delete -f observation-layer/k8s/prometheus-deployment.yaml
kubectl delete -f observation-layer/k8s/prometheus-configmap.yaml
kubectl delete -f observation-layer/k8s/jaeger-deployment.yaml
kubectl delete -f observation-layer/k8s/loki-deployment.yaml
kubectl delete -f observation-layer/k8s/loki-configmap.yaml
kubectl delete -f observation-layer/k8s/namespace.yaml
```
Loading