Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Python
__pycache__/
*.pyc

# Virtual env
.venv/
venv/

# Env files
.env

# OS
.DS_Store

# Zip files
*.zip
9 changes: 9 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM python:3.11-slim

WORKDIR /app

COPY . .

RUN pip install --no-cache-dir -r requirements.txt

CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]
112 changes: 66 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,86 +1,106 @@
<<<<<<< HEAD
SHECODES
=======
#SheCodes
# HackToFuture 4.0 — Template
>>>>>>> 76a8c92793877c61d8ad7cfa4401659e91cd5af8

Welcome to your official HackToFuture 4 repository.

This repository template will be used for development, tracking progress, and final submission of your project. Ensure that all work is committed here within the allowed hackathon duration.
# HackToFuture 4.0 — Decision-Driven Autonomous Recovery for Kubernetes Systems

---

### Instructions for the teams:
## Problem Statement / Idea

- Fork the Repository and name the forked repo in this convention: hacktofuture4-team_id (for eg: hacktofuture4-A01)
Modern cloud applications run on Kubernetes using multiple interconnected microservices. When something fails, Kubernetes can restart containers, but it does not understand the root cause of the problem.

---
Because of this:

* Failures can spread across services
* Systems can experience downtime quickly
* Engineers must manually analyze logs and metrics

This manual process is slow and does not scale well for large systems.

## Rules
This problem mainly affects:

- Work must be done ONLY in the forked repository
- Only Four Contributors are allowed.
- After 36 hours, Please make PR to the Main Repository. A Form will be sent to fill the required information.
- Do not copy code from other teams
- All commits must be from individual GitHub accounts
- Please provide meaningful commits for tracking.
- Do not share your repository with other teams
- Final submission must be pushed before the deadline
- Any violation may lead to disqualification
* Site Reliability Engineers (SREs)
* DevOps teams
* Developers managing cloud-native applications

---

# The Final README Template
## Proposed Solution

## Problem Statement / Idea
We built an Autonomous Recovery System that monitors system signals, detects issues, analyzes them, and suggests recovery actions.

Clearly describe the problem you are solving.
### How it works:

- What is the problem?
- Why is it important?
- Who are the target users?
1. Telemetry Collection
The system collects signals such as CPU usage, memory usage, restart count, latency, and error rate.

---
2. Anomaly Detection
A rule-based detection system checks if the signals cross defined thresholds.

## Proposed Solution
3. AI-Based Analysis
Gemini analyzes the detected anomaly and provides:

* Root Cause
* Recommended Action

Explain your approach:
4. Recovery Suggestion
The system suggests actions like restarting a pod or scaling a deployment.

- What are you building?
- How does it solve the problem?
- What makes your solution unique?
### What makes it different

Most systems only monitor and alert.
Our system helps in understanding the issue and suggests what action to take, reducing manual effort.

---

## Features

List the core features of your project:

- Feature 1
- Feature 2
- Feature 3
* Real-time telemetry collection
* Rule-based anomaly detection
* AI-based root cause analysis
* Recovery action suggestions
* Monitoring using Prometheus and Grafana
* Docker-based deployment

---

## Tech Stack

Mention all technologies used:

- Frontend:
- Backend:
- Database:
- APIs / Services:
- Tools / Libraries:
* Frontend: Streamlit
* Backend: FastAPI
* Monitoring: Prometheus
* Observability: OpenTelemetry
* Infrastructure: Docker
* Database: Redis
* AI: Gemini API

---

## Project Setup Instructions

Provide clear steps to run your project:

```bash
# Clone the repository
git clone <repo-link>
git clone https://github.com/NehaRaii029/hacktofuture4-A08

# Install dependencies
...
# Go into the project folder
cd hacktofuture4-A08

# Run the project
...
docker-compose up -d --build
```

### Access the services

Backend API: [http://localhost:8000/docs](http://localhost:8000/docs)
Grafana Dashboard: [http://localhost:3000](http://localhost:3000)
Prometheus: [http://localhost:9090](http://localhost:9090)

---

## Final Note

This project improves system reliability by turning monitoring data into clear insights and actionable recovery suggestions.
Empty file added ai_engine/__init__.py
Empty file.
50 changes: 50 additions & 0 deletions ai_engine/gemini_analyzer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import os
from dotenv import load_dotenv

load_dotenv()

api_key = os.getenv("GOOGLE_API_KEY")

# Try to import Gemini only if key exists
if api_key:
import google.generativeai as genai
genai.configure(api_key=api_key)
model = genai.GenerativeModel("models/gemini-1.5-flash-latest")
else:
model = None


def analyze_incident(signals):
# If no API → fallback (VERY IMPORTANT)
if model is None:
return f"""
Root Cause: High resource usage detected
Recommended Action: Restart pod or scale deployment

Details:
CPU={signals['cpu']}%
Memory={signals['memory']}%
Restarts={signals['restarts']}
Latency={signals['latency']}ms
Error Rate={signals['error_rate']}
"""

# If API exists → use Gemini
prompt = f"""
Analyze Kubernetes anomaly:
CPU={signals['cpu']}%
Memory={signals['memory']}%
Restarts={signals['restarts']}
Latency={signals['latency']}ms
Error Rate={signals['error_rate']}

Return:
Root Cause:
Recommended Action:
"""

try:
response = model.generate_content(prompt)
return response.text
except Exception:
return "AI analysis failed. Using fallback recovery."
Empty file added anomaly_engine/__init__.py
Empty file.
30 changes: 30 additions & 0 deletions anomaly_engine/rule_detector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# def detect_anomaly(signals):
# if signals["cpu"] > 90:
# return True
# if signals["memory"] > 90:
# return True
# if signals["restarts"] > 3:
# return True
# if signals["latency"] > 2000:
# return True
# return False
def detect_anomaly(signals):
cpu = signals.get("cpu", 0)
memory = signals.get("memory", 0)
restarts = signals.get("restarts", 0)
latency = signals.get("latency", 0)

if cpu > 85:
return True
if memory > 85:
return True
if restarts > 2:
return True
if latency > 1000:
return True

return False

#for anamoly
# def detect_anomaly(signals):
# return True
Empty file added backend/__init__.py
Empty file.
12 changes: 12 additions & 0 deletions backend/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from fastapi import FastAPI
from backend.routes import telemetry, analyze, recovery

app = FastAPI(title="Autonomous Recovery System")

app.include_router(telemetry.router)
app.include_router(analyze.router)
app.include_router(recovery.router)

@app.get("/")
def root():
return {"message": "Decision-Driven Autonomous Recovery API Running"}
Empty file added backend/routes/__init__.py
Empty file.
20 changes: 20 additions & 0 deletions backend/routes/analyze.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from fastapi import APIRouter
from telemetry.aggregator import collect_signals
from anomaly_engine.rule_detector import detect_anomaly
from ai_engine.gemini_analyzer import analyze_incident

router = APIRouter(prefix="/analyze", tags=["Analyze"])

@router.get("/")
def analyze():
signals = collect_signals()

if not detect_anomaly(signals):
return {"status": "Normal"}

gemini_result = analyze_incident(signals)

return {
"status": "Anomaly Detected",
"gemini_analysis": gemini_result
}
28 changes: 28 additions & 0 deletions backend/routes/recovery.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from fastapi import APIRouter
from telemetry.aggregator import collect_signals
from ai_engine.gemini_analyzer import analyze_incident
from recovery_engine.executor import execute_recovery

router = APIRouter(prefix="/recovery", tags=["Recovery"])

@router.post("/execute")
def recover():
signals = collect_signals()
analysis = analyze_incident(signals)

if "scale" in analysis.lower():
action = "scale"
elif "rollback" in analysis.lower():
action = "rollback"
elif "isolate" in analysis.lower():
action = "isolate"
else:
action = "restart"

result = execute_recovery(action)

return {
"analysis": analysis,
"selected_action": action,
"execution_result": result
}
14 changes: 14 additions & 0 deletions backend/routes/telemetry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# What it does:
# Receives telemetry snapshots.

# PPT Module:
# Telemetry Collection
from fastapi import APIRouter
from telemetry.aggregator import collect_signals

router = APIRouter(prefix="/telemetry", tags=["Telemetry"])

@router.get("/collect")
def collect():
data = collect_signals()
return {"telemetry": data}
Empty file added configs/__init__.py
Empty file.
5 changes: 5 additions & 0 deletions configs/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# What it does:

# Stores config constants.
PROMETHEUS_URL = "http://localhost:9090"
MODEL_PATH = "ml_engine/models/isolation_forest.pkl"
Empty file added dashboard/__init__.py
Empty file.
29 changes: 29 additions & 0 deletions dashboard/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# What it does:

# Shows live metrics, anomaly alerts, RCA, recovery logs.

# PPT Module:

# Real-time System View
import streamlit as st
import requests

st.title("Autonomous Recovery Dashboard")

telemetry = requests.get("http://localhost:8000/telemetry/collect").json()
anomaly = requests.get("http://localhost:8000/anomaly/detect").json()
rca = requests.get("http://localhost:8000/rca/analyze").json()

st.subheader("Live Metrics")
st.json(telemetry)

st.subheader("Anomaly Detection")
st.json(anomaly)

st.subheader("Root Cause Analysis")
st.json(rca)

if st.button("Trigger Recovery"):
recovery = requests.post("http://localhost:8000/recovery/execute").json()
st.subheader("Recovery Result")
st.json(recovery)
Loading