hacktofuture-sjec · NehaRaii029 · Apr 15, 2026 · Apr 16, 2026 · Apr 16, 2026 · Apr 16, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,16 @@
+# Python
+__pycache__/
+*.pyc
+
+# Virtual env
+.venv/
+venv/
+
+# Env files
+.env
+
+# OS
+.DS_Store
+
+# Zip files
+*.zip
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,9 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+COPY . .
+
+RUN pip install --no-cache-dir -r requirements.txt
+
+CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]
diff --git a/README.md b/README.md
@@ -1,86 +1,106 @@
+<<<<<<< HEAD
+SHECODES
+=======
+#SheCodes
 # HackToFuture 4.0 — Template
+>>>>>>> 76a8c92793877c61d8ad7cfa4401659e91cd5af8
 
-Welcome to your official HackToFuture 4 repository.
-
-This repository template will be used for development, tracking progress, and final submission of your project. Ensure that all work is committed here within the allowed hackathon duration.
+# HackToFuture 4.0 — Decision-Driven Autonomous Recovery for Kubernetes Systems  
 
 ---
 
-### Instructions for the teams:
+## Problem Statement / Idea
 
-- Fork the Repository and name the forked repo in this convention: hacktofuture4-team_id (for eg: hacktofuture4-A01)
+Modern cloud applications run on Kubernetes using multiple interconnected microservices. When something fails, Kubernetes can restart containers, but it does not understand the root cause of the problem.
 
----
+Because of this:
+
+* Failures can spread across services
+* Systems can experience downtime quickly
+* Engineers must manually analyze logs and metrics
+
+This manual process is slow and does not scale well for large systems.
 
-## Rules
+This problem mainly affects:
 
-- Work must be done ONLY in the forked repository
-- Only Four Contributors are allowed.
-- After 36 hours, Please make PR to the Main Repository. A Form will be sent to fill the required information.
-- Do not copy code from other teams
-- All commits must be from individual GitHub accounts
-- Please provide meaningful commits for tracking.
-- Do not share your repository with other teams
-- Final submission must be pushed before the deadline
-- Any violation may lead to disqualification
+* Site Reliability Engineers (SREs)
+* DevOps teams
+* Developers managing cloud-native applications
 
 ---
 
-# The Final README Template 
+## Proposed Solution
 
-## Problem Statement / Idea
+We built an Autonomous Recovery System that monitors system signals, detects issues, analyzes them, and suggests recovery actions.
 
-Clearly describe the problem you are solving.
+### How it works:
 
-- What is the problem?
-- Why is it important?
-- Who are the target users?
+1. Telemetry Collection
+   The system collects signals such as CPU usage, memory usage, restart count, latency, and error rate.
 
----
+2. Anomaly Detection
+   A rule-based detection system checks if the signals cross defined thresholds.
 
-## Proposed Solution
+3. AI-Based Analysis
+   Gemini analyzes the detected anomaly and provides:
+
+   * Root Cause
+   * Recommended Action
 
-Explain your approach:
+4. Recovery Suggestion
+   The system suggests actions like restarting a pod or scaling a deployment.
 
-- What are you building?
-- How does it solve the problem?
-- What makes your solution unique?
+### What makes it different
+
+Most systems only monitor and alert.
+Our system helps in understanding the issue and suggests what action to take, reducing manual effort.
 
 ---
 
 ## Features
 
-List the core features of your project:
-
-- Feature 1
-- Feature 2
-- Feature 3
+* Real-time telemetry collection
+* Rule-based anomaly detection
+* AI-based root cause analysis
+* Recovery action suggestions
+* Monitoring using Prometheus and Grafana
+* Docker-based deployment
 
 ---
 
 ## Tech Stack
 
-Mention all technologies used:
-
-- Frontend:
-- Backend:
-- Database:
-- APIs / Services:
-- Tools / Libraries:
+* Frontend: Streamlit
+* Backend: FastAPI
+* Monitoring: Prometheus
+* Observability: OpenTelemetry
+* Infrastructure: Docker
+* Database: Redis
+* AI: Gemini API
 
 ---
 
 ## Project Setup Instructions
 
-Provide clear steps to run your project:
-
 ```bash
 # Clone the repository
-git clone <repo-link>
+git clone https://github.com/NehaRaii029/hacktofuture4-A08
 
-# Install dependencies
-...
+# Go into the project folder
+cd hacktofuture4-A08
 
 # Run the project
-...
+docker-compose up -d --build
 ```
+
+### Access the services
+
+Backend API: [http://localhost:8000/docs](http://localhost:8000/docs)
+Grafana Dashboard: [http://localhost:3000](http://localhost:3000)
+Prometheus: [http://localhost:9090](http://localhost:9090)
+
+---
+
+## Final Note
+
+This project improves system reliability by turning monitoring data into clear insights and actionable recovery suggestions.
diff --git a/ai_engine/__init__.py b/ai_engine/__init__.py
diff --git a/ai_engine/gemini_analyzer.py b/ai_engine/gemini_analyzer.py
@@ -0,0 +1,50 @@
+import os
+from dotenv import load_dotenv
+
+load_dotenv()
+
+api_key = os.getenv("GOOGLE_API_KEY")
+
+# Try to import Gemini only if key exists
+if api_key:
+    import google.generativeai as genai
+    genai.configure(api_key=api_key)
+    model = genai.GenerativeModel("models/gemini-1.5-flash-latest")
+else:
+    model = None
+
+
+def analyze_incident(signals):
+    # If no API → fallback (VERY IMPORTANT)
+    if model is None:
+        return f"""
+        Root Cause: High resource usage detected
+        Recommended Action: Restart pod or scale deployment
+
+        Details:
+        CPU={signals['cpu']}%
+        Memory={signals['memory']}%
+        Restarts={signals['restarts']}
+        Latency={signals['latency']}ms
+        Error Rate={signals['error_rate']}
+        """
+
+    # If API exists → use Gemini
+    prompt = f"""
+    Analyze Kubernetes anomaly:
+    CPU={signals['cpu']}%
+    Memory={signals['memory']}%
+    Restarts={signals['restarts']}
+    Latency={signals['latency']}ms
+    Error Rate={signals['error_rate']}
+
+    Return:
+    Root Cause:
+    Recommended Action:
+    """
+
+    try:
+        response = model.generate_content(prompt)
+        return response.text
+    except Exception:
+        return "AI analysis failed. Using fallback recovery."
diff --git a/anomaly_engine/__init__.py b/anomaly_engine/__init__.py
diff --git a/anomaly_engine/rule_detector.py b/anomaly_engine/rule_detector.py
@@ -0,0 +1,30 @@
+# def detect_anomaly(signals):
+#     if signals["cpu"] > 90:
+#         return True
+#     if signals["memory"] > 90:
+#         return True
+#     if signals["restarts"] > 3:
+#         return True
+#     if signals["latency"] > 2000:
+#         return True
+#     return False
+def detect_anomaly(signals):
+    cpu = signals.get("cpu", 0)
+    memory = signals.get("memory", 0)
+    restarts = signals.get("restarts", 0)
+    latency = signals.get("latency", 0)
+
+    if cpu > 85:
+        return True
+    if memory > 85:
+        return True
+    if restarts > 2:
+        return True
+    if latency > 1000:
+        return True
+
+    return False
+
+#for anamoly
+# def detect_anomaly(signals):
+#     return True
diff --git a/backend/__init__.py b/backend/__init__.py
diff --git a/backend/main.py b/backend/main.py
@@ -0,0 +1,12 @@
+from fastapi import FastAPI
+from backend.routes import telemetry, analyze, recovery
+
+app = FastAPI(title="Autonomous Recovery System")
+
+app.include_router(telemetry.router)
+app.include_router(analyze.router)
+app.include_router(recovery.router)
+
+@app.get("/")
+def root():
+    return {"message": "Decision-Driven Autonomous Recovery API Running"}
diff --git a/backend/routes/__init__.py b/backend/routes/__init__.py
diff --git a/backend/routes/analyze.py b/backend/routes/analyze.py
@@ -0,0 +1,20 @@
+from fastapi import APIRouter
+from telemetry.aggregator import collect_signals
+from anomaly_engine.rule_detector import detect_anomaly
+from ai_engine.gemini_analyzer import analyze_incident
+
+router = APIRouter(prefix="/analyze", tags=["Analyze"])
+
+@router.get("/")
+def analyze():
+    signals = collect_signals()
+
+    if not detect_anomaly(signals):
+        return {"status": "Normal"}
+
+    gemini_result = analyze_incident(signals)
+
+    return {
+        "status": "Anomaly Detected",
+        "gemini_analysis": gemini_result
+    }
diff --git a/backend/routes/recovery.py b/backend/routes/recovery.py
@@ -0,0 +1,28 @@
+from fastapi import APIRouter
+from telemetry.aggregator import collect_signals
+from ai_engine.gemini_analyzer import analyze_incident
+from recovery_engine.executor import execute_recovery
+
+router = APIRouter(prefix="/recovery", tags=["Recovery"])
+
+@router.post("/execute")
+def recover():
+    signals = collect_signals()
+    analysis = analyze_incident(signals)
+
+    if "scale" in analysis.lower():
+        action = "scale"
+    elif "rollback" in analysis.lower():
+        action = "rollback"
+    elif "isolate" in analysis.lower():
+        action = "isolate"
+    else:
+        action = "restart"
+
+    result = execute_recovery(action)
+
+    return {
+        "analysis": analysis,
+        "selected_action": action,
+        "execution_result": result
+    }
diff --git a/backend/routes/telemetry.py b/backend/routes/telemetry.py
@@ -0,0 +1,14 @@
+# What it does:
+# Receives telemetry snapshots.
+
+# PPT Module:
+# Telemetry Collection
+from fastapi import APIRouter
+from telemetry.aggregator import collect_signals
+
+router = APIRouter(prefix="/telemetry", tags=["Telemetry"])
+
+@router.get("/collect")
+def collect():
+    data = collect_signals()
+    return {"telemetry": data}
diff --git a/configs/__init__.py b/configs/__init__.py
diff --git a/configs/settings.py b/configs/settings.py
@@ -0,0 +1,5 @@
+# What it does:
+
+# Stores config constants.
+PROMETHEUS_URL = "http://localhost:9090"
+MODEL_PATH = "ml_engine/models/isolation_forest.pkl"
diff --git a/dashboard/__init__.py b/dashboard/__init__.py
diff --git a/dashboard/app.py b/dashboard/app.py
@@ -0,0 +1,29 @@
+# What it does:
+
+# Shows live metrics, anomaly alerts, RCA, recovery logs.
+
+# PPT Module:
+
+# Real-time System View
+import streamlit as st
+import requests
+
+st.title("Autonomous Recovery Dashboard")
+
+telemetry = requests.get("http://localhost:8000/telemetry/collect").json()
+anomaly = requests.get("http://localhost:8000/anomaly/detect").json()
+rca = requests.get("http://localhost:8000/rca/analyze").json()
+
+st.subheader("Live Metrics")
+st.json(telemetry)
+
+st.subheader("Anomaly Detection")
+st.json(anomaly)
+
+st.subheader("Root Cause Analysis")
+st.json(rca)
+
+if st.button("Trigger Recovery"):
+    recovery = requests.post("http://localhost:8000/recovery/execute").json()
+    st.subheader("Recovery Result")
+    st.json(recovery)