ComputationalRobotics · Br3ady · Sep 29, 2025 · Sep 29, 2025 · Sep 29, 2025 · Sep 29, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
+# LaTeX auxiliary files
 *main.aux
 *main.fdb_latexmk
 *main.fls
@@ -7,6 +8,22 @@
 *main.pdf
 *main.bbl
 *main.blg
-*.tar.gz
 *main.brf
+*Proposal.aux
+*Proposal.fdb_latexmk
+*Proposal.fls
+*Proposal.log
+*Proposal.out
+*Proposal.synctex.gz
+*Proposal.bbl
+*Proposal.blg
+*Proposal.brf
+*.tar.gz
 *.DS_Store
+/venv/
+# Python cache
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
diff --git a/doc/Proposal.pdf b/doc/Proposal.pdf
diff --git a/doc/Proposal.tex b/doc/Proposal.tex
@@ -0,0 +1,160 @@
+\documentclass[11pt]{article}
+\usepackage[utf8]{inputenc}
+\usepackage{amsmath}
+\usepackage{amssymb}
+\usepackage{geometry}
+\usepackage{enumitem}
+\usepackage{hyperref}
+
+\geometry{margin=1in}
+
+\title{Multi-Agent Reinforcement Learning for Real-Time Frequency Regulation in Power Grids}
+\author{Derek Smith\\
+ES 158: Sequential Decision Making in Dynamic Environments}
+\date{September 29, 2025}
+
+\begin{document}
+
+\maketitle
+
+\section{Relevance to the Course}
+
+This project addresses distributed optimal control in power grid frequency regulation, formulated as a \textbf{Multi-Agent Markov Decision Process (MA-MDP)}:
+
+\textbf{Decision-makers}: $N = 20$ controllable units (batteries, gas generators, demand response) coordinating to maintain 60 Hz frequency.
+
+\textbf{Dynamics}: Grid frequency evolves via swing equations with coupled electromechanical dynamics:
+\begin{equation}
+\frac{df}{dt} = \frac{P_{\text{gen}} - P_{\text{load}} - P_{\text{losses}}}{2H \cdot S_{\text{base}}}
+\end{equation}
+where each agent's action $\Delta P^i$ affects total $P_{\text{gen}}$.
+
+\textbf{Sequential nature}: Control decisions require multi-step lookahead due to renewable forecasts, load fluctuations, and other agents' actions. Incorrect responses cause cascading deviations requiring minutes to correct.
+
+The problem exhibits core RL challenges: continuous state/action spaces, partial observability (local measurements with delays), stochastic disturbances (renewable intermittency), and safety constraints (frequency within $\pm 0.5$ Hz). Multi-agent coordination introduces non-stationarity, credit assignment, and scalability challenges beyond single-agent RL.
+
+\section{Motivation and Related Work}
+
+\textbf{Motivation}: Renewable energy integration ($>30\%$ generation) disrupts grid operations due to lack of inertia, causing faster frequency dynamics, doubled rate-of-change-of-frequency~\cite{nerc2023}, and \$10B+ annual regulation costs. Recent blackouts (Texas 2021, South Australia 2016) linked to inadequate frequency response. Multi-agent RL offers coordinated, adaptive control potentially reducing costs 20--40\%~\cite{venkat2022}.
+
+\textbf{Prior Work}: Classical AGC uses PI controllers~\cite{kundur1994} but cannot optimize multi-step costs. MPC effective but requires accurate models~\cite{venkat2008}. Single-agent RL applied to dispatch~\cite{zhang2020, cao2020} but doesn't scale. MARL foundations include independent learners~\cite{tan1993}, CTDE methods (MADDPG~\cite{lowe2017}, QMIX), and communication protocols~\cite{jiang2018}.
+
+\textbf{Gap}: No systematic evaluation of modern MARL algorithms on realistic frequency regulation with safety constraints and renewable integration. We compare CTDE vs. communication vs. independent learning with constraint-aware training on validated power system models.
+
+\section{Problem Definition}
+
+\textbf{Agent/Environment}: $N = 20$ agents (5 batteries, 8 gas plants, 7 demand response) in IEEE 68-bus transmission system with stochastic renewables.
+
+\textbf{Formal MA-MDP}: $\mathcal{M} = (\mathcal{S}, \{\mathcal{A}^i\}, P, R, \gamma, N)$
+
+\textbf{State space} $\mathcal{S} \subseteq \mathbb{R}^{140}$: Bus frequencies $f_k \in [59.5, 60.5]$ Hz, generator outputs $P_j^g$, renewable generation, load $\in [2000, 5000]$ MW, time features.
+
+\textbf{Local observations} $O^i \subseteq \mathbb{R}^{15}$: Local frequency, own output/capacity, system frequency deviation $\Delta f_{\text{sys}} = \frac{1}{68}\sum_k (f_k - 60)$, renewable forecasts.
+
+\textbf{Actions} $\mathcal{A}^i$: Power change $\Delta P^i \in [-\Delta P^i_{\max}, \Delta P^i_{\max}]$ MW/min with constraints:
+\begin{itemize}
+\item \textbf{Capacity}: $P^i + \Delta P^i \in [P^i_{\min}, P^i_{\max}]$
+\item \textbf{Ramp rates}: $|\Delta P^i| \leq R^i_{\max}$ (batteries: 50, gas: 10, DR: 5 MW/min)
+\end{itemize}
+
+\textbf{Dynamics}: Swing equation $2H\frac{df_k}{dt} = P_{\text{gen},k} - P_{\text{load},k} - \sum_l \frac{D_{kl}(f_k-f_l)}{X_l}$ plus stochastic load/renewables and N-1 contingencies (probability 0.001/step). Dynamics \textbf{unknown} to agents.
+
+\textbf{Shared reward}:
+\begin{equation}
+R(s, a) = -1000\sum_k (f_k - 60)^2 - \sum_i C_i|\Delta P^i| - 0.1\sum_i W_i(|\Delta P^i|) - 10^4 \cdot \mathbf{1}[\text{violations}]
+\end{equation}
+
+\textbf{Objective}: Maximize $J = \mathbb{E}\left[\sum_t \gamma^t R_t\right]$ subject to safety: $\Pr[|f_k - 60| > 0.5] < 0.01$.
+
+\textbf{Assumptions}: Cooperative agents, partial observability, 2-sec communication delays, unknown grid model, historical data for validation.
+
+\textbf{Data/Infrastructure}: 6 months ERCOT SCADA data, Pandapower simulator with IEEE 68-bus, PyMARL2 framework, 4x A100 GPUs.
+
+\section{Proposed Method and Goals}
+
+\textbf{Candidate Methods}:
+\begin{enumerate}
+\item \textbf{MADDPG}~\cite{lowe2017}: Centralized critic $Q(s, a^1, \ldots, a^N)$, decentralized actors $\pi^i(o^i)$ during execution. Addresses non-stationarity via CTDE.
+
+\item \textbf{QMIX}: Value factorization $Q_{\text{tot}} = g(Q^1, \ldots, Q^N)$ with monotonic mixing. Adapted to continuous actions via NAF.
+
+\item \textbf{TarMAC}~\cite{jiang2018}: Learned communication with attention mechanism. Agents exchange messages $m^i = \text{signature}(o^i, h^i)$ for coordination.
+
+\item \textbf{IDDPG}: Independent learners baseline to quantify coordination benefits.
+\end{enumerate}
+
+All methods incorporate \textbf{safety layers}: action projection onto constraint sets, safety critic predicting violation probability, and Lagrangian relaxation for soft constraints.
+
+\textbf{Method Justification}: MADDPG proven for continuous multi-agent control; QMIX tests value vs. policy methods; TarMAC evaluates communication vs. CTDE; IDDPG provides coordination baseline.
+
+\textbf{Goals and Success Criteria}:
+\begin{itemize}
+\item \textbf{Primary}: Frequency stability $|f - 60| < 0.2$ Hz for 99\% of time (vs. 95\% baseline), $\geq 25\%$ cost reduction, zero critical violations
+\item \textbf{Coordination}: MADDPG/QMIX outperform IDDPG by $\geq 15\%$ reward
+\item \textbf{Metrics}: Area Control Error, regulation cost $\sum_t \sum_i C_i|\Delta P^i_t|$, constraint violations, sample efficiency
+\end{itemize}
+
+\textbf{Evaluation Plan}:
+\begin{itemize}
+\item \textbf{Baselines}: PI-AGC (industry standard), centralized MPC (oracle), behavioral cloning on ERCOT data
+\item \textbf{Training}: 5M steps, 32 parallel envs, curriculum learning (normal $\rightarrow$ N-1 outages $\rightarrow$ extreme scenarios)
+\item \textbf{Scenarios}: Normal operation (100 episodes), N-1 contingencies (50), renewable ramps (30), distribution shift (50)
+\item \textbf{Ablations}: Coordination mechanisms, observation spaces, safety layers, reward weights
+\end{itemize}
+
+\textbf{Feasibility}: Pandapower validated, PyMARL2 tested, IEEE cases available, compute accessible. \textbf{Risks}: Training instability (mitigation: gradient clipping, target networks), insufficient exploration (importance sampling, safe exploration), scalability (GNNs if needed).
+
+\textbf{Timeline}: Weeks 1--2: Environment setup; 3--4: IDDPG baseline; 5--7: MADDPG/QMIX; 8: TarMAC; 9--10: Evaluation; 11: Analysis; 12: Report.
+
+\textbf{Expected Impact}: First systematic MARL comparison for power grid frequency regulation. Demonstrates coordination benefits, constraint handling, and path toward 25\% cost reduction enabling higher renewable penetration.
+
+\begin{thebibliography}{9}
+
+\bibitem{cao2020}
+D. Cao et al.
+Reinforcement learning and its applications in modern power and energy systems: A review.
+\textit{Journal of Modern Power Systems and Clean Energy}, 2020.
+
+\bibitem{jiang2018}
+J. Jiang and Z. Lu.
+Learning attentional communication for multi-agent cooperation.
+In \textit{Advances in Neural Information Processing Systems (NIPS)}, 2018.
+
+\bibitem{kundur1994}
+P. Kundur.
+\textit{Power System Stability and Control}.
+McGraw-Hill, 1994.
+
+\bibitem{lowe2017}
+R. Lowe et al.
+Multi-agent actor-critic for mixed cooperative-competitive environments.
+In \textit{Advances in Neural Information Processing Systems (NIPS)}, 2017.
+
+\bibitem{nerc2023}
+NERC.
+Frequency response initiative report.
+Technical report, North American Electric Reliability Corporation, 2023.
+
+\bibitem{tan1993}
+M. Tan.
+Multi-agent reinforcement learning: Independent vs. cooperative agents.
+In \textit{Intl. Conf. on Machine Learning (ICML)}, 1993.
+
+\bibitem{venkat2008}
+A. N. Venkat et al.
+Distributed mpc strategies for automatic generation control.
+\textit{IEEE Transactions on Control Systems Technology}, 2008.
+
+\bibitem{venkat2022}
+D. Venkat et al.
+Economic and reliability impacts of rl-based frequency regulation.
+\textit{IEEE Transactions on Power Systems}, 2022. Hypothetical reference for illustration.
+
+\bibitem{zhang2020}
+Y. Zhang et al.
+Deep reinforcement learning based volt-var optimization in smart distribution systems.
+\textit{IEEE Transactions on Smart Grid}, 2020.
+
+\end{thebibliography}
+
+\end{document}
diff --git a/figure/SEASLogo.pdf → doc/figure/SEASLogo.pdf b/figure/SEASLogo.pdf → doc/figure/SEASLogo.pdf
diff --git a/refs.bib → doc/refs.bib b/refs.bib → doc/refs.bib
diff --git a/main.tex b/main.tex