Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# LaTeX auxiliary files
*main.aux
*main.fdb_latexmk
*main.fls
Expand All @@ -7,6 +8,22 @@
*main.pdf
*main.bbl
*main.blg
*.tar.gz
*main.brf
*Proposal.aux
*Proposal.fdb_latexmk
*Proposal.fls
*Proposal.log
*Proposal.out
*Proposal.synctex.gz
*Proposal.bbl
*Proposal.blg
*Proposal.brf
*.tar.gz
*.DS_Store
/venv/
# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
Binary file added doc/Proposal.pdf
Binary file not shown.
160 changes: 160 additions & 0 deletions doc/Proposal.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{geometry}
\usepackage{enumitem}
\usepackage{hyperref}

\geometry{margin=1in}

\title{Multi-Agent Reinforcement Learning for Real-Time Frequency Regulation in Power Grids}
\author{Derek Smith\\
ES 158: Sequential Decision Making in Dynamic Environments}
\date{September 29, 2025}

\begin{document}

\maketitle

\section{Relevance to the Course}

This project addresses distributed optimal control in power grid frequency regulation, formulated as a \textbf{Multi-Agent Markov Decision Process (MA-MDP)}:

\textbf{Decision-makers}: $N = 20$ controllable units (batteries, gas generators, demand response) coordinating to maintain 60 Hz frequency.

\textbf{Dynamics}: Grid frequency evolves via swing equations with coupled electromechanical dynamics:
\begin{equation}
\frac{df}{dt} = \frac{P_{\text{gen}} - P_{\text{load}} - P_{\text{losses}}}{2H \cdot S_{\text{base}}}
\end{equation}
where each agent's action $\Delta P^i$ affects total $P_{\text{gen}}$.

\textbf{Sequential nature}: Control decisions require multi-step lookahead due to renewable forecasts, load fluctuations, and other agents' actions. Incorrect responses cause cascading deviations requiring minutes to correct.

The problem exhibits core RL challenges: continuous state/action spaces, partial observability (local measurements with delays), stochastic disturbances (renewable intermittency), and safety constraints (frequency within $\pm 0.5$ Hz). Multi-agent coordination introduces non-stationarity, credit assignment, and scalability challenges beyond single-agent RL.

\section{Motivation and Related Work}

\textbf{Motivation}: Renewable energy integration ($>30\%$ generation) disrupts grid operations due to lack of inertia, causing faster frequency dynamics, doubled rate-of-change-of-frequency~\cite{nerc2023}, and \$10B+ annual regulation costs. Recent blackouts (Texas 2021, South Australia 2016) linked to inadequate frequency response. Multi-agent RL offers coordinated, adaptive control potentially reducing costs 20--40\%~\cite{venkat2022}.

\textbf{Prior Work}: Classical AGC uses PI controllers~\cite{kundur1994} but cannot optimize multi-step costs. MPC effective but requires accurate models~\cite{venkat2008}. Single-agent RL applied to dispatch~\cite{zhang2020, cao2020} but doesn't scale. MARL foundations include independent learners~\cite{tan1993}, CTDE methods (MADDPG~\cite{lowe2017}, QMIX), and communication protocols~\cite{jiang2018}.

\textbf{Gap}: No systematic evaluation of modern MARL algorithms on realistic frequency regulation with safety constraints and renewable integration. We compare CTDE vs. communication vs. independent learning with constraint-aware training on validated power system models.

\section{Problem Definition}

\textbf{Agent/Environment}: $N = 20$ agents (5 batteries, 8 gas plants, 7 demand response) in IEEE 68-bus transmission system with stochastic renewables.

\textbf{Formal MA-MDP}: $\mathcal{M} = (\mathcal{S}, \{\mathcal{A}^i\}, P, R, \gamma, N)$

\textbf{State space} $\mathcal{S} \subseteq \mathbb{R}^{140}$: Bus frequencies $f_k \in [59.5, 60.5]$ Hz, generator outputs $P_j^g$, renewable generation, load $\in [2000, 5000]$ MW, time features.

\textbf{Local observations} $O^i \subseteq \mathbb{R}^{15}$: Local frequency, own output/capacity, system frequency deviation $\Delta f_{\text{sys}} = \frac{1}{68}\sum_k (f_k - 60)$, renewable forecasts.

\textbf{Actions} $\mathcal{A}^i$: Power change $\Delta P^i \in [-\Delta P^i_{\max}, \Delta P^i_{\max}]$ MW/min with constraints:
\begin{itemize}
\item \textbf{Capacity}: $P^i + \Delta P^i \in [P^i_{\min}, P^i_{\max}]$
\item \textbf{Ramp rates}: $|\Delta P^i| \leq R^i_{\max}$ (batteries: 50, gas: 10, DR: 5 MW/min)
\end{itemize}

\textbf{Dynamics}: Swing equation $2H\frac{df_k}{dt} = P_{\text{gen},k} - P_{\text{load},k} - \sum_l \frac{D_{kl}(f_k-f_l)}{X_l}$ plus stochastic load/renewables and N-1 contingencies (probability 0.001/step). Dynamics \textbf{unknown} to agents.

\textbf{Shared reward}:
\begin{equation}
R(s, a) = -1000\sum_k (f_k - 60)^2 - \sum_i C_i|\Delta P^i| - 0.1\sum_i W_i(|\Delta P^i|) - 10^4 \cdot \mathbf{1}[\text{violations}]
\end{equation}

\textbf{Objective}: Maximize $J = \mathbb{E}\left[\sum_t \gamma^t R_t\right]$ subject to safety: $\Pr[|f_k - 60| > 0.5] < 0.01$.

\textbf{Assumptions}: Cooperative agents, partial observability, 2-sec communication delays, unknown grid model, historical data for validation.

\textbf{Data/Infrastructure}: 6 months ERCOT SCADA data, Pandapower simulator with IEEE 68-bus, PyMARL2 framework, 4x A100 GPUs.

\section{Proposed Method and Goals}

\textbf{Candidate Methods}:
\begin{enumerate}
\item \textbf{MADDPG}~\cite{lowe2017}: Centralized critic $Q(s, a^1, \ldots, a^N)$, decentralized actors $\pi^i(o^i)$ during execution. Addresses non-stationarity via CTDE.

\item \textbf{QMIX}: Value factorization $Q_{\text{tot}} = g(Q^1, \ldots, Q^N)$ with monotonic mixing. Adapted to continuous actions via NAF.

\item \textbf{TarMAC}~\cite{jiang2018}: Learned communication with attention mechanism. Agents exchange messages $m^i = \text{signature}(o^i, h^i)$ for coordination.

\item \textbf{IDDPG}: Independent learners baseline to quantify coordination benefits.
\end{enumerate}

All methods incorporate \textbf{safety layers}: action projection onto constraint sets, safety critic predicting violation probability, and Lagrangian relaxation for soft constraints.

\textbf{Method Justification}: MADDPG proven for continuous multi-agent control; QMIX tests value vs. policy methods; TarMAC evaluates communication vs. CTDE; IDDPG provides coordination baseline.

\textbf{Goals and Success Criteria}:
\begin{itemize}
\item \textbf{Primary}: Frequency stability $|f - 60| < 0.2$ Hz for 99\% of time (vs. 95\% baseline), $\geq 25\%$ cost reduction, zero critical violations
\item \textbf{Coordination}: MADDPG/QMIX outperform IDDPG by $\geq 15\%$ reward
\item \textbf{Metrics}: Area Control Error, regulation cost $\sum_t \sum_i C_i|\Delta P^i_t|$, constraint violations, sample efficiency
\end{itemize}

\textbf{Evaluation Plan}:
\begin{itemize}
\item \textbf{Baselines}: PI-AGC (industry standard), centralized MPC (oracle), behavioral cloning on ERCOT data
\item \textbf{Training}: 5M steps, 32 parallel envs, curriculum learning (normal $\rightarrow$ N-1 outages $\rightarrow$ extreme scenarios)
\item \textbf{Scenarios}: Normal operation (100 episodes), N-1 contingencies (50), renewable ramps (30), distribution shift (50)
\item \textbf{Ablations}: Coordination mechanisms, observation spaces, safety layers, reward weights
\end{itemize}

\textbf{Feasibility}: Pandapower validated, PyMARL2 tested, IEEE cases available, compute accessible. \textbf{Risks}: Training instability (mitigation: gradient clipping, target networks), insufficient exploration (importance sampling, safe exploration), scalability (GNNs if needed).

\textbf{Timeline}: Weeks 1--2: Environment setup; 3--4: IDDPG baseline; 5--7: MADDPG/QMIX; 8: TarMAC; 9--10: Evaluation; 11: Analysis; 12: Report.

\textbf{Expected Impact}: First systematic MARL comparison for power grid frequency regulation. Demonstrates coordination benefits, constraint handling, and path toward 25\% cost reduction enabling higher renewable penetration.

\begin{thebibliography}{9}

\bibitem{cao2020}
D. Cao et al.
Reinforcement learning and its applications in modern power and energy systems: A review.
\textit{Journal of Modern Power Systems and Clean Energy}, 2020.

\bibitem{jiang2018}
J. Jiang and Z. Lu.
Learning attentional communication for multi-agent cooperation.
In \textit{Advances in Neural Information Processing Systems (NIPS)}, 2018.

\bibitem{kundur1994}
P. Kundur.
\textit{Power System Stability and Control}.
McGraw-Hill, 1994.

\bibitem{lowe2017}
R. Lowe et al.
Multi-agent actor-critic for mixed cooperative-competitive environments.
In \textit{Advances in Neural Information Processing Systems (NIPS)}, 2017.

\bibitem{nerc2023}
NERC.
Frequency response initiative report.
Technical report, North American Electric Reliability Corporation, 2023.

\bibitem{tan1993}
M. Tan.
Multi-agent reinforcement learning: Independent vs. cooperative agents.
In \textit{Intl. Conf. on Machine Learning (ICML)}, 1993.

\bibitem{venkat2008}
A. N. Venkat et al.
Distributed mpc strategies for automatic generation control.
\textit{IEEE Transactions on Control Systems Technology}, 2008.

\bibitem{venkat2022}
D. Venkat et al.
Economic and reliability impacts of rl-based frequency regulation.
\textit{IEEE Transactions on Power Systems}, 2022. Hypothetical reference for illustration.

\bibitem{zhang2020}
Y. Zhang et al.
Deep reinforcement learning based volt-var optimization in smart distribution systems.
\textit{IEEE Transactions on Smart Grid}, 2020.

\end{thebibliography}

\end{document}
File renamed without changes.
File renamed without changes.
47 changes: 0 additions & 47 deletions main.tex

This file was deleted.

Loading