Skip to content

Commit 788ea92

Browse files
committed
Update documentation
1 parent bc7bcae commit 788ea92

File tree

7 files changed

+58
-51
lines changed

7 files changed

+58
-51
lines changed

docs/source/API/spaces/actions.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
:members: __init__, act
1919

2020
.. autoclass:: ActionRestore
21-
:members:: __init__, act
21+
:members: __init__, act
2222

2323
.. autoclass:: ActionStringGeneralize
2424
:members: __init__, act, add

docs/source/API/spaces/discrete_state_environment.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919

2020
DiscreteEnvConfig
2121
DiscreteStateEnvironment
22-
MultiprocessEnv
22+
2323

2424

2525

docs/source/Examples/a2c_three_columns.rst

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,28 +6,65 @@ A2C algorithm
66
-------------
77

88
Both the Q-learning algorithm we used in `Q-learning on a three columns dataset <qlearning_three_columns.html>`_ and the SARSA algorithm in
9-
`Semi-gradient SARSA on a three columns data set`_ are value-based methods; that is they estimate value functions. Specifically the state-action function
9+
`Semi-gradient SARSA on a three columns data set <semi_gradient_sarsa_three_columns.html>`_ are value-based methods; that is they estimate directly value functions. Specifically the state-action function
1010
:math:`Q`. By knowing :math:`Q` we can construct a policy to follow for example to choose the action that at the given state
11-
maximizes the state-action function i.e. :math:`argmax_{\alpha}Q(s_t, \alpha)` i.e. a greedy policy.
12-
13-
However, the true objective of reinforcement learning is to directly learn a policy :math:`\pi`.
11+
maximizes the state-action function i.e. :math:`argmax_{\alpha}Q(s_t, \alpha)` i.e. a greedy policy. These methods are called off-policy methods.
1412

13+
However, the true objective of reinforcement learning is to directly learn a policy :math:`\pi`. One class of algorithms towards this directions are policy gradient algorithms
14+
like REINFORCE and Advantage Actor-Critic of A2C algorithms.
1515

16+
Typically, with these methods we approximate directly the policy by a parametrized model.
17+
Thereafter, we train the model i.e. learn its paramters by taking samples from the environment.
1618
The main advantage of learning a parametrized policy is that it can be any learnable function e.g. a linear model or a deep neural network.
1719

1820
The A2C algorithm is a a synchronous version of A3C. Both algorithms, fall under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor
1921
and a parametrized value function; the critic. The role of the policy or actor network is to indicate which action to take on a given state. In our implementation below,
2022
the policy network returns a probability distribution over the action space. Specifically, a tensor of probabilities. The role of the critic model is to evaluate how good is
2123
the action that is selected.
2224

23-
In A2C there is a single agent that interacts with multiple instances of the environment. In other words, we create a number of workers where each worker loads its own instance
24-
of the data set to anonymize. A shared model is then optimized by each worker.
25+
In A2C there is a single agent that interacts with multiple instances of the environment. In other words, we create a number of workers where each worker loads its own instance of the data set to anonymize. A shared model is then optimized by each worker.
26+
27+
The objective of the agent is to maximize the expected discounted return:
2528

29+
.. math::
30+
31+
J(\pi_{\theta}) = E_{\tau \sim \rho_{\theta}}\left[\sum_{t=0}^T\gamma^t R(s_t, \alpha_t)\right]
32+
33+
where :math:`\tau` is the trajectory the agent observes with probability distribution :math:`\rho_{\theta}`, :math:`\gamma` is the
34+
discount factor and :math:`R(s_t, \alpha_t)` represents some unknown to the agent reward function.
2635
We can use neural networks to approximate both models
2736

2837

2938
Specifically, we will use a weight-sharing model. Moreover, the environment is a multi-process class that gathers samples from multiple
3039
emvironments at once.
3140

41+
The advanatge :math:`A(s_t, \alpha_t)` is defined as [REF]
42+
43+
.. math::
44+
45+
A(s_t, \alpha_t) = Q_{\pi}(s_t, \alpha_t) - V_{\pi}(s_t)
46+
47+
It represents a goodness fit for an action at a given state. where ...
48+
49+
50+
51+
The critic loss function is
52+
53+
So the gradient becomes
54+
55+
.. math::
56+
57+
\nabla_{\theta}J(\theta) \approx \sum_{t=0}^{T-1} \nabla_{\theta}log \pi_{\theta}(\alpha_t | s_t)A(s_t, \alpha_t)
58+
59+
60+
61+
Overall, the A2C algorithm is described below
62+
63+
1. Initialize the network parameters $\theta$
64+
2. Play :math:`N` steps in the environment using the current policy :math:`\pi_{\theta}`
65+
3. Loop over the accumulated experience in reversed order :math:`T, t-1, \dots, t_0`
66+
- Copute the total reward :math:`R = r_t + \gamma R`
67+
- Compute Actor gradients
68+
- Compute Critic gradients
3269
Code
3370
----

src/algorithms/pytorch_multi_process_trainer.py

Lines changed: 1 addition & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from typing import TypeVar, Any
99
from dataclasses import dataclass
1010

11-
import torch.multiprocessing as mp
11+
1212

1313
from src.utils import INFO
1414
from src.utils.function_wraps import time_func, time_func_wrapper
@@ -63,47 +63,6 @@ class WorkerResult(object):
6363
worker_idx: int
6464

6565

66-
class TorchProcsHandler(object):
67-
"""The TorchProcsHandler class. Utility
68-
class to handle PyTorch processe
69-
70-
"""
71-
72-
def __init__(self, n_procs: int) -> None:
73-
"""Constructor
74-
75-
Parameters
76-
----------
77-
n_procs: The number of processes to handle
78-
79-
"""
80-
self.n_procs = n_procs
81-
self.processes = []
82-
83-
def create_and_start(self, target: Any, *args) -> None:
84-
for i in range(self.n_procs):
85-
p = mp.Process(target=target, args=args)
86-
p.start()
87-
self.processes.append(p)
88-
89-
def create_process_and_start(self, target: Any, args) -> None:
90-
p = mp.Process(target=target, args=args)
91-
p.start()
92-
self.processes.append(p)
93-
94-
def join(self) -> None:
95-
for p in self.processes:
96-
p.join()
97-
98-
def terminate(self) -> None:
99-
for p in self.processes:
100-
p.terminate()
101-
102-
def join_and_terminate(self):
103-
self.join()
104-
self.terminate()
105-
106-
10766
def worker(worker_idx: int, worker_model: nn.Module, params: dir):
10867
"""Executes the process work
10968

src/spaces/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from src.spaces.time_step import TimeStep, StepType, VectorTimeStep
2+
from src.spaces.multiprocess_env import MultiprocessEnv

src/spaces/time_step.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,4 +85,13 @@ def copy_time_step(time_step: TimeStep, **copy_options) -> TimeStep:
8585
reward=reward, discount=discount)
8686

8787

88+
class VectorTimeStep(object):
89+
90+
def __init__(self):
91+
self.time_steps = []
92+
93+
def append(self, time_step: TimeStep) -> None:
94+
self.time_steps.append(time_step)
95+
96+
8897

tests/test_epsilon_greedy_q_estimator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,4 @@ def test_q_hat_value_raise_InvalidParamValue(self):
2626

2727

2828
if __name__ == '__main__':
29-
unittest.main()
29+
unittest.main()

0 commit comments

Comments
 (0)