You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/Examples/a2c_three_columns.rst
+43-6Lines changed: 43 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,28 +6,65 @@ A2C algorithm
6
6
-------------
7
7
8
8
Both the Q-learning algorithm we used in `Q-learning on a three columns dataset <qlearning_three_columns.html>`_ and the SARSA algorithm in
9
-
`Semi-gradient SARSA on a three columns data set`_ are value-based methods; that is they estimate value functions. Specifically the state-action function
9
+
`Semi-gradient SARSA on a three columns data set<semi_gradient_sarsa_three_columns.html>`_ are value-based methods; that is they estimate directly value functions. Specifically the state-action function
10
10
:math:`Q`. By knowing :math:`Q` we can construct a policy to follow for example to choose the action that at the given state
11
-
maximizes the state-action function i.e. :math:`argmax_{\alpha}Q(s_t, \alpha)` i.e. a greedy policy.
12
-
13
-
However, the true objective of reinforcement learning is to directly learn a policy :math:`\pi`.
11
+
maximizes the state-action function i.e. :math:`argmax_{\alpha}Q(s_t, \alpha)` i.e. a greedy policy. These methods are called off-policy methods.
14
12
13
+
However, the true objective of reinforcement learning is to directly learn a policy :math:`\pi`. One class of algorithms towards this directions are policy gradient algorithms
14
+
like REINFORCE and Advantage Actor-Critic of A2C algorithms.
15
15
16
+
Typically, with these methods we approximate directly the policy by a parametrized model.
17
+
Thereafter, we train the model i.e. learn its paramters by taking samples from the environment.
16
18
The main advantage of learning a parametrized policy is that it can be any learnable function e.g. a linear model or a deep neural network.
17
19
18
20
The A2C algorithm is a a synchronous version of A3C. Both algorithms, fall under the umbrella of actor-critic methods [REF]. In these methods, we estimate a parametrized policy; the actor
19
21
and a parametrized value function; the critic. The role of the policy or actor network is to indicate which action to take on a given state. In our implementation below,
20
22
the policy network returns a probability distribution over the action space. Specifically, a tensor of probabilities. The role of the critic model is to evaluate how good is
21
23
the action that is selected.
22
24
23
-
In A2C there is a single agent that interacts with multiple instances of the environment. In other words, we create a number of workers where each worker loads its own instance
24
-
of the data set to anonymize. A shared model is then optimized by each worker.
25
+
In A2C there is a single agent that interacts with multiple instances of the environment. In other words, we create a number of workers where each worker loads its own instance of the data set to anonymize. A shared model is then optimized by each worker.
26
+
27
+
The objective of the agent is to maximize the expected discounted return:
0 commit comments