You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/Examples/semi_gradient_sarsa_three_columns.rst
+34-9Lines changed: 34 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,19 +2,41 @@ Semi-gradient SARSA algorithm
2
2
=============================
3
3
4
4
In this example, we continue using a three-column data set as in the `Q-learning on a three columns dataset <qlearning_three_columns.html>`_.
5
-
In that example, we used a state aggregation approach to model the overall distortion of the data set in the range :math:`\[0, 1]`.
6
-
In this example, we take an alternative approach. We will use bins to discretize the deformation range for each column in the data set.
7
-
The state vector will contain these deformations. Hence, for the three column data set, the state vector will have three entries,
8
-
each indicating the distortion of the respective column.
5
+
In that example, we used a state aggregation approach to model the overall distortion of the data set in the range :math:`[0, 1]`.
6
+
Herein, we take an alternative approach. We will assume that the column distortion is in the range :math:`\[0, 1]` where the edge points mean no distortion
7
+
and full distortion of the column respectively. For each column, we will use the same approach to discretize the continuous :math:`[0, 1]` range
8
+
into a given number of disjoint bins.
9
9
10
+
Contrary to representing the state-action function :math:`q_{\pi}` using a table as we did in `Q-learning on a three columns dataset <qlearning_three_columns.html>`_, we will assume a functional form for
11
+
it. Specifically, we assume that the state-action function can be approximated by :math:`\hat{q} \approx q_{\pi}` given by
where :math:`\mathbf{w}` is the weights vector and :math:`\mathbf{x}(s, \alpha)` is called the feature vector representing state :math:`s` when taking action :math:`\alpha` [1]. For our case the components of the feature vector will be distortions of the three columns when applying action :math:`\alpha` on the data set. Our goal now is to find the components of the weight vector. We can the stochastic gradient descent (or SGD )
17
+
for this [1]. In this case, the update rule is [1]
We will use bins to discretize the deformation range for each column in the data set.
33
+
The state vector will contain these deformations. Hence, for the three column data set, the state vector will have three entries, each indicating the distortion of the respective column.
12
34
13
35
The semi-gradient SARSA algorithm is shown below
14
36
15
37
.. figure:: images/semi_gradient_sarsa.png
16
38
17
-
Episodic semi-gradient SARSA algorithm. Image from [1]
39
+
Episodic semi-gradient SARSA algorithm. Image from [1].
18
40
19
41
20
42
@@ -24,8 +46,11 @@ Tiling
24
46
25
47
We will use a linear function approximation for :math:`\hat{q}`:
26
48
27
-
.. math::
28
-
\hat{q} = \mathbf{w}^T\mathbf{x}
49
+
50
+
.. figure:: images/tiling_example.png
51
+
52
+
Multiple, overlapping grid-tilings on a limited two-dimensional space.
53
+
These tilings are offset from one another by a uniform amount in each dimension. Image from [1].
29
54
30
55
31
56
Code
@@ -210,4 +235,4 @@ Code
210
235
References
211
236
----------
212
237
213
-
1. Sutton and Barto, Reinforcement Learning
238
+
1. Richard S. Sutton and Andrw G. Barto, Reinforcement Learning. An Introduction 2nd Edition, MIT Press
0 commit comments