Merge pull request #73 from pockerman/investigate_sarsa_semi_gradient

pockerman · web-flow · commit dd5f73d15d96 · 2022-03-11T11:51:35.000Z
Investigate sarsa semi gradient
diff --git a/docs/source/API/epsilon_greedy_policy.rst b/docs/source/API/epsilon_greedy_policy.rst
@@ -0,0 +1,11 @@
+epsilon\_greedy\_policy
+=======================
+
+.. automodule:: epsilon_greedy_policy
+
+.. autoclass:: EpsilonDecayOption
+   
+.. autoclass:: EpsilonGreedyConfig
+   
+.. autoclass:: EpsilonGreedyPolicy
+   :members: __init__, from_config, __str__, __call__, on_state, actions_after_episode
diff --git a/docs/source/API/time_step.rst b/docs/source/API/time_step.rst
@@ -0,0 +1,19 @@
+﻿time\_step
+==========
+
+.. automodule:: time_step
+   :members: copy_time_step
+
+.. autoclass:: StepType 
+.. autoclass:: TimeStep 
+   :members: first, mid, last, done 
+   
+   
+   
+
+   
+   
+   
+
+
+
diff --git a/docs/source/Examples/images/semi_gradient_sarsa_3_columns_distortion.png b/docs/source/Examples/images/semi_gradient_sarsa_3_columns_distortion.png
diff --git a/docs/source/Examples/images/semi_gradient_sarsa_3_columns_reward.png b/docs/source/Examples/images/semi_gradient_sarsa_3_columns_reward.png
diff --git a/docs/source/Examples/qlearning_three_columns.rst b/docs/source/Examples/qlearning_three_columns.rst
@@ -1,9 +1,20 @@
 Q-learning on a three columns dataset
 =====================================
 
+Overview
+--------
+
+In this example, we use a tabular Q-learning algorithm to anonymize a data set with three columns.
+
+
+
+
 In this simple example we show how to apply QLearning on a dataset with three columns.
 
 
+Code
+----
+
 .. code-block::
 
 	import numpy as np
diff --git a/docs/source/Examples/semi_gradient_sarsa_three_columns.rst b/docs/source/Examples/semi_gradient_sarsa_three_columns.rst
@@ -1,25 +1,33 @@
 Semi-gradient SARSA algorithm 
 =============================
 
+Overview
+--------
+
+In this example, we use the episodic semi-gradient SARSA algorithm to anonymize a data set with three columns.
+
+
+Semi-gradient SARSA algorithm 
+-----------------------------
+
 In this example, we continue using a three-column data set as in the `Q-learning on a three columns dataset <qlearning_three_columns.html>`_.
-In that example, we used a state aggregation approach to model the overall distortion of the data set in the range :math:`[0, 1]`. 
-Herein, we take an alternative approach. We will assume that the column distortion is in the range :math:`\[0, 1]` where the edge points mean no distortion
-and full distortion of the column respectively.  For each column, we will use the same approach to discretize the continuous :math:`[0, 1]` range
-into a given number of disjoint bins.
+In that example, we used state aggregation to model the overall distortion of the data set in the range :math:`[0, 1]`. 
+Herein, we take an alternative approach. We will assume that the column distortion is in the range :math:`[0, 1]` where the edge points mean no distortion
+and full distortion of the column respectively.  For each column, we will use the same methodology as in `Q-learning on a three columns dataset <qlearning_three_columns.html>`_ to discretize the continuous :math:`[0, 1]` range into a given number of disjoint bins.
 
 Contrary to representing the state-action function :math:`q_{\pi}` using a table as we did in `Q-learning on a three columns dataset <qlearning_three_columns.html>`_, we will assume  a functional form for 
 it. Specifically, we assume that the state-action function can be approximated by :math:`\hat{q} \approx q_{\pi}` given by 
 
 .. math::
 	\hat{q}(s, \alpha) = \mathbf{w}^T\mathbf{x}(s, \alpha) = \sum_{i}^{d} w_i, x_i(s, \alpha)
 
-where :math:`\mathbf{w}` is the weights vector and :math:`\mathbf{x}(s, \alpha)` is called the feature vector representing state :math:`s` when taking action :math:`\alpha` [1]. For our case the components of the feature vector will be distortions of the three columns when applying action :math:`\alpha` on the data set. Our goal now is to find the components of the weight vector. We can the stochastic gradient descent (or SGD )
-for this [1]. In this case, the update rule is [1]
+where :math:`\mathbf{w}` is the weights vector and :math:`\mathbf{x}(s, \alpha)` is called the feature vector representing state :math:`s` when taking action :math:`\alpha` [1]. We will use `Tile coding`_ to construct :math:`\mathbf{x}(s, \alpha)`.  Our goal now is to find the components of the weight vector. 
+We can use stochastic gradient descent (or SGD ) for this [1]. In this case, the update rule is [1]
 
 .. math::
    \mathbf{w}_{t + 1} = \mathbf{w}_t + \eta\left[U_t - \gamma \hat{q}(s_t, \alpha_t, \mathbf{w}_t)\right] \nabla_{\mathbf{w}} \hat{q}(s_t, \alpha_t, \mathbf{w}_t)
    
-where :math:`U_t` for one-step SARSA is given by [1]:
+where :math:`\eta` is the learning rate and :math:`U_t`, for one-step SARSA, is given by [1]:
 
 .. math::
    U_t = R_t + \gamma \hat{q}(s_{t + 1}, \alpha_{t + 1}, \mathbf{w}_t)
@@ -29,20 +37,27 @@ Since, :math:`\hat{q}(s, \alpha)` is a linear function with respect to the weigh
 .. math::
    \nabla_{\mathbf{w}} \hat{q}(s, \alpha) = \mathbf{x}(s, \alpha)
 
-We will use bins to discretize the deformation range for each column in the data set.
-The state vector will contain these deformations. Hence, for the three column data set, the state vector will have three entries, each indicating the distortion of the respective column.
-
 The semi-gradient SARSA algorithm is shown below
 
 .. figure:: images/semi_gradient_sarsa.png 
 
    Episodic semi-gradient SARSA algorithm. Image from [1].
  
  
-  
- 
-Tiling
-------
+Tile coding
+-------------
+
+Since we consider all the columns distortions in the data set, means that we deal with a multi-dimensional continuous spaces. In this case,
+we can use tile coding to construct :math:`\mathbf{x}(s, \alpha)` [1].
+
+Tile coding is a form of coarse coding for multi-dimensional continuous spaces [1]. In this method, the features are grouped into partitions of the state
+space. Each partition is called a tiling, and each element of the partition is called a
+tile [1]. The following figure shows the a 2D state space partitioned in a uniform grid (left).
+If we only use this tiling,  we would not have coarse coding but just a case of state aggregation.
+
+In order to apply coarse coding, we use overlapping tiling partitions. In this case, each tiling is offset by a fraction of a tile width [1].
+A simple case with four tilings is shown on the right side of following figure. 
+
 
 We will use a linear function approximation for :math:`\hat{q}`:
 
@@ -53,9 +68,22 @@ We will use a linear function approximation for :math:`\hat{q}`:
    These tilings are offset from one another by a uniform amount in each dimension. Image from [1].
 
 
+One practical advantage of tile coding is that the overall number of features that are active 
+at a given instance is the same for any state [1]. Exactly one feature is present in each tiling, so the total number of features present is
+always the same as the number of tilings [1]. This allows the learning parameter :math:`\eta`, to be set according to
+
+.. math::
+   \eta = \frac{1}{n}
+   
+   
+where :math:`n` is the number of tilings. 
+
+
 Code
 ----
 
+The necessary imports
+
 .. code-block::
 
 	import random
@@ -77,6 +105,8 @@ Code
 	from src.utils.string_distance_calculator import StringDistanceType
 	from src.utils.reward_manager import RewardManager
 
+Next we set some constants
+
 .. code-block::
 
 	N_LAYERS = 5
@@ -99,6 +129,8 @@ Code
 	REWARD_FACTOR = 0.95
 	PUNISH_FACTOR = 2.0
 
+We continue by establishing some helper functions
+
 .. code-block::
 
 	def get_ethinicity_hierarchy():
@@ -140,7 +172,6 @@ Code
 	    ethnicity_hierarchy["White"] = "White"
 	    return ethnicity_hierarchy
 
-.. code-block::
 
 	def load_mock_subjects() -> MockSubjectsLoader:
 
@@ -201,6 +232,8 @@ Code
 
 		return env
 
+The driver code brings all elements together
+
 .. code-block::
 
 	if __name__ == '__main__':
@@ -231,6 +264,10 @@ Code
 	    trainer.train()
 
   
+.. figure:: images/semi_gradient_sarsa_3_columns_reward.png
+
+
+.. figure:: images/semi_gradient_sarsa_3_columns_distortion.png
    
 References
 ----------
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -17,6 +17,7 @@
 sys.path.append(os.path.abspath("../../src/algorithms/"))
 sys.path.append(os.path.abspath("../../src/exceptions/"))
 sys.path.append(os.path.abspath("../../src/spaces/"))
+sys.path.append(os.path.abspath("../../src/policies/"))
 print(sys.path)
 
 
diff --git a/docs/source/modules.rst b/docs/source/modules.rst
@@ -7,6 +7,8 @@ API
    API/actions
    API/state
    API/epsilon_greedy_q_estimator
+   API/epsilon_greedy_policy
+   API/time_step
    generated/action_space
    generated/q_estimator
    generated/q_learning
@@ -17,6 +19,5 @@ API
    generated/discrete_state_environment
    generated/observation_space
    generated/state
-   generated/time_step
    generated/tiled_environment
 
diff --git a/src/algorithms/epsilon_greedy_q_estimator.py b/src/algorithms/epsilon_greedy_q_estimator.py
@@ -18,6 +18,9 @@
 
 @dataclass(init=True, repr=True)
 class EpsilonGreedyQEstimatorConfig(EpsilonGreedyConfig):
+    """Configuration class for EpsilonGreedyQEstimator
+
+    """
     gamma: float = 1.0
     alpha: float = 1.0
     env: Env = None
@@ -29,7 +32,7 @@ class EpsilonGreedyQEstimator(WithEstimatorMixin):
     """
 
     def __init__(self, config: EpsilonGreedyQEstimatorConfig):
-        """Constructor
+        """Constructor. Initialize the estimator with a given configuration
 
         Parameters
         ----------
@@ -71,43 +74,13 @@ def q_hat_value(self, state_action_vec: StateActionVec) -> float:
         -------
         float
 
-
         """
 
         if self.weights is None:
             raise InvalidParamValue(param_name="weights", param_value="None. Have you called initialize?")
 
         return self.weights.dot(state_action_vec)
 
-    """
-    def update_weights(self, total_reward: float, state_action: Action,
-                       state_action_: Action, t: float) -> None:
-        
-        Update the weights
-
-        Parameters
-        ----------
-
-        total_reward: The reward observed
-        state_action: The action that led to the reward
-        state_action_:
-        t: The decay factor for alpha
-
-        Returns
-        -------
-
-        None
-
-        
-
-        if self.weights is None:
-            raise InvalidParamValue(param_name="weights", param_value="None. Have you called initialize?")
-
-        v1 = self.q_hat_value(state_action_vec=state_action)
-        v2 = self.q_hat_value(state_action_vec=state_action_)
-        self.weights += self.alpha / t * (total_reward + self.gamma * v2 - v1) * state_action
-    """
-
     def on_state(self, state: State) -> Action:
         """Returns the action on the given state
 
diff --git a/src/examples/semi_gradient_sarsa.py b/src/examples/semi_gradient_sarsa.py
@@ -16,6 +16,8 @@
 from src.utils.numeric_distance_type import NumericDistanceType
 from src.utils.string_distance_calculator import StringDistanceType
 from src.utils.reward_manager import RewardManager
+from src.utils.plot_utils import plot_running_avg
+from src.utils import INFO
 
 
 N_LAYERS = 5
@@ -144,24 +146,60 @@ def load_discrete_env() -> DiscreteStateEnvironment:
     # set the seed for random engine
     random.seed(42)
 
+    # load the discrete environment
     discrete_env = load_discrete_env()
+
+    # establish the configuration for the Tiled environment
     tiled_env_config = TiledEnvConfig(n_layers=N_LAYERS, n_bins=N_BINS,
                                       env=discrete_env,
                                       column_ranges={"ethnicity": [0.0, 1.0],
                                                      "salary": [0.0, 1.0],
                                                      "diagnosis": [0.0, 1.0]})
+    # create the Tiled environment
     tiled_env = TiledEnv(tiled_env_config)
     tiled_env.create_tiles()
 
-    configuration = {"n_episodes": N_EPISODES, "output_msg_frequency": OUTPUT_MSG_FREQUENCY}
-
+    # agent configuration
     agent_config = SemiGradSARSAConfig(gamma=GAMMA, alpha=ALPHA, n_itrs_per_episode=N_ITRS_PER_EPISODE,
                                        policy=EpsilonGreedyQEstimator(EpsilonGreedyQEstimatorConfig(eps=EPS, n_actions=tiled_env.n_actions,
                                                                                                     decay_op=EPSILON_DECAY_OPTION,
                                                                                                     epsilon_decay_factor=EPSILON_DECAY_FACTOR,
-                                                                                                    env=tiled_env, gamma=GAMMA, alpha=ALPHA)))
+                                                                                                    env=tiled_env,
+                                                                                                    gamma=GAMMA,
+                                                                                                    alpha=ALPHA)))
+    # create the agent
     agent = SemiGradSARSA(agent_config)
 
     # create a trainer to train the Qlearning agent
+    configuration = {"n_episodes": N_EPISODES, "output_msg_frequency": OUTPUT_MSG_FREQUENCY}
     trainer = Trainer(env=tiled_env, agent=agent, configuration=configuration)
+
+    # train the agent
     trainer.train()
+
+    # avg_rewards = trainer.avg_rewards()
+    avg_rewards = trainer.total_rewards
+    plot_running_avg(avg_rewards, steps=100,
+                     xlabel="Episodes", ylabel="Reward",
+                     title="Running reward average over 100 episodes")
+
+    avg_episode_dist = np.array(trainer.total_distortions)
+    print("{0} Max/Min distortion {1}/{2}".format(INFO, np.max(avg_episode_dist), np.min(avg_episode_dist)))
+
+    plot_running_avg(avg_episode_dist, steps=100,
+                     xlabel="Episodes", ylabel="Distortion",
+                     title="Running distortion average over 100 episodes")
+
+    print("=============================================")
+    print("{0} Generating distorted dataset".format(INFO))
+
+    """
+    # Let's play
+    env.reset()
+
+    stop_criterion = IterationControl(n_itrs=10, min_dist=MIN_DISTORTION, max_dist=MAX_DISTORTION)
+    agent.play(env=env, stop_criterion=stop_criterion)
+    env.save_current_dataset(episode_index=-2, save_index=False)
+    """
+    print("{0} Done....".format(INFO))
+    print("=============================================")
diff --git a/src/policies/epsilon_greedy_policy.py b/src/policies/epsilon_greedy_policy.py
diff --git a/src/spaces/time_step.py b/src/spaces/time_step.py