pockerman
diff --git a/‎README.md‎
Lines changed: 44 additions & 6 deletions b/‎README.md‎
Lines changed: 44 additions & 6 deletions
diff --git a/‎images/q_learn_epsilon_greedy_avg_run_distortion.png‎
53.7 KB b/‎images/q_learn_epsilon_greedy_avg_run_distortion.png‎
53.7 KB
diff --git a/‎images/q_learn_epsilon_greedy_avg_run_reward.png‎
53.8 KB b/‎images/q_learn_epsilon_greedy_avg_run_reward.png‎
53.8 KB
diff --git a/‎images/q_learn_epsilon_greedy_decay_avg_run_distortion.png‎
42.8 KB b/‎images/q_learn_epsilon_greedy_decay_avg_run_distortion.png‎
42.8 KB
diff --git a/‎images/q_learn_epsilon_greedy_decay_avg_run_reward.png‎
34.1 KB b/‎images/q_learn_epsilon_greedy_decay_avg_run_reward.png‎
34.1 KB
diff --git a/‎images/q_learn_epsilon_greedy_decay_rate_avg_run_distortion.png‎
44.9 KB b/‎images/q_learn_epsilon_greedy_decay_rate_avg_run_distortion.png‎
44.9 KB
diff --git a/‎images/q_learn_epsilon_greedy_decay_rate_avg_run_reward.png‎
38.6 KB b/‎images/q_learn_epsilon_greedy_decay_rate_avg_run_reward.png‎
38.6 KB
diff --git a/‎images/q_learn_softmax_avg_run_distortion.png‎
66.8 KB b/‎images/q_learn_softmax_avg_run_distortion.png‎
66.8 KB
diff --git a/‎images/q_learn_softmax_avg_run_reward.png‎
71.2 KB b/‎images/q_learn_softmax_avg_run_reward.png‎
71.2 KB
diff --git a/‎src/algorithms/q_learning.py‎
Lines changed: 26 additions & 12 deletions b/‎src/algorithms/q_learning.py‎
Lines changed: 26 additions & 12 deletions
@@ -1,24 +1,62 @@
 # RL anonymity
 
-An experimental effort to use reinforcement learning techniques for data anonymity. 
+An experimental effort to use reinforcement learning techniques for data anonymization. 
 
 ## Conceptual overview
 
+The term data anonymization refers to techiniques that can be applied on a given dataset, D, such that after
+the latter has been submitted to such techniques, it makes it difficult for a third party to identify or infer the existence
+of specific individuals in D. Anonymization techniques, typically result into some sort of distortion
+of the original dataset. This means that in order to maintain some utility of the transformed dataset, the transofrmations
+applied should be constrained in some sense. In the end, it can be argued, that data anonymization is an optimization problem
+meaning striking the right balance between data utility and privacy. 
+
 Reinforcement learning is a learning framework based on accumulated experience. In this paradigm, an agent is learning by iteracting with an environment 
-without (to a large extent) any supervision. The following image   schematically describes the reinforcement learning framework 
+without (to a large extent) any supervision. The following image describes, schematically, the reinforcement learning framework .
 
 ![RL paradigm](images/agent_environment_interface.png "Reinforcement learning paradigm") 
 
-The framework has been use successfully to many recent advances in corntol, robotics, games and elsewhere.
+The agent chooses an action, ```a_t```, to perform out of predefined set of actions ```A```. The chosen action is executed by the environment
+instance and returns to the agent a reward signal, ```r_t```, as well as the new state, ```s_t```, that the enviroment is in. 
+The framework has successfully been used  to many recent advances in control, robotics, games and elsewhere.
+
 
-Given that data anonymity is essentially an optimization problem; between data utility and privacy, in this repository we try
-to use the reinforcement learning paradigm in order to train agents to perform this optimization for us. The following image
-places this into a persepctive 
+Let's assume that we have in our disposal two numbers a minimum distortion, ```MIN_DIST``` that should be applied to the dataset
+for achieving privacy and a maximum distortion, ```MAX_DIST```,  that should be applied to the dataset in order to maintain some utility.
+Let's assume also that any overall dataset distortion in ```[MIN_DIST, MAX_DIST]``` is acceptable in order to cast the dataset as 
+preserving  privacy and preserving dataset utility. We can then train a reinforcement learning agent to distort the dataset
+such that the aforementioned objective is achieved.
 
+Overall, this is shown in the image below.
 
 ![RL anonymity paradigm](images/general_concept.png "Reinforcement learning anonymity schematics")
 
+The images below show the overall running distortion average and running reward average achieved by using the 
+<a href="https://en.wikipedia.org/wiki/Q-learning">Q-learning</a> algorithm and various policies.
+
+**Q-learning with epsilon-greedy policy and constant epsilon**
+![RL anonymity paradigm](images/q_learn_epsilon_greedy_avg_run_distortion.png "Epsilon-greedy constant epsilon ")
+![RL anonymity paradigm](images/q_learn_epsilon_greedy_avg_run_reward.png "Reinforcement learning anonymity schematics")
+
+**Q-learning with epsilon-greedy policy and decaying epsilon per episode**
+![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_avg_run_distortion.png "Reinforcement learning anonymity schematics")
+![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_avg_run_reward.png "Reinforcement learning anonymity schematics")
+
+
+**Q-learning with epsilon-greedy policy with decaying epsilon at constant rate**
+![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_rate_avg_run_distortion.png "Reinforcement learning anonymity schematics")
+![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_rate_avg_run_reward.png "Reinforcement learning anonymity schematics")
+
+**Q-learning with softmax policy running average distorion**
+![RL anonymity paradigm](images/q_learn_softmax_avg_run_distortion.png "Reinforcement learning anonymity schematics")
+![RL anonymity paradigm](images/q_learn_softmax_avg_run_reward.png "Reinforcement learning anonymity schematics")
+
+
 ## Dependencies 
 
+- NumPy
+
 ## Documentation
 
+## References
+
@@ -6,10 +6,11 @@
 from typing import TypeVar
 
 from src.exceptions.exceptions import InvalidParamValue
-from src.utils.mixins import WithMaxActionMixin
+from src.utils.mixins import WithMaxActionMixin, WithQTableMixinBase
 
 Env = TypeVar('Env')
 Policy = TypeVar('Policy')
+Criterion = TypeVar('Criterion')
 
 
 class QLearnConfig(object):
@@ -39,8 +40,8 @@ def name(self) -> str:
 
     def actions_before_training(self, env: Env, **options):
 
-        if self.config.policy is None:
-            raise InvalidParamValue(param_name="policy", param_value="None")
+        if not isinstance(self.config.policy, WithQTableMixinBase):
+            raise InvalidParamValue(param_name="policy", param_value=str(self.config.policy))
 
         for state in range(1, env.n_states):
             for action in range(env.n_actions):
@@ -56,10 +57,11 @@ def actions_after_episode_ends(self, **options):
 
         self.config.policy.actions_after_episode(options['episode_idx'])
 
-    def play(self, env: Env) -> None:
+    def play(self, env: Env, stop_criterion: Criterion) -> None:
         """
         Play the game on the environment. This should produce
         a distorted dataset
+        :param stop_criterion:
         :param env:
         :return:
         """
@@ -69,7 +71,23 @@ def play(self, env: Env) -> None:
         # the max payout.
         # TODO: This will no work as the distortion is calculated
         # by summing over the columns.
-        raise NotImplementedError("Function not implemented")
+
+        # set the q_table for the policy
+        self.config.policy.q_table = self.q_table
+        total_dist = env.total_average_current_distortion()
+        while stop_criterion.continue_itr(total_dist):
+
+            if stop_criterion.iteration_counter == 12:
+                print("Break...")
+
+            # use the policy to select an action
+            state_idx = env.get_aggregated_state(total_dist)
+            action_idx = self.config.policy.on_state(state_idx)
+            action = env.get_action(action_idx)
+            print("{0} At state={1} with distortion={2} select action={3}".format("INFO: ", state_idx, total_dist,
+                                                                                  action.column_name + "-" + action.action_type.name))
+            env.step(action=action)
+            total_dist = env.total_average_current_distortion()
 
     def train(self, env: Env, **options) -> tuple:
 
@@ -84,15 +102,10 @@ def train(self, env: Env, **options) -> tuple:
         for itr in range(self.config.n_itrs_per_episode):
 
             # epsilon-greedy action selection
-            action_idx = self.config.policy(q_func=self.q_table, state=state)
+            action_idx = self.config.policy(q_table=self.q_table, state=state)
 
             action = env.get_action(action_idx)
 
-            #if action.action_type.name == "GENERALIZE" and action.column_name == "salary":
-             #   print("Attempt to generalize salary")
-            #else:
-             #   print(action.action_type.name, " on ", action.column_name)
-
             # take action A, observe R, S'
             next_time_step = env.step(action)
             next_state = next_time_step.observation
@@ -111,7 +124,8 @@ def train(self, env: Env, **options) -> tuple:
 
         return episode_score, total_distortion, counter
 
-    def _update_Q_table(self, state: int, action: int, n_actions: int, reward: float, next_state: int = None) -> None:
+    def _update_Q_table(self, state: int, action: int, n_actions: int,
+                        reward: float, next_state: int = None) -> None:
         """
         Update the Q-value for the state
         """