Skip to content

Commit c03c8c8

Browse files
authored
Merge pull request #28 from pockerman/finish_q_learning
Finish q learning
2 parents 24d6bfa + f254597 commit c03c8c8

23 files changed

+484
-185
lines changed

README.md

Lines changed: 44 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,62 @@
11
# RL anonymity
22

3-
An experimental effort to use reinforcement learning techniques for data anonymity.
3+
An experimental effort to use reinforcement learning techniques for data anonymization.
44

55
## Conceptual overview
66

7+
The term data anonymization refers to techiniques that can be applied on a given dataset, D, such that after
8+
the latter has been submitted to such techniques, it makes it difficult for a third party to identify or infer the existence
9+
of specific individuals in D. Anonymization techniques, typically result into some sort of distortion
10+
of the original dataset. This means that in order to maintain some utility of the transformed dataset, the transofrmations
11+
applied should be constrained in some sense. In the end, it can be argued, that data anonymization is an optimization problem
12+
meaning striking the right balance between data utility and privacy.
13+
714
Reinforcement learning is a learning framework based on accumulated experience. In this paradigm, an agent is learning by iteracting with an environment
8-
without (to a large extent) any supervision. The following image schematically describes the reinforcement learning framework
15+
without (to a large extent) any supervision. The following image describes, schematically, the reinforcement learning framework .
916

1017
![RL paradigm](images/agent_environment_interface.png "Reinforcement learning paradigm")
1118

12-
The framework has been use successfully to many recent advances in corntol, robotics, games and elsewhere.
19+
The agent chooses an action, ```a_t```, to perform out of predefined set of actions ```A```. The chosen action is executed by the environment
20+
instance and returns to the agent a reward signal, ```r_t```, as well as the new state, ```s_t```, that the enviroment is in.
21+
The framework has successfully been used to many recent advances in control, robotics, games and elsewhere.
22+
1323

14-
Given that data anonymity is essentially an optimization problem; between data utility and privacy, in this repository we try
15-
to use the reinforcement learning paradigm in order to train agents to perform this optimization for us. The following image
16-
places this into a persepctive
24+
Let's assume that we have in our disposal two numbers a minimum distortion, ```MIN_DIST``` that should be applied to the dataset
25+
for achieving privacy and a maximum distortion, ```MAX_DIST```, that should be applied to the dataset in order to maintain some utility.
26+
Let's assume also that any overall dataset distortion in ```[MIN_DIST, MAX_DIST]``` is acceptable in order to cast the dataset as
27+
preserving privacy and preserving dataset utility. We can then train a reinforcement learning agent to distort the dataset
28+
such that the aforementioned objective is achieved.
1729

30+
Overall, this is shown in the image below.
1831

1932
![RL anonymity paradigm](images/general_concept.png "Reinforcement learning anonymity schematics")
2033

34+
The images below show the overall running distortion average and running reward average achieved by using the
35+
<a href="https://en.wikipedia.org/wiki/Q-learning">Q-learning</a> algorithm and various policies.
36+
37+
**Q-learning with epsilon-greedy policy and constant epsilon**
38+
![RL anonymity paradigm](images/q_learn_epsilon_greedy_avg_run_distortion.png "Epsilon-greedy constant epsilon ")
39+
![RL anonymity paradigm](images/q_learn_epsilon_greedy_avg_run_reward.png "Reinforcement learning anonymity schematics")
40+
41+
**Q-learning with epsilon-greedy policy and decaying epsilon per episode**
42+
![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_avg_run_distortion.png "Reinforcement learning anonymity schematics")
43+
![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_avg_run_reward.png "Reinforcement learning anonymity schematics")
44+
45+
46+
**Q-learning with epsilon-greedy policy with decaying epsilon at constant rate**
47+
![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_rate_avg_run_distortion.png "Reinforcement learning anonymity schematics")
48+
![RL anonymity paradigm](images/q_learn_epsilon_greedy_decay_rate_avg_run_reward.png "Reinforcement learning anonymity schematics")
49+
50+
**Q-learning with softmax policy running average distorion**
51+
![RL anonymity paradigm](images/q_learn_softmax_avg_run_distortion.png "Reinforcement learning anonymity schematics")
52+
![RL anonymity paradigm](images/q_learn_softmax_avg_run_reward.png "Reinforcement learning anonymity schematics")
53+
54+
2155
## Dependencies
2256

57+
- NumPy
58+
2359
## Documentation
2460

61+
## References
62+
53.7 KB
Loading
53.8 KB
Loading
42.8 KB
Loading
34.1 KB
Loading
44.9 KB
Loading
38.6 KB
Loading
66.8 KB
Loading
71.2 KB
Loading

src/algorithms/q_learning.py

Lines changed: 26 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@
66
from typing import TypeVar
77

88
from src.exceptions.exceptions import InvalidParamValue
9-
from src.utils.mixins import WithMaxActionMixin
9+
from src.utils.mixins import WithMaxActionMixin, WithQTableMixinBase
1010

1111
Env = TypeVar('Env')
1212
Policy = TypeVar('Policy')
13+
Criterion = TypeVar('Criterion')
1314

1415

1516
class QLearnConfig(object):
@@ -39,8 +40,8 @@ def name(self) -> str:
3940

4041
def actions_before_training(self, env: Env, **options):
4142

42-
if self.config.policy is None:
43-
raise InvalidParamValue(param_name="policy", param_value="None")
43+
if not isinstance(self.config.policy, WithQTableMixinBase):
44+
raise InvalidParamValue(param_name="policy", param_value=str(self.config.policy))
4445

4546
for state in range(1, env.n_states):
4647
for action in range(env.n_actions):
@@ -56,10 +57,11 @@ def actions_after_episode_ends(self, **options):
5657

5758
self.config.policy.actions_after_episode(options['episode_idx'])
5859

59-
def play(self, env: Env) -> None:
60+
def play(self, env: Env, stop_criterion: Criterion) -> None:
6061
"""
6162
Play the game on the environment. This should produce
6263
a distorted dataset
64+
:param stop_criterion:
6365
:param env:
6466
:return:
6567
"""
@@ -69,7 +71,23 @@ def play(self, env: Env) -> None:
6971
# the max payout.
7072
# TODO: This will no work as the distortion is calculated
7173
# by summing over the columns.
72-
raise NotImplementedError("Function not implemented")
74+
75+
# set the q_table for the policy
76+
self.config.policy.q_table = self.q_table
77+
total_dist = env.total_average_current_distortion()
78+
while stop_criterion.continue_itr(total_dist):
79+
80+
if stop_criterion.iteration_counter == 12:
81+
print("Break...")
82+
83+
# use the policy to select an action
84+
state_idx = env.get_aggregated_state(total_dist)
85+
action_idx = self.config.policy.on_state(state_idx)
86+
action = env.get_action(action_idx)
87+
print("{0} At state={1} with distortion={2} select action={3}".format("INFO: ", state_idx, total_dist,
88+
action.column_name + "-" + action.action_type.name))
89+
env.step(action=action)
90+
total_dist = env.total_average_current_distortion()
7391

7492
def train(self, env: Env, **options) -> tuple:
7593

@@ -84,15 +102,10 @@ def train(self, env: Env, **options) -> tuple:
84102
for itr in range(self.config.n_itrs_per_episode):
85103

86104
# epsilon-greedy action selection
87-
action_idx = self.config.policy(q_func=self.q_table, state=state)
105+
action_idx = self.config.policy(q_table=self.q_table, state=state)
88106

89107
action = env.get_action(action_idx)
90108

91-
#if action.action_type.name == "GENERALIZE" and action.column_name == "salary":
92-
# print("Attempt to generalize salary")
93-
#else:
94-
# print(action.action_type.name, " on ", action.column_name)
95-
96109
# take action A, observe R, S'
97110
next_time_step = env.step(action)
98111
next_state = next_time_step.observation
@@ -111,7 +124,8 @@ def train(self, env: Env, **options) -> tuple:
111124

112125
return episode_score, total_distortion, counter
113126

114-
def _update_Q_table(self, state: int, action: int, n_actions: int, reward: float, next_state: int = None) -> None:
127+
def _update_Q_table(self, state: int, action: int, n_actions: int,
128+
reward: float, next_state: int = None) -> None:
115129
"""
116130
Update the Q-value for the state
117131
"""

0 commit comments

Comments
 (0)