Merge pull request #320 from SubstraFoundation/fully-flexible-split

bowni · web-flow · commit d6b43a2ee446 · 2021-03-04T09:33:22.000+01:00
Create a fully flexible splitter
diff --git a/mplc/doc/documentation.md b/mplc/doc/documentation.md
@@ -238,23 +238,25 @@ There are 2 ways to select a dataset. You can either choose a pre-implemented da
   Example: `amounts_per_partner=[0.3, 0.3, 0.1, 0.3]`
 
 <a id="sample_split_option"></a>
-- `samples_split_option`: Used to set the strategy of samples data split. You can either instantiate a Splitter before passing it to Scenario, as in the below example, or you can pass it by its string identifier. In the latter case, the default parameters for the Splitter selected will be used.
+- `samples_split_option`: Used to set the strategy of samples data split. You can either instantiate a `Splitter` before passing it to `Scenario`, or you can pass it by its string identifier. In the latter case, the default parameters for the `Splitter` selected will be used.
   How the original dataset data samples are split among partners:
-    - `RandomSplitter`: the dataset is shuffled and partners receive data samples selected randomly
-        String identifier: `'random'`
-    - `StratifiedSplitter`: the dataset is stratified per class and each partner receives certain classes only (note: depending on the `amounts_per_partner` specified, there might be small overlaps of classes)
-        String identifier: `'stratified'``[[nb of clusters (int), 'shared' or 'specific']]`
-    - `'AdvancedSplitter'`: in certain cases it might be interesting to split the dataset among partners in a more elaborate way. For that we consider the data samples from the initial dataset as split in clusters per data labels. The advanced split is configured by indicating, for each partner in sequence, the following 2 elements: `[[nb of clusters (int), 'shared' or 'specific']]`. Practically, you can either instantiate your `AdvancedSplitter` object, and pass this list `[[nb of clusters (int), 'shared' or 'specific']]` to the keyword argument `description`, or use the string identifier and pass the list `[[nb of clusters (int), 'shared' or 'specific']]` to the scenario via the keyword argument `samples_split_configuration`.
-        String identifier:`'advanced'`.
-        Configuration:  
+  
+    - `RandomSplitter`: the dataset is shuffled and partners receive data samples selected randomly. String identifier: `'random'`
+      
+    - `StratifiedSplitter`: the dataset is stratified per class and each partner receives certain classes only (note: depending on the `amounts_per_partner` specified, there might be some overlap of classes). String identifier: `'stratified'`
+    
+    - `AdvancedSplitter`: in certain cases it might be interesting to split the dataset among partners in a more elaborate way. For that we consider the data samples from the initial dataset as split in clusters per data labels. The advanced split is configured by indicating, for each partner in sequence, the following 2 elements: `[[nb of clusters (int), 'shared' or 'specific']]`. Practically, you can either instantiate your `AdvancedSplitter` object, and pass this list `[[nb of clusters (int), 'shared' or 'specific']]` to the keyword argument `description`, or use the string identifier and pass the list `[[nb of clusters (int), 'shared' or 'specific']]` to the `Scenario` via the keyword argument `samples_split_configuration`. String identifier:`'advanced'`. Configuration:  
         - `nb of clusters (int)`: the given partner will receive data samples from that many different clusters (clusters of data samples per labels/classes)
         - `'shared'` or `'specific'`:
-         - `'shared'`: all partners with option `'shared'` receive data samples picked
+          - `'shared'`: all partners with option `'shared'` receive data samples picked
                        from clusters they all share data samples from
-         - `'specific'`: each partner with option `'specific'` receives data samples picked 
-                         from cluster(s) it is the only one to receive from
-
-  Example: `samples_split_option='advanced', samples_split_configuration=[[7, 'shared'], [6, 'shared'], [2, 'specific'], [1, 'specific']]]`
+          - `'specific'`: each partner with option `'specific'` receives data samples picked 
+                         from cluster(s) it is the only one to receive from  
+    Example: `samples_split_option='advanced', samples_split_configuration=[[7, 'shared'], [6, 'shared'], [2, 'specific'], [1, 'specific']]`
+    
+    - `FlexibleSplitter`: in other cases one might want to specify in detail the split among partners (partner per partner and class per class). For that the `FlexibleSplitter` can be used. It is configured by indicating, for each partner in sequence, a list of the percentage of samples for each class: `[[% for class 1, ..., % for class n]]`. As above, it can be instantiated separately and then passed to the `Scenario` instance. Or the string identifier `'flexible'` can be used for the parameter `samples_split_option`, coupled with the split configuration passed to the keyword argument `samples_split_configuration`. String identified: `'flexible'`. 
+    Example: `samples_split_option='flexible', samples_split_configuration=[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.5, 0.5], [0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.0]]` (this corresponds to 50% of the last 3 classes for partner 1, and 50% or 100% of each of the first 9 classes for partner 2).  
+    Note: in the list of % for each class, one shouldn't interpret the order of its inputs as any human-readable order of the samples (e.g. alphabetical, numerical...). The implementation uses the order in which the samples appear in the dataset. As such, note that one can artificially enforce a certain order if desired, by sorting the dataset beforehand. 
 
 ![Example of the advanced split option](../../img/advanced_split_example.png)
 
diff --git a/mplc/scenario.py b/mplc/scenario.py
@@ -174,6 +174,11 @@ def __init__(
         # (% of samples of the dataset for each partner, ...
         # ... has to sum to 1, and number of items has to equal partners_count)
         self.amounts_per_partner = amounts_per_partner
+        if np.sum(self.amounts_per_partner) != 1:
+            raise ValueError("The sum of the amount per partners you provided isn't equal to 1")
+        if len(self.amounts_per_partner) != self.partners_count:
+            raise AttributeError(f"The amounts_per_partner list should have a size ({len(self.amounts_per_partner)}) "
+                                 f"equals to partners_count ({self.partners_count})")
 
         #  To configure how validation set and test set will be organized.
         if test_set in ['local', 'global']:
@@ -341,9 +346,11 @@ def __init__(
             self.save_folder = Path(save_path) / self.scenario_name
         else:
             self.save_folder = None
-        # ------------------------------------------------------------------
+
+        # -------------------------------------------------------------------
         # Select in the kwargs the parameters to be transferred to sub object
-        # ------------------------------------------------------------------
+        # -------------------------------------------------------------------
+
         self.mpl_kwargs = {}
         for key, value in kwargs.items():
             if key.startswith('mpl_'):
diff --git a/mplc/splitter.py b/mplc/splitter.py
@@ -23,10 +23,6 @@ def __init__(self, amounts_per_partner, val_set='global', test_set='global', **k
         self.dataset = None
         self.partners_list = None
 
-        # Check the percentages of samples per partner and control its coherence
-        if np.sum(self.amounts_per_partner) != 1:
-            raise ValueError("The sum of the amount per partners you provided isn't equal to 1")
-
     @property
     def partners_count(self):
         return len(self.partners_list)
@@ -37,29 +33,40 @@ def __str__(self):
     def split(self, partners_list, dataset):
         self.dataset = dataset
         self.partners_list = partners_list
-        if len(self.amounts_per_partner) != self.partners_count:
-            raise AttributeError(f"The amounts_per_partner list should have a size ({len(self.amounts_per_partner)}) "
-                                 f"equals to partners_count ({self.partners_count})")
 
-        logger.info("### Splitting data among partners:")
-        logger.info("Train data split:")
+        logger.info("Splitting data among partners: starting now.")
+        self._test_config_coherence()
+        logger.info("Coherence of config parameters: OK.")
+
+        logger.info("Train data split: starting now.")
         self._split_train()
 
         if self.val_set == 'local':
-            logger.info("Validation data split:")
+            logger.info("Validation data split: starting now.")
             self._split_val()
 
         if self.test_set == 'local':
-            logger.info("Test data split:")
+            logger.info("Test data split: starting now.")
             self._split_test()
 
         for partner in self.partners_list:
             logger.info(
-                f"   Partner #{partner.id}: "
-                f"{partner.final_nb_samples} samples "
-                f"with labels {partner.labels}"
+                f"Partner #{partner.id}: {partner.final_nb_samples} samples with labels {partner.labels}"
             )
 
+    def _test_config_coherence(self):
+        self._test_amounts_per_partner_total()
+        self._test_amounts_per_partner_length()
+
+    def _test_amounts_per_partner_total(self):
+        if np.sum(self.amounts_per_partner) != 1:
+            raise ValueError("The sum of the amount per partners you provided isn't equal to 1; it has to.")
+
+    def _test_amounts_per_partner_length(self):
+        if len(self.amounts_per_partner) != self.partners_count:
+            raise AttributeError(f"The amounts_per_partner list should have a size ({len(self.amounts_per_partner)}) "
+                                 f"equals to partners_count ({self.partners_count})")
+
     def _split_train(self):
         subsets = self._generate_subset(self.dataset.x_train, self.dataset.y_train)
         for idx, p in enumerate(self.partners_list):
@@ -89,8 +96,79 @@ def copy(self):
         return self.__copy__()
 
 
+class FlexibleSplitter(Splitter):
+    name = 'Fully Flexible Splitter'
+
+    def __init__(self, amounts_per_partner, configuration, **kwargs):
+
+        logger.info("Proceeding to a flexible split as requested. Please note that the flexible "
+                    "split currently discards the amounts_per_partner (if provided) and infers amounts of samples "
+                    "per partner from the samples_split_configuration provided.")
+
+        # First we re-assemble the split configuration per cluster
+        self.configuration = configuration
+        self.split_configuration = configuration
+        self.samples_split_grouped_by_cluster = list(zip(*configuration))
+
+        # Init of the superclass to inherit its methods
+        super().__init__(amounts_per_partner, **kwargs)
+
+    def _test_config_coherence(self):
+
+        # First, we test if the splitter configuration is coherent with the number of partners
+        if len(self.split_configuration) != self.partners_count:
+            raise AttributeError(f"The split configuration should have a size ({len(self.split_configuration)}) "
+                                 f"equals to partners_count ({self.partners_count})")
+
+        # Second, we test for each class that the amount of samples split across partners is <= 100%
+        for idx, cluster in enumerate(self.samples_split_grouped_by_cluster):
+            if np.sum(cluster) > 1:
+                raise ValueError(f"Amounts of samples of class {idx} split among partners exceed 100%, "
+                                 f"the dataset split cannot be performed.")
+
+    def _generate_subset(self, x, y):
+
+        # Convert raw labels in y to simplify operations on the dataset
+        lb = LabelEncoder()
+        y_str = lb.fit_transform([str(label) for label in y])
+        labels = list(set(y_str))
+
+        # Split the datasets (x and y) into subsets of samples of each label (called "clusters")
+        x_for_cluster, y_for_cluster, nb_samples_per_cluster = {}, {}, {}
+        for label in labels:
+            idx_in_full_set = np.where(y_str == label)
+            x_for_cluster[label] = x[idx_in_full_set]
+            y_for_cluster[label] = y[idx_in_full_set]
+            nb_samples_per_cluster[label] = len(y_for_cluster[label])
+
+        # Assemble datasets per partner by looping over partners and labels
+        res = []
+        nb_samples_split = []
+        for p_idx, p in enumerate(self.partners_list):
+
+            list_arrays_x, list_arrays_y = [], []
+
+            for idx, label in enumerate(labels):
+                nb_samples_to_pick = int(nb_samples_per_cluster[label] * self.samples_split_grouped_by_cluster[idx][
+                    p_idx])
+                list_arrays_x.append(x_for_cluster[label][:nb_samples_to_pick])
+                x_for_cluster[label] = x_for_cluster[label][nb_samples_to_pick:]
+                list_arrays_y.append(y_for_cluster[label][:nb_samples_to_pick])
+                y_for_cluster[label] = y_for_cluster[label][nb_samples_to_pick:]
+
+            res.append((np.concatenate(list_arrays_x), np.concatenate(list_arrays_y)))
+            nb_samples_split.append(len(np.concatenate(list_arrays_y)))
+
+        # Log the relative amounts of samples split among partners
+        total_nb_samples_split = np.sum(nb_samples_split)
+        relative_nb_samples = [round(nb / total_nb_samples_split, 2) for nb in nb_samples_split]
+        logger.info(f"Partners' relative number of samples: {relative_nb_samples}")
+
+        return res
+
+
 class RandomSplitter(Splitter):
-    name = 'Random samples split'
+    name = 'Random Splitter'
 
     def _generate_subset(self, x, y):
         if self.partners_count == 1:
@@ -107,7 +185,7 @@ def _generate_subset(self, x, y):
 
 
 class StratifiedSplitter(Splitter):
-    name = 'Stratified samples split'
+    name = 'Stratified Splitter'
 
     def _generate_subset(self, x, y):
         if self.partners_count == 1:
@@ -124,9 +202,10 @@ def _generate_subset(self, x, y):
 
 
 class AdvancedSplitter(Splitter):
-    name = 'Advanced samples split'
+    name = 'Advanced Splitter'
 
     def __init__(self, amounts_per_partner, configuration, **kwargs):
+        self.configuration = configuration
         self.num_clusters, self.specific_shared = list(zip(*configuration))
         super().__init__(amounts_per_partner, **kwargs)
 
@@ -271,6 +350,7 @@ def _generate_subset(self, x, y):
 
 
 IMPLEMENTED_SPLITTERS = {
+    'flexible': FlexibleSplitter,
     'random': RandomSplitter,
     'stratified': StratifiedSplitter,
     'advanced': AdvancedSplitter
diff --git a/tests/contrib_end_to_end_test.py b/tests/contrib_end_to_end_test.py
@@ -27,7 +27,7 @@ def test_titanic_contrib(self):
 
         df = test_utils.get_latest_dataframe("*end_to_end_test*")
 
-        # 2 contributivity methods X 2 parters x 2 repeats = 12
+        # 2 contributivity methods X 2 partners x 2 repeats = 12
         assert len(df) == 12
 
     def test_mnist_contrib(self):
diff --git a/tests/unit_tests.py b/tests/unit_tests.py
@@ -1,6 +1,6 @@
 # -*- coding: utf-8 -*-
 """
-This enables to parameterize unit tests - the tests are run by Travis each time you commit to the github repo
+This enables to parameterize unit tests - the tests are run by GitHub Actions each time you commit to the github repo
 """
 
 #########
@@ -55,7 +55,7 @@
 from mplc.partner import Partner
 from mplc.scenario import Scenario
 # create_Mpl uses create_Dataset and create_Contributivity uses create_Scenario
-from mplc.splitter import AdvancedSplitter, RandomSplitter, StratifiedSplitter
+from mplc.splitter import FlexibleSplitter, AdvancedSplitter, RandomSplitter, StratifiedSplitter
 
 
 ######
@@ -91,7 +91,12 @@ def create_MultiPartnerLearning(create_all_datasets):
 @pytest.fixture(scope="class", params=(RandomSplitter([0.1, 0.2, 0.3, 0.4]),
                                        StratifiedSplitter([0.1, 0.2, 0.3, 0.4]),
                                        AdvancedSplitter([0.3, 0.5, 0.2],
-                                                        [[4, "specific"], [6, "shared"], [4, "shared"]])))
+                                                        [[4, "specific"], [6, "shared"], [4, "shared"]]),
+                                       FlexibleSplitter([1.0, 0.0, 0.0], [
+                                                            [0.33, 0.33, 0.33, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
+                                                        ])))
 def create_splitter(request):
     return request.param()
 
@@ -113,13 +118,15 @@ def create_Partner(create_all_datasets):
                          ['not-corrupted'] * 3),
                         (Cifar10, "random", ['not-corrupted'] * 3),
                         (Cifar10,
-                         AdvancedSplitter([0.3, 0.5, 0.2], [[4, "specific"], [6, "shared"], [4, "shared"]]),
+                         FlexibleSplitter([0.3, 0.5, 0.2], [[0.33, 0.33, 0.33, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]]),
                          ['not-corrupted'] * 3)),
                 ids=['Mnist - basic',
                      'Mnist - basic - corrupted',
                      'Mnist - advanced',
                      'Cifar10 - basic',
-                     'Cifar10 - advanced'])
+                     'Cifar10 - flex'])
 def create_Scenario(request):
     dataset = request.param[0]()
     samples_split_option = request.param[1]
@@ -366,6 +373,37 @@ def test_advanced_splitter_local(self, create_all_datasets):
             with pytest.raises(Exception):
                 splitter.split(partners_list, dataset)
 
+    def test_flexible_splitter_global(self, create_all_datasets):
+        dataset = create_all_datasets
+        splitter = FlexibleSplitter([0.3, 0.3, 0.4], configuration=[
+                                                            [0.33, 0.33, 0.33, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]])
+        partners_list = [Partner(i) for i in range(len(splitter.amounts_per_partner))]
+        if dataset.num_classes == 10:
+            splitter.split(partners_list, dataset)
+            for p in partners_list:
+                assert len(p.y_val) == 0, "validation set is not empty in spite of the val_set == 'global'"
+                assert len(p.y_test) == 0, "test set is not empty in spite of the val_set == 'global'"
+                assert len(p.x_train) == len(p.y_train), 'labels and samples numbers mismatches'
+                assert len(p.labels) < dataset.num_classes, f'Partner {p.id} has all labels.'
+
+    def test_flexible_splitter_local(self, create_all_datasets):
+        dataset = create_all_datasets
+        splitter = FlexibleSplitter([0.3, 0.3, 0.4], configuration=[
+                                                            [0.33, 0.33, 0.33, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
+                                                            [0.33, 0.33, 0.33, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]],
+                                    val_set='local', test_set='local')
+        partners_list = [Partner(i) for i in range(len(splitter.amounts_per_partner))]
+        if dataset.num_classes == 10:
+            splitter.split(partners_list, dataset)
+            for p in partners_list:
+                assert len(p.y_val) > 0, "validation set is empty in spite of the val_set == 'local'"
+                assert len(p.y_test) > 0, "test set is empty in spite of the val_set == 'local'"
+                assert len(p.x_train) == len(p.y_train), 'labels and samples numbers mismatches'
+                assert len(p.labels) < dataset.num_classes, f'Partner {p.id} has all labels.'
+
 
 ######
 #