Libraries Upgrade et al. by emersodb · Pull Request #150 · VectorInstitute/midst-toolkit

emersodb · 2026-06-29T13:20:33Z

PR Type

Fix

Short Description

Clickup Ticket(s): N/A

Upgrading the libraries to address vulnerabilities. This required some code changes with the new libraries functionality and new APIs. None of the fixes were major, but did require some digging and test fixes.

Tests Added

Test fixes associated with the required changes were made. Some small changes to the deterministic tests were also required with the move to single threading for the tests.

…nges required

emersodb · 2026-06-29T13:20:49Z

    plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (AUC = {roc_auc:.4f})")
    plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
-    plt.xlim([0.0, 1.0])
-    plt.ylim([0.0, 1.05])


Mypy complained about this.

emersodb · 2026-06-29T13:21:04Z

+    }
+    targets: dict[str, np.ndarray] = {
+        DataSplit.TRAIN.value: data[[table_metadata.target_column_name]].values.astype(np.float32)
+    }


Mypy complained about the typing.

emersodb · 2026-06-29T13:22:10Z

+        if (
+            is_string_dtype(dataframe[column_name])
+            or is_object_dtype(dataframe[column_name])
+            or (is_column_type_numerical(dataframe, column_name) and dataframe[column_name].nunique() <= threshold)


You'll see this throughout, but Pandas typing for strings has changed quite a bit. Now they have a separate string type you can use, along with the object type. So this catches that.

emersodb · 2026-06-29T13:22:55Z

+        self.validate_dataframe_dtypes(real_data)
+        self.validate_dataframe_dtypes(synthetic_data)
+        if holdout_data is not None:
+            self.validate_dataframe_dtypes(holdout_data)


Added these validations to make sure that string types are not sneaking into our metrics computations. These were added wherever strings would cause issues.

emersodb · 2026-06-29T13:23:23Z

            do_preprocessing=False,
            verbose=False,
            nn_dist=self.norm.value,
+            plot_figures=False,


In the new SynthEval, they generated plots by default, which was very annoying. This shuts them off.

emersodb · 2026-06-29T13:23:56Z

        )

-        return self.syntheval_metric.evaluate(self.confidence_level.value)
+        return self.syntheval_metric.evaluate(ci="sem", confidence=self.confidence_level.value)


New upgrade changed this method signature a bit. So we're forcing consistency here.

emersodb · 2026-06-29T13:25:41Z

For some reason, SynthEval changed the way they report the metric output here from nested dictionaries to pandas dataframes and changed a bunch of the key names. Very annoying, but these changes just address that new reporting structure and keep our output structures the same.

emersodb · 2026-06-29T13:29:30Z

+        if pd.api.types.is_string_dtype(data_splits.train_data.data[column_name]) or pd.api.types.is_object_dtype(
+            data_splits.train_data.data[column_name]
+        ):
+            data_splits.train_data.data.loc[column_data == "?", column_name] = "nan"


Pandas now does some strong type checking. If you have a dataframe with an integer type (for example) and you try to set it's value to a string, it will blow up. This happens even when there are no "?" strings in the column (as would be the case for an integer column).

emersodb · 2026-06-29T13:29:54Z

            end = min((i + 1) * batch_size, a_array.shape[0])
            distance, search_indices = index.search(a_array[start:end], k=1)
-            index.remove_ids(search_indices.flatten())
+            index.remove_ids(search_indices.flatten())  # type: ignore


I could not, for the life of me, figure out the right type here. The faiss library is a mess.

emersodb · 2026-06-29T13:30:17Z

    shutil.rmtree(results_dir, ignore_errors=True)


+@pytest.mark.integration_test()


This wasn't marked integration, even though it is 🙂

emersodb · 2026-06-29T13:30:30Z



-def test_train_attack_classifier_mismatched_data(attack_data):
+@pytest.mark.parametrize("classifier_type", [ClassifierType.XGBOOST, ClassifierType.CATBOOST])


Mocking these to make it faster.

emersodb · 2026-06-29T13:31:14Z

-    mapped_holdout_data = HOLDOUT_DATA.replace({"cat": 0, "horse": 1, "dog": 2})
+    mapped_real_data = REAL_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int)
+    mapped_synthetic_data = SYNTHETIC_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int)
+    mapped_holdout_data = HOLDOUT_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int)


This replace, for some reason, doesn't reclassify the column as an integer, even though it maps all categorical values to integers and seems like it should. So forcing it.

emersodb · 2026-06-29T13:31:54Z

    # Now add some NaNs to the train and validation splits
-    dataset.numerical_features["train"][0, 1] = np.NaN
-    dataset.numerical_features["val"][1, 1] = np.NaN
+    dataset.numerical_features["train"][0, 1] = np.nan


np.NaN is no longer supported.

emersodb · 2026-06-29T13:32:55Z

+import os
+
+
+os.environ.setdefault("OMP_NUM_THREADS", "1")


With the upgrades, there is a hanging/segfault issue with pytest and xgboost where we get nested thread spawning. So this variable forces pytest to use single threads.

@lotif: If you have a better idea definitely let me know!

emersodb · 2026-06-29T13:33:21Z

 tests/integration/assets/tabsyn/results

+# Emitted SynthEval analysis config file during metric creation. Unfortunately cannot be turned off...
+SE_analysis_config.json


New Syntheval emits this json file and it cannot be disabled after looking at their code...

emersodb · 2026-06-29T13:57:14Z

-            "Label column should not be included in the set of numerical columns provided"
+            "Label column should not be included in the set of categorical columns provided"
        )
+        categorical_columns = categorical_columns + [label_column]


Adding the label column here to make sure it gets preprocessed with the cat columns, but maintain the entry points of having label column separate to make it backwards consistent and also match the corresponding regression metric.

…e_libraries

Upgrading libraries to address vulnerabiltiies and the associated cha…

1739f18

…nges required

emersodb requested review from lotif and sarakodeiri June 29, 2026 13:20

emersodb commented Jun 29, 2026

View reviewed changes

Reverting the f1 score metric entry points to previous setup

8a1173a

emersodb commented Jun 29, 2026

View reviewed changes

emersodb added 3 commits June 29, 2026 10:05

Upgrading setup-python

c5eeb50

Upgrading checkout

8bd398e

Merge branch 'marcelo/tabsyn-ensemble' into dbe/tabsyn_example_upgrad…

a9265a1

…e_libraries

		shutil.rmtree(results_dir, ignore_errors=True)


		@pytest.mark.integration_test()



		def test_train_attack_classifier_mismatched_data(attack_data):
		@pytest.mark.parametrize("classifier_type", [ClassifierType.XGBOOST, ClassifierType.CATBOOST])

		import os


		os.environ.setdefault("OMP_NUM_THREADS", "1")

Uh oh!

Conversation

emersodb commented Jun 29, 2026

PR Type

Short Description

Tests Added

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant