Skip to content

Libraries Upgrade et al.#150

Open
emersodb wants to merge 5 commits into
marcelo/tabsyn-ensemblefrom
dbe/tabsyn_example_upgrade_libraries
Open

Libraries Upgrade et al.#150
emersodb wants to merge 5 commits into
marcelo/tabsyn-ensemblefrom
dbe/tabsyn_example_upgrade_libraries

Conversation

@emersodb

Copy link
Copy Markdown
Collaborator

PR Type

Fix

Short Description

Clickup Ticket(s): N/A

Upgrading the libraries to address vulnerabilities. This required some code changes with the new libraries functionality and new APIs. None of the fixes were major, but did require some digging and test fixes.

Tests Added

Test fixes associated with the required changes were made. Some small changes to the deterministic tests were also required with the move to single threading for the tests.

@emersodb emersodb requested review from lotif and sarakodeiri June 29, 2026 13:20
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (AUC = {roc_auc:.4f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy complained about this.

}
targets: dict[str, np.ndarray] = {
DataSplit.TRAIN.value: data[[table_metadata.target_column_name]].values.astype(np.float32)
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy complained about the typing.

if (
is_string_dtype(dataframe[column_name])
or is_object_dtype(dataframe[column_name])
or (is_column_type_numerical(dataframe, column_name) and dataframe[column_name].nunique() <= threshold)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll see this throughout, but Pandas typing for strings has changed quite a bit. Now they have a separate string type you can use, along with the object type. So this catches that.

self.validate_dataframe_dtypes(real_data)
self.validate_dataframe_dtypes(synthetic_data)
if holdout_data is not None:
self.validate_dataframe_dtypes(holdout_data)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these validations to make sure that string types are not sneaking into our metrics computations. These were added wherever strings would cause issues.

do_preprocessing=False,
verbose=False,
nn_dist=self.norm.value,
plot_figures=False,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new SynthEval, they generated plots by default, which was very annoying. This shuts them off.

)

return self.syntheval_metric.evaluate(self.confidence_level.value)
return self.syntheval_metric.evaluate(ci="sem", confidence=self.confidence_level.value)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New upgrade changed this method signature a bit. So we're forcing consistency here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, SynthEval changed the way they report the metric output here from nested dictionaries to pandas dataframes and changed a bunch of the key names. Very annoying, but these changes just address that new reporting structure and keep our output structures the same.

if pd.api.types.is_string_dtype(data_splits.train_data.data[column_name]) or pd.api.types.is_object_dtype(
data_splits.train_data.data[column_name]
):
data_splits.train_data.data.loc[column_data == "?", column_name] = "nan"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas now does some strong type checking. If you have a dataframe with an integer type (for example) and you try to set it's value to a string, it will blow up. This happens even when there are no "?" strings in the column (as would be the case for an integer column).

end = min((i + 1) * batch_size, a_array.shape[0])
distance, search_indices = index.search(a_array[start:end], k=1)
index.remove_ids(search_indices.flatten())
index.remove_ids(search_indices.flatten()) # type: ignore

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not, for the life of me, figure out the right type here. The faiss library is a mess.

shutil.rmtree(results_dir, ignore_errors=True)


@pytest.mark.integration_test()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't marked integration, even though it is 🙂



def test_train_attack_classifier_mismatched_data(attack_data):
@pytest.mark.parametrize("classifier_type", [ClassifierType.XGBOOST, ClassifierType.CATBOOST])

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocking these to make it faster.

mapped_holdout_data = HOLDOUT_DATA.replace({"cat": 0, "horse": 1, "dog": 2})
mapped_real_data = REAL_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int)
mapped_synthetic_data = SYNTHETIC_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int)
mapped_holdout_data = HOLDOUT_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replace, for some reason, doesn't reclassify the column as an integer, even though it maps all categorical values to integers and seems like it should. So forcing it.

# Now add some NaNs to the train and validation splits
dataset.numerical_features["train"][0, 1] = np.NaN
dataset.numerical_features["val"][1, 1] = np.NaN
dataset.numerical_features["train"][0, 1] = np.nan

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.NaN is no longer supported.

Comment thread tests/conftest.py
import os


os.environ.setdefault("OMP_NUM_THREADS", "1")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the upgrades, there is a hanging/segfault issue with pytest and xgboost where we get nested thread spawning. So this variable forces pytest to use single threads.

@lotif: If you have a better idea definitely let me know!

Comment thread .gitignore
tests/integration/assets/tabsyn/results

# Emitted SynthEval analysis config file during metric creation. Unfortunately cannot be turned off...
SE_analysis_config.json

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Syntheval emits this json file and it cannot be disabled after looking at their code...

"Label column should not be included in the set of numerical columns provided"
"Label column should not be included in the set of categorical columns provided"
)
categorical_columns = categorical_columns + [label_column]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the label column here to make sure it gets preprocessed with the cat columns, but maintain the entry points of having label column separate to make it backwards consistent and also match the corresponding regression metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant