Libraries Upgrade et al.#150
Conversation
| plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"ROC curve (AUC = {roc_auc:.4f})") | ||
| plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--") | ||
| plt.xlim([0.0, 1.0]) | ||
| plt.ylim([0.0, 1.05]) |
There was a problem hiding this comment.
Mypy complained about this.
| } | ||
| targets: dict[str, np.ndarray] = { | ||
| DataSplit.TRAIN.value: data[[table_metadata.target_column_name]].values.astype(np.float32) | ||
| } |
There was a problem hiding this comment.
Mypy complained about the typing.
| if ( | ||
| is_string_dtype(dataframe[column_name]) | ||
| or is_object_dtype(dataframe[column_name]) | ||
| or (is_column_type_numerical(dataframe, column_name) and dataframe[column_name].nunique() <= threshold) |
There was a problem hiding this comment.
You'll see this throughout, but Pandas typing for strings has changed quite a bit. Now they have a separate string type you can use, along with the object type. So this catches that.
| self.validate_dataframe_dtypes(real_data) | ||
| self.validate_dataframe_dtypes(synthetic_data) | ||
| if holdout_data is not None: | ||
| self.validate_dataframe_dtypes(holdout_data) |
There was a problem hiding this comment.
Added these validations to make sure that string types are not sneaking into our metrics computations. These were added wherever strings would cause issues.
| do_preprocessing=False, | ||
| verbose=False, | ||
| nn_dist=self.norm.value, | ||
| plot_figures=False, |
There was a problem hiding this comment.
In the new SynthEval, they generated plots by default, which was very annoying. This shuts them off.
| ) | ||
|
|
||
| return self.syntheval_metric.evaluate(self.confidence_level.value) | ||
| return self.syntheval_metric.evaluate(ci="sem", confidence=self.confidence_level.value) |
There was a problem hiding this comment.
New upgrade changed this method signature a bit. So we're forcing consistency here.
There was a problem hiding this comment.
For some reason, SynthEval changed the way they report the metric output here from nested dictionaries to pandas dataframes and changed a bunch of the key names. Very annoying, but these changes just address that new reporting structure and keep our output structures the same.
| if pd.api.types.is_string_dtype(data_splits.train_data.data[column_name]) or pd.api.types.is_object_dtype( | ||
| data_splits.train_data.data[column_name] | ||
| ): | ||
| data_splits.train_data.data.loc[column_data == "?", column_name] = "nan" |
There was a problem hiding this comment.
Pandas now does some strong type checking. If you have a dataframe with an integer type (for example) and you try to set it's value to a string, it will blow up. This happens even when there are no "?" strings in the column (as would be the case for an integer column).
| end = min((i + 1) * batch_size, a_array.shape[0]) | ||
| distance, search_indices = index.search(a_array[start:end], k=1) | ||
| index.remove_ids(search_indices.flatten()) | ||
| index.remove_ids(search_indices.flatten()) # type: ignore |
There was a problem hiding this comment.
I could not, for the life of me, figure out the right type here. The faiss library is a mess.
| shutil.rmtree(results_dir, ignore_errors=True) | ||
|
|
||
|
|
||
| @pytest.mark.integration_test() |
There was a problem hiding this comment.
This wasn't marked integration, even though it is 🙂
|
|
||
|
|
||
| def test_train_attack_classifier_mismatched_data(attack_data): | ||
| @pytest.mark.parametrize("classifier_type", [ClassifierType.XGBOOST, ClassifierType.CATBOOST]) |
There was a problem hiding this comment.
Mocking these to make it faster.
| mapped_holdout_data = HOLDOUT_DATA.replace({"cat": 0, "horse": 1, "dog": 2}) | ||
| mapped_real_data = REAL_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int) | ||
| mapped_synthetic_data = SYNTHETIC_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int) | ||
| mapped_holdout_data = HOLDOUT_DATA.replace({"cat": 0, "horse": 1, "dog": 2}).astype(int) |
There was a problem hiding this comment.
This replace, for some reason, doesn't reclassify the column as an integer, even though it maps all categorical values to integers and seems like it should. So forcing it.
| # Now add some NaNs to the train and validation splits | ||
| dataset.numerical_features["train"][0, 1] = np.NaN | ||
| dataset.numerical_features["val"][1, 1] = np.NaN | ||
| dataset.numerical_features["train"][0, 1] = np.nan |
There was a problem hiding this comment.
np.NaN is no longer supported.
| import os | ||
|
|
||
|
|
||
| os.environ.setdefault("OMP_NUM_THREADS", "1") |
There was a problem hiding this comment.
With the upgrades, there is a hanging/segfault issue with pytest and xgboost where we get nested thread spawning. So this variable forces pytest to use single threads.
@lotif: If you have a better idea definitely let me know!
| tests/integration/assets/tabsyn/results | ||
|
|
||
| # Emitted SynthEval analysis config file during metric creation. Unfortunately cannot be turned off... | ||
| SE_analysis_config.json |
There was a problem hiding this comment.
New Syntheval emits this json file and it cannot be disabled after looking at their code...
| "Label column should not be included in the set of numerical columns provided" | ||
| "Label column should not be included in the set of categorical columns provided" | ||
| ) | ||
| categorical_columns = categorical_columns + [label_column] |
There was a problem hiding this comment.
Adding the label column here to make sure it gets preprocessed with the cat columns, but maintain the entry points of having label column separate to make it backwards consistent and also match the corresponding regression metric.
PR Type
Fix
Short Description
Clickup Ticket(s): N/A
Upgrading the libraries to address vulnerabilities. This required some code changes with the new libraries functionality and new APIs. None of the fixes were major, but did require some digging and test fixes.
Tests Added
Test fixes associated with the required changes were made. Some small changes to the deterministic tests were also required with the move to single threading for the tests.