WIP: fix: updates tests for bigframes package#16525
WIP: fix: updates tests for bigframes package#16525chalmerlowe wants to merge 5 commits intomainfrom
Conversation
…s & value fluctuations
…mponents & value fluctuations
There was a problem hiding this comment.
Code Review
This pull request updates system tests for K-Means, PCA, and ARIMA models. Key changes include implementing helper functions to sort categorical lists to ensure test stability, increasing relative tolerance (rtol) for PCA tests to accommodate numerical drift, and adding missing metrics to ARIMA score tests. Feedback highlights a potential logic error in the ARIMA test where the expected Mean Absolute Scaled Error is set to 0.0 despite non-zero RMSE, and identifies code duplication between the K-Means and PCA test files, suggesting that the categorical sorting logic should be centralized in a shared utility module.
| "root_mean_squared_error": [120.675442, 120.675442], | ||
| "mean_absolute_percentage_error": [4.80044, 4.80044], | ||
| "symmetric_mean_absolute_percentage_error": [4.744332, 4.744332], | ||
| "mean_absolute_scaled_error": [0.0, 0.0], |
There was a problem hiding this comment.
The expected value for mean_absolute_scaled_error is set to 0.0, which is inconsistent with the non-zero root_mean_squared_error of 120.675442. Since a non-zero RMSE implies a non-zero Mean Absolute Error (MAE), the Mean Absolute Scaled Error (MASE) should also be non-zero. This suggests the expected value of 0.0 is likely incorrect. This issue is also present at lines 493, 580, and 596.
| def sort_categorical_lists(val): | ||
| if isinstance(val, list) and len(val) > 0: | ||
| return sorted(val, key=lambda x: x["category"]) | ||
| return val |
There was a problem hiding this comment.
This helper function for sorting categorical lists is duplicated in packages/bigframes/tests/system/small/ml/test_decomposition.py (where it is named sort_categorical). To improve maintainability and ensure consistency across the test suite, consider moving this logic to a shared utility module, such as bigframes.testing.utils.
References
- To ensure dictionary keys remain sorted without manual effort, programmatically sort the dictionary before returning it (e.g., using
dict(sorted(metadata.items()))) instead of relying on manual ordering in the code.
| def sort_categorical(val): | ||
| if isinstance(val, list) and len(val) > 0: | ||
| return sorted(val, key=lambda x: x["category"]) | ||
| return val |
There was a problem hiding this comment.
This helper function is duplicated in packages/bigframes/tests/system/small/ml/test_cluster.py (where it is named sort_categorical_lists). Consider centralizing this logic in a shared utility module like bigframes.testing.utils.
References
- To ensure dictionary keys remain sorted without manual effort, programmatically sort the dictionary before returning it (e.g., using
dict(sorted(metadata.items()))) instead of relying on manual ordering in the code.
| expected, | ||
| check_exact=False, | ||
| rtol=0.1, | ||
| rtol=0.2, # FIX: Slightly increased rtol for numerical drift (from 0.1) |
There was a problem hiding this comment.
Increasing the relative tolerance (rtol) from 0.1 to 0.2 is a significant change (100% increase). While numerical drift is common in ML system tests, a 20% relative error tolerance is quite high and might mask regressions. This change is applied throughout this file; consider if a tighter tolerance can be maintained or if the source of the drift can be addressed.
WIP: First crack at resolving a number of the concerns in ISSUE #16489