refactor(_mf_class.py): improve code quality and fix potential issues by DataboyUsen · Pull Request #42 · softmin/ReHLine-python

DataboyUsen · 2026-04-03T06:11:44Z

Replace random_state handling with proper rng instance initialization
Remove pandas dependency by using numpy-only operations
Fix convergence check: apply abs() to obj_diff before comparison
Porperly move the initialization of model parameters to fit() method
Fix spelling errors in comments and docstrings

- Replace random_state handling with proper rng instance initialization - Remove pandas dependency by using numpy-only operations - Fix convergence check: apply abs() to obj_diff before comparison - Porperly move the initialization of model parameters to fit() method - Fix spelling errors in comments and docstrings

codecov · 2026-04-03T06:13:07Z

Codecov Report

❌ Patch coverage is 85.71429% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rehline/_mf_class.py	85.71%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

statmlben · 2026-04-03T14:20:53Z

Thank you @DataboyUsen

Let me just copy paste the suggestions generated by opus 4.6: most of the comments are valid, please check P4 very careful to see if it is correct or not. Thx!

1. BUG: Shared mutable list via `[...] * n` (Critical)

Location: _mf_class.py:278 and _mf_class.py:285

self.Iu = [np.array([], dtype=int)] * self.n_users  # line 278
self.Ui = [np.array([], dtype=int)] * self.n_items  # line 285

[obj] * n creates a list of n references to the same np.array object. Although the bug is masked by the subsequent for loop which reassigns self.Iu[u] = idxs for every user that has ratings, any user ID not present in the training data will share the same underlying array object. If one cold-start user's entry is later mutated in-place, all cold-start users' entries would be affected.

Fix: Use a list comprehension to create independent arrays:

self.Iu = [np.array([], dtype=int) for _ in range(self.n_users)]
self.Ui = [np.array([], dtype=int) for _ in range(self.n_items)]

2. sklearn Estimator Contract Violations (Medium-High)

2a. Validation logic in `init` (lines 202-213)

sklearn convention requires that __init__ only assigns parameters — validation should happen in fit(). The current validation in __init__ prevents set_params() from working correctly in grid search scenarios.

Fix: Move the validation checks from __init__ to the beginning of fit().

2b. Default parameter mutation: `None` → `[]` (lines 220-221)

self.constraint_user = constraint_user if constraint_user is not None else []

get_params() returns [] but the constructor default is None. This breaks the round-trip identity required by sklearn.base.clone().

Fix: Store None as-is and handle the None → [] conversion in fit():

self.constraint_user = constraint_user  # store as-is
# In fit():
constraint_user = self.constraint_user if self.constraint_user is not None else []

2c. Derived attribute `self.rng` in `init` (lines 231-234)

self.rng is derived from random_state but won't update if random_state is changed via set_params().

Fix: Create the RNG in fit() instead:

# In __init__:
self.random_state = random_state  # just store

# In fit():
rng = np.random.default_rng(self.random_state)
self.P = rng.normal(...)

2d. Pre-initialized fitted attributes (lines 235-238)

self.P, self.Q, self.bu, self.bi are set to None in __init__. These are fitted attributes that should only be created in fit().

Fix: Remove from __init__; they are already assigned in fit() at lines 302-305.

3. Code Duplication: User-side vs Item-side update blocks (Medium)

Location: Lines 311-375 (user update) and 378-442 (item update)

These two blocks are nearly identical — only the roles of P/Q, bu/bi, Iu/Ui, and C_user/C_item are swapped. This ~130 lines of duplicated logic makes maintenance error-prone.

Fix: Extract a helper method _update_one_side(...):

def _update_one_side(self, X, y, factor_matrix, bias_vector, other_factor,
                     other_bias, index_lists, C_side, constraint, is_biased):
    """Update one side (user or item) of the factorization."""
    for idx in range(factor_matrix.shape[0]):
        index_tmp = index_lists[idx]
        # ... shared logic ...

4. `obj()` recomputes loss parameters unnecessarily (Low-Medium)

Location: Lines 515-516

U, V, Tau, S, T = _make_loss_rehline_param(loss=self.loss, X=X, y=y)

Inside obj(), _make_loss_rehline_param is called with the raw user-item X matrix, but this function expects X to be a feature matrix (n_samples, n_features). For the MF case, the raw (user_id, item_id) pairs don't serve as a proper design matrix. The function only uses X.shape[0] to determine the number of samples for constructing U/V/Tau/S/T, so it works by coincidence — but it's fragile and semantically incorrect.

Fix: Pass only n_samples information or create a minimal placeholder:

n = len(y)
X_dummy = np.ones((n, 1))  # only shape matters for loss param construction
U, V, Tau, S, T = _make_loss_rehline_param(loss=self.loss, X=X_dummy, y=y)

5. Missing input validation in `fit()` (Low-Medium)

No check that X has exactly 2 columns.
No check that user/item IDs in X are integers within [0, n_users) / [0, n_items).
No check that X and y have consistent lengths.
No conversion of X to integer dtype (float IDs would cause silent fancy-indexing issues).

Fix: Add validation at the top of fit():

X = np.asarray(X)
y = np.asarray(y)
if X.ndim != 2 or X.shape[1] != 2:
    raise ValueError("X must have shape (n_ratings, 2)")
if X.shape[0] != len(y):
    raise ValueError("X and y must have the same number of samples")
user_ids = X[:, 0].astype(int)
item_ids = X[:, 1].astype(int)
if np.any(user_ids < 0) or np.any(user_ids >= self.n_users):
    raise ValueError("User IDs must be in [0, n_users)")
if np.any(item_ids < 0) or np.any(item_ids >= self.n_items):
    raise ValueError("Item IDs must be in [0, n_items)")

6. `decision_function` lacks fitted check (Low)

Location: Lines 458-481

decision_function doesn't verify the model has been fitted. Calling it before fit() will raise a cryptic TypeError from indexing None arrays.

Fix: Add sklearn's check_is_fitted:

from sklearn.utils.validation import check_is_fitted

def decision_function(self, X):
    check_is_fitted(self, ['P', 'Q'])
    ...

7. Minor Style & Readability Issues (Low)

np.nan * np.zeros(...) (line 272): Equivalent to np.full((...), np.nan) which is clearer and avoids unnecessary multiplication.
Magic numbers in C_user / C_item formulas (lines 289-290): The /2 factor is unexplained and should have a comment clarifying its mathematical origin.
_tmp suffix overuse: Variables like index_tmp, len_tmp, y_tmp, item_tmp, etc. could use more descriptive names (e.g., user_ratings_idx, n_user_ratings).
Redundant BaseEstimator in class declaration (line 21): _BaseReHLine already inherits from BaseEstimator, making it redundant (though harmless).

Summary of Proposed Changes

#	Issue	Severity	Lines Affected
1	Shared mutable list bug	Critical	278, 285
2a	Validation in `__init__`	Medium-High	202-213
2b	Default param mutation	Medium-High	220-221
2c	Derived `self.rng` in `__init__`	Medium-High	231-234
2d	Pre-init fitted attributes	Medium-High	235-238
3	User/Item update duplication	Medium	311-442
4	`obj()` X parameter misuse	Low-Medium	515-516
5	Missing input validation	Low-Medium	fit() top
6	Missing fitted check	Low	458
7	Minor style issues	Low	Various

DataboyUsen · 2026-04-03T14:27:49Z

Sure, I'll check those potential issues. Thank you so much for the assitance.

…s for OPUS 4.6 - Fix cold start users sharing mutable list bug (suggestions by opus 4.6, P1) - Further adhere to sklearn style: move all model parameter definitions from __init__ to fit(), transfer validation logic from __init__ to fit() with extended checks, and correct rng instance usage (suggestions by opus 4.6, P2&P5) - Clarify obj() loss computation logic: replace inappropriate direct X usage with meaningless X_dummy placeholder, explicitly indicating it as an intermediate variable not involved in loss calculation (suggestions by opus 4.6, P4) - Add comments and optimize low-efficiency code (suggestions by opus 4.6, P7)

Merge upstream/main: sync with Professor's latest updates

- Previously attempted to follow sklearn convention: set to None in __init__, initialize to [] in fit() - This caused CI failure because test accesses .constraint_user.append() before fit() is called - Reverted to original behavior (init as []) to maintain compatibility with existing tests - No impact on correctness: [] and None treated equivalently in fit()

DataboyUsen · 2026-04-04T18:23:19Z

I've made the revisions based on Opus 4.6's suggestions. Apologies for not running the CI tests locally before pushing, which led to one improper push. The code changes themselves are still aligned with the intended direction. Regarding the codecov error, perhaps we can discuss it together in our next meeting?

statmlben · 2026-04-05T03:23:20Z

@DataboyUsen Thx! You may check my commit here caaf4be. I will cherry-pick this time.

DataboyUsen added 3 commits April 5, 2026 01:13

Merge remote-tracking branch 'upstream/main'

73fdf7b

Merge upstream/main: sync with Professor's latest updates

statmlben merged commit e8d0c61 into softmin:main Apr 5, 2026
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(_mf_class.py): improve code quality and fix potential issues#42

refactor(_mf_class.py): improve code quality and fix potential issues#42
statmlben merged 4 commits into
softmin:mainfrom
DataboyUsen:main

DataboyUsen commented Apr 3, 2026

Uh oh!

codecov Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

statmlben commented Apr 3, 2026

Uh oh!

DataboyUsen commented Apr 3, 2026

Uh oh!

DataboyUsen commented Apr 4, 2026

Uh oh!

statmlben commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DataboyUsen commented Apr 3, 2026

Uh oh!

codecov Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

statmlben commented Apr 3, 2026

1. BUG: Shared mutable list via [...] * n (Critical)

2. sklearn Estimator Contract Violations (Medium-High)

2a. Validation logic in __init__ (lines 202-213)

2b. Default parameter mutation: None → [] (lines 220-221)

2c. Derived attribute self.rng in __init__ (lines 231-234)

2d. Pre-initialized fitted attributes (lines 235-238)

3. Code Duplication: User-side vs Item-side update blocks (Medium)

4. obj() recomputes loss parameters unnecessarily (Low-Medium)

5. Missing input validation in fit() (Low-Medium)

6. decision_function lacks fitted check (Low)

7. Minor Style & Readability Issues (Low)

Summary of Proposed Changes

Uh oh!

DataboyUsen commented Apr 3, 2026

Uh oh!

DataboyUsen commented Apr 4, 2026

Uh oh!

statmlben commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 3, 2026 •

edited

Loading

1. BUG: Shared mutable list via `[...] * n` (Critical)

2a. Validation logic in `init` (lines 202-213)

2b. Default parameter mutation: `None` → `[]` (lines 220-221)

2c. Derived attribute `self.rng` in `init` (lines 231-234)

4. `obj()` recomputes loss parameters unnecessarily (Low-Medium)

5. Missing input validation in `fit()` (Low-Medium)

6. `decision_function` lacks fitted check (Low)

statmlben commented Apr 5, 2026 •

edited

Loading