Skip to content

refactor(_mf_class.py): improve code quality and fix potential issues#42

Open
DataboyUsen wants to merge 1 commit intosoftmin:mainfrom
DataboyUsen:main
Open

refactor(_mf_class.py): improve code quality and fix potential issues#42
DataboyUsen wants to merge 1 commit intosoftmin:mainfrom
DataboyUsen:main

Conversation

@DataboyUsen
Copy link
Copy Markdown
Contributor

  • Replace random_state handling with proper rng instance initialization
  • Remove pandas dependency by using numpy-only operations
  • Fix convergence check: apply abs() to obj_diff before comparison
  • Porperly move the initialization of model parameters to fit() method
  • Fix spelling errors in comments and docstrings

- Replace random_state handling with proper rng instance initialization
- Remove pandas dependency by using numpy-only operations
- Fix convergence check: apply abs() to obj_diff before comparison
- Porperly move the initialization of model parameters to fit() method
- Fix spelling errors in comments and docstrings
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 0% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rehline/_mf_class.py 0.00% 24 Missing ⚠️

📢 Thoughts on this report? Let us know!

@statmlben
Copy link
Copy Markdown
Member

Thank you @DataboyUsen

Let me just copy paste the suggestions generated by opus 4.6: most of the comments are valid, please check P4 very careful to see if it is correct or not. Thx!

1. BUG: Shared mutable list via [...] * n (Critical)

Location: _mf_class.py:278 and _mf_class.py:285

self.Iu = [np.array([], dtype=int)] * self.n_users  # line 278
self.Ui = [np.array([], dtype=int)] * self.n_items  # line 285

[obj] * n creates a list of n references to the same np.array object. Although the bug is masked by the subsequent for loop which reassigns self.Iu[u] = idxs for every user that has ratings, any user ID not present in the training data will share the same underlying array object. If one cold-start user's entry is later mutated in-place, all cold-start users' entries would be affected.

Fix: Use a list comprehension to create independent arrays:

self.Iu = [np.array([], dtype=int) for _ in range(self.n_users)]
self.Ui = [np.array([], dtype=int) for _ in range(self.n_items)]

2. sklearn Estimator Contract Violations (Medium-High)

2a. Validation logic in __init__ (lines 202-213)

sklearn convention requires that __init__ only assigns parameters — validation should happen in fit(). The current validation in __init__ prevents set_params() from working correctly in grid search scenarios.

Fix: Move the validation checks from __init__ to the beginning of fit().

2b. Default parameter mutation: None[] (lines 220-221)

self.constraint_user = constraint_user if constraint_user is not None else []

get_params() returns [] but the constructor default is None. This breaks the round-trip identity required by sklearn.base.clone().

Fix: Store None as-is and handle the None[] conversion in fit():

self.constraint_user = constraint_user  # store as-is
# In fit():
constraint_user = self.constraint_user if self.constraint_user is not None else []

2c. Derived attribute self.rng in __init__ (lines 231-234)

self.rng is derived from random_state but won't update if random_state is changed via set_params().

Fix: Create the RNG in fit() instead:

# In __init__:
self.random_state = random_state  # just store

# In fit():
rng = np.random.default_rng(self.random_state)
self.P = rng.normal(...)

2d. Pre-initialized fitted attributes (lines 235-238)

self.P, self.Q, self.bu, self.bi are set to None in __init__. These are fitted attributes that should only be created in fit().

Fix: Remove from __init__; they are already assigned in fit() at lines 302-305.


3. Code Duplication: User-side vs Item-side update blocks (Medium)

Location: Lines 311-375 (user update) and 378-442 (item update)

These two blocks are nearly identical — only the roles of P/Q, bu/bi, Iu/Ui, and C_user/C_item are swapped. This ~130 lines of duplicated logic makes maintenance error-prone.

Fix: Extract a helper method _update_one_side(...):

def _update_one_side(self, X, y, factor_matrix, bias_vector, other_factor,
                     other_bias, index_lists, C_side, constraint, is_biased):
    """Update one side (user or item) of the factorization."""
    for idx in range(factor_matrix.shape[0]):
        index_tmp = index_lists[idx]
        # ... shared logic ...

4. obj() recomputes loss parameters unnecessarily (Low-Medium)

Location: Lines 515-516

U, V, Tau, S, T = _make_loss_rehline_param(loss=self.loss, X=X, y=y)

Inside obj(), _make_loss_rehline_param is called with the raw user-item X matrix, but this function expects X to be a feature matrix (n_samples, n_features). For the MF case, the raw (user_id, item_id) pairs don't serve as a proper design matrix. The function only uses X.shape[0] to determine the number of samples for constructing U/V/Tau/S/T, so it works by coincidence — but it's fragile and semantically incorrect.

Fix: Pass only n_samples information or create a minimal placeholder:

n = len(y)
X_dummy = np.ones((n, 1))  # only shape matters for loss param construction
U, V, Tau, S, T = _make_loss_rehline_param(loss=self.loss, X=X_dummy, y=y)

5. Missing input validation in fit() (Low-Medium)

  • No check that X has exactly 2 columns.
  • No check that user/item IDs in X are integers within [0, n_users) / [0, n_items).
  • No check that X and y have consistent lengths.
  • No conversion of X to integer dtype (float IDs would cause silent fancy-indexing issues).

Fix: Add validation at the top of fit():

X = np.asarray(X)
y = np.asarray(y)
if X.ndim != 2 or X.shape[1] != 2:
    raise ValueError("X must have shape (n_ratings, 2)")
if X.shape[0] != len(y):
    raise ValueError("X and y must have the same number of samples")
user_ids = X[:, 0].astype(int)
item_ids = X[:, 1].astype(int)
if np.any(user_ids < 0) or np.any(user_ids >= self.n_users):
    raise ValueError("User IDs must be in [0, n_users)")
if np.any(item_ids < 0) or np.any(item_ids >= self.n_items):
    raise ValueError("Item IDs must be in [0, n_items)")

6. decision_function lacks fitted check (Low)

Location: Lines 458-481

decision_function doesn't verify the model has been fitted. Calling it before fit() will raise a cryptic TypeError from indexing None arrays.

Fix: Add sklearn's check_is_fitted:

from sklearn.utils.validation import check_is_fitted

def decision_function(self, X):
    check_is_fitted(self, ['P', 'Q'])
    ...

7. Minor Style & Readability Issues (Low)

  • np.nan * np.zeros(...) (line 272): Equivalent to np.full((...), np.nan) which is clearer and avoids unnecessary multiplication.
  • Magic numbers in C_user / C_item formulas (lines 289-290): The /2 factor is unexplained and should have a comment clarifying its mathematical origin.
  • _tmp suffix overuse: Variables like index_tmp, len_tmp, y_tmp, item_tmp, etc. could use more descriptive names (e.g., user_ratings_idx, n_user_ratings).
  • Redundant BaseEstimator in class declaration (line 21): _BaseReHLine already inherits from BaseEstimator, making it redundant (though harmless).

Summary of Proposed Changes

# Issue Severity Lines Affected
1 Shared mutable list bug Critical 278, 285
2a Validation in __init__ Medium-High 202-213
2b Default param mutation Medium-High 220-221
2c Derived self.rng in __init__ Medium-High 231-234
2d Pre-init fitted attributes Medium-High 235-238
3 User/Item update duplication Medium 311-442
4 obj() X parameter misuse Low-Medium 515-516
5 Missing input validation Low-Medium fit() top
6 Missing fitted check Low 458
7 Minor style issues Low Various

@DataboyUsen
Copy link
Copy Markdown
Contributor Author

Sure, I'll check those potential issues. Thank you so much for the assitance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants