Skip to content

Add KLUWrapper#296

Open
jd-lara wants to merge 31 commits intomainfrom
claude/klu-sparse-rhs-wrapper-jLAmX
Open

Add KLUWrapper#296
jd-lara wants to merge 31 commits intomainfrom
claude/klu-sparse-rhs-wrapper-jLAmX

Conversation

@jd-lara
Copy link
Copy Markdown
Member

@jd-lara jd-lara commented Apr 28, 2026

This PR is important to reduce issues with KLU allowing for multi entry

claude and others added 8 commits April 25, 2026 17:54
Adds a small allocation-aware wrapper over libklu (provided via
SuiteSparse_jll) tailored to the access patterns of this package:

- KLULinSolveCache splits symbolic and numeric factorization, supports
  numeric_refactor! that reuses the analysis and the prior numeric struct,
  and is non-allocating after construction.
- solve!/tsolve! call libklu's native dense multi-RHS routines directly.
- solve_sparse!/solve_sparse provide a sparse-RHS path: B's columns are
  scattered into a small dense scratch in chunks instead of densifying the
  full N x M RHS up front. Has a skip_empty option for RHSs where most
  columns are structurally empty (Ward reduction, Woodbury kernel).
- Real (Float64) and complex (ComplexF64) paths share the same struct via
  type-dispatched ccall helpers (klu_l_* and klu_zl_*).

This replaces the use of KLU.jl elsewhere in the package. KLU.jl is removed
from Project.toml; SuiteSparse_jll is added in its place.

https://claude.ai/code/session_0128vLG5HzxrMTZYjE9oTGfd
Replaces KLU.KLUFactorization with the new KLULinSolveCache across the
matrix types and uses solve_sparse! for the multi-RHS, structurally
sparse code paths:

- ABA_Matrix.K is now KLULinSolveCache{Float64}; factorize/is_factorized
  updated accordingly. DC_ABA_Matrix_Factorized type alias updated.
- _calculate_PTDF_matrix_KLU uses solve_sparse! on BA[valid_ix, :],
  scattering the structurally-sparse RHS columns instead of densifying
  the full (buscount-nref) x linecount matrix.
- _calculate_LODF_matrix_KLU passes transpose(a)[valid_ix, :] directly
  to solve_sparse!, removing the original zeros(buscount, buscount)
  intermediate.
- VirtualPTDF.K, VirtualLODF.K, VirtualMODF.K all hold a
  KLULinSolveCache{Float64}; per-row solve! calls remain non-allocating.
- DC_vPTDF_Matrix is left unconstrained on K so AppleAccelerate
  factorizations still satisfy the alias.

https://claude.ai/code/session_0128vLG5HzxrMTZYjE9oTGfd
Ward reduction:
- The boundary-bus identity solve becomes a single solve_sparse! call on
  a sparse identity matrix instead of a per-bus loop with one RHS each.
- The y_eq computation switches to solve_sparse(...; skip_empty=true) on
  y_eb. Most external buses are not adjacent to any boundary bus, so the
  RHS columns are largely empty and the skip-empty short-circuit avoids
  redundant solves.

Tests:
- Real and complex round-trip and refactor coverage.
- solve_sparse! agrees with the dense-RHS path.
- skip_empty produces zero columns for empty RHS columns.
- solve_sparse! into a view writes only the targeted rows.

https://claude.ai/code/session_0128vLG5HzxrMTZYjE9oTGfd
Simplifications:
- Drop redundant n field from KLULinSolveCache; derive from colptr.
- Drop _decrement!/_increment! helpers in favour of broadcast `.+= 1` /
  `.-= 1`.
- Rename finalize! -> Base.finalize so it composes with stdlib instead of
  shadowing Base.finalize through the module's exports.
- Drop solve_w_refinement (no in-tree caller); add Base.\\ on the cache
  for symmetry with KLU.KLUFactorization.
- Unify the two tsolve! methods via a _tsolve_call helper that ignores
  the conjugate flag on the real path.
- Drop block= and skip_empty= keywords from solve_sparse!. Always pack
  non-empty columns into a single dense scratch and dispatch one libklu
  multi-RHS solve. Empty RHS columns are zeroed in the output without a
  solve, in all cases.
- Throw stdlib exception types (SingularException, OutOfMemoryError,
  ArgumentError, OverflowError) from klu_throw to match the conventions
  used in SparseArrays.CHOLMOD and elsewhere in PNM.
- Refactor _calculate_PTDF_matrix_KLU: collapse the duplicated dist_slack
  branches into a single solve plus a slack-distribution post-step.
- Strip migration-narrative comments from the consumer call sites.

New: KLULinSolvePool
- A small pool of independent KLULinSolveCache workers, each holding its
  own factorization of the same matrix. KLU's numeric struct and Common
  status field are mutated by klu_solve, so safe parallel use needs one
  factor copy per worker.
- API: KLULinSolvePool(A; nworkers), with_worker(f, pool) -> result,
  acquire!/release!, numeric_refactor!(pool, A), Base.finalize.
- Plumbing into VirtualMODF (per-worker scratch + cache locks) is staged
  for the next commit.

https://claude.ai/code/session_0128vLG5HzxrMTZYjE9oTGfd
VirtualMODF.K is now a KLULinSolvePool{Float64} sized at construction time
to nworkers (defaults to Threads.nthreads()). Per-worker scratch buffers
(work_ba_col, temp_data) are stored as Vector{Vector{Float64}} indexed by
the worker handle returned from with_worker.

Woodbury kernel refactor:
- Split _compute_woodbury_factors / _apply_woodbury_correction into
  pure-data _impl functions that take cache + scratch + data arrays
  explicitly, plus mat-typed outer wrappers.
- VirtualPTDF wrapper uses its single shared scratch (existing single-cache
  behavior).
- VirtualMODF wrapper uses with_worker to acquire a per-worker cache and
  scratch.

Cache locking:
- Add ReentrantLocks for woodbury_cache and row_caches.
- _get_woodbury_factors and _get_or_create_row_cache use double-checked
  locking so concurrent miss-then-fill is correct.
- clear_caches! / clear_all_caches! / Base.getindex acquire the locks.

Tests:
- New test/test_virtual_threaded.jl: uses Threads.@Spawn for VirtualPTDF,
  VirtualLODF, and VirtualMODF. The VirtualMODF case spawns one task per
  (monitored, contingency) work item and validates parallel results
  match a serial reference. VirtualPTDF/VirtualLODF use a single-task
  @Spawn since their K and scratch are still shared (not yet pool-backed);
  the comments call this out explicitly.
- Tests in test/test_klu_wrapper.jl exercise KLULinSolvePool: basic
  with_worker, concurrent solves via @threads on a 4-worker pool, and
  numeric_refactor!.

https://claude.ai/code/session_0128vLG5HzxrMTZYjE9oTGfd
@jd-lara jd-lara requested review from Copilot and luke-kiernan April 28, 2026 00:15
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

Performance Results

Precompile Time

Main This Branch Delta
2.196 s 2.196 s -0.0%

Execution Time

Test Main This Branch Delta
matpower_ACTIVSg2000_sys-Build PTDF First 2.176 s 1.85 s -15.0%
matpower_ACTIVSg2000_sys-Build PTDF Second 116.7 ms 499.3 ms +327.7%
matpower_ACTIVSg2000_sys-Build Ybus First 15.3 ms 13.6 ms -11.2%
matpower_ACTIVSg2000_sys-Build Ybus Second 14.3 ms 12.3 ms -13.8%
matpower_ACTIVSg2000_sys-Build LODF First 168.9 ms 165.8 ms -1.8%
matpower_ACTIVSg2000_sys-Build LODF Second 236.0 ms 163.2 ms -30.8%
matpower_ACTIVSg2000_sys-Build VirtualMODF First 4.102 s 4.944 s +20.5%
matpower_ACTIVSg2000_sys-Build VirtualMODF Second 203.6 ms 672.7 ms +230.4%
matpower_ACTIVSg2000_sys-VirtualMODF Query 10 rows 477.1 ms 505.4 ms +5.9%
matpower_ACTIVSg2000_sys-Radial network reduction First 449.6 ms 451.5 ms +0.4%
matpower_ACTIVSg2000_sys-Radial network reduction Second 0.6 ms 0.8 ms +17.6%
matpower_ACTIVSg2000_sys-Degree two network reduction First 1.665 s 1.674 s +0.5%
matpower_ACTIVSg2000_sys-Degree two network reduction Second 1.1 ms 1.1 ms -1.9%
Base_Eastern_Interconnect_515GW-Build Ybus First 3.373 s 3.298 s -2.2%
Base_Eastern_Interconnect_515GW-Build Ybus Second 3.417 s 3.175 s -7.1%
Base_Eastern_Interconnect_515GW-Radial network reduction First 1.29 s 40.8 ms -96.8%
Base_Eastern_Interconnect_515GW-Radial network reduction Second 33.6 ms 40.3 ms +19.9%
Base_Eastern_Interconnect_515GW-Degree two network reduction First 361.3 ms 346.6 ms -4.1%
Base_Eastern_Interconnect_515GW-Degree two network reduction Second 45.4 ms 37.6 ms -17.3%

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the dependency on KLU.jl with an in-repo KLUWrapper built on SuiteSparse_jll and wires it through PTDF/LODF/MODF/Ward workflows, adding a worker-pool abstraction to support thread-safe parallel solves and new tests to validate correctness under concurrency.

Changes:

  • Introduce src/KLUWrapper/ with KLULinSolveCache (cached factorization) and KLULinSolvePool (thread-safe parallel solves) plus dense/sparse RHS solve helpers.
  • Refactor PTDF/LODF/Virtual* and Ward reduction codepaths to use klu_factorize, solve!, and solve_sparse!/solve_sparse instead of KLU.jl.
  • Add tests covering wrapper correctness, sparse RHS behavior, pool behavior, and threaded Virtual* access; bump package version and switch dependency to SuiteSparse_jll.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/test_virtual_threaded.jl Adds multithreaded regression tests for VirtualPTDF/VirtualLODF (single-task) and VirtualMODF (pool-backed parallelism).
test/test_klu_wrapper.jl Adds comprehensive unit tests for cache/pool factorization, dense + sparse RHS solves, refactor/reset paths, and allocation expectations.
src/woodbury_kernel.jl Refactors Woodbury computation into pure-data impl functions and adds dispatchers for VirtualPTDF vs pool-backed VirtualMODF.
src/ward_reduction.jl Switches Ward reduction solves to KLUWrapper and uses sparse-RHS solve for boundary columns.
src/virtual_ptdf_calculations.jl Updates VirtualPTDF to accept KLULinSolveCache and routes solves through the wrapper abstraction.
src/virtual_modf_calculations.jl Makes VirtualMODF pool-backed for parallel getindex, adds locks around caches, and threads Woodbury calls through with_worker.
src/virtual_lodf_calculations.jl Migrates VirtualLODF factorization/solves to KLULinSolveCache and uses in-place solve!.
src/ptdf_calculations.jl Replaces dense RHS KLU.solve! usage with solve_sparse! over selected BA rows.
src/lodf_calculations.jl Replaces dense RHS KLU.solve! usage with solve_sparse! for incidence-transpose RHS and updates denom solve.
src/PowerflowMatrixTypes.jl Updates type aliases to use KLULinSolveCache and relaxes VirtualPTDF’s factorization type parameter.
src/PowerNetworkMatrices.jl Includes/imports KLUWrapper APIs into the main module.
src/KLUWrapper/solve_sparse_rhs.jl Implements sparse RHS packing + block-chunked solve to bound working set.
src/KLUWrapper/solve_dense.jl Implements in-place dense solve and transpose solve; adds \\ for allocating solve.
src/KLUWrapper/pool.jl Adds KLULinSolvePool with Channel-based worker acquisition and refactor/reset logic.
src/KLUWrapper/klu_jll_bindings.jl Adds low-level ccall bindings for SuiteSparse_long KLU entry points and error mapping.
src/KLUWrapper/klu_cache.jl Implements KLULinSolveCache lifecycle, symbolic/numeric refactor, pattern checks, and scratch management.
src/KLUWrapper/KLUWrapper.jl Defines the KLUWrapper module and exports wrapper APIs.
src/BA_ABA_matrices.jl Updates ABA factorization storage and constructors to use klu_factorize.
Project.toml Bumps version and replaces KLU dependency with SuiteSparse_jll.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/KLUWrapper/pool.jl Outdated
Comment thread src/KLUWrapper/pool.jl Outdated
Comment thread src/KLUWrapper/pool.jl Outdated
Copy link
Copy Markdown
Collaborator

@luke-kiernan luke-kiernan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, will look more later.

Comment thread src/KLUWrapper/klu_cache.jl Outdated
@jd-lara
Copy link
Copy Markdown
Member Author

jd-lara commented May 4, 2026

There is still something odd with the windows version that fails in a non deterministic way.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/virtual_ptdf_calculations.jl
Comment thread src/virtual_ptdf_calculations.jl Outdated
Comment thread src/solver_dispatch.jl Outdated
Comment thread src/virtual_lodf_calculations.jl Outdated
Comment on lines +232 to +235
Number of parallel workers in the underlying KLU pool. Defaults to
`_default_pool_workers()` — `max(1, Threads.nthreads() - 1)` on
Mac/Linux and `1` on Windows (where the KLU pool path is serialized
through `solver_lock` to work around a libklu thread-safety issue).
@jd-lara
Copy link
Copy Markdown
Member Author

jd-lara commented May 4, 2026

I've looked at the copilots comments and addressed the one that look real

Copy link
Copy Markdown
Collaborator

@luke-kiernan luke-kiernan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the backend multi-threading stuff--the matrix math itself I'm less familiar with.

Comment thread src/KLUWrapper/solve_sparse_rhs.jl
Comment thread src/KLUWrapper/solve_sparse_rhs.jl Outdated
Comment thread src/KLUWrapper/klu_cache.jl Outdated
Comment thread src/KLUWrapper/pool.jl Outdated
Comment thread src/solver_dispatch.jl
# outside the Virtual{PTDF, LODF, MODF} files so all three matrices share
# the same `with_solver` seam and the same KLU/AppleAccelerate factory.
#
# Windows libklu thread-safety: the MinGW build of `SuiteSparse_jll`'s
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also error in non multi-threaded situations on windows, too, for seemingly inexplicable reasons. Do you have a minimum reproducible example of that? The only 2 examples I know of are both quite complex.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still investigating what's going on in Windows. The latest exploration showed that the issue only happens in windows when the PTDF and the LODF are used. Still not a good explanation of what's going on.

Comment thread src/row_cache.jl Outdated
to test for a hit, runs `compute_row` outside the lock on a miss
(KLU solves dominate the cost), then takes the lock again to insert. A
concurrent producer that wins the insert race wins; the other side
returns the winner's row. The signature places `compute_row` first so
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature places compute_row first so callers can use do … end block syntax.

explain?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does placing compute_row first matter for do...end block syntax?

Comment thread src/solver_dispatch.jl
Comment thread src/KLUWrapper/pool.jl
Comment thread src/KLUWrapper/pool.jl Outdated
# snapshot is taken under `state_lock` for consistency; the `take!` loop
# blocks outside the lock so concurrent `release!` calls can land.
function _drain_available!(pool::KLULinSolvePool)
n_to_drain = @lock pool.state_lock count(pool.valid)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: there's a function for this, n_valid(pool).

I'm also struggling to convince myself that there's no race condition here. Can a worker go from valid to invalid between the count call and the take! call? And if it can happen, does it cause issues?

Copy link
Copy Markdown
Member Author

@jd-lara jd-lara May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like to verify empirically, flip const KLU_POOL_DEBUG = true in KLUWrapper.jl and re-run the threaded testsets in test_virtual_threaded.jl; the per-worker atomic collision counter in with_worker
(pool.jl:168-180) will log an @error if any task ever observes a worker held by another task concurrently.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Relying on empirical tests feels weird, because compiler optimizations and undefined behavior. But if you're confident this is ok, I'll take your word for it.

Comment thread src/KLUWrapper/pool.jl
# channel. Drain them so future `acquire!` calls consume valid
# indices instead.
_drain_sentinels!(pool)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After _drain_sentinels! and drain_sentinels!, should the pool be empty? I suspect so (in which case I'd add something to check said invariant)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants