Skip to content

Add Schur completment and its mat-free mode#35

Open
zitongzhan wants to merge 55 commits into
releasefrom
memory-issue-swp
Open

Add Schur completment and its mat-free mode#35
zitongzhan wants to merge 55 commits into
releasefrom
memory-issue-swp

Conversation

@zitongzhan
Copy link
Copy Markdown
Collaborator

This pull request introduces significant improvements to the optimizer infrastructure, focusing on enhanced memory profiling, a new Schur complement optimizer, and better support for matrix-free operations.

Optimizer Enhancements

  • Added a new Schur optimizer class in bae.optim.optimizer, implementing the Schur complement method with support for both standard and matrix-free normal equations, block Jacobi preconditioning, and efficient memory usage.

  • Updated the LM optimizer to support a matrix_free_normal mode, allowing for more efficient computation and memory usage in large-scale problems.

  • Add a custom TrustRegion class that supports Warp, especially for use with the Schur optimizer.

Sparse Matrix and PyOps Improvements

  • Improved sparse matrix operations, including fixes to inv_op for correct tensor creation and a new test block in py_ops.py for diagonal operations on CUDA.

Comment thread bae/sparse/warp_wrappers.py Fixed
Comment thread bae/optim/optimizer.py Fixed
Comment thread bae/sparse/py_ops.py Fixed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high-performance Triton kernels for sparse BSR operations, including matrix-vector multiplication, matrix-matrix multiplication, and transposition. It also implements a matrix-free NormalMatVec operator and a new Schur complement-based optimizer to improve the efficiency of bundle adjustment tasks. The bundle adjustment example was updated with CUDA memory snapshotting and Warp mempool reporting. Review feedback highlights a critical issue where in-place diagonal modifications in the LM and Schur optimizers cause damping factors to accumulate incorrectly during step rejections. Additionally, the reviewer recommends removing performance-hindering torch.cuda.empty_cache() calls, addressing potential divisions by zero in the Conjugate Gradient solver, and cleaning up redundant or commented-out code.

Comment thread bae/optim/optimizer.py
Comment thread bae/optim/optimizer.py Outdated
Comment thread bae/optim/optimizer.py Outdated
Comment thread bae/optim/triton_kernel.py Outdated
Comment thread bae/optim/triton_kernel.py Outdated
Comment thread bae/sparse/warp_wrappers.py Outdated
Comment thread bae/utils/pysolvers.py Outdated
zitongzhan and others added 3 commits May 23, 2026 20:35
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Comment thread ba_example.py Outdated
@zitongzhan
Copy link
Copy Markdown
Collaborator Author

Profile Summary
Profiled current ba_example.py on Venice problem-1778-993923-pre: 5,001,946 observations, 1,778 cameras, 993,923 points. I passed matrix_free_normal=True and False; current ba_example.py defaults to disabled.

Mode Steady wall time Main slow operators
matrix_free_normal=True 1.53 s Warp BSR MV kernels inside linear.cg: ~1.19 s CUDA, ~84%
matrix_free_normal=False 1.83 s Split between explicit Schur warp_bsr_mm: ~0.66 s, and CG BSR MV: ~0.68 s

Enabled
With matrix-free enabled, the bottleneck is still BSR matvec, now inside Warp CG. The hottest kernels were:

Kernel / scope CUDA time
bsr_mv_transpose_kernel... 801 ms
bsr_mv_kernel_acf84b96... 230 ms
bsr_mv_kernel_0d4f3dc9... 163 ms
jacobian 123 ms

This corresponds to the repeated matrix-free Schur matvec in optimizer.py, especially the sparse.bsr_mv chain at lines 145-152 and the CG calls at lines 180 and 200.

Disabled
With matrix-free disabled, the cost shifts: explicit Schur construction becomes about as expensive as CG matvecs.

Kernel / scope CUDA time
warp_bsr_mm scope 658 ms
_bsr_mm_compute_values... 611 ms
bsr_mv_tiled_kernel... 645 ms
jacobian 122 ms

That maps to explicit Schur construction at optimizer.py: WV_i = sparse.bsr_mm(W, V_i) and WVi_Wt = sparse.bsr_mm(WV_i, Wt).

@SEOKWOOPARK
Copy link
Copy Markdown

SEOKWOOPARK commented May 29, 2026

Runtime with 20 iterations based on Schur

trafalgar_without_mat_free

=> trafalgar (problem-257-65132-pre) without Matrix-Free + 257 images + 65132 points + 225911 observations

trafalgar_with_mat_free

=> trafalgar (problem-257-65132-pre) with Matrix-Free + 257 images + 65132 points + 225911 observations

ladybug_without_mat_free

=> ladybug (problem-1723-156502-pre) without Matrix-Free + 1723 images + 156502 points + 678718 observations

ladybug_with_mat_free

=> ladybug (problem-1723-156502-pre) with Matrix-Free + 1723 images + 156502 points + 678718 observations

dubrovnik_without_mat_free

=> dubrovnik (problem-356-226730-pre) without Matrix-Free + 356 images + 226730 points + 1255268 observations

dubrovnik_with_mat_free

=> dubrovnik (problem-356-226730-pre) with Matrix-Free + 356 images + 226730 points + 1255268 observations

venice_without_mat_free

=> venice (problem-1778-993923-pre) without Matrix-Free + 1778 images + 993923 points + 5001946 observations

venice_with_mat_free

=> venice (problem-1778-993923-pre) with Matrix-Free + 1778 images + 993923 points + 5001946 observations

final_with_mat_free

=> final (problem-13682-4456117-pre) with Matrix-Free + 13682 images + 4456117 points + 28987644 observations

Comment thread ba_example.py Fixed
Comment thread ba_example.py Fixed
Comment thread ba_example.py Dismissed
Comment thread ba_example.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants