Reduce QP overheads by rg20 · Pull Request #1140 · NVIDIA/cuopt

rg20 · 2026-04-23T21:21:45Z

Description

This PR improves QP performance, particularly on easy problems that run under few hundred milli seconds.

Algorithmic improvements

Free variables are directly handled in the iteration updates. Prior to this change, each free variable was split into two variables with bounds [0, \inf)
Early stopping of GMRES iterative refinement when there is no sufficient progress

Tunable parameters

Ability to turn off iterative refinement. The iterative refinement adds initial overhead for thrust calls, this is significant enough for easy problems.
Ability to set step scale for barrier iteration update. cuOpt default is 0.9, setting this to 0.99 reduces iteration count by 30% for portfolio optimization problems.

Overhead reductions

Conversion from optimization_problem_t to problem_t takes about 70ms. For QP, problem_t is not used as we are again converting that to user_problem_t. Added conversion from optimization_problem_t to user_problem_t directly.
Removed/guarded dense column detection code when Q is not diagonal.
Avoid computing AD, ADAT helper data when using augmented system

Performance improvements

4% geomean speedup for barrier on LPfeas benchmark (note this with 10 min timelimit). physiciansched and fome13, went from suboptimal to optimal
46% geomean speedup for Maros benchmark. 122 optimal, 5 suboptimal --> 128 optimal, 3 suboptimal (200 iteration limit)
47% geomean speedup for portfolio problems
QPLIB is not impacted

Issue

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

copy-pr-bot · 2026-04-23T21:21:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-23T21:24:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds native free-variable support for QP barrier solves, adjusts related barrier math and control flow, records free-variable indices in presolve metadata, conditions dense-column detection on Q presence/diagonality, times cuDSS initialization, removes one pre-barrier diagnostic, and introduces a ratio-based GMRES stop/update rule.

Changes

Cohort / File(s)	Summary
Barrier: native free-variable support `cpp/src/barrier/barrier.cu`	Adds `n_free_vars` and GPU flag vector `d_is_free_` in `iteration_data_t`, initializes from `presolve_info.free_variable_indices`. Forces z=0 for native free vars during initial point, saves/restores x around `ensure_positive`. Excludes free vars from ratio-test (`gpu_max_step_to_boundary`), modifies search-direction formation (`gpu_compute_search_direction`) to skip complementarity terms for free vars and set `dz=0`, adjusts complementarity denominators and RHS handling when updating `mu`/`compute_cc_rhs`. Adds NVTX around augmented solve and `check_for_suboptimal_solution`.
Presolve: free-variable metadata & behavior `cpp/src/dual_simplex/presolve.hpp`, `cpp/src/dual_simplex/presolve.cpp`	Adds `std::vector<i_t> free_variable_indices` to `presolve_info_t`. For problems with `Q.n > 0` (QPs), presolve no longer splits free variables into v/w; instead records their indices and leaves them unchanged. For non-QP cases, existing split behavior remains.
Barrier: dense-column detection conditional `cpp/src/barrier/barrier.cu`	Dense-column detection (`find_dense_columns`) is only run when `!has_Q
Sparse Cholesky / cuDSS timing `cpp/src/barrier/sparse_cholesky.cuh`	Adds a timestamp immediately before `cudssCreateMg` and logs elapsed initialization time after `cudssSetStream`; call sequence/params unchanged.
Removed pre-barrier diagnostic scan `cpp/src/dual_simplex/solve.cpp`	Removes conditional pre-barrier scan and logging over `presolve_info.free_variable_pairs` for QPs that computed quadratic row spans and logged when both variables lacked quadratic terms.
Iterative refinement: ratio-based stopping `cpp/src/barrier/iterative_refinement.hpp`	GMRES iterative refinement uses `improvement_ratio = best_residual / residual`; updates saved best and `best_residual` only when improvement > `stop_ratio`. If improvement ∈ (0, stop_ratio), logs (when enabled), saves current best, and breaks early; otherwise treats stagnation/increase as before and restores prior best.
Other small instrumentation/formatting `...`	Minor NVTX additions and whitespace/formatting edits accompanying the above changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Reduce QP overheads' clearly and concisely describes the main objective of the changeset, which focuses on optimizing quadratic program performance through native free variable handling and dense-column detection improvements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description provides detailed context about QP performance improvements, algorithmic changes, and specific performance metrics across different benchmarks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cpp/src/barrier/iterative_refinement.hpp (1)
181-182: Make the early-stop ratio configurable.

Line 181 hard-codes a solver-wide accuracy/performance tradeoff into a shared iterative-refinement path. A fixed 5.0 may be fine for the profiling case that motivated this PR, but it is likely too aggressive for some matrices and too loose for others. Please plumb this through existing solver settings or a caller-supplied parameter instead of baking it into the header.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/barrier/iterative_refinement.hpp` around lines 181 - 182, Replace the
hard-coded local constant f_t stop_ratio = 5.0 in iterative_refinement.hpp with
a configurable parameter sourced from the solver settings or a caller-provided
argument: add a new float-typed setting (or function parameter) named stop_ratio
(or iterative_refinement_stop_ratio) to the solver config or to the iterative
refinement function/method, use that variable in place of the literal when
computing bnorm and the early-stop decision, keep the default value at 5.0 if
not supplied, and update any callers or constructor that invoke the iterative
refinement path to thread the new setting through.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/barrier/iterative_refinement.hpp`:
- Around line 365-380: The branch treating improvement_ratio == stop_ratio as
non-improvement is wrong; update the comparison logic in the iterative
refinement block (symbols: improvement_ratio, stop_ratio, best_residual, x_sav,
x, show_info, CUOPT_LOG_INFO) so that hitting the threshold counts as
improvement (e.g., use >= instead of >) or use an epsilon-based comparison for
floating-point stability (|improvement_ratio - stop_ratio| <= eps) to decide
improvement, and ensure the residual update and raft::copy for
best_residual/x_sav->x remain inside the "improved" branch while keeping the
logging/break behavior only for true stagnation/worsening cases.

---

Nitpick comments:
In `@cpp/src/barrier/iterative_refinement.hpp`:
- Around line 181-182: Replace the hard-coded local constant f_t stop_ratio =
5.0 in iterative_refinement.hpp with a configurable parameter sourced from the
solver settings or a caller-provided argument: add a new float-typed setting (or
function parameter) named stop_ratio (or iterative_refinement_stop_ratio) to the
solver config or to the iterative refinement function/method, use that variable
in place of the literal when computing bnorm and the early-stop decision, keep
the default value at 5.0 if not supplied, and update any callers or constructor
that invoke the iterative refinement path to thread the new setting through.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a7b6ab20-ce6f-49c8-9326-46dd180e9846

📥 Commits

Reviewing files that changed from the base of the PR and between cd41420 and 020312b.

📒 Files selected for processing (1)

cpp/src/barrier/iterative_refinement.hpp

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 2310-2340: The free-variable branch currently sets diag to 0
whenever q_jj > 0 which allows tiny positive Q(j,j) to produce near-zero diag;
change the lambda used in the cub::DeviceTransform::Transform (the one capturing
free_var_reg in the data.Q.n > 0 && data.Q_diagonal case) to floor the returned
value at free_var_reg (e.g., return max(q_jj, free_var_reg) or return (q_jj >
free_var_reg) ? q_jj : free_var_reg) so diag for free variables is never below
free_var_reg; also ensure the other free-variable-only lambda (the data.Q.n == 0
path) similarly returns at least free_var_reg for is_free entries.
- Around line 3596-3606: The initial mu computation (the one that divides by n +
num_upper_bounds before the iteration loop) doesn't account for native free
variables set via presolve_info.free_variable_indices and data.n_free_vars;
update that initial mu calculation to use the same adjusted divisor as
compute_mu()/compute_target_mu() by subtracting data.n_free_vars (i.e., use n +
num_upper_bounds - data.n_free_vars) when presolve_info.free_variable_indices is
non-empty, and ensure data.n_free_vars and d_is_free_ are initialized before
this initial mu is computed so the first iteration uses the same barrier measure
as later compute_mu()/compute_target_mu() calls.
- Around line 3081-3083: The calculation of mu_aff and subsequent mu division
must guard against mu_denom == 0 (mu_denom = x.size() + n_upper_bounds -
n_free_vars) to avoid division-by-zero when all variables are free; add an
explicit fast-path: compute mu_denom as now, then if mu_denom <= eps (use a
small f_t epsilon like 0 or machine epsilon), set mu_aff = static_cast<f_t>(0)
and mu = static_cast<f_t>(0), set any mu ratio (e.g., mu_aff_over_mu) to a safe
default (1 or 0 depending on downstream expectations) and skip the
mu-aff/mu-based updates, otherwise perform the existing divisions. Apply the
identical guard and handling for the other occurrence around the mu computation
block (the second spot noted in the comment).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94fd7d1f-a999-4dab-a9dc-d5d097ed0af6

📥 Commits

Reviewing files that changed from the base of the PR and between 104b82c and 0002029.

📒 Files selected for processing (3)

cpp/src/barrier/barrier.cu
cpp/src/dual_simplex/presolve.cpp
cpp/src/dual_simplex/presolve.hpp

rg20 · 2026-05-12T14:26:48Z

/ok to test ab709dc

chris-maes · 2026-05-14T22:42:37Z

+#define CUOPT_RANDOM_SEED                  "random_seed"
+#define CUOPT_PDLP_PRECISION               "pdlp_precision"
+#define CUOPT_MIP_SEMICONTINUOUS_BIG_M     "mip_semi_continuous_big_m"
+#define CUOPT_BARRIER_ITERATIVE_REFINEMENT "barrier_iterative_refinement"


Nit: could you put these next to CUOPT_BARRIER_DUAL_INITIAL_POINT above. So we keep all the barrier parameters together?

chris-maes · 2026-05-14T22:44:13Z

        A_dense.from_sparse(lp.A, j, k++);
      }
    }
-    original_A_values = AD.x;


Is this line no longer necessary? It isn't included below. Just checking.

chris-maes · 2026-05-14T22:51:07Z

+                                                      rmm::cuda_stream_view stream) const
+{
+  static_assert(std::is_signed_v<i_t>);
+  i_t const mm = m;


Nit: why can't you just use m and n?

Because you want these to be const?

chris-maes · 2026-05-14T22:51:33Z

 };

+template <typename i_t, typename f_t>
+void device_csc_matrix_t<i_t, f_t>::to_compressed_row(device_csr_matrix_t<i_t, f_t>& Arow,


This is really cool!

chris-maes · 2026-05-14T22:55:50Z

+  raft::copy(rows.data(), i.data(), nz, stream);
+  raft::copy(vals.data(), x.data(), nz, stream);
+
+  thrust::tabulate(exec,


This is using a binary search to find out which a column the nonzero stored at index p is goes in?

Oh you are expanding to back to triplet?

chris-maes · 2026-05-14T23:03:03Z

+      device_A_x_values.resize(device_AD.x.size(), handle_ptr->get_stream());
+      raft::copy(
+        device_A_x_values.data(), device_AD.x.data(), device_AD.x.size(), handle_ptr->get_stream());
+      device_AD.to_compressed_row(device_A, handle_ptr->get_stream());


chris-maes · 2026-05-14T23:03:28Z

                      vector_norm2<i_t, f_t>(data.dual_residual));
 #endif
  // Make sure (w, x, v, z) > 0
+  // Save free variable x values before forcing positive (they can be negative)


I will update this in my PR

chris-maes · 2026-05-14T23:03:42Z

  data.w.ensure_positive(epsilon_adjust);
  data.x.ensure_positive(epsilon_adjust);
+
+  // For native free variables (QP): restore x values and set z = 0


To be updated

chris-maes · 2026-05-14T23:04:25Z

-    data.d_complementarity_xz_rhs_.size(),
-    [new_mu] HD(f_t dx_aff, f_t dz_aff) { return -(dx_aff * dz_aff) + new_mu; },
-    stream_view_.value());
+  auto fill_linear_cc_rhs = [&](raft::device_span<f_t> out,


I will update this in my PR

chris-maes · 2026-05-14T23:05:10Z

+    iteration_data_t<i_t, f_t> data(lp, num_upper_bounds, Q, settings, start_time);
+
+    // Set up native free variable tracking for QPs
+    if (!presolve_info.free_variable_indices.empty()) {


I will update this in my PR

chris-maes · 2026-05-14T23:07:53Z

+
+  auto row_iter = thrust::device_pointer_cast(rows.data());
+  auto col_iter = thrust::device_pointer_cast(cols.data());
+  thrust::sort_by_key(exec,


No issue with the code, but I'd love to understand how it works.

You want to sort the nonzeros by row. I'm not sure I understand why the second iterator has row_iter + nz and col_iter + nz.

chris-maes · 2026-05-14T23:08:38Z

+
+  rmm::device_uvector<i_t> rows(nz, stream);
+  rmm::device_uvector<i_t> cols(nz, stream);
+  rmm::device_uvector<f_t> vals(nz, stream);


Is vals needed? Why not directly use Arow.x?

chris-maes · 2026-05-14T23:09:50Z

+    cudaMemcpyAsync(Arow.row_start.data() + mm, &nz, sizeof(i_t), cudaMemcpyHostToDevice, stream));
+
+  rmm::device_uvector<i_t> rows(nz, stream);
+  rmm::device_uvector<i_t> cols(nz, stream);


Is cols needed? Why not directly use Arow.x? You could always make cols a reference to Arow.x if you wanted to keep the name cols. But it seems like you need an extra copy at the end.

chris-maes · 2026-05-14T23:10:23Z

+                      thrust::make_zip_iterator(thrust::make_tuple(row_iter + nz, col_iter + nz)),
+                      thrust::device_pointer_cast(vals.data()));
+
+  raft::copy(Arow.j.data(), cols.data(), nz, stream);


I think you can get rid of these two copies and extra allocations by using Arow.j and Arow.x directly.

chris-maes · 2026-05-14T23:11:01Z

  }

  if (settings.barrier_presolve && free_variables > 0) {
-    // Try to remove free variables


Please add back comment.

chris-maes · 2026-05-14T23:13:16Z

    {CUOPT_DUAL_INFEASIBLE_TOLERANCE, &pdlp_settings.tolerances.dual_infeasible_tolerance, f_t(0.0), f_t(1e-1), std::max(f_t(1e-10), std::numeric_limits<f_t>::epsilon())},
    {CUOPT_MIP_CUT_CHANGE_THRESHOLD, &mip_settings.cut_change_threshold, f_t(-1.0), std::numeric_limits<f_t>::infinity(), f_t(-1.0)},
    {CUOPT_MIP_CUT_MIN_ORTHOGONALITY, &mip_settings.cut_min_orthogonality, f_t(0.0), f_t(1.0), f_t(0.5)},
+    {CUOPT_BARRIER_STEP_SCALE, &pdlp_settings.barrier_step_scale, f_t(0.5), f_t(1.0), f_t(0.9)},


We can't ever go to 1.0. The max needs to be something like 0.9999

chris-maes · 2026-05-14T23:15:22Z

+      case dual_simplex::lp_status_t::OPTIMAL: return pdlp_termination_status_t::Optimal;
+      case dual_simplex::lp_status_t::INFEASIBLE:
+        return pdlp_termination_status_t::PrimalInfeasible;
+      case dual_simplex::lp_status_t::UNBOUNDED: return pdlp_termination_status_t::DualInfeasible;


If the dual is infeasible, I don't think we have a way to tell if the primal is unbounded or infeasible. So this should probably map to the UNBOUNDED_OR_INFEASIBLE status.

chris-maes · 2026-05-14T23:16:45Z

    barrier_settings.log.log = false;
  }

+  barrier_settings.log.printf("Barrier settings created at %.2f seconds, toc time: %.2f seconds\n",


Remove this before merging

chris-maes · 2026-05-14T23:19:00Z

 }

+template <typename i_t, typename f_t>
+static dual_simplex::user_problem_t<i_t, f_t> cuopt_optimization_problem_to_simplex_problem(


Nit: this should probably be called cuopt_optimization_problem_to_user_problem

chris-maes · 2026-05-14T23:22:23Z

      cudss_mt_lib_file = CUDSS_MT_LIB_FILE_NAME;
    }

    if (cudss_mt_lib_file != nullptr) {


Do you want to provide an option for the user to disable the CUDSS_THREADING_LIB?

chris-maes

Awesome.

I will make changes to the free variable PR. I think you will either need to rebase off that PR or pull them in.

Most of the comments are very minor or just questions.

The only one that definitely requires changing is the maximum barrier_step_scale

rg20 requested a review from a team as a code owner April 23, 2026 21:21

rg20 requested review from kaatish and nguidotti April 23, 2026 21:21

rg20 requested review from chris-maes and removed request for kaatish and nguidotti April 23, 2026 21:21

rg20 added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Apr 23, 2026

rg20 added this to the 26.06 milestone Apr 23, 2026

rg20 added 2 commits April 23, 2026 14:23

Disable looking for dense columns with q is not diagonal

588d5c5

Add timer around cudss create context

cd41420

rg20 force-pushed the qp_profiling branch from 939b08f to cd41420 Compare April 23, 2026 21:23

yuwenchen95 reviewed Apr 24, 2026

View reviewed changes

Comment thread cpp/src/dual_simplex/solve.cpp

Early termination for when the residual is not decreasing fast enough

020312b

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread cpp/src/barrier/iterative_refinement.hpp

rg20 and others added 2 commits April 24, 2026 11:51

Add missing nvtx ranges

104b82c

Add support for handling free variables in QPs with the augmented system

0002029

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread cpp/src/barrier/barrier.cu

Comment thread cpp/src/barrier/barrier.cu Outdated

Comment thread cpp/src/barrier/barrier.cu

rg20 added 10 commits April 27, 2026 06:34

Disable bounding free variables for QP

f6be237

More timers

7b8c2ec

Fix compilation error

0c59dd3

Don't use iterative refinement for initial point calculation

d4bfc9f

set step_size to 0.99

8ea77bf

Do not use threading layer for cudss reordering

f340ca2

Increase the resolution of timer

a80f902

directly go from optimization_problem_t to simplex_problem_t

add1a01

revert unnecessary change

47ddf84

don't compute ADAT related info when using augmented

2f5db64

rg20 and others added 8 commits May 8, 2026 12:12

Handle all free variable case

d806a5a

Merge branch 'main' into qp_profiling

fcc6652

Revert unintended changes

3b56eff

Merge remote-tracking branch 'upstream/main' into qp_profiling

464f768

Cleanup logs

6c1cc72

Merge remote-tracking branch 'upstream/main' into qp_profiling

52bd299

Remove unnecessary changes

c0e2542

Merge remote-tracking branch 'upstream/main' into qp_profiling

ab709dc

chris-maes reviewed May 14, 2026

View reviewed changes

Conversation

rg20 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rg20 commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-maes left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

rg20 commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading