Skip to content

Conversation

@PierreMartinon
Copy link
Member

@PierreMartinon PierreMartinon commented Jan 31, 2025

Providing the block structure to ADNLPModels leads to less sparse matrices since there is no analysis of problem-specific functions, but is faster and requires less memory. This method scales better than using the forward/reverse optimized backend, and appears to be faster for large enough problems.

See test/docs/AD_backend.md

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Jan 31, 2025

The relevant code in solve.jl

    # objective and constraints functions
    f = x -> DOCP_objective(x, docp)
    c! = (c, x) -> DOCP_constraints!(c, x, docp)

    # sparsity pattern
    J_backend = ADNLPModels.SparseADJacobian(docp.dim_NLP_variables, f, docp.dim_NLP_constraints, c!, DOCP_Jacobian_pattern(docp))
    H_backend = ADNLPModels.SparseReverseADHessian(docp.dim_NLP_variables, f, docp.dim_NLP_constraints, c!, DOCP_Hessian_pattern(docp))

    # call NLP problem constructor
    if adnlp_backend == :manual
        nlp = ADNLPModel!(
        f, x0, docp.var_l, docp.var_u,
        c!, docp.con_l, docp.con_u,
        gradient_backend = ADNLPModels.ReverseDiffADGradient,
        hprod_backend = ADNLPModels.ReverseDiffADHvprod,
        jtprod_backend = ADNLPModels.ReverseDiffADJtprod,
        jacobian_backend = J_backend,
        hessian_backend = H_backend,
        show_time = show_time
    )
    else
        nlp = ADNLPModel!(
            x -> DOCP_objective(x, docp), x0, docp.var_l, docp.var_u,
            (c, x) -> DOCP_constraints!(c, x, docp), docp.con_l, docp.con_u,
            backend = adnlp_backend, show_time = show_time
            )
    end

with the sparse structure for the Jacobian/Hessian in disc/trapeze.jl disc/midpoint.jl and disc/irk.jl

- :zygote : Zygote
# Benchmark for different AD backends
The backend for ADNLPModels can be set in transcription / solve calls with the option `adnlp_backend=`. Possible values include the predefined(*) backends for ADNLPModels:
- `:optimized`* Default for CTDirect. Forward mode for Jacobian, reverse for Gradient and Hessian.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `:optimized`* Default for CTDirect. Forward mode for Jacobian, reverse for Gradient and Hessian.
- `:optimized`* Default for CTDirect. Forward mode for Jacobian, reverse for Gradient and forward over reverse for Hessian.

```

Takeaways:
- the `:optimized` backend (with reverse mode for Hessian) is much better than full forward mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- the `:optimized` backend (with reverse mode for Hessian) is much better than full forward mode.
- the `:optimized` backend (with forward over reverse mode for Hessian) is much better than full forward mode.


Takeaways:
- the `:optimized` backend (with reverse mode for Hessian) is much better than full forward mode.
- manual sparse pattern seems to give even better performance for larger problems. This is likely due to the increasing cost of computing the Hessian sparsity in terms of allocations and time. This observation is consistent with the comparison with Jump that seems to use a different, less sparse but faster method for the Hessian.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- manual sparse pattern seems to give even better performance for larger problems. This is likely due to the increasing cost of computing the Hessian sparsity in terms of allocations and time. This observation is consistent with the comparison with Jump that seems to use a different, less sparse but faster method for the Hessian.
- manual sparse pattern seems to give even better performance for larger problems. This is likely due to the increasing cost of computing the Hessian sparsity with SparseConnectivityTracer.jl in terms of allocations and time.
This observation is consistent with the comparison with JuMP that seems to use a different, less sparse but faster method for the Hessian.
The sparsity pattern detection in JuMP relies on the expression tree of the objective and constraints built from its DSL.

- redo tests on algal_bacterial problem, including Jump
- add some tests for different backends in test_misc
- try to disable some unused (?) parts such as hprod ? (according to show_time info the impact may be small)
- reuse ADNLPModels functions to get block sparsity patterns then rebuild full patterns ?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@PierreMartinon PierreMartinon Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: yes, thanks, these were quite useful to spot missing nonzeros in the manual pattern !

function DOCP_Hessian_pattern(docp::DOCP{Trapeze})

# NB. need to provide full pattern for coloring, not just upper/lower part
H = zeros(Bool, docp.dim_NLP_variables, docp.dim_NLP_variables)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
H = zeros(Bool, docp.dim_NLP_variables, docp.dim_NLP_variables)
H = BitMatrix(undef, docp.dim_NLP_variables, docp.dim_NLP_variables)
fill!(H, false)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should reduce the allocations by a factor 8 😃
-> Bool = 1 byte / octet = 8 bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amontoison Nice, I did not know a Bool would take a full byte ! In the meantime I switched to the vector format (Is,Js,Vs) so there is no more matrix allocation. Do you think computing the number of nonzero and allocating the vectors directly at full size would be noticeably better than allocating at 0 size and using push! to add elements ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using push! is slower because it involves dynamic allocations. I expect allocating the BitMatrix at the beginning to speed up the operations.


function DOCP_Jacobian_pattern(docp::DOCP{Trapeze})

J = zeros(Bool, docp.dim_NLP_constraints, docp.dim_NLP_variables)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
J = zeros(Bool, docp.dim_NLP_constraints, docp.dim_NLP_variables)
J = BitMatrix(undef, docp.dim_NLP_constraints, docp.dim_NLP_variables)
fill!(J, false)

src/solve.jl Outdated
jtprod_backend = ADNLPModels.ReverseDiffADJtprod,
jacobian_backend = J_backend,
hessian_backend = H_backend,
show_time = show_time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't use the following backends:

  • hprod
  • jprod
  • jtprod
  • ghjvprod

I recommend to set them as an EmptyBackend.

I should add an easy way for the user to provide list the backends that he needs in ADNLPModels.jl (JuliaSmoothOptimizers/ADNLPModels.jl#324).

| 1000 | 926.0 | 7.1 |
| 2500 | | 31.8 |
| 5000 | | |
*** building the hessian is one third of the total solve time...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will pass show_time to the constructor of the sparse Jacobians and Hessians such that you can know how much time is spend in the sparsity detection (JuliaSmoothOptimizers/ADNLPModels.jl#325).

@amontoison
Copy link
Collaborator

@PierreMartinon How do you determine the sparsity pattern on the objective?
Is it not a function provided by the user?
What we need in MadNLP / IPOPT / KNITRO is the sparsity pattern of the Hessian of Lagrangian L(x,y) = f(x) + y'c(x) which is the the sparsity pattern of the Hessian of the objective combined with the sparsity pattern of the Hessian of all constraints.

Comment on lines 6 to 7
- `:enzyme`* Enzyme (not working).
- `:zygote`* Zygote (not working).
Copy link
Collaborator

@amontoison amontoison Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Enzyme.jl, Michel is working on it:
JuliaSmoothOptimizers/ADNLPModels.jl#322

For Zygote.jl, I will probably remove the support in the next major release because we only have dense AD backends for this package and the Hessian is only for the objective...
We can't easily support the Hessian of the Lagrangian, which is what you need here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, noted. I did not push the tests far for these two, I basically just passed the option to see what would happen.

@amontoison
Copy link
Collaborator

cc @jbcaillau

@jbcaillau
Copy link
Member

@PierreMartinon very nice 👍🏽👍🏽👍🏽

Regarding the tests, do we agree that apart from the current issue with artifacts, there are only problems with tests involving other RK schemes than trapezoidal (:midpoint, e.g.)?

The case being, since the default trapezoidal scheme is able to treat 99% use cases, is it OK to set the default as being trapezoidal + manual sparsity, and switch to automatic for other schemes?

@jbcaillau
Copy link
Member

jbcaillau commented Feb 1, 2025

@amontoison the gradient of the objective is expected to be very sparse. a typical Mayer program (in which, internally, every problem is tranformed AFAIR - @PierreMartinon ?) has a cost only depending on the value of the state on the last grid point; for (n + m) * N unknowns, that's only n nonzeros.

@PierreMartinon How do you determine the sparsity pattern on the objective? Is it not a function provided by the user? What we need in MadNLP / IPOPT / KNITRO is the sparsity pattern of the Hessian of Lagrangian L(x,y) = f(x) + y'c(x) which is the the sparsity pattern of the Hessian of the objective combined with the sparsity pattern of the Hessian of all constraints.

Yes mostly, see below.

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Feb 6, 2025

@PierreMartinon How do you determine the sparsity pattern on the objective? Is it not a function provided by the user? What we need in MadNLP / IPOPT / KNITRO is the sparsity pattern of the Hessian of Lagrangian L(x,y) = f(x) + y'c(x) which is the the sparsity pattern of the Hessian of the objective combined with the sparsity pattern of the Hessian of all constraints.

Indeed, the 'manual' mode for sparsity pattern is a crude one that makes no assumption about the problem's specific function, ie it assumes that all functions involved have 'full' (nonzero) derivatives. What it does is eliminate the blocks that we know for sure are zero terms. For the objective, we have 2 possible cases:

  • mayer cost f(x) that is more precisely of the form g0(x0,xf,v). In the Hessian this leads to nnz blocks at locations (x0,x0), (xf,xf), (v,v) and the 6 cross derivative terms. Of course in practice a significant part of the values will actually be zeros, so we waste some sparsity !
  • lagrange cost: this one is simply of the form f(x) = x_i, so second derivatives are zero.

The resulting code is rather ugly, although I tried to improve the readability.

"""
$(TYPEDSIGNATURES)

Build sparsity pattern for Hessian of Lagrangian
"""
function DOCP_Hessian_pattern(docp::DOCP{Trapeze})

    # NB. need to provide full pattern for coloring, not just upper/lower part
    Is = Vector{Int}(undef, 0)
    Js = Vector{Int}(undef, 0)

    # index alias for v
    v_start = docp.dim_NLP_variables - docp.dim_NLP_v + 1
    v_end = docp.dim_NLP_variables

    # 0. objective
    # 0.1 mayer cost (x0, xf, v) 
    # -> grouped with term 3. for boundary conditions
    # 0.2 lagrange case (lf)
    # -> 2nd order term is zero
   
    # 1. main loop over steps
    # 1.0 v / v term
    add_nonzero_block!(Is, Js, v_start, v_end, v_start, v_end)

    for i = 1:docp.dim_NLP_steps

        # contiguous variables blocks will be used when possible
        # x_i (l_i) u_i x_i+1 (l_i+1) u_i+1
        var_block = docp.discretization._step_variables_block * 2
        var_offset = (i-1)*docp.discretization._step_variables_block

        # 1.1 state eq 0 = x_i+1 - (x_i + h_i/2 (f(t_i,x_i,u_i,v) + f(t_i+1,x_i+1,u_i+1,v)))
        # depends on x_i, u_i, x_i+1, u_i+1, and v -> included in 1.2
        # 1.2 lagrange part 0 = l_i+1 - (l_i + h_i/2 (l(t_i,x_i,u_i,v) + l(t_i+1,x_i+1,u_i+1,v)))
        # depends on x_i, l_i, u_i, x_i+1, l_i+1, u_i+1, and v
        # -> use single block for all step variables
        add_nonzero_block!(Is, Js, var_offset+1, var_offset+var_block, var_offset+1, var_offset+var_block)
        add_nonzero_block!(Is, Js, var_offset+1, var_offset+var_block, v_start, v_end; sym=true)

        # 1.3 path constraint g(t_i, x_i, u_i, v)
        # -> included in 1.2
    end

    # 2. final path constraints (xf, uf, v)
    # -> included in last loop iteration

    # 3. boundary constraints (x0, xf, v)
    # -> (x0, v) terms included in first loop iteration
    # -> (xf, v) terms included in last loop iteration
    if docp.is_mayer || docp.dim_boundary_cons > 0
        var_offset = docp.dim_NLP_steps*docp.discretization._step_variables_block
        x0_start = 1
        x0_end = docp.dim_OCP_x
        xf_start = var_offset + 1
        xf_end = var_offset + docp.dim_OCP_x
        add_nonzero_block!(Is, Js, x0_start, x0_end, xf_start, xf_end; sym=true)
    end
    # 3.1 null initial condition for lagrangian cost state l0
    # -> 2nd order term is zero

    # build and return sparse matrix
    nnzj = length(Is)
    Vs = ones(Bool, nnzj)
    return sparse(Is, Js, Vs, docp.dim_NLP_variables, docp.dim_NLP_variables)

end

with

function add_nonzero_block!(Is, Js, i_start, i_end, j_start, j_end; sym=false)
    for i=i_start:i_end
        for j=j_start:j_end
            push!(Is, i)
            push!(Js, j)
            sym && push!(Is, j)
            sym && push!(Js, i)
        end
    end
    return
end

We can see the limitations of this approach as the J/H patterns are less dense than the ones computed with the usual backend. On the other hand the startup cost is much smaller. All in all this seems to be a kind of tradeoff between the time taken by building the derivatives vs the time spent in the optimization itself (iterations number and also derivatives evaluations). I expect for instance the 'manual' derivatives to be more expensive to evaluate since they are less sparse.

A potential improvement would be to determine the sparsity patterns of the problem's functions, and use them to fill the Jacobian / Hessian patterns, instead of putting full dense blocks.

Oh, I should try a BitVector for Vs :D
edit: meh, using trues(nnz) give a very small reduction in allocations and is not faster (apparently some operations may perform better on Vector{Bool} than BitArray). I'll keep it simple for now.

@PierreMartinon PierreMartinon marked this pull request as ready for review February 6, 2025 10:05
@PierreMartinon
Copy link
Member Author

@PierreMartinon some errors in CI and doc (not taking into account breakages, due to artefact update, I suppose), changed to draft.

Seems to be fine now :-)

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Feb 6, 2025

@PierreMartinon very nice 👍🏽👍🏽👍🏽

Regarding the tests, do we agree that apart from the current issue with artifacts, there are only problems with tests involving other RK schemes than trapezoidal (:midpoint, e.g.)?

The case being, since the default trapezoidal scheme is able to treat 99% use cases, is it OK to set the default as being trapezoidal + manual sparsity, and switch to automatic for other schemes?

I think we can do a bit more testing but basically yes, something like that for the moment. Doing midpoint after trapeze and IRK should not take long, I'll get started.
Edit: midpoint done.

The harder part will be to try to improve the manual mode that currently has quite a bit of excess nonzeros elements.

A first direction is purely to use the finest possible granularity for the 'full' nnz blocks: for instance, in a lagrange problem we add a final component in the state for the integral of the running cost, which is never passed to the original OCP functions. So we know that the derivatives wrt these particular variables are zero, however it can be a bit involved to 'skip' these. This is more involved in the IRK schemes where we have the stage variables and equations that may also include this additional component. Another exemple, the time steps h_i depend on v in the variable time case, but not in the fixed time case... Ideally we would add nnz blocks down to each individual variable present in each constraints / functions.

A second direction is to exploit the sparsity from the OCP specific functions instead of assuming full nnz blocks. Here we would need to apply the AD on these functions and reinject the sub-patterns into the Jacobian / Hessian somehow. Probably the most effective, but will require some work since I never tried that ;-)

@jbcaillau
Copy link
Member

@PierreMartinon some errors in CI and doc (not taking into account breakages, due to artefact update, I suppose), changed to draft.

Seems to be fine now :-)

anyone did sth? @ocots aware of that?

@jbcaillau
Copy link
Member

@PierreMartinon @ocots some breakage tests seem to pass, some not. looks like there is a new error there:

IMG_2136

any clue?

@PierreMartinon some errors in CI and doc (not taking into account breakages, due to artefact update, I suppose), changed to draft.

Seems to be fine now :-)

anyone did sth? @ocots aware of that?

@ocots
Copy link
Member

ocots commented Feb 12, 2025

Don't know but it must not work with the v4 without changing the code of the action.

@jbcaillau
Copy link
Member

@PierreMartinon ready to merge in spite of the (broken 😬) breakages above? if yes, should close #183

@PierreMartinon
Copy link
Member Author

PierreMartinon commented Feb 13, 2025

@PierreMartinon ready to merge in spite of the (broken 😬) breakages above? if yes, should close #183

Breakage task only checks if OptimalControl can work with the current code, and is useful mostly when preparing releases. Also, the Breakage task is kind of broken now ;-)

The actual tests for CTDirect are the CI ones (currently failing, I'm looking into it. edit: looks like some overzealous type qualification for the ocp description :D).

After this one is done I'll start preparing release 0.13.1, as well as the update for the tutorials / docs in OptimalControl (discretizations in particular, and probably a few words on the AD backends)

@PierreMartinon PierreMartinon merged commit f7cd141 into main Feb 13, 2025
6 of 7 checks passed
@PierreMartinon PierreMartinon deleted the sparse_pattern branch February 13, 2025 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants