-
Notifications
You must be signed in to change notification settings - Fork 5
Manual sparse pattern for Jacobian and Hessian #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…omputed one dur to lagrange cost state
|
The relevant code in with the sparse structure for the Jacobian/Hessian in |
test/docs/AD_backend.md
Outdated
| - :zygote : Zygote | ||
| # Benchmark for different AD backends | ||
| The backend for ADNLPModels can be set in transcription / solve calls with the option `adnlp_backend=`. Possible values include the predefined(*) backends for ADNLPModels: | ||
| - `:optimized`* Default for CTDirect. Forward mode for Jacobian, reverse for Gradient and Hessian. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `:optimized`* Default for CTDirect. Forward mode for Jacobian, reverse for Gradient and Hessian. | |
| - `:optimized`* Default for CTDirect. Forward mode for Jacobian, reverse for Gradient and forward over reverse for Hessian. |
test/docs/AD_backend.md
Outdated
| ``` | ||
|
|
||
| Takeaways: | ||
| - the `:optimized` backend (with reverse mode for Hessian) is much better than full forward mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - the `:optimized` backend (with reverse mode for Hessian) is much better than full forward mode. | |
| - the `:optimized` backend (with forward over reverse mode for Hessian) is much better than full forward mode. |
test/docs/AD_backend.md
Outdated
|
|
||
| Takeaways: | ||
| - the `:optimized` backend (with reverse mode for Hessian) is much better than full forward mode. | ||
| - manual sparse pattern seems to give even better performance for larger problems. This is likely due to the increasing cost of computing the Hessian sparsity in terms of allocations and time. This observation is consistent with the comparison with Jump that seems to use a different, less sparse but faster method for the Hessian. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - manual sparse pattern seems to give even better performance for larger problems. This is likely due to the increasing cost of computing the Hessian sparsity in terms of allocations and time. This observation is consistent with the comparison with Jump that seems to use a different, less sparse but faster method for the Hessian. | |
| - manual sparse pattern seems to give even better performance for larger problems. This is likely due to the increasing cost of computing the Hessian sparsity with SparseConnectivityTracer.jl in terms of allocations and time. | |
| This observation is consistent with the comparison with JuMP that seems to use a different, less sparse but faster method for the Hessian. | |
| The sparsity pattern detection in JuMP relies on the expression tree of the objective and constraints built from its DSL. |
test/docs/AD_backend.md
Outdated
| - redo tests on algal_bacterial problem, including Jump | ||
| - add some tests for different backends in test_misc | ||
| - try to disable some unused (?) parts such as hprod ? (according to show_time info the impact may be small) | ||
| - reuse ADNLPModels functions to get block sparsity patterns then rebuild full patterns ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently added these functions for you:
https://jso.dev/ADNLPModels.jl/dev/sparse/#Extracting-sparsity-patterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: yes, thanks, these were quite useful to spot missing nonzeros in the manual pattern !
src/disc/trapeze.jl
Outdated
| function DOCP_Hessian_pattern(docp::DOCP{Trapeze}) | ||
|
|
||
| # NB. need to provide full pattern for coloring, not just upper/lower part | ||
| H = zeros(Bool, docp.dim_NLP_variables, docp.dim_NLP_variables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| H = zeros(Bool, docp.dim_NLP_variables, docp.dim_NLP_variables) | |
| H = BitMatrix(undef, docp.dim_NLP_variables, docp.dim_NLP_variables) | |
| fill!(H, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should reduce the allocations by a factor 8 😃
-> Bool = 1 byte / octet = 8 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amontoison Nice, I did not know a Bool would take a full byte ! In the meantime I switched to the vector format (Is,Js,Vs) so there is no more matrix allocation. Do you think computing the number of nonzero and allocating the vectors directly at full size would be noticeably better than allocating at 0 size and using push! to add elements ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using push! is slower because it involves dynamic allocations. I expect allocating the BitMatrix at the beginning to speed up the operations.
src/disc/trapeze.jl
Outdated
|
|
||
| function DOCP_Jacobian_pattern(docp::DOCP{Trapeze}) | ||
|
|
||
| J = zeros(Bool, docp.dim_NLP_constraints, docp.dim_NLP_variables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| J = zeros(Bool, docp.dim_NLP_constraints, docp.dim_NLP_variables) | |
| J = BitMatrix(undef, docp.dim_NLP_constraints, docp.dim_NLP_variables) | |
| fill!(J, false) |
src/solve.jl
Outdated
| jtprod_backend = ADNLPModels.ReverseDiffADJtprod, | ||
| jacobian_backend = J_backend, | ||
| hessian_backend = H_backend, | ||
| show_time = show_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't use the following backends:
- hprod
- jprod
- jtprod
- ghjvprod
I recommend to set them as an EmptyBackend.
I should add an easy way for the user to provide list the backends that he needs in ADNLPModels.jl (JuliaSmoothOptimizers/ADNLPModels.jl#324).
| | 1000 | 926.0 | 7.1 | | ||
| | 2500 | | 31.8 | | ||
| | 5000 | | | | ||
| *** building the hessian is one third of the total solve time... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will pass show_time to the constructor of the sparse Jacobians and Hessians such that you can know how much time is spend in the sparsity detection (JuliaSmoothOptimizers/ADNLPModels.jl#325).
|
@PierreMartinon How do you determine the sparsity pattern on the objective? |
test/docs/AD_backend.md
Outdated
| - `:enzyme`* Enzyme (not working). | ||
| - `:zygote`* Zygote (not working). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Enzyme.jl, Michel is working on it:
JuliaSmoothOptimizers/ADNLPModels.jl#322
For Zygote.jl, I will probably remove the support in the next major release because we only have dense AD backends for this package and the Hessian is only for the objective...
We can't easily support the Hessian of the Lagrangian, which is what you need here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, noted. I did not push the tests far for these two, I basically just passed the option to see what would happen.
|
cc @jbcaillau |
|
@PierreMartinon very nice 👍🏽👍🏽👍🏽 Regarding the tests, do we agree that apart from the current issue with artifacts, there are only problems with tests involving other RK schemes than trapezoidal ( The case being, since the default trapezoidal scheme is able to treat 99% use cases, is it OK to set the default as being trapezoidal + manual sparsity, and switch to automatic for other schemes? |
|
@amontoison the gradient of the objective is expected to be very sparse. a typical Mayer program (in which, internally, every problem is tranformed AFAIR - @PierreMartinon ?) has a cost only depending on the value of the state on the last grid point; for (n + m) * N unknowns, that's only n nonzeros.
Yes mostly, see below. |
Indeed, the 'manual' mode for sparsity pattern is a crude one that makes no assumption about the problem's specific function, ie it assumes that all functions involved have 'full' (nonzero) derivatives. What it does is eliminate the blocks that we know for sure are zero terms. For the objective, we have 2 possible cases:
The resulting code is rather ugly, although I tried to improve the readability. with We can see the limitations of this approach as the J/H patterns are less dense than the ones computed with the usual backend. On the other hand the startup cost is much smaller. All in all this seems to be a kind of tradeoff between the time taken by building the derivatives vs the time spent in the optimization itself (iterations number and also derivatives evaluations). I expect for instance the 'manual' derivatives to be more expensive to evaluate since they are less sparse. A potential improvement would be to determine the sparsity patterns of the problem's functions, and use them to fill the Jacobian / Hessian patterns, instead of putting full dense blocks. Oh, I should try a BitVector for Vs :D |
Seems to be fine now :-) |
I think we can do a bit more testing but basically yes, something like that for the moment. Doing midpoint after trapeze and IRK should not take long, I'll get started. The harder part will be to try to improve the manual mode that currently has quite a bit of excess nonzeros elements. A first direction is purely to use the finest possible granularity for the 'full' nnz blocks: for instance, in a lagrange problem we add a final component in the state for the integral of the running cost, which is never passed to the original OCP functions. So we know that the derivatives wrt these particular variables are zero, however it can be a bit involved to 'skip' these. This is more involved in the IRK schemes where we have the stage variables and equations that may also include this additional component. Another exemple, the time steps h_i depend on v in the variable time case, but not in the fixed time case... Ideally we would add nnz blocks down to each individual variable present in each constraints / functions. A second direction is to exploit the sparsity from the OCP specific functions instead of assuming full nnz blocks. Here we would need to apply the AD on these functions and reinject the sub-patterns into the Jacobian / Hessian somehow. Probably the most effective, but will require some work since I never tried that ;-) |
anyone did sth? @ocots aware of that? |
|
@PierreMartinon @ocots some breakage tests seem to pass, some not. looks like there is a new error there: any clue?
|
|
Don't know but it must not work with the v4 without changing the code of the action. |
|
@PierreMartinon ready to merge in spite of the (broken 😬) breakages above? if yes, should close #183 |
Breakage task only checks if The actual tests for After this one is done I'll start preparing release 0.13.1, as well as the update for the tutorials / docs in |

Providing the block structure to ADNLPModels leads to less sparse matrices since there is no analysis of problem-specific functions, but is faster and requires less memory. This method scales better than using the forward/reverse optimized backend, and appears to be faster for large enough problems.
See test/docs/AD_backend.md