Skip to content

List.sort done in in-place array sort#1349

Open
talsewell wants to merge 37 commits intoCakeML:masterfrom
talsewell:array-heap-sort-cleanup
Open

List.sort done in in-place array sort#1349
talsewell wants to merge 37 commits intoCakeML:masterfrom
talsewell:array-heap-sort-cleanup

Conversation

@talsewell
Copy link
Copy Markdown
Contributor

This is an implementation of the remaining idea of issue #979 .

It introduces a list-of-heaps sort in basis/pure/heap_list_sortScript.sml, and a monadic variant of it that suffix-encodes the heaps into a single array in basis/monadic/heap_list_sort_monadicScript.sml. The latter theory includes a proof of equivalence, that the monadic computation produces the same results as the pure one.

The monadic definitions are then translated (by the monadic translator) in ListProg, and the translation of List.sort uses them. However the constant being translated is just the pure heap-list-sort, so it should EVAL OK.

Various monadic translator fixes, mostly related to type polymorphism, are included in these commits.

Yet to be discovered: Does anything else break in the build? Does this sort function actually perform OK?

The monadic translator already includes code to substitute types if the
supplied definition has a type variable for the state or exception type, e.g.
if the relevant function didn't make use of the state or exception value so it
was left fully polymorphic. Update this to also substitute type variables and
print a message if the relevant types match but are not identical, e.g. if the
state type has a polymorphic parameter that is not instantiated to the same
thing.
The existing monadic-translator interface code does a bit of guessing of
various constant names, and needs to set the translator use_full_type_names
setting to false to make the naming predictable enough. This makes the monadic
translator incompatible with various translation contexts.

Some of this code can be replaced by use of get_type_inv and other query
functions to fetch the relevant constants. This seems to allow the monadic
translator to run with use_full_type_names present, and doesn't seem to break
anything.
Handle a case where the updator for a polymorphic state field is too general.
This is a bit of a HOL4 specialty, where a record has a type variable (say 'a)
that appears in only one field, and an updator for that field is allowed to
change the type of the overall datatype by replacing the field.

There was already code for handling this in the monad translator, it just
seemed to miss a case. I found it by adding some more error diagnostic code
along the way, and I might leave it in.
The idea is to write a sort function with a O(n * log(n)) complexity,
which exposes a pure functional interface but which uses a mutable
array to actually do the rearrangement.

This pure file presents a heap-sort algorithm, which the mutable
array modelled as a function. Correctness is verified. The next step
is to translate or present some CakeML code that implements this
algorithm with a mutable array.
Convert the pure functional heap-sort representation to a monadic
representation compatible with the monadic translator (based on some
examples). Attempt to translate to CakeML AST, but blocked on various
errors.
It seems that the cause of recent issues is that the monadic translator is
incredibly fragile when it comes to type parameters if the state type needs to
be polymorphic.

Tweak things to check that the type variable is consistent. This code seems to
work with "tvar = ``: 'state``" and "tvar = ``: 'a``" but it does not seem to
work with "tvar = ``: 'el``", which is all a bit mysterious.
Switch out the standard heap-sort, in which the heap points in the opposite
direction, for a list-of-heaps sort in which power-of-two-minus-one sized heaps
point in the correct direction within the array. This has the advantage that an
already sorted array is confirmed sorted without moving anything and in linear
time. It also "agrees" with the functional model.

Problem though, the mapping of the multiple heaps into the array gets painful
to reason about. Saving before trying a different approach.
I gave up on setting up a separation-style approach and just verified
everything using TAKE/DROP. Some of the proofs are a bit of a mess.
Attempt to port the proof of simulation of the 979 heap-list-sort from a
"direct" version on the monadic semantics to a postcondition/hoare style
approach.

The big upgrade would be if we could do something separation-ish about the
array, since each worker works on a "focal chunk" of the array leaving the rest
unchanged, and it would be a huge upgrade if the disjointness of these could be
made more implicit.
After figuring out some issues with polymorphism, present a mechanism that
sort of works involving disjoint array chunk membership. It works, and the
most involved proof is now much faster and much more step-by-step. The other
proofs are getting clunkier though. Sigh. Time to try something else.
Finally, an approach I'm happy with. Committing then will clean this up.
After a lot of false steps, the process seems to work. The monadic constants
are defined and verified in the monadic sub-theory. The real monadic
translation is set up in ListProgScript, together with a bit of a hack for
rewriting the accessor constants from the sub-theory to the ones the translator
setup builds. The final "sort" function is derived from that, and can be
trivially proven to be the pure version (and thus a sorting function) in
ListProof.
Along the way, we seem to have made the monadic translator resistant to type
variables with names other than 'a or 'state.
Change around the translation. Prove the "monadic-sort = functional-sort"
theorem, and use its symmetric version as the definition to translate the
functional algorithm.

The point is that the translation and EVAL calculations will now both do
the correct thing with the one constant.
In the monadic translation, the state type represents all the stateful stuff
for the purposes of the monadic model. It doesn't directly exist as any type in
the CakeML translation, instead its fields are placed into reference and array
objects states and the FFI oracle and whatnot. A previous patch was confused
about this.

This patch restores the previous behaviour, including the pretty awful
re-parsing of strings to fetch constants and overloads that were defined
before, though puts a little defensive programming around some of it. Hopefully
fixes the breakage of the IO-based monadic translation examples.
Switch out the "sort" symbol of mllist to call the new heap-list-sort with
all the same key logical theorems exported.

This should cause the Candle kernel etc to call on the new sort and translate
correctly.
Oops. The linter really doesn't like that.
Follow-up to the previous adjustment, we need to make sure that mllist$sort
is translated in ListProg, not just something equal to it, so everything is
pointing at the one sort instance.
The bignum proof needs to know that "sort R [x] = [x]" for some reason. This
was done by supplying its definitional theorems. It's now derived from PERM
instead, which is likely to be more stable. It also occurs to me that EVAL
might be fairly safe.
This includes a variant of one of the sort workers that keeps a "fuel"
parameter to make termination obvious, which in turn means that translation to
CV can be done by auto-methods. I also tried a more direct approach, and it was
horrible, so I think the silly fuel parameter is being kept.
@talsewell talsewell force-pushed the array-heap-sort-cleanup branch from 77517e0 to 309e4ee Compare March 16, 2026 04:35
The Candle proofs include a pass across the translated CakeML AST to check for
problematic effects. The array-based sort complicates this, so instead, restore
the previous mergesort translation, call it from Candle, and tweak some proofs
to refer to the correct one.
Mostly duplicating the change made to the "standard" theories.
A merge-sort implementation that first scans for increasing/decreasing
runs, and merges them trying to merge even-size runs along roughly the
design of "timsort". May replace the standard mergesort.
This proof is still a bit involved, but less so than the heap version.

It probably helps that the data representation, in which the list chunks are
all placed in order, has less moving parts than the flattened-heap variant.
Exercise the same routine as for heap-list-sort, translating the sort
function to CakeML AST in a separate theory to the monadic development,
including a hack that re-defines the monadic elements.
The heap-list-sort development was done on the assumption that it would become
the basis List.sort function, which turns out to be premature. For now, move
both in-array sorts (heap-list and merge-run) to be examples. If the merge-run
sort turns out to be useful it may be moved back in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant