Implementation: prefer traits approach or fixed structures approach #10

douweschulte · 2023-11-03T11:42:00Z

douweschulte
Nov 3, 2023
Maintainer

For reference here my post from the discussion in [mzcore#1]:

So if I may consensus above is to use a standard library memory layout (u8, &str) for the basic building blocks (AminoAcid, Peptide) and provide all basic properties from a shared trait (AminoAcid, Peptide). These could be be easily extended for use by other authors by implementing the traits for their own type. This way of building it makes it really easy to build custom versions of these basic types.

My implementation
My own code I organised by creating a fixed memory representation for the basic types (an enum for AminoAcid and quite an intense struct for Peptide) and I wrote all my functions to work on these types. This mean it would impossible to design custom amino acids (except as mods to existing aas), but I have never needed them in my work. For the peptides I would be quite cautious to use a standard library memory layout -- especially &str because this adds a lot of complexity in terms of handling weird chars in unicode while something like [u8] is a bit more sensible -- this makes it really easy to write incorrect amino acid codes which I see as a major reason to use Rust in my code. I saw in the feature list somewhere that ProForma support is a goal, which means that we wil have to build is support for a lot of very 'weird' things in spectra -- global isotope modifications, ambiguous modifications, chimeric spectra, glycans, cross linking, other charge carriers -- and handling those nicely steered me to creating my own memory representation.
To consolidate a bit, my goals for rustyms was to represent anything that can be written in ProForma, as this is an exhaustive specification for my use cases. So I have never thought about the reasoning for adding user defined amino acids.

Generic traits
If for extensibility besides the standard representation used in mzcore other memory layouts are to be supported with traits I would prefer a trait selection that tries to minimise the number of tasks a single trait has to do. So that would give you something like

HasMass mass + average weight
HasPKa
LinearSequence any linear sequence of elements which have a defined mass, can contain all fragmentation code for peptides, rna, dna, whatever you can dream of that fragments (essentially impl Iterator<Item=impl HasMass>)
ComplexSequence a more complex sequence like branched / cyclic peptides + could contain glycan branches for glycan fragmentation (essentially a graph type where the element implements HasMass)

If extensibility is a primary design goal such a set of composable traits is preferred over a design that has a smaller number of traits but where each trait needs to support so many separate methods that implementing this on your own get pretty much impossible.

Differences
The code for any design that uses traits however will be quite a bit harder to develop because there are many more 'moving parts'. To any casual user both designs (with defined memory layout, and with extensive use of traits) will be identical, they would only use the defined functions on the predefined standard types. The downside of using fixed memory layouts is mainly memory usage, if a Peptide struct always takes 64 bytes for example for features you in your case do not need, but you have a layout which does it in 2 bytes (&[u8]) you could potentially have a higher memory usage then absolutely needed, but I have not seen this as a problem. For performance there should not be a difference between a fully featured struct and a trimmed down one except for the difference in time it takes to copy this number of bytes and as long as we are not talking about simd where specific memory layouts can be very beneficial.

Wrap up
To me the question boils down to: is the proposed extensibility needed? My answer would be no, as long as the provided implementation works for any peptide that is valid in ProForma. But I would say if extensibility is deemed to be necessary, then go all in and make it as generic as possible (and reflect this in the used names).

Very generic traits example
As I have not tried a potential traits setup I was trying it out now, below is my try for a set of traits that I think can encapsulate all the features I had to build in to support ProForma entirely. I tried to define the traits as general as possible, this allows us to write the (hard!) fragmentation for branched and internal frgaments once and apply it to all things we want (like branched peptides, glycans, and whatever people tend to trow on mass specs nowadays). Potentially simpler traits could be defined that do not have branched structures, or multiple 'masses' per element, which could then use speedier code for the fragments generation, but that would be 'just' a slimmed down version of the big traits and where possible I would vote against making multiple traits but instead focus our efforts on making the generic case as fast as possible.

Click to see the code

type Mass = f64;
type PKa = f64;
type Formula = usize; // Stub

trait HasMass {
    fn monoisotopic_mass(&self) -> Mass;
    fn average_weight(&self) -> Mass;
}

trait HasPKa {
    fn pka(&self) -> PKa;
}

trait HasFormula {
    fn formula(&self) -> Formula;
}

trait TryFormula {
    fn try_formula(&self) -> Option<Formula>;
}

impl<T: HasFormula> TryFormula for T {
    fn try_formula(&self) -> Option<Formula> {
        Some(self.formula())
    }
}

/// The position of a link that can be broken by [`HasFragments`]
struct Position {
    /// The inner depth, so the number of elements from the root element
    pub inner_depth: usize,
    /// The outer depth, the maximal number of elements down from this element
    pub outer_depth: usize,
    /// Which branches where taken on the path from the root element to this element
    pub branches: Vec<usize>,
}

/// This gives you any tree shaped structure of the given element type.
/// - [`Element`] - The basic element of this structure, this could be amino acids, monosaccharides, or any other type you want
/// - [`Size`] - The size kind you want to accumulate for your structure, this could be mass, average weight, molecular formula, or any other type you want
/// - [`Label`] - The label you want to provide for different possible sizes your element can have, this could be used to track where ambiguous modifications are placed for the final result, or to track the usage of ambiguous amino acids
trait HasFragments<Element, Size, Label>
where
    Element: HasFragments<Element, Size, Label>,
    Size: std::ops::Add<Output = Size> + Clone,
    Label: std::ops::Add<Output = Label> + Clone,
{
    /// The [`Size`] for this element
    fn size(&self) -> &[(Size, Label)];

    /// Get the next elements, they have to be sorted on your own metric
    fn next_elements(&self) -> &[Element];

    /// Gets all the fragments which can be made by a single link breaking.
    /// The result is for each broken bond the [`Size`] and [`Label`] from both sides of the broken link.
    fn single_fragments(&self) -> Vec<(Position, Size, Size, Label)> {
        todo!(); // Should be implemented on the trait
    }

    /// Gets all internal fragments, with their [`Size`] and [`Label`]
    fn internal_fragments(&self) -> Vec<(Vec<Position>, Size, Label)> {
        todo!(); // Should be implemented on the trait
    }
}

/// Here is how to apply neutral losses, just do a [`flat_map`] over the generated fragments
fn neutral_losses<Size: std::ops::Add<Output = Size> + Clone>(
    size: Size,
    losses: &[Size],
) -> Vec<Size> {
    losses
        .iter()
        .map(|loss| size.clone() + loss.clone())
        .collect()
}

/// Here is how to apply a global isotope modification, of course the internals are stubbed as is the actual modification, but I believe this would be quite usable
fn global_modification(formula: Formula, modification: ()) -> Formula {
    todo!();
}

trait HasPeaks<Size> {
    /// Give the peaks of this 'spectrum'
    fn peaks_list(&self) -> &[Size];

    /// Do the annotation of the peaks, give the pairs back (index into peaks list, index into fragments list)
    fn annotation(
        &self,
        fragments: &[Size],
        in_bounds: impl Fn(Size, Size) -> bool,
    ) -> Vec<(usize, usize)> {
        todo!(); // Should be implemented on the trait
    }
}

Ping @lazear, you will likely be interested and this is maybe easier to find than the slack.

What implementation should we aim for

Predefined memory layout for all basic structures (enum AA, struct Peptide, ...)

0%

Basic traits to decouple memory layout from the calculations (trait AA, trait Peptide, ...)

50%

Generic traits to decouple memory layout and the general concepts (trait HasFragments)

50%

2 votes

david-bouyssie · 2023-11-03T13:18:44Z

david-bouyssie
Nov 3, 2023
Maintainer

Ideally I would like to check the three proposed solutions:

fine grained traits serving a given purpose (so with composability in mind)
traits that aggregates several composable traits in order to propose fully defined types
reference Structs with predefined memory layout that could serve as built-in types for Rusteomics but could eventually be extended/replaced in Rusteomics libraries dependent projects

2 replies

douweschulte Nov 3, 2023
Maintainer Author

You mean to have all of the above in the final implementation? That to me is the way forward if we go for the most generic option. I feel choosing any level on the scale of increasing genericness will need us to provide to genericness levels below (so picking the most generic traits as implementation means that we have to provide more overview traits and a predefined memory layout in rusteomics). So if you mean it in that way I fully agree.

david-bouyssie Nov 3, 2023
Maintainer

Well I hope we understand the same thing: having a mix of generic types and corresponding concrete types.
I will let other people vote before final decision.

mobiusklein · 2023-11-04T17:08:05Z

mobiusklein
Nov 4, 2023
Collaborator

Keep in mind the trade-offs of generics in Rust. This also applies to traits when used as generic bounds, which is often preferable to dyn Trait or Box<dyn Trait> to avoid vtable pointers and enable inlining.

Increased compilation times. Rust takes longer to compile than C++ during development, and plastering generics all over the codebase increases that cost. You can only duel with swords for so long before your arms get tired or you fall off a chair and sprain something https://xkcd.com/303/.
No specialization. Rust is still pretty young as programming languages go, and its type system lacks features like template specialization that helps you escape from unexpected conflicts in your API. You might find it is impossible to represent an intersection between two traits without violating a third trait unless you then factor that third trait apart. dtolnay specialization doesn't work with the type system so much as abuse a convenience feature in it, and other tricks either work through fragile narrow gaps or hard-to-reason-about quirks rather than real features of the type system. See also: https://www.johndcook.com/blog/2009/07/27/baklav-code/.
Multiplexing generics makes development harder as you need to keep repeating your combinations of generics on every impl block, and only once you go to use your generic code will the compiler let you know that your intended template parameters don't satisfy certain requirements. It's also harder to read, which means it will be trickier to fix that kind of problem.

I like the ultra-generic designs though, because they are high art.

1 reply

douweschulte Nov 5, 2023
Maintainer Author

I agree with your points. Lets see if we can prevent the use of Box<...>. For the compilation times, I agree that using more of the type systems increases this I do not think we can get anywhere near being actually bitten by this, if we keep our traits comparatively simple and limited, which also touches your point 3. For the specialization, I have hit this before as well, but then mostly in cases like impl X for impl Iterator<Item= impl Trait>> which never proved a roadblock for any project I worked on before.

douweschulte · 2023-11-05T12:30:12Z

douweschulte
Nov 5, 2023
Maintainer Author

Over the last days I have been thinking about this a bit more, I have warmed up to the use of traits. I would propose to use something similar to the traits I listed in the original discussion, but otherwise try to keep the use of generics reasonable as commented on above by @mobiusklein.

My reasoning for the quite intense generics in the traits I posted above are the following:

Generic over output type, because being able to fit in the type system if you want a single f64 as result for the mass of the fragments, or a full MolecularFormula should allow us to make the decision of choosing speed & memory usage over data richness each on our own. And most importantly having this as generic type allows the compiler to generate optimal code for both use cases while we have to write the logic only once.
Generic over the label because that is something I want to use to keep track of ambiguous masses for aminoacids and modifications, but providing a zero sized type allows anyone to use the code without any overhead regarding the labeling.
Generic over the element, because in that way we can use the exact same logic on peptides and glycans (which we both need for full ProForma support).

Note: for ion series I envisioned to use a post processor on the output from the written trait that generates all feasible ions given the method based on offsets from the given possible links broken. The same post processing method can be applied for neutral losses and charge states as well.

Additional traits that are needed:

AminoAcid
LinearFragments (naming??) which is a version of the complex fragmentation for linear peptides where no ambiguity is allowed. This to provide a faster implementation if necessary, but also to be able to provide a type system bound for the use of the very complex features of ProForma which are not always sensible to use, and would be tiring to check for all the time.
MonoSaccharide
Element?
MolecularFormula?

PS: Implementation note
I propose to use Cow as the returning type for many of the traits that now are written as &[...] becasue this also allows any implementation to do an allocation on the spot without too much hassle.

0 replies

lazear · 2023-11-05T17:29:05Z

lazear
Nov 5, 2023
Collaborator

I think there are merits to both approaches (concrete-only vs traits + providing concrete types). In any other language, I would suggest only have the concrete types and be done with it - but since this is Rust, there might be a substantial portion of users that care deeply about performance/memory use. Joshua's concerns are completely merited, and tbh I tend to avoid the use of traits - unless absolutely necessary - because of their complexity. I suggested traits because I wasn't a huge fan of the initial proposed API with 64-byte AminoAcids and AminoAcidTables everywhere. If there is no need to support extensibility on the concrete type-level (e.g. defining a new AA), then there is less need for traits.

That being said, I see traits as a nice way to handle interop between different Rust programs and the rusteomics ecosystem. Obviously, there are not many public MS/Rust projects yet, but I expect we will see more soon. If we can make it so that rusty-ms, mzdata and Sage can all plug-in by just defining a couple traits on top of their existing concrete types, that seems like an easy win. Of course, the same thing could be achieved by converting to rusteomics types as needed...

I like the high-art composability of traits - very Haskell-ish. Two considerations:

A composable set of traits like Mass, Fragments etc seems very nice in terms of expanding support to non-peptide based MS, e.g. metabolomics or small molecules. Is this a goal for the project, or do we just want to focus on proteomics? If we just want to focus on proteomics, and we don't care about supporting user-defined AAs, do we need all of the trait machinery or can we just write some reasonably-efficient concrete types?
A larger the set of composable traits = higher barrier to interop. If I have to define impl AminoAcid for T that's one thing, but having to impl HasMass, HasPka, etc is another. I am just suggesting that we don't go too overboard :)

5 replies

douweschulte Nov 6, 2023
Maintainer Author

Nice points!

I would advise against making small molecules a goal we should strive for, but if the traits we end up having also work for these fields then that is just perfect. For the point about needing the traits: using the traits we can be generic over memory layout, and for simd operations it can be very beneficial to have a 'weird' layout, as well as like you said this way we can have very small memory footprint of necessary or a very high complex design depending on the problem. Also I had to write the fragmentation code multiple times for both peptides and glycans and I would love to only have to write that once as it is quite complex conceptually and having it twice just makes it harder to keep both in sync, while conceptually the problem is exactly the same.
Yes, I agree. That is something we need to balance every time we propose a new trait. The other side of the balance is that having to define all properties of an amino acid is a higher barrier to interop if only one property is needed for the actual calculations the user actually want.

Also yes I like Haskell so that is maybe where that comes from ;-). I moved to Rust after having used and loved Haskell for a while but not daring to use it in 'actual' products for many different reasons.

For the interop between different crates in the ecosystem I really like your point. Of course implementing type conversions would be fine, but I think traits are conceptually a nicer solution as you pointed out.

david-bouyssie Nov 6, 2023
Maintainer

Regarding the large data structures, I don't think it's a big deal if they are used in a "static" way. I mean it totally depends on the the number of time they are instantiated. If they are static-like data (used as references to structures instantiated only once), then I guess we are safe on the performance side, whatever the size of the underlying struct. This is the case for full amino acid definitions for instance.
I tried the add on top of the struct, another data structure (AminoAcidTable), that holds a distinct list of AminoAcid definitions (mapped by their single letter code, assertion on aa_by_code1.len() == aa_vec.len()). This AminoAcidTable "guarantees" that amino acids structs are instantiated only once within the library. If there are other instances in-memory, this is the end-user responsibility, but this should not impact the functions/methods defined in rusteomics libraries.
@lazear has some reasonable concerns regarding the way AminoAcidTable is currently being used, and this can definitely be improved.

I'll try to come up with a modified version of the current mzcore PR (rusteomics/mzcore#1), that incorporates new traits on top of existing data structures.
I'm sure we can develop an API that is both efficient and flexible, without adding too much complexity.

douweschulte Nov 6, 2023
Maintainer Author

An additional comment about data structures being instantiated once is that OnceCell and OnceLock have just been stabilised in Rust. These provide a nice interface for having a data structure instantiated once easily (replaces the widely used lazy_static! crate). I used them extensively for these kinds of global data in rustyms, for ontologies, but also for predefined lists of monosaccharides and other related stuff.

david-bouyssie Nov 7, 2023
Maintainer

Good to know, thanks!
Found this online article on this topic:
https://betterprogramming.pub/introducing-oncecell-and-oncelock-the-new-buddies-in-rust-1-70-0-229cd94e4ae2

david-bouyssie Nov 8, 2023
Maintainer

I did some progress on this point but need more time to polish the proposed solution.
At the moment I think we could mix different patterns.
Some types may need extensibility/composability and some others may not.
The most important point I guess is to be able to anticipate that for the whole API.
I would go for extensible amino acids and amino acid sequences for instance. This would allow different kinds of their representation while providing similar computations. But I would stick to concrete types for atoms/elements and other stuffs like Peptide and Spectrum. Although multiple data structures could be provided there to cover different flavors (linear vs complex peptide, spectrum data precision specialization, etc...).

I also tried to abstract what I called the atom and AA tables (called CELL in rustyms). I introduced the concept of AminoAcidFactory which can be implemented in different ways.
Impl of those factories, have to be extended with computations relative to atoms / amino acid entities. It means tables are now defined as &self instead of being provided as additional parameter of static function calls.
We could eventually go away from the current customisable definition of amino acids, if we want to provide an even simpler API. But in my opinion it's a good trade-off.

david-bouyssie · 2023-11-09T23:04:14Z

david-bouyssie
Nov 9, 2023
Maintainer

Just committed my changes:
david-bouyssie/mzcore@4b79f58

Will do a PR tomorrow

0 replies

david-bouyssie · 2023-11-10T12:53:49Z

david-bouyssie
Nov 10, 2023
Maintainer

The PR #2 is now flying.

In this PR I tried to provide new layers of abstraction using traits, but I also tried to keep certain degree of flexibility to define custom amino acids.
This allows to load definitions of amino acids from external sources (even if no IO support is provided here, I think it would be better to add some kind of utility to mzio).
This can also be useful for some specific use cases. For instance, one may want to change the amino masses to match Nitrogen 15 heavy labeling. This could be done in different ways of course, but a straightforward way using the current API is to define a custom AminoAcidTable, where masses of amino acids are recomputed to be based on N14->N15 replacement at the monoisotopic level.

I also tried the provide an API which can hide all this complexity, for an example see https://github.com/david-bouyssie/mzcore/blob/b63e27993ee7276ac48928580f6029701c111050/mzcore-rs/src/ms/mass_calc.rs#L112

Thus, this should work (with the proper imports):

let mono_mass_from_seq = "INTERSTELLAR".amino_acids_as_bytes().fold(0.0, |mass_sum, aa_byte| mass_sum + aa_byte.mono_mass()) + WATER_MONO_MASS;

I hope this new API is a good baseline for a consensus.

Regarding ProForma, since this a large topic I would prefer to treat this in a specific branch (and maybe a specific crate feature).
Also I think we should open a "Feature request poll".
I think we need to decide if Rusteomics should support ProForma and if yes, shall we support writing to ProForma only or reading from ProForma and writing to ProForma. I think these are too different use cases.
@douweschulte since you have more experience than me about ProForma, could you please open this discussion?

1 reply

douweschulte Nov 14, 2023
Maintainer Author

I started the ProForma discussion in #13. I will look into the PR in #2.

david-bouyssie · 2023-11-11T14:57:46Z

david-bouyssie
Nov 11, 2023
Maintainer

@mobiusklein I just discoverd that you developed this other related crate:
https://github.com/mobiusklein/chemical_element

There is a strong overlap between this crate, rustyms and now rusteomics.
I hope we can converge to a common implementation.
Your input will be very valuable for the new PR #2.

This PR is heavily inspired by rustms, but I could also have a deeper look at your own implem to see if we are missing stuffs.

Regarding deisotopping I was thinking about putting this within the ms module of mzcore, and eventually set this as an optional mzcore feature.
I think mzcore should maximize modularity by trying to split its distinct functionalities.

I'm looking forward, I think we can make quick progress if we manage to work together :)

0 replies

Rusteomics

Implementation: prefer traits approach or fixed structures approach #10

Uh oh!

Uh oh!

douweschulte Nov 3, 2023 Maintainer

Replies: 7 comments · 9 replies

Uh oh!

david-bouyssie Nov 3, 2023 Maintainer

Uh oh!

douweschulte Nov 3, 2023 Maintainer Author

Uh oh!

david-bouyssie Nov 3, 2023 Maintainer

Uh oh!

mobiusklein Nov 4, 2023 Collaborator

Uh oh!

douweschulte Nov 5, 2023 Maintainer Author

Uh oh!

douweschulte Nov 5, 2023 Maintainer Author

Uh oh!

lazear Nov 5, 2023 Collaborator

Uh oh!

douweschulte Nov 6, 2023 Maintainer Author

Uh oh!

Uh oh!

david-bouyssie Nov 6, 2023 Maintainer

Uh oh!

douweschulte Nov 6, 2023 Maintainer Author

Uh oh!

Uh oh!

david-bouyssie Nov 7, 2023 Maintainer

Uh oh!

Uh oh!

david-bouyssie Nov 8, 2023 Maintainer

Uh oh!

david-bouyssie Nov 9, 2023 Maintainer

Uh oh!

Uh oh!

david-bouyssie Nov 10, 2023 Maintainer

Uh oh!

douweschulte Nov 14, 2023 Maintainer Author

Uh oh!

Uh oh!

david-bouyssie Nov 11, 2023 Maintainer

douweschulte
Nov 3, 2023
Maintainer

Replies: 7 comments 9 replies

david-bouyssie
Nov 3, 2023
Maintainer

douweschulte Nov 3, 2023
Maintainer Author

david-bouyssie Nov 3, 2023
Maintainer

mobiusklein
Nov 4, 2023
Collaborator

douweschulte Nov 5, 2023
Maintainer Author

douweschulte
Nov 5, 2023
Maintainer Author

lazear
Nov 5, 2023
Collaborator

douweschulte Nov 6, 2023
Maintainer Author

david-bouyssie Nov 6, 2023
Maintainer

douweschulte Nov 6, 2023
Maintainer Author

david-bouyssie Nov 7, 2023
Maintainer

david-bouyssie Nov 8, 2023
Maintainer

david-bouyssie
Nov 9, 2023
Maintainer

david-bouyssie
Nov 10, 2023
Maintainer

douweschulte Nov 14, 2023
Maintainer Author

david-bouyssie
Nov 11, 2023
Maintainer