Enabling of `MDAnalysis.analysis.align.AverageStructure` parallelization by talagayev · Pull Request #4738 · MDAnalysis/mdanalysis

talagayev · 2024-10-18T18:16:26Z

Fixes #4659 attempt

Changes made in this Pull Request:

added backends and aggregators to AlignTraj and AverageStructure in analysis.align.
added the client_AlignTraj and client_AverageStructure in conftest.py
added client_AlignTraj and client_AverageStructure in run() in test_align.py

Currently for AlignTraj it only accepts serial and dask with multiprocessing leading to the pytests taking forever. An additional error that appears is the following:

OSError: File opened in mode: self.mode. Reading only allow in mode "r"

For AverageStructure the Failure that appears is the following:

AttributeError: 'numpy.ndarray' object has no attribute 'load_new'

Which leads me to believe that AverageStructure can not be parallelized, but I would need additional opinions on it and on AlignTraj :)

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Developers certificate of origin

I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4738.org.readthedocs.build/en/4738/

pep8speaks · 2024-10-18T18:16:33Z

Hello @talagayev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file testsuite/MDAnalysisTests/analysis/test_align.py:

Line 310:80: E501 line too long (86 > 79 characters)
Line 327:80: E501 line too long (87 > 79 characters)
Line 333:80: E501 line too long (97 > 79 characters)
Line 357:80: E501 line too long (85 > 79 characters)
Line 377:80: E501 line too long (91 > 79 characters)

Comment last updated at 2025-01-11 21:40:18 UTC

codecov · 2025-01-11T20:06:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.74%. Comparing base (24548e6) to head (ee8d845).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #4738   +/-   ##
========================================
  Coverage    92.73%   92.74%           
========================================
  Files          180      180           
  Lines        22475    22491   +16     
  Branches      3190     3191    +1     
========================================
+ Hits         20842    20859   +17     
  Misses        1176     1176           
+ Partials       457      456    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

talagayev · 2025-11-26T22:12:44Z

@marinegor I would ping you in this PR as well.

Here basically I tried different ways to see if it is possible to parallelize the AverageStructure and AlignTraj classes.
I was able to implement the parallelization for AverageStructure with the first attempt not working due to being able to read a universe so that error appeared

AttributeError: 'numpy.ndarray' object has no attribute 'load_new'

I added _first to make the class parallelizable. This works well with the test, expect the case, when in_memory=True, then the process takes very long, which I assume is connected to memory issues, so in the current case I revert back to serial for cases when in_memory=True.

As for AlignTraj due to it transforming coordinates and writing out structures it would be necessary to rewrite more parts of the code to make it parallelizable, so I didn't find an easy solution for that, which wouldn't require bigger modifications of the class. So there the question would be if we keep it as non parallelizable or should I try to modify the code to make it parallelizable, which would require bigger modifications?

marinegor · 2025-11-26T23:30:22Z

@talagayev

expect the case, when in_memory=True, then the process takes very long, which I assume is connected to memory issues

I think parallelization should not actually work with in_memory cases (@yuxuanzhuang please correct me if I'm wrong, afaik you've been working on this). Hence I'd explicitly raise an exception if uses asks for parallel execution and provides in_memory as well.

So there the question would be if we keep it as non parallelizable or should I try to modify the code to make it parallelizable, which would require bigger modifications?

If you're running out of ideas, I'd suggest making this PR for AverageStructure only, and create appropriate issue for AlignTraj, describing your attempts so far.

Also, I imagine there are problems with serialization of self._writer, no? Perhaps we can chat on discord about it (I'm @marinegor there)?

talagayev · 2025-11-27T01:03:46Z

@talagayev

expect the case, when in_memory=True, then the process takes very long, which I assume is connected to memory issues

I think parallelization should not actually work with in_memory cases (@yuxuanzhuang please correct me if I'm wrong, afaik you've been working on this). Hence I'd explicitly raise an exception if uses asks for parallel execution and provides in_memory as well.

So there the question would be if we keep it as non parallelizable or should I try to modify the code to make it parallelizable, which would require bigger modifications?

If you're running out of ideas, I'd suggest making this PR for AverageStructure only, and create appropriate issue for AlignTraj, describing your attempts so far.

Also, I imagine there are problems with serialization of self._writer, no? Perhaps we can chat on discord about it (I'm @marinegor there)?

Yes makes sense. I think the current two ones that use in_memory and are analysis related are AverageStructure and AlignTraj.

Yes that would be good, I can then rename the PR to cover only AverageStructure for now, add the missing parts for the PR (Documentation + Changelog), create an Issue and write you on Discord, so that we can brainstorm how to adjust the code to make it parallelizable. Yes self._writer is one of the difficulties. I guess for the aligntraj you can adjust the code to give it the reference, but yes the writing during parallelization is the tricky part, maybe with tmp information that is then merged or maybe just doing the calculations and the writing is then in conclude, basically keeping that part serial and only making the calculations parallel.

marinegor · 2025-11-27T21:44:40Z

package/MDAnalysis/analysis/align.py

+            if requested_backend not in (None, "serial"):
+                warnings.warn(
+                    "The in-memory parallel trajectory usage is inefficient"
+                    "and not supported. Falling back to serial.",
+                    RuntimeWarning,
+                )


I won't be in favor of a warning, and would rather explicitly raise ValueError because, well, how often do you switch off / ignore warnings?)

True, adjusted it to raise a ValueError for that case and adjusted the test to cover the ValueError.

marinegor · 2025-11-27T21:45:42Z

@talagayev ok, will be waiting for your message.
I also assigned myself a reviewer here, so just re-request review when you think you're done!

talagayev · 2025-11-28T13:03:43Z

@talagayev ok, will be waiting for your message. I also assigned myself a reviewer here, so just re-request review when you think you're done!

Added the Documentation, CHANGELOG and adjust to raise and ValueError. The PR would be ready to be re-reviewed :)

marinegor

@talagayev all looking good! I initially commented on one extra line but just realized I can remove it myself.

Also, may I ask you to create a separate issue for AlignTraj parallelization? Perhaps you could describe the issues you encountered there, and suggest the direction in which one should move to actually enable it.

testsuite/MDAnalysisTests/analysis/conftest.py

talagayev · 2025-12-15T17:28:03Z

@talagayev all looking good! I initially commented on one extra line but just realized I can remove it myself.

Also, may I ask you to create a separate issue for AlignTraj parallelization? Perhaps you could describe the issues you encountered there, and suggest the direction in which one should move to actually enable it.

Perfect :)

Yes, I can create an Issue and write some Ideas in there.

marinegor · 2025-12-15T18:49:53Z

/azp run

azure-pipelines · 2025-12-15T18:50:06Z

Azure Pipelines successfully started running 1 pipeline(s).

marinegor

@talagayev sorry for postponing that but I think these comments will make the code better :)

marinegor · 2025-12-15T20:10:12Z

package/MDAnalysis/analysis/align.py

+        if getattr(self, "_in_memory", False):
+            # We are in the in_memory case: always run serial.
+            if requested_backend not in (None, "serial"):
+                raise ValueError(
+                    "The in-memory parallel trajectory usage is not supported. Use serial backend instead.",
+                )
+            return super().run(
+                start=start, stop=stop, step=step, verbose=verbose
+            )
+        else:
+            if requested_backend is not None:
+                kwargs["backend"] = requested_backend
+            return super().run(
+                start=start, stop=stop, step=step, verbose=verbose, **kwargs
+            )


@talagayev sorry I didn't think about it earlier, but this actually feels a bit hacky to me, and I think I know why: run() isn't supposed to do any validation, it's performed by _configure_backend() method instead (docs). In your case, I'd patch it to be something like:

def _configure_backend( self, backend: Union[str, BackendBase], n_workers: int, unsupported_backend: bool = False, ) -> BackendBase: configured_backend = super()._configure_backend(backend=backend, n_workers=n_workers, unsupported_backend=unsupported_backend) if not isinstance(configured_backend, MDAnalysis.analysis.backends.BackendSerial) and self._in_memory: raise ValueError('...')

this way you don't have to patch run() with double-nested ifs, and generally write less code.

yes was a hacky approach to do the go around with the serial case. Agree your approach looks cleaner, will try to adjust it in the upcoming days :)

Adjusted now to use _configure_backend as suggested :)

marinegor · 2025-12-16T11:16:59Z

@talagayev and regarding AlignTraj -- I'd just be bold and say that it's impossible to parallelize that with current split-and-combine technique, since that would require an ability to write an arbitrary frame with self._writer, and not only sequential writing. I don't think any writers allow that.

So even in this PR, set available_backends to only serial explicitly, and note that it's impossible to parallelize. And as for the next issue, I'd target adding an option to save aligned trajectory to self.results, and writing it in _conclude.

… into align_parllel

orbeckst added Component-Analysis parallelization labels Mar 14, 2025

talagayev closed this Oct 15, 2025

talagayev force-pushed the align_parllel branch from c37add2 to 03eef45 Compare October 15, 2025 23:15

talagayev and others added 5 commits November 16, 2025 14:58

addition of parallelization to align.py

9152b99

addition of client_AverageStructure to conftest.py

ed792cb

Merge branch 'MDAnalysis:develop' into align_parllel

e4d622d

added parallelization to align.py

20420d2

added tests

75a6c11

talagayev reopened this Nov 23, 2025

talagayev added 3 commits November 23, 2025 20:15

added documentation

489703d

black formatting

a9b73e5

black format

90389bd

talagayev marked this pull request as ready for review November 26, 2025 22:03

talagayev and others added 3 commits November 27, 2025 21:22

Merge branch 'develop' into align_parllel

37df243

adjusted versionchanged and added Changelog entry

6afa3e1

black formatting

ec1cf4e

marinegor self-requested a review November 27, 2025 21:42

marinegor requested changes Nov 27, 2025

View reviewed changes

adjusted to ValueError and black formatting

5009b02

talagayev changed the title ~~'MDAnalysis.analysis.align' parallelization~~ Enabling of MDAnalysis.analysis.align.AverageStructure parallelization Nov 28, 2025

talagayev requested a review from marinegor November 28, 2025 13:02

marinegor requested changes Dec 14, 2025

View reviewed changes

testsuite/MDAnalysisTests/analysis/conftest.py Show resolved Hide resolved

Update testsuite/MDAnalysisTests/analysis/conftest.py

522b5f6

marinegor approved these changes Dec 14, 2025

View reviewed changes

marinegor reviewed Dec 14, 2025

View reviewed changes

testsuite/MDAnalysisTests/analysis/conftest.py Show resolved Hide resolved

apply formatting

fbb85ac

marinegor requested changes Dec 15, 2025

View reviewed changes

talagayev and others added 5 commits February 2, 2026 00:15

adjusted to configured_backend

5c93849

Merge branch 'align_parllel' of https://github.com/talagayev/mdanalysis…

66826bf

… into align_parllel

adde conftest comment

b2bb425

black formatting

e697ac4

Merge branch 'MDAnalysis:develop' into align_parllel

ee8d845

talagayev requested a review from marinegor February 2, 2026 01:28

Conversation

talagayev commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers certificate of origin

Uh oh!

pep8speaks commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2025-01-11 21:40:18 UTC

Uh oh!

codecov bot commented Jan 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

talagayev commented Nov 26, 2025

Uh oh!

marinegor commented Nov 26, 2025

Uh oh!

talagayev commented Nov 27, 2025

Uh oh!

marinegor Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

talagayev Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

marinegor commented Nov 27, 2025

Uh oh!

talagayev commented Nov 28, 2025

Uh oh!

marinegor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

talagayev commented Dec 15, 2025

Uh oh!

marinegor commented Dec 15, 2025

Uh oh!

azure-pipelines bot commented Dec 15, 2025

Uh oh!

marinegor left a comment

Choose a reason for hiding this comment

Uh oh!

marinegor Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

talagayev Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

talagayev Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

marinegor commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

talagayev commented Oct 18, 2024 •

edited

Loading

pep8speaks commented Oct 18, 2024 •

edited

Loading

codecov bot commented Jan 11, 2025 •

edited

Loading

marinegor left a comment •

edited

Loading