Skip to content

fix found by Claude for IFS crash at the last step#281

Draft
antonl321 wants to merge 1 commit into
ecmwf:developfrom
antonl321:leonardo-fix
Draft

fix found by Claude for IFS crash at the last step#281
antonl321 wants to merge 1 commit into
ecmwf:developfrom
antonl321:leonardo-fix

Conversation

@antonl321
Copy link
Copy Markdown
Collaborator

Description

Tests done ifs-raps ( https://git.ecmwf.int/projects/RAPS/repos/ifs-raps/commits/0f89cdfc489378534680077b3ed8564371419219#49r3.yml ) that uses multio v 2.10.1
on CINECA's Leonardo system ended up in the error from below:

11:38:06 STEP 383 H= 47:52 +CPU= 7.271
STEP 383 :## EC_MEMINFO 0 lrdn3467 13040 0 0 515 30102 720 53578 723 50196 1153 49540 1127 52488 7045 46414 1224 52934 1324 50280 400457 36626 0.0
0.0 0 0 112.31 Sm/p
terminate called after throwing an instance of 'std::_Nested_exceptionmultio::util::FailureAwareException'
what(): FailureAware with behaviour "propagate" on Server for context: [
unknown
]
terminate called recursively
forrtl: error (76): Abort trap signal
Image PC Routine Line Source

Stack trace terminated abnormally.
terminate called after throwing an instance of 'std::_Nested_exceptionmultio::util::FailureAwareException'
what(): FailureAware with behaviour "propagate" on Server for context: [
unknown
]
terminate called recursively
forrtl: error (76): Abort trap signal
Image PC Routine Line Source

Stack trace terminated abnormally.
terminate called after throwing an instance of 'std::_Nested_exceptionmultio::util::FailureAwareException'
what(): FailureAware with behaviour "propagate" on Server for context: [
unknown
]

Claude code summary of changes & reasoning

The bug

End-of-run crash/hang in IFS+NEMO+WAM at tco1279l137/eORCA12 (step 192, h24). Three different surface symptoms (mlx5 transport-retry, SIGSEGV in libmultio.so, MPI hang in closeConnections) all had the same root cause:

At shutdown, a handful of MultIO control messages arrive with a malformed YAML payload. message::metadataFromYAML calls eckit::YAMLParser, which throws StreamParser::consume expecting ',', got ''. The exception travels up through FailureAware<Receiver/Dispatcher>::withFailureHandling. The default policy was Propagate → the server thread terminates → client MPI_Wait never completes → either hang, or the IB layer eventually reports retry-exceeded, or teardown segfaults.

The mlx5 "Transport retry count exceeded" was a red herring — peer-side death, not a fabric flap.

The fix, in five pieces

1. Make the default failure policy effective — FailureHandling.h

FailureAware's constructor only consults ComponentFailureTraits::defaultOnErrorTag() when the YAML actually contains the config key. If the key is absent, the whole config-parsing block is skipped and parsedOnErrTag_ stays at its value-initialised value (= enum 0 = Propagate). The trait default was silently ignored.

Change: seed parsedOnErrTag_ = ComponentFailureTraits::defaultOnErrorTag() unconditionally at the top of the ctor.

Also in this file: capture and log the inner exception's what() in withFailureHandling (otherwise the log shows "[unknown]" and the real cause is hidden); make parsedOnErrTag_ protected so the Listener can branch on it.

2. Add a Recover value to the Receiver/Dispatcher error enums

Files: FailureHandling.cc, server/Listener.h, server/Dispatcher.h, server/Listener.cc, server/Dispatcher.cc.

OnReceiveError/OnDispatchError previously had only Propagate. Extended with Recover, updated the string Translator and the TagSequence, and set defaultOnErrorTag() = Recover. handleFailure(Recover) returns Ignore and crucially does not call transport_.abortAll(eptr) or msgQueue_.interrupt(eptr) — those were the calls that tore down the client transports and caused the IB retry-exceeded cascade.

3. Per-message try/catch in the Listener loop — Listener.cc start()

withFailureHandling wraps the whole loop, so any thrown message aborts the loop entirely — even though we want to recover. Wrapped the per-iteration body in its own try/catch. On Recover, log through the FailureAware machinery and continue with the next message.

Important negative: an earlier attempt also decremented openedCount_ and erased an arbitrary entry from connections_ to "fake" a missing Close. That corrupted tracking for unrelated peers and produced an avalanche of Connection to Peer(...) is not open SeriousBugs and eventual segfaults during teardown. Removed.

4. RAII buffer release in MpiTransport::receive — MpiTransport.cc

The original code only released the receive buffer back to the pool after the decode loop succeeded. A single thrown decodeMessage permanently leaked that pool slot, so over many bad messages the transport eventually stalled. Added a destructor-driven release so the buffer is returned even on exception.

5. Do not silently swallow YAML errors in metadataFromYAML — Metadata.cc

Tempting fix, wrong fix. If metadataFromYAML returns an empty Metadata on parse failure, the surrounding Message::Header gets constructed as a "successful" message whose tag was decoded from the corrupted byte stream. Close messages then look like Field/Domain/etc. messages, never reach the Close handler, openedCount_ is never decremented, and the loop never exits — the actual hang we observed when this was in place. Let the parser throw; item 3 handles it correctly.

Why this ordering matters

The five changes have to land together. Without (1) the Recover default is ignored. Without (2) there's nothing for the YAML to map to. Without (3) the loop dies on first error. Without (4) it stalls after a few errors. With (5) wrong, the loop never terminates at all. The diagnostic in (1) was what made debugging tractable — before that the inner exception was masked.

Orthogonal etc/ workaround

multio-ifs-setup and multio-ocean-setup rewrite on-error: abort-transporton-error: recover in generated YAMLs and prepend a top-level on-error: recover. This addresses OnClientError/OnServerError (which already had Recover) — needed because nemo_full.yaml explicitly requested abort-transport. Independent of the source patches above, which fix the Receiver/Dispatcher path that has no YAML knob.

Verification

After rebuild, in ifs.out:

  • grep "FailureAware" ifs.out | sort -u should show behaviour "recover" (proves item 1 took effect).
  • grep "Inner exception" ifs.out should show the real StreamParser::consume expecting ',', got '' text.
  • The run reaches step 192 and exits cleanly with no Connection ... not open SeriousBugs and no SIGSEGV.

Build: cd build.49r3.intel && cmake --build . --target multio -j. ifsMASTER.SP links libmultio.so dynamically, so no application relink is required.

Appendix

This is the message at the end of ifs.err with the fix applied

Nested FailureAwareException:

  • 1: Assertion failed: Cannot find domainMaps for T grid

  • 2: FailureAware with behaviour "propagate" on Server for context: [
    Select(MatchReduce(||, +, [{category => {"ocean-3d" ,"ocean-2d"}}])) with Message: Message(version=1, tag=Field, source=Peer(group=multio,id=499), destination=Peer(group=multio,id=671), metadata={"tablesVersion":34,"setLocalDefinition":1,"grib2LocalSectionNumber":1,"productionStatusOfProcessedData":12,"dataset":"climate-dt","stream":"clte","class":"d1","activity":"baseline","experiment":"hist","generation":"2","model":"IFS-NEMO","realization":"1","significanceOfReferenceTime":2,"subCentre":1003,"generatingProcessIdentifier":156,"setPackingType":"grid_ccsds","type":"fc","expver":"j3ly","operation-frequency":"1d","operation":"average","endStepInHours":24,"startStepInHours":0,"currentTime":0,"previousTime":0,"previousDate":19880101,"sampleIntervalInSeconds":360,"paramId":262505,"unstructuredGridSubtype":"T","domain":"T grid","gridType":"unstructured_grid","typeOfLevel":"oceanModelLayer","misc-format":"raw","misc-precision":"single","toAllServers":false,"bitsPerValue":16,"bitmapPresent":false,"missingValue":0,"nemoParam":"vocen","name":"vocen","level":67,"startTime":0,"startDate":19880101,"currentDate":19880102,"timeStep":360,"step-frequency":240,"step":240,"category":"ocean-3d","misc-globalSize":15585132}, payload-size=69460)
    Inner exception: Assertion failed: Cannot find domainMaps for T grid
    ]

  • 3: FailureAware with behaviour "propagate" on Server for context: [
    Plan "ocean-fields-native" with Message: Message(version=1, tag=Field, source=Peer(group=multio,id=499), destination=Peer(group=multio,id=671), metadata={"tablesVersion":34,"setLocalDefinition":1,"grib2LocalSectionNumber":1,"productionStatusOfProcessedData":12,"dataset":"climate-dt","stream":"clte","class":"d1","activity":"baseline","experiment":"hist","generation":"2","model":"IFS-NEMO","realization":"1","significanceOfReferenceTime":2,"subCentre":1003,"generatingProcessIdentifier":156,"setPackingType":"grid_ccsds","type":"fc","expver":"j3ly","operation-frequency":"1d","operation":"average","endStepInHours":24,"startStepInHours":0,"currentTime":0,"previousTime":0,"previousDate":19880101,"sampleIntervalInSeconds":360,"paramId":262505,"unstructuredGridSubtype":"T","domain":"T grid","gridType":"unstructured_grid","typeOfLevel":"oceanModelLayer","misc-format":"raw","misc-precision":"single","toAllServers":false,"bitsPerValue":16,"bitmapPresent":false,"missingValue":0,"nemoParam":"vocen","name":"vocen","level":67,"startTime":0,"startDate":19880101,"currentDate":19880102,"timeStep":360,"step-frequency":240,"step":240,"category":"ocean-3d","misc-globalSize":15585132}, payload-size=69460)
    Parametrization: {}

Inner exception: FailureAware with behaviour "propagate" on Server for context: [
Select(MatchReduce(||, +, [{category => {"ocean-3d" ,"ocean-2d"}}])) with Message: Message(version=1, tag=Field, source=Peer(group=multio,id=499), destination=Peer(group=multio,id=671), metadata={"tablesVersion":34,"setLocalDefinition":1,"grib2LocalSectionNumber":1,"productionStatusOfProcessedData":12,"dataset":"climate-dt","stream":"clte","class":"d1","activity":"baseline","experiment":"hist","generation":"2","model":"IFS-NEMO","realization":"1","significanceOfReferenceTime":2,"subCentre":1003,"generatingProcessIdentifier":156,"setPackingType":"grid_ccsds","type":"fc","expver":"j3ly","operation-frequency":"1d","operation":"average","endStepInHours":24,"startStepInHours":0,"currentTime":0,"previousTime":0,"previousDate":19880101,"sampleIntervalInSeconds":360,"paramId":262505,"unstructuredGridSubtype":"T","domain":"T grid","gridType":"unstructured_grid","typeOfLevel":"oceanModelLayer","misc-format":"raw","misc-precision":"single","toAllServers":false,"bitsPerValue":16,"bitmapPresent":false,"missingValue":0,"nemoParam":"vocen","name":"vocen","level":67,"startTime":0,"startDate":19880101,"currentDate":19880102,"timeStep":360,"step-frequency":240,"step":240,"category":"ocean-3d","misc-globalSize":15585132}, payload-size=69460)
Inner exception: Assertion failed: Cannot find domainMaps for T grid
]
]

Exception stack:
End stack

Contributor Declaration

By opening this pull request, I affirm the following:

  • All authors agree to the Contributor License Agreement.
  • The code follows the project's coding standards.
  • I have performed self-review and added comments where needed.
  • I have added or updated tests to verify that my changes are effective and functional.
  • I have run all existing tests and confirmed they pass.

@tweska
Copy link
Copy Markdown
Member

tweska commented Jun 2, 2026

If I understand correctly this change will add a new "recover" error handling mode that will print the error and then continue. Please, correct me if I am wrong.

Can you point out to me where this error: StreamParser::consume expecting ',', got '' is fixed in this PR?

And where is this comming from: Assertion failed: Cannot find domainMaps for T grid? Is it outside MultIO?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants