fix found by Claude for IFS crash at the last step#281
Draft
antonl321 wants to merge 1 commit into
Draft
Conversation
Member
|
If I understand correctly this change will add a new "recover" error handling mode that will print the error and then continue. Please, correct me if I am wrong. Can you point out to me where this error: And where is this comming from: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Tests done ifs-raps ( https://git.ecmwf.int/projects/RAPS/repos/ifs-raps/commits/0f89cdfc489378534680077b3ed8564371419219#49r3.yml ) that uses multio v 2.10.1
on CINECA's Leonardo system ended up in the error from below:
11:38:06 STEP 383 H= 47:52 +CPU= 7.271
STEP 383 :## EC_MEMINFO 0 lrdn3467 13040 0 0 515 30102 720 53578 723 50196 1153 49540 1127 52488 7045 46414 1224 52934 1324 50280 400457 36626 0.0
0.0 0 0 112.31 Sm/p
terminate called after throwing an instance of 'std::_Nested_exceptionmultio::util::FailureAwareException'
what(): FailureAware with behaviour "propagate" on Server for context: [
unknown
]
terminate called recursively
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
Stack trace terminated abnormally.
terminate called after throwing an instance of 'std::_Nested_exceptionmultio::util::FailureAwareException'
what(): FailureAware with behaviour "propagate" on Server for context: [
unknown
]
terminate called recursively
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
Stack trace terminated abnormally.
terminate called after throwing an instance of 'std::_Nested_exceptionmultio::util::FailureAwareException'
what(): FailureAware with behaviour "propagate" on Server for context: [
unknown
]
Claude code summary of changes & reasoning
The bug
End-of-run crash/hang in IFS+NEMO+WAM at tco1279l137/eORCA12 (step 192, h24). Three different surface symptoms (mlx5 transport-retry, SIGSEGV in
libmultio.so, MPI hang incloseConnections) all had the same root cause:At shutdown, a handful of MultIO control messages arrive with a malformed YAML payload.
message::metadataFromYAMLcallseckit::YAMLParser, which throwsStreamParser::consume expecting ',', got ''. The exception travels up throughFailureAware<Receiver/Dispatcher>::withFailureHandling. The default policy wasPropagate→ the server thread terminates → clientMPI_Waitnever completes → either hang, or the IB layer eventually reports retry-exceeded, or teardown segfaults.The mlx5 "Transport retry count exceeded" was a red herring — peer-side death, not a fabric flap.
The fix, in five pieces
1. Make the default failure policy effective — FailureHandling.h
FailureAware's constructor only consultsComponentFailureTraits::defaultOnErrorTag()when the YAML actually contains the config key. If the key is absent, the whole config-parsing block is skipped andparsedOnErrTag_stays at its value-initialised value (= enum0=Propagate). The trait default was silently ignored.Change: seed
parsedOnErrTag_ = ComponentFailureTraits::defaultOnErrorTag()unconditionally at the top of the ctor.Also in this file: capture and log the inner exception's
what()inwithFailureHandling(otherwise the log shows "[unknown]" and the real cause is hidden); makeparsedOnErrTag_protectedso the Listener can branch on it.2. Add a
Recovervalue to the Receiver/Dispatcher error enumsFiles: FailureHandling.cc,
server/Listener.h,server/Dispatcher.h,server/Listener.cc,server/Dispatcher.cc.OnReceiveError/OnDispatchErrorpreviously had onlyPropagate. Extended withRecover, updated the string Translator and theTagSequence, and setdefaultOnErrorTag() = Recover.handleFailure(Recover)returnsIgnoreand crucially does not calltransport_.abortAll(eptr)ormsgQueue_.interrupt(eptr)— those were the calls that tore down the client transports and caused the IB retry-exceeded cascade.3. Per-message try/catch in the Listener loop — Listener.cc
start()withFailureHandlingwraps the whole loop, so any thrown message aborts the loop entirely — even though we want to recover. Wrapped the per-iteration body in its own try/catch. OnRecover, log through the FailureAware machinery and continue with the next message.Important negative: an earlier attempt also decremented
openedCount_and erased an arbitrary entry fromconnections_to "fake" a missing Close. That corrupted tracking for unrelated peers and produced an avalanche ofConnection to Peer(...) is not openSeriousBugs and eventual segfaults during teardown. Removed.4. RAII buffer release in
MpiTransport::receive— MpiTransport.ccThe original code only released the receive buffer back to the pool after the decode loop succeeded. A single thrown
decodeMessagepermanently leaked that pool slot, so over many bad messages the transport eventually stalled. Added a destructor-driven release so the buffer is returned even on exception.5. Do not silently swallow YAML errors in
metadataFromYAML— Metadata.ccTempting fix, wrong fix. If
metadataFromYAMLreturns an emptyMetadataon parse failure, the surroundingMessage::Headergets constructed as a "successful" message whose tag was decoded from the corrupted byte stream. Close messages then look like Field/Domain/etc. messages, never reach the Close handler,openedCount_is never decremented, and the loop never exits — the actual hang we observed when this was in place. Let the parser throw; item 3 handles it correctly.Why this ordering matters
The five changes have to land together. Without (1) the
Recoverdefault is ignored. Without (2) there's nothing for the YAML to map to. Without (3) the loop dies on first error. Without (4) it stalls after a few errors. With (5) wrong, the loop never terminates at all. The diagnostic in (1) was what made debugging tractable — before that the inner exception was masked.Orthogonal etc/ workaround
multio-ifs-setup and multio-ocean-setup rewrite
on-error: abort-transport→on-error: recoverin generated YAMLs and prepend a top-levelon-error: recover. This addressesOnClientError/OnServerError(which already hadRecover) — needed becausenemo_full.yamlexplicitly requestedabort-transport. Independent of the source patches above, which fix the Receiver/Dispatcher path that has no YAML knob.Verification
After rebuild, in
ifs.out:grep "FailureAware" ifs.out | sort -ushould showbehaviour "recover"(proves item 1 took effect).grep "Inner exception" ifs.outshould show the realStreamParser::consume expecting ',', got ''text.Connection ... not openSeriousBugs and no SIGSEGV.Build:
cd build.49r3.intel && cmake --build . --target multio -j.ifsMASTER.SPlinkslibmultio.sodynamically, so no application relink is required.Appendix
This is the message at the end of ifs.err with the fix applied
Nested FailureAwareException:
1: Assertion failed: Cannot find domainMaps for T grid
2: FailureAware with behaviour "propagate" on Server for context: [
Select(MatchReduce(||, +, [{category => {"ocean-3d" ,"ocean-2d"}}])) with Message: Message(version=1, tag=Field, source=Peer(group=multio,id=499), destination=Peer(group=multio,id=671), metadata={"tablesVersion":34,"setLocalDefinition":1,"grib2LocalSectionNumber":1,"productionStatusOfProcessedData":12,"dataset":"climate-dt","stream":"clte","class":"d1","activity":"baseline","experiment":"hist","generation":"2","model":"IFS-NEMO","realization":"1","significanceOfReferenceTime":2,"subCentre":1003,"generatingProcessIdentifier":156,"setPackingType":"grid_ccsds","type":"fc","expver":"j3ly","operation-frequency":"1d","operation":"average","endStepInHours":24,"startStepInHours":0,"currentTime":0,"previousTime":0,"previousDate":19880101,"sampleIntervalInSeconds":360,"paramId":262505,"unstructuredGridSubtype":"T","domain":"T grid","gridType":"unstructured_grid","typeOfLevel":"oceanModelLayer","misc-format":"raw","misc-precision":"single","toAllServers":false,"bitsPerValue":16,"bitmapPresent":false,"missingValue":0,"nemoParam":"vocen","name":"vocen","level":67,"startTime":0,"startDate":19880101,"currentDate":19880102,"timeStep":360,"step-frequency":240,"step":240,"category":"ocean-3d","misc-globalSize":15585132}, payload-size=69460)
Inner exception: Assertion failed: Cannot find domainMaps for T grid
]
3: FailureAware with behaviour "propagate" on Server for context: [
Plan "ocean-fields-native" with Message: Message(version=1, tag=Field, source=Peer(group=multio,id=499), destination=Peer(group=multio,id=671), metadata={"tablesVersion":34,"setLocalDefinition":1,"grib2LocalSectionNumber":1,"productionStatusOfProcessedData":12,"dataset":"climate-dt","stream":"clte","class":"d1","activity":"baseline","experiment":"hist","generation":"2","model":"IFS-NEMO","realization":"1","significanceOfReferenceTime":2,"subCentre":1003,"generatingProcessIdentifier":156,"setPackingType":"grid_ccsds","type":"fc","expver":"j3ly","operation-frequency":"1d","operation":"average","endStepInHours":24,"startStepInHours":0,"currentTime":0,"previousTime":0,"previousDate":19880101,"sampleIntervalInSeconds":360,"paramId":262505,"unstructuredGridSubtype":"T","domain":"T grid","gridType":"unstructured_grid","typeOfLevel":"oceanModelLayer","misc-format":"raw","misc-precision":"single","toAllServers":false,"bitsPerValue":16,"bitmapPresent":false,"missingValue":0,"nemoParam":"vocen","name":"vocen","level":67,"startTime":0,"startDate":19880101,"currentDate":19880102,"timeStep":360,"step-frequency":240,"step":240,"category":"ocean-3d","misc-globalSize":15585132}, payload-size=69460)
Parametrization: {}
Inner exception: FailureAware with behaviour "propagate" on Server for context: [
Select(MatchReduce(||, +, [{category => {"ocean-3d" ,"ocean-2d"}}])) with Message: Message(version=1, tag=Field, source=Peer(group=multio,id=499), destination=Peer(group=multio,id=671), metadata={"tablesVersion":34,"setLocalDefinition":1,"grib2LocalSectionNumber":1,"productionStatusOfProcessedData":12,"dataset":"climate-dt","stream":"clte","class":"d1","activity":"baseline","experiment":"hist","generation":"2","model":"IFS-NEMO","realization":"1","significanceOfReferenceTime":2,"subCentre":1003,"generatingProcessIdentifier":156,"setPackingType":"grid_ccsds","type":"fc","expver":"j3ly","operation-frequency":"1d","operation":"average","endStepInHours":24,"startStepInHours":0,"currentTime":0,"previousTime":0,"previousDate":19880101,"sampleIntervalInSeconds":360,"paramId":262505,"unstructuredGridSubtype":"T","domain":"T grid","gridType":"unstructured_grid","typeOfLevel":"oceanModelLayer","misc-format":"raw","misc-precision":"single","toAllServers":false,"bitsPerValue":16,"bitmapPresent":false,"missingValue":0,"nemoParam":"vocen","name":"vocen","level":67,"startTime":0,"startDate":19880101,"currentDate":19880102,"timeStep":360,"step-frequency":240,"step":240,"category":"ocean-3d","misc-globalSize":15585132}, payload-size=69460)
Inner exception: Assertion failed: Cannot find domainMaps for T grid
]
]
Exception stack:
End stack
Contributor Declaration
By opening this pull request, I affirm the following: