hiffy: improve task synchronization after reset#598
Conversation
Cool! That's really cool! Wow! |
hawkw
left a comment
There was a problem hiding this comment.
I do wonder if, in the long term, it is worth just giving the kernel some bit that it can set in the task control block to say "yes i have definitely started this task" a bit more authoritatively, but this seems reasonable for the near term. I had some nits about the comments and warnings, but otherwise...sure whatever, this is a reasonablish solution to an unfortunate situation.
| // try again -- giving hiffy time to finish initializing, or to go back | ||
| // to sleep and wait for instructions, respectively. | ||
| // | ||
| // However -- if HIFFY_READY==1, things are more complicated. |
There was a problem hiding this comment.
This is one of those sentences where you're just like "Oh, this is gonna be good".
humility-hiffy/src/lib.rs
Outdated
| if hiffy_task.is_none() { | ||
| log::warn!( | ||
| "can't find task named 'hiffy' in image, \ | ||
| synchronization may be wrong." |
There was a problem hiding this comment.
nit: bonus whitespace here:
| synchronization may be wrong." | |
| synchronization may be wrong." |
also, i feel like the meaning of "synchronization may be wrong" is not very obvious to a potential person who encounters this message?
There was a problem hiding this comment.
Yeah, I will revisit that method. Would you like more or less Bonus Whitespace™ in the remix? I have some spare.
humility-hiffy/src/lib.rs
Outdated
| // This really shouldn't happen, but it's permitted by the API | ||
| // types, sooooooo | ||
| HubrisTask::Kernel => None, |
There was a problem hiding this comment.
Hm, is this case so absurd that it ought to have its own warning? Just to disambiguate between "this image doesn't contain a task named hiffy" which is somewhat normalish, and "somehow hiffy was found to be inside the kernel, which is completely absurd and should never happen"?
There was a problem hiding this comment.
I mean, personally, I think this operation returns the wrong type -- from a brief survey of uses of this API (lookup_task) they fall into two categories:
- Cases where
HubrisTask::Kernelcauses an immediatebail!and - Cases where it probably ought to but isn't being checked for.
Fixing that feels like a bigger issue, so I will add a warning here as you suggest, and file a bug to follow up on the API.
humility-hiffy/src/lib.rs
Outdated
| SchedState::InSend(_) | ||
| | SchedState::InReply(_) | ||
| | SchedState::InRecv(_) => Ok(true), | ||
| _ => Ok(false), |
There was a problem hiding this comment.
i think that, since this is the place where we actually determine whether we consider Hiffy to have "started", it might be worth having a little commentary here explaining why we consider these states to mean "yes it has started" and we don't do that if it is Runnable? Even if the comment is a bit duplicative of the other bigger block comment.
I think having an in-kernel flag that specifically says "this task has made a syscall that isn't associated with startup" (which, currently, means any syscalls, since pre-main init makes no syscalls, but that might change in the future?) would be fantastic. I also think that the task accessors in Humility ought to take a task generation number and fail if the generation changes unexpectedly, but that's a way bigger fix. Because of Humility's back-compat requirement, we need a solution that works for pre-kernel-flag firmware, and I think this change is probably the best bet there (among variously iffy bets). But I'd totally consider the kernel flag thing, and tbh I bet there are a couple other similar things we could use in the kernel that would be low cost. |
Totally agree that we have to do something like this to be able to support older images regardless of whether we end up adding an easier way to expose this information. I just wanted to bring up "maybe we should also, separately, go and figure out how to have a nicer way to communicate this information". |
hawkw
left a comment
There was a problem hiding this comment.
i like the new comments/warnings!
humility-hiffy/src/lib.rs
Outdated
| SchedState::Stopped | ||
| | SchedState::Runnable => Ok(false), |
There was a problem hiding this comment.
rustfmt seems to dislike this bit
| SchedState::Stopped | |
| | SchedState::Runnable => Ok(false), | |
| SchedState::Stopped | SchedState::Runnable => Ok(false), |
| // types, sooooooo | ||
| HubrisTask::Kernel => { | ||
| log::warn!( | ||
| "This application appears to have named its kernel \ |
There was a problem hiding this comment.
i like the use of "appears" here, because i kind of imagine that there are basically three ways we could end up here:
- someone made some image for testing hiffy things and named the kernel "hiffy" without thinking of the consequences of doing so, which seems possible but unlikely,
- the image and/or its DWARF have been somehow garbled/corrupted in what would have to be a shockingly pernicious way to make it be basically parseable but this messed up.
- the mechanism humility uses to determine if a region is the kernel or not (which i don't remember off the top of my head) has been broken for some reason.
in at least one or perhaps two of those cases, it may not "actually" be the kernel from the perspective of when the archive was built, so saying it "looks like" the kernel is probably the most accurate way to represent our confidence in this.
| "Can't find task named 'hiffy' in image. We'll still try to \ | ||
| use it, but if this is just after a reboot, we may not be \ | ||
| able to synchronize with its startup correctly, which may \ | ||
| produce intermittent flaky behavior. This message is not \ |
There was a problem hiding this comment.
it might be nice to be able to explain how it would be flaky, but this message is already quite long and i think it's good enough.
There was a problem hiding this comment.
I'm open to suggestions, I'm not sure how to explain the flake -- since I can't tell what hiffy is being used for at this level in the library. It will probably usually be timeouts and message decode errors, but we're in UB territory so ...
Manufacturing test revealed an occasional flake in programming auxflash on Minibar. It turns out to be a multi-part problem. We program auxflash by driving Hiffy through Humility to generate IPCs. This means the flash process is 1. Write firmware to the SP's internal flash. 2. Reboot into it. 3. Enlist the hiffy task to check and program auxflash. 4. Reboot again. Humility synchronizes with Hiffy by watching for a `HIFFY_READY` variable to go from 0->1. Unfortunately, if the hiffy task faults or the system reboots, nothing switches that back to 0 until hiffy's bss initialization code runs. It turns out that Minibar has several high priority tasks that use CPU at boot, delaying hiffy's task initialization, including bss. This means that, after a reboot, Humility can observe arbitrary contents in the `HIFFY_READY` variable for quite a long period of time (about 42ms in my tests). In this specific case, it's usually a `1` left there by the previous incarnation of the system... but it could also be random data, if the last image we were running had Hiffy at a different physical address. (Yes, this is an uninitialized variable bug despite being all safe Rust. Debuggers: Always Unsafe(tm)) Fixing this has turned out to be a little subtle. The approach taken by this commit is to "detect" the completion of Hiffy's task initialization by seeing it enter into a system call. Since Hiffy spends 99+% of its time parked in sleeps, this is a reliable indicator that control has gotten to main and the contents of `HIFFY_READY` are now initialized.
d779399 to
5eff861
Compare
Manufacturing test revealed an occasional flake in programming auxflash on Minibar. It turns out to be a multi-part problem.
We program auxflash by driving Hiffy through Humility to generate IPCs. This means the flash process is
Humility synchronizes with Hiffy by watching for a
HIFFY_READYvariable to go from 0->1. Unfortunately, if the hiffy task faults or the system reboots, nothing switches that back to 0 until hiffy's bss initialization code runs.It turns out that Minibar has several high priority tasks that use CPU at boot, delaying hiffy's task initialization, including bss. This means that, after a reboot, Humility can observe arbitrary contents in the
HIFFY_READYvariable for quite a long period of time (about 42ms in my tests). In this specific case, it's usually a1left there by the previous incarnation of the system... but it could also be random data, if the last image we were running had Hiffy at a different physical address.(Yes, this is an uninitialized variable bug despite being all safe Rust. Debuggers: Always Unsafe(tm))
Fixing this has turned out to be a little subtle. The approach taken by this commit is to "detect" the completion of Hiffy's task initialization by seeing it enter into a system call. Since Hiffy spends 99+% of its time parked in sleeps, this is a reliable indicator that control has gotten to main and the contents of
HIFFY_READYare now initialized.