improve: log diagnostic for remote deployment junit extension#3306
improve: log diagnostic for remote deployment junit extension#3306csviri wants to merge 4 commits intooperator-framework:mainfrom
Conversation
ddc23d3 to
5bb595d
Compare
There was a problem hiding this comment.
Pull request overview
Adds richer diagnostics when a cluster-deployed operator times out during deployment, to make remote E2E failures easier to debug.
Changes:
- Catch
KubernetesClientTimeoutExceptionduringwaitUntilReadyand emit diagnostics before rethrowing. - Add
logDiagnosticInfo(...)to report deployments, pods, container statuses, related events, and recent pod logs on timeout.
| " Could not retrieve logs for pod '{}': {}", | ||
| pod.getMetadata().getName(), | ||
| logEx.getMessage()); |
There was a problem hiding this comment.
When pod log retrieval fails, this only logs the exception message and drops the stack trace, which makes diagnosing client/auth/network issues harder. Consider logging the exception itself as the last argument (similar to the diagEx handling below) so the full cause is available when needed.
| " Could not retrieve logs for pod '{}': {}", | |
| pod.getMetadata().getName(), | |
| logEx.getMessage()); | |
| " Could not retrieve logs for pod '{}'", | |
| pod.getMetadata().getName(), | |
| logEx); |
|
Now it fails, this seems to be an issue that I also observed with the operations sample PR: It seems that CRD is not applied in some cases, will investigate further. This might be actually because of junit extension issue - maybe with newer version of junit? - since we never observed this before. Or maybe with your recent changes @xstefank ? |
|
@csviri there is only one test in TomcatOperatorE2ETest so I don't think deleting of the CRD is the issue. |
|
Will mere
No, not the deletion, just thought it might ring some bell, nvm |
|
I will do a nicer opt-out from that feature for E2Es |
xstefank
left a comment
There was a problem hiding this comment.
@csviri all tests passed with the CRD deleting after the test class is executed in my PR. So I don't understand why you think that it caused failures in this PR? Local and Cluster tests run separatedly, so they should both restart the operator and redeploy the CRD. I don't think leaving some CRDs like it is proposed in this PR is good.
@xstefank they are running against the same cluster, there is probably an issue with cluster test runner regarding the CRD, I will take a look on that later; for this fixes the issue. Also this api to turn off CRD deletion might help for others too. Will continue on this in this PR. |
|
@xstefank btw, currently the deletion of CRD-s would great to handle uniformly also for ClusterDeployedOperatorExtension so implementing it in AbstractOperatorExtension, would you care to create a PR for that? |
|
@xstefank I addressed the CRD issue in separate PR what we discussed. Removed the deletion opt-out, but let there the option for the future / if some might need it. |
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
|
Delete CRD flag I will move to a separate PR. |
adds also option to not delete CRDs
Signed-off-by: Attila Mészáros a_meszaros@apple.com