Skip to content

feat: cdk diagnose displays the root cause of a past stack deployment failure#1378

Open
rix0rrr wants to merge 30 commits intomainfrom
huijbers/cdk-diagnose
Open

feat: cdk diagnose displays the root cause of a past stack deployment failure#1378
rix0rrr wants to merge 30 commits intomainfrom
huijbers/cdk-diagnose

Conversation

@rix0rrr
Copy link
Copy Markdown
Contributor

@rix0rrr rix0rrr commented Apr 16, 2026

Useful when you use a CI/CD system to deploy stacks, but the failure reason is hard to parse.

This PR unifies and makes the error analysis and printing paths of cdk deploy accessible as a separate subcommand, cdk diagnose. This allows you to get the same interpretation of CloudFormation error reporting strategies, as well as source localization for a CI/CD deployment as is possible today for a deployment performed directly via the CLI.

In the future, we will further extend the diagnosis that the CLI performs to get you more information.

This is what errors look like now (both for cdk deploy and cdk diagnose):

    ❌ Stack cdktest-0648snacfpxl-diagnose-deploy-fail:
    Early validation failed for change set cdk-deploy-change-set:
     └─ cdktest-0648snacfpxl-diagnose-deploy-fail
         └─ BadPolicy  (AWS::IAM::Policy BadPolicy)
            🛑 Required property [PolicyName] not found (at /Resources/BadPolicy/Properties)
            Source Location: new DeployFailStack (/private/var/folders/w3/3m2f73xn1rq69wbbx9zw70s00000gq/T/cdk-integ-0648snacfpxl/app.js:29:5)
                             Object.<anonymous> (/private/var/folders/w3/3m2f73xn1rq69wbbx9zw70s00000gq/T/cdk-integ-0648snacfpxl/app.js:69:1)
                             Module._compile (node:internal/modules/cjs/loader:1761:14)
                             Module._extensions..js (node:internal/modules/cjs/loader:1893:10)
                             Module.load (node:internal/modules/cjs/loader:1481:32)
                             Module._load (node:internal/modules/cjs/loader:1300:12)
                             TracingChannel.traceSync (node:diagnostics_channel:328:14)
                             wrapModuleLoad (node:internal/modules/cjs/loader:245:24)
                             Module.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:154:5)
                             node:internal/main/run_main_module:33:47

Broad design of this code:

image

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@rix0rrr rix0rrr requested a review from a team April 16, 2026 14:57
@rix0rrr rix0rrr marked this pull request as draft April 16, 2026 14:57
auto-merge was automatically disabled April 16, 2026 14:57

Pull request was converted to draft

@github-actions github-actions Bot added the p2 label Apr 16, 2026
@aws-cdk-automation aws-cdk-automation requested a review from a team April 16, 2026 14:59
…nt failure

Useful when you use a CI/CD system to deploy stacks, but the failure
reason is hard to parse.

This PR unifies makes the error analysis and printing paths of `cdk
deploy` accessible as a separate subcommand, `cdk diagnose`. This allows
you to get the same interpretation of CloudFormation error reporting
strategies, as well as source localization for a CI/CD deployment
as is possible today for a deployment performed directly via the CLI.

In the future, we will further extend the diagnosis that the CLI
performs to get you more information.

This PR extends the bootstrap template permissions with `List`
permissions for the `ChangeSets`, turns them into `List*` and
`Describe*` so that if there are future read-only operations
we don't have to keep adding them.

It also turns on `IncludeNestedStack: true` for CreateChangeSet so
that early validation is run for nested stacks.
rix0rrr and others added 2 commits April 22, 2026 15:55
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 49.05660% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.93%. Comparing base (2ee00f1) to head (53ff33b).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
packages/aws-cdk/lib/cli/cli.ts 10.52% 17 Missing ⚠️
packages/aws-cdk/lib/cli/cdk-toolkit.ts 61.11% 7 Missing ⚠️
packages/aws-cdk/lib/cxapp/exec.ts 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1378      +/-   ##
==========================================
- Coverage   88.14%   87.93%   -0.21%     
==========================================
  Files          74       74              
  Lines       10481    10530      +49     
  Branches     1432     1435       +3     
==========================================
+ Hits         9238     9260      +22     
- Misses       1216     1243      +27     
  Partials       27       27              
Flag Coverage Δ
suite.unit 87.93% <49.05%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

Total lines changed 2870 is greater than 1000. Please consider breaking this PR down.

rix0rrr and others added 4 commits April 24, 2026 13:49
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
});

expect(stdErr).toContain('Import of existing resources failed');
expect(stdErr).toContain('needs a DeletionPolicy');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shrug. The message changed slightly. This is enough to confirm that the message is in the output.

Copy link
Copy Markdown
Contributor

@mrgrain mrgrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor stuff.

Bigger points for for me are:

  • Should the new command be --unstable for now?
  • The naming and jsdocs around OldestEvent don't click with me at all. Would love to see an iteration on the API to make the intent more clear.

Comment thread packages/@aws-cdk/toolkit-lib/lib/actions/diagnose/index.ts
/**
* Optionally a source trace
*
* (Not optional on purpose so we are not allowed to forget to call the code that should fill it)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Although this feels like a linter rules/TS config trying to come out 😱

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to have branded types here but I also didn't want to expose those publicly, so this is the middle ground.

Comment thread packages/@aws-cdk/toolkit-lib/lib/toolkit/toolkit.ts
Comment thread packages/@aws-cdk/toolkit-lib/lib/toolkit/toolkit.ts
/* c8 ignore stop */

type TypeUnderlyingBrand<A> = A extends Branded<infer T, any> ? T : never;
type TypeUnderlyingBrand<A> = Omit<A, keyof Brand<any>>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems risky to go from never to any... Can you help explain this?

Copy link
Copy Markdown
Contributor Author

@rix0rrr rix0rrr Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

The goal is, given Branded<A>, to find A. For whatever reason, the A extends Branded<infer T> never worked. I forget what it resolved to, but it was either any or unknown.

Since Branded<A> does A & Something, we can also try to remove the intersection. Googling told me that the way to de-intersect a type is to do Omit<A, keyof Something>. In our case the Something is a Brand<B>, but we don't care about the B because the keys are always the same, so B=any.

TL;DR: the never and any have nothing to do with each other, they just both happened to occur in 2 type manipulation signatures trying to do the same thing in a different way.

Comment on lines +151 to +152
// Some custom resource types that the CDK standard library creates that we
// would like to see it if they fail.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): What a shame we don't have a consistent naming pattern for these 🤦🏻

shouldStop(event: ResourceEvent): 'stop-include' | 'stop-exclude' | 'continue';
}

export abstract class OldestEvent {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: The naming of this doesn't click with me.

Comment on lines +42 to +48
public static timestamp(startTime: number): IOldestEvent {
return {
shouldStop(event) {
return event.event.Timestamp!.valueOf() < startTime ? 'stop-exclude' : 'continue';
},
};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromTimestamp or withStartingPoint ?

OldestEvent.timestamp(...) just doesn't speak to me at all.

Comment thread packages/aws-cdk/README.md Outdated
after-the-fact. This can be useful to refresh your memory, ask a colleague to
help diagnose a deployment problem, or try to diagnose a deployment problem if
your way of working dictates that you perform stack deployment via a CI/CD
system (instead of directly using the CLI).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
system (instead of directly using the CLI).
system instead of directly using the CDK CLI.

Co-authored-by: Momo Kornher <kornherm@amazon.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants