Skip to content

feat: add observability-as-a-service baseline for CDK platform stack#3

Merged
tukue merged 7 commits intomainfrom
codex/implement-observability-as-a-service
Apr 1, 2026
Merged

feat: add observability-as-a-service baseline for CDK platform stack#3
tukue merged 7 commits intomainfrom
codex/implement-observability-as-a-service

Conversation

@tukue
Copy link
Copy Markdown
Owner

@tukue tukue commented Apr 1, 2026

Motivation

  • Provide a platform-managed observability baseline so workloads emit golden-signal metrics, structured logs, and alerts by default.
  • Improve discoverability and integration with platform runbooks/portal by exporting observability resource identifiers from the stack.

Description

  • Added CloudWatch and SNS primitives in lib/cdk-app-stack.ts: imported aws-cloudwatch, aws-cloudwatch-actions, and aws-sns, and provisioned an encrypted SNS topic ObservabilityAlarmTopic.
  • Created CloudWatch alarms for Lambda errors (LambdaErrorsAlarm), Lambda p95 duration (LambdaDurationP95Alarm), and API Gateway 5xx (Api5xxAlarm) and wired them to the SNS topic via SnsAction.
  • Added a CloudWatch dashboard PlatformObservabilityDashboard with widgets for Lambda invocations/errors, Lambda duration (p50/p95), API requests/5xx, and API latency (p50/p95).
  • Introduced a dedicated application log group LambdaApplicationLogs and exported stack outputs ObservabilityDashboardName, ObservabilityAlarmTopicArn, and LambdaApplicationLogGroupName for discoverability.
  • Updated Lambda environment variables in the CDK stack to include APP_LOG_LEVEL and SERVICE_NAME, and refactored the handler in lib/function.ts to emit structured JSON logs, a log helper, and to propagate a correlation ID in responses (x-correlation-id).
  • Added docs/observability-as-a-service.md describing assessment, implemented baseline, and recommended next steps, and updated docs/platform-product-progress.md to reflect the maturity increment.

Testing

  • Ran npm run build which completed successfully (TypeScript compile succeeded).
  • Ran npm run synth which completed successfully and produced a CloudFormation template (synth succeeded with non-blocking tooling/runtime warnings).

Codex Task

Copy link
Copy Markdown

@amazon-q-developer amazon-q-developer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR adds a solid observability baseline with CloudWatch dashboards, alarms, and structured logging. However, 2 logic errors block merge that will prevent the observability features from functioning correctly.

Critical Issues (Must Fix)

  1. Lambda log group mismatch: Custom log group won't receive logs due to incorrect naming pattern
  2. Missing correlation ID in error paths: Breaks request tracing for failed requests

Overall Assessment

Once the logic errors are fixed, this PR delivers the intended observability-as-a-service baseline. The structured logging, alarms, and dashboard implementation are well-designed.


You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.

Comment on lines 79 to 84
return {
statusCode: 500,
headers: {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*'
'Access-Control-Allow-Origin': '*',
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Logic Error: Missing correlation ID in error response breaks observability contract. Error responses should include the same headers as success responses to enable request tracing.

Suggested change
return {
statusCode: 500,
headers: {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*'
'Access-Control-Allow-Origin': '*',
},
const errorCorrelationId =
event.headers?.['x-correlation-id'] ||
event.headers?.['X-Correlation-Id'] ||
context.awsRequestId;
return {
statusCode: 500,
headers: {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*',
'x-correlation-id': errorCorrelationId,
},
body: JSON.stringify({ error: 'Internal Server Error' }),
};

Comment on lines +115 to +120
const lambdaAppLogs = new logs.LogGroup(this, 'LambdaApplicationLogs', {
logGroupName: `/aws/lambda/${lambda_backend.functionName}-application`,
encryptionKey,
retention: logs.RetentionDays.ONE_MONTH,
removalPolicy: RemovalPolicy.RETAIN,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Logic Error: Custom log group is created but Lambda won't write to it. Lambda writes to /aws/lambda/${lambda_backend.functionName} by default, not -application suffix. To use this log group, either configure Lambda's logGroup property or use metric filters on the default log group.

Suggested change
const lambdaAppLogs = new logs.LogGroup(this, 'LambdaApplicationLogs', {
logGroupName: `/aws/lambda/${lambda_backend.functionName}-application`,
encryptionKey,
retention: logs.RetentionDays.ONE_MONTH,
removalPolicy: RemovalPolicy.RETAIN,
});
const lambdaAppLogs = new logs.LogGroup(this, 'LambdaApplicationLogs', {
logGroupName: `/aws/lambda/${lambda_backend.functionName}`,
encryptionKey,
retention: logs.RetentionDays.ONE_MONTH,
removalPolicy: RemovalPolicy.RETAIN,
});

@tukue tukue self-assigned this Apr 1, 2026
@tukue tukue added this to the mvp milestone Apr 1, 2026
@tukue tukue merged commit 3510a9e into main Apr 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant