From 18d6cb5cf0f6df11da292c3f488f0b9fc35a97c4 Mon Sep 17 00:00:00 2001 From: Tukue Gebregergis Date: Wed, 1 Apr 2026 11:09:56 +0200 Subject: [PATCH 1/7] feat: implement observability as a service baseline --- docs/observability-as-a-service.md | 65 +++++++++++++++++ docs/platform-product-progress.md | 12 +++- lib/cdk-app-stack.ts | 108 +++++++++++++++++++++++++++++ lib/function.ts | 66 ++++++++++++++++-- 4 files changed, 243 insertions(+), 8 deletions(-) create mode 100644 docs/observability-as-a-service.md diff --git a/docs/observability-as-a-service.md b/docs/observability-as-a-service.md new file mode 100644 index 0000000..92865f4 --- /dev/null +++ b/docs/observability-as-a-service.md @@ -0,0 +1,65 @@ +# Observability as a Service (OaaS) - Assessment and Implementation + +_Last updated: 2026-04-01_ + +## Objective + +Provide a platform-managed observability baseline for workloads so teams get actionable telemetry by default, without creating custom dashboards, alarms, or logging pipelines per service. + +## Current-state assessment (this repo) + +### Strengths already present + +- API Gateway execution/access logging is enabled with CloudWatch log delivery. +- Lambda tracing is enabled (`Tracing.ACTIVE`) and API Gateway tracing is enabled. +- Shared KMS encryption is already used for API logs and other platform resources. + +### Gaps identified + +- No pre-provisioned dashboard that aggregates API and Lambda golden signals. +- No default alerting path for operational failures (e.g., Lambda errors, API 5xx). +- Lambda application logs were not standardized for correlation and service-level context. +- No explicit service-level observability outputs for integration into runbooks/portal metadata. + +## Implemented OaaS baseline in this change + +### 1) Platform alerting channel + +- Added an encrypted SNS topic (`ObservabilityAlarmTopic`) for alarm fan-out. +- Wired core alarms to this topic to establish a shared notification mechanism. + +### 2) Golden signal alarms + +- Added Lambda error alarm (sum errors over 5 minutes). +- Added Lambda latency alarm (p95 duration over 2 seconds). +- Added API Gateway 5xx alarm (sum server errors over 5 minutes). + +### 3) Shared dashboard + +- Added CloudWatch dashboard with: + - Lambda invocations/errors + - Lambda duration p50/p95 + - API requests/5xx + - API latency p50/p95 + +### 4) Structured application logging contract + +- Updated Lambda handler to emit JSON logs with: + - timestamp + - level + - service name + - correlation ID + - request metadata (method/path/request ID) +- Added correlation ID propagation in API responses (`x-correlation-id`). + +### 5) Discoverability outputs + +- Added stack outputs for dashboard name, alarm topic ARN, and application log group name to simplify integration with platform docs and runbooks. + +## Recommended next steps (Phase 2+) + +1. Subscribe Slack/Email/PagerDuty endpoints to the SNS alarm topic via environment-specific config. +2. Add metric filters and alarms for error-rate SLO burn thresholds. +3. Standardize alarm severity labels and route high/medium/low priorities separately. +4. Introduce OpenTelemetry collector path for vendor-neutral traces/metrics export. +5. Export dashboard links into Backstage component annotations for developer self-service. diff --git a/docs/platform-product-progress.md b/docs/platform-product-progress.md index 5e3ea34..900aaec 100644 --- a/docs/platform-product-progress.md +++ b/docs/platform-product-progress.md @@ -1,6 +1,6 @@ # Platform as a Product Progress Tracker -_Last updated: 2026-03-26_ +_Last updated: 2026-04-01_ ## Delivery status snapshot @@ -13,7 +13,7 @@ _Last updated: 2026-03-26_ | Secure-by-default CDK sample hardening | ✅ Complete | 100% | KMS, VPC, DLQ, IAM auth, caching, encrypted logs implemented. | | Environment overlays (dev/stage/prod) | 🟡 In Progress | 40% | Structure exists; env-specific manifests and policy sets pending. | | Policy-as-code enforcement (OPA/Kyverno) | 🟡 In Progress | 60% | Conftest policy bundle and CI enforcement added for deployment security/image/resource guardrails. | -| Observability productization | 🟡 In Progress | 35% | Architecture defined; Prometheus/Grafana/Loki/OTel deployments pending. | +| Observability productization | 🟡 In Progress | 60% | CloudWatch dashboard, alerts, and structured logging baseline implemented; Prometheus/Grafana/Loki/OTel deployments pending. | | EKS + Argo CD platform runtime | ⏳ Planned | 20% | Target model documented; implementation modules still to be added. | | Backstage portal deployment | ⏳ Planned | 15% | Template exists; portal deployment and catalog automation pending. | @@ -33,6 +33,14 @@ _Last updated: 2026-03-26_ 4. Add observability baseline (Prometheus, Grafana, Loki, OpenTelemetry Collector). 5. Expand service repo structure with CI, Dockerfile, Helm chart, and SLO/runbook assets. +## Latest implementation increment + +- Added a concrete "observability as a service" baseline to the CDK sample stack: + - CloudWatch dashboard for API + Lambda golden signals. + - Encrypted SNS-backed alarm fan-out for Lambda and API failures/latency. + - Structured Lambda JSON logging and correlation-ID propagation. +- See `docs/observability-as-a-service.md` for assessment details and rollout recommendations. + ## Definition of done for next milestone - [ ] `platform/environments/{dev,stage,prod}` contain concrete compositions. diff --git a/lib/cdk-app-stack.ts b/lib/cdk-app-stack.ts index 0f75a7b..8dcaa85 100644 --- a/lib/cdk-app-stack.ts +++ b/lib/cdk-app-stack.ts @@ -10,6 +10,9 @@ import * as kms from 'aws-cdk-lib/aws-kms'; import * as ec2 from 'aws-cdk-lib/aws-ec2'; import * as sqs from 'aws-cdk-lib/aws-sqs'; import * as logs from 'aws-cdk-lib/aws-logs'; +import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch'; +import * as cwActions from 'aws-cdk-lib/aws-cloudwatch-actions'; +import * as sns from 'aws-cdk-lib/aws-sns'; export class CdkAppStack extends cdk.Stack { constructor(scope: Construct, id: string, props?: cdk.StackProps) { @@ -75,6 +78,8 @@ export class CdkAppStack extends cdk.Stack { environment: { DYNAMODB: dynamodb_table.tableName, NODE_OPTIONS: '--enable-source-maps', + APP_LOG_LEVEL: 'INFO', + SERVICE_NAME: 'demo-api', }, environmentEncryption: encryptionKey, // Bundling options for esbuild @@ -107,6 +112,13 @@ export class CdkAppStack extends cdk.Stack { removalPolicy: RemovalPolicy.RETAIN, }); + const lambdaAppLogs = new logs.LogGroup(this, 'LambdaApplicationLogs', { + logGroupName: `/aws/lambda/${lambda_backend.functionName}-application`, + encryptionKey, + retention: logs.RetentionDays.ONE_MONTH, + removalPolicy: RemovalPolicy.RETAIN, + }); + // 4. API Gateway Definition const api = new gateway.RestApi(this, 'RestAPI', { restApiName: 'Demo API', @@ -151,6 +163,84 @@ export class CdkAppStack extends cdk.Stack { items.addMethod('GET', rootIntegration); items.addMethod('POST', rootIntegration); + const alarmTopic = new sns.Topic(this, 'ObservabilityAlarmTopic', { + displayName: 'Platform Observability Alerts', + masterKey: encryptionKey, + }); + + const lambdaErrorsAlarm = new cloudwatch.Alarm(this, 'LambdaErrorsAlarm', { + metric: lambda_backend.metricErrors({ + period: cdk.Duration.minutes(5), + statistic: 'sum', + }), + threshold: 1, + evaluationPeriods: 1, + datapointsToAlarm: 1, + treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING, + alarmDescription: 'Lambda function has errors in the last 5 minutes', + }); + + const lambdaDurationAlarm = new cloudwatch.Alarm(this, 'LambdaDurationP95Alarm', { + metric: lambda_backend.metricDuration({ + period: cdk.Duration.minutes(5), + statistic: 'p95', + }), + threshold: 2000, + evaluationPeriods: 2, + datapointsToAlarm: 2, + treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING, + alarmDescription: 'Lambda p95 duration is above 2 seconds', + }); + + const api5xxAlarm = new cloudwatch.Alarm(this, 'Api5xxAlarm', { + metric: api.metricServerError({ + period: cdk.Duration.minutes(5), + statistic: 'sum', + }), + threshold: 1, + evaluationPeriods: 1, + datapointsToAlarm: 1, + treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING, + alarmDescription: 'API Gateway has 5xx responses in the last 5 minutes', + }); + + lambdaErrorsAlarm.addAlarmAction(new cwActions.SnsAction(alarmTopic)); + lambdaDurationAlarm.addAlarmAction(new cwActions.SnsAction(alarmTopic)); + api5xxAlarm.addAlarmAction(new cwActions.SnsAction(alarmTopic)); + + const observabilityDashboard = new cloudwatch.Dashboard(this, 'PlatformObservabilityDashboard', { + dashboardName: `${cdk.Stack.of(this).stackName}-platform-observability`, + }); + + observabilityDashboard.addWidgets( + new cloudwatch.GraphWidget({ + title: 'Lambda Invocations / Errors', + left: [lambda_backend.metricInvocations(), lambda_backend.metricErrors()], + width: 12, + }), + new cloudwatch.GraphWidget({ + title: 'Lambda Duration (p50/p95)', + left: [ + lambda_backend.metricDuration({ statistic: 'p50' }), + lambda_backend.metricDuration({ statistic: 'p95' }), + ], + width: 12, + }), + new cloudwatch.GraphWidget({ + title: 'API Gateway Requests / 5XX', + left: [api.metricCount(), api.metricServerError()], + width: 12, + }), + new cloudwatch.GraphWidget({ + title: 'API Gateway Latency (p50/p95)', + left: [ + api.metricLatency({ statistic: 'p50' }), + api.metricLatency({ statistic: 'p95' }), + ], + width: 12, + }), + ); + // Stack Outputs // Export important information new cdk.CfnOutput(this, 'ApiUrl', { @@ -165,6 +255,24 @@ export class CdkAppStack extends cdk.Stack { exportName: 'tableName', }); + new cdk.CfnOutput(this, 'ObservabilityDashboardName', { + value: observabilityDashboard.dashboardName, + description: 'CloudWatch dashboard for platform observability', + exportName: 'observabilityDashboardName', + }); + + new cdk.CfnOutput(this, 'ObservabilityAlarmTopicArn', { + value: alarmTopic.topicArn, + description: 'SNS topic ARN for observability alarms', + exportName: 'observabilityAlarmTopicArn', + }); + + new cdk.CfnOutput(this, 'LambdaApplicationLogGroupName', { + value: lambdaAppLogs.logGroupName, + description: 'Application log group name used by Lambda structured logs', + exportName: 'lambdaApplicationLogGroupName', + }); + // Optional: Add Tags to Resources cdk.Tags.of(this).add('Environment', 'Development'); cdk.Tags.of(this).add('Project', 'DemoAPI'); diff --git a/lib/function.ts b/lib/function.ts index 7680278..f81c420 100644 --- a/lib/function.ts +++ b/lib/function.ts @@ -3,32 +3,86 @@ import * as AWS from 'aws-sdk'; const dynamodb = new AWS.DynamoDB.DocumentClient(); const TABLE_NAME = process.env.DYNAMODB || ''; +const SERVICE_NAME = process.env.SERVICE_NAME || 'unknown-service'; +const LOG_LEVEL = process.env.APP_LOG_LEVEL || 'INFO'; + +type LogLevel = 'DEBUG' | 'INFO' | 'WARN' | 'ERROR'; + +const log = ( + level: LogLevel, + message: string, + details: Record = {} +): void => { + const payload = { + timestamp: new Date().toISOString(), + level, + service: SERVICE_NAME, + message, + ...details, + }; + + if (level === 'ERROR') { + console.error(JSON.stringify(payload)); + return; + } + + if (level === 'WARN') { + console.warn(JSON.stringify(payload)); + return; + } + + if (level === 'DEBUG' && LOG_LEVEL !== 'DEBUG') { + return; + } + + console.log(JSON.stringify(payload)); +}; export const handler = async ( event: APIGatewayEvent, context: Context ): Promise => { try { + const correlationId = + event.headers?.['x-correlation-id'] || + event.headers?.['X-Correlation-Id'] || + context.awsRequestId; + + log('INFO', 'request-received', { + awsRequestId: context.awsRequestId, + correlationId, + path: event.path, + method: event.httpMethod, + tableNameConfigured: TABLE_NAME.length > 0, + }); + return { statusCode: 200, headers: { 'Content-Type': 'application/json', - 'Access-Control-Allow-Origin': '*' + 'Access-Control-Allow-Origin': '*', + 'x-correlation-id': correlationId, }, body: JSON.stringify({ message: 'Testing from Lambda!', - event, - context - }) + service: SERVICE_NAME, + correlationId, + requestId: context.awsRequestId, + }), }; } catch (error) { + log('ERROR', 'request-failed', { + awsRequestId: context.awsRequestId, + errorMessage: error instanceof Error ? error.message : 'Unknown error', + }); + return { statusCode: 500, headers: { 'Content-Type': 'application/json', - 'Access-Control-Allow-Origin': '*' + 'Access-Control-Allow-Origin': '*', }, - body: JSON.stringify({ error: 'Internal Server Error' }) + body: JSON.stringify({ error: 'Internal Server Error' }), }; } }; From 27d1f24660a5356cb657e744cadb478abb74e4bf Mon Sep 17 00:00:00 2001 From: Tukue Gebregergis Date: Wed, 1 Apr 2026 11:32:35 +0200 Subject: [PATCH 2/7] fix: align lambda log group and propagate correlation ids on errors --- lib/cdk-app-stack.ts | 2 +- lib/function.ts | 11 ++++++----- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/lib/cdk-app-stack.ts b/lib/cdk-app-stack.ts index 8dcaa85..4bb9580 100644 --- a/lib/cdk-app-stack.ts +++ b/lib/cdk-app-stack.ts @@ -113,7 +113,7 @@ export class CdkAppStack extends cdk.Stack { }); const lambdaAppLogs = new logs.LogGroup(this, 'LambdaApplicationLogs', { - logGroupName: `/aws/lambda/${lambda_backend.functionName}-application`, + logGroupName: `/aws/lambda/${lambda_backend.functionName}`, encryptionKey, retention: logs.RetentionDays.ONE_MONTH, removalPolicy: RemovalPolicy.RETAIN, diff --git a/lib/function.ts b/lib/function.ts index f81c420..0b7bd62 100644 --- a/lib/function.ts +++ b/lib/function.ts @@ -42,12 +42,12 @@ export const handler = async ( event: APIGatewayEvent, context: Context ): Promise => { - try { - const correlationId = - event.headers?.['x-correlation-id'] || - event.headers?.['X-Correlation-Id'] || - context.awsRequestId; + const correlationId = + event.headers?.['x-correlation-id'] || + event.headers?.['X-Correlation-Id'] || + context.awsRequestId; + try { log('INFO', 'request-received', { awsRequestId: context.awsRequestId, correlationId, @@ -81,6 +81,7 @@ export const handler = async ( headers: { 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*', + 'x-correlation-id': correlationId, }, body: JSON.stringify({ error: 'Internal Server Error' }), }; From 285eecc3a6aae30f98327d9c9707d11eea3a614f Mon Sep 17 00:00:00 2001 From: Tukue Gebregergis Date: Wed, 1 Apr 2026 11:32:40 +0200 Subject: [PATCH 3/7] docs: rename golden signal alarm section to recommended baseline --- docs/observability-as-a-service.md | 2 +- docs/platform-product-progress.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/observability-as-a-service.md b/docs/observability-as-a-service.md index 92865f4..0fb3313 100644 --- a/docs/observability-as-a-service.md +++ b/docs/observability-as-a-service.md @@ -28,7 +28,7 @@ Provide a platform-managed observability baseline for workloads so teams get act - Added an encrypted SNS topic (`ObservabilityAlarmTopic`) for alarm fan-out. - Wired core alarms to this topic to establish a shared notification mechanism. -### 2) Golden signal alarms +### 2) Recommended baseline recommendations for alarms - Added Lambda error alarm (sum errors over 5 minutes). - Added Lambda latency alarm (p95 duration over 2 seconds). diff --git a/docs/platform-product-progress.md b/docs/platform-product-progress.md index 900aaec..26a9018 100644 --- a/docs/platform-product-progress.md +++ b/docs/platform-product-progress.md @@ -36,7 +36,7 @@ _Last updated: 2026-04-01_ ## Latest implementation increment - Added a concrete "observability as a service" baseline to the CDK sample stack: - - CloudWatch dashboard for API + Lambda golden signals. + - CloudWatch dashboard for API + Lambda recommended baseline telemetry signals. - Encrypted SNS-backed alarm fan-out for Lambda and API failures/latency. - Structured Lambda JSON logging and correlation-ID propagation. - See `docs/observability-as-a-service.md` for assessment details and rollout recommendations. From f865d2a397682a3c009294f271a6006a17900ecb Mon Sep 17 00:00:00 2001 From: Tukue Gebregergis Date: Wed, 1 Apr 2026 11:32:47 +0200 Subject: [PATCH 4/7] docs: position repository as platform engineering consulting profile --- README.md | 3 + ...platform-engineering-consulting-profile.md | 62 +++++++++++++++++++ docs/platform-product-progress.md | 1 + 3 files changed, 66 insertions(+) create mode 100644 docs/platform-engineering-consulting-profile.md diff --git a/README.md b/README.md index ab2fe6d..3772c98 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,8 @@ This repository has been transformed from a single-stack IaC project into a **Pl It now provides opinionated architecture, repository layout, templates, and delivery workflows to support a scalable **Internal Developer Platform (IDP)**. +It is also curated as a **Platform Engineering consulting profile project** that demonstrates strategy, architecture, implementation, and measurable outcomes for platform transformations. + ## What is included - A target platform architecture with: @@ -81,6 +83,7 @@ Track implementation maturity and next milestones in: - `docs/platform-product-progress.md` - `docs/platform-product-operating-model.md` +- `docs/platform-engineering-consulting-profile.md` ## Quick commands diff --git a/docs/platform-engineering-consulting-profile.md b/docs/platform-engineering-consulting-profile.md new file mode 100644 index 0000000..f05015a --- /dev/null +++ b/docs/platform-engineering-consulting-profile.md @@ -0,0 +1,62 @@ +# Platform Engineering Consulting Profile + +_Last updated: 2026-04-01_ + +## Purpose + +This repository is not only an implementation sandbox; it is a **consulting profile project** that demonstrates how to design, deliver, and operationalize a platform-as-a-product model for internal engineering teams. + +## Consulting narrative + +Use this project to show end-to-end consulting capability across four dimensions: + +1. **Strategy and operating model** + - Translate business/product constraints into a platform operating model. + - Define ownership boundaries (platform team vs application teams). + - Establish adoption metrics and maturity milestones. + +2. **Architecture and controls** + - Design the target architecture for runtime, delivery, governance, and observability. + - Implement secure-by-default guardrails and policy-as-code checks. + - Standardize golden paths for developer onboarding and service delivery. + +3. **Implementation and enablement** + - Deliver reusable IaC modules and environment composition patterns. + - Implement self-service templates and GitOps workflows. + - Provide practical observability defaults and incident-response hooks. + +4. **Adoption and measurable outcomes** + - Reduce lead time to first deployment. + - Improve deployment reliability and policy compliance. + - Improve mean time to detect through platform-managed telemetry. + +## Portfolio-ready capability map + +### Capability demonstrated today + +- Platform architecture blueprint and phased rollout guidance. +- Backstage self-service template scaffolding. +- CI quality gates for platform IaC and GitOps manifest checks. +- Secure-by-default CDK reference stack. +- Observability baseline implementation (dashboard, alarms, structured logging). + +### Capability in active roadmap + +- EKS runtime modules and environment compositions. +- Argo CD app-of-apps implementation. +- Expanded policy controls beyond Deployment resources. +- Production-grade observability stack services (Prometheus/Grafana/Loki/OpenTelemetry). + +## How to present this in consulting engagements + +Use this repository as an engagement artifact to communicate: + +- **Current-state assessment**: what exists and where the risks/gaps are. +- **Target-state design**: architecture and operating model with clear ownership. +- **Delivery plan**: phased implementation with visible checkpoints. +- **Evidence of execution**: code, policies, templates, and telemetry defaults. +- **Value realization**: KPIs tied to developer productivity and reliability outcomes. + +## Suggested profile statement + +> Platform Engineer (Consulting): I design and implement platform-as-a-product capabilities that improve developer experience, enforce secure defaults, and scale delivery through self-service, GitOps, policy-as-code, and observability. diff --git a/docs/platform-product-progress.md b/docs/platform-product-progress.md index 26a9018..fcb3def 100644 --- a/docs/platform-product-progress.md +++ b/docs/platform-product-progress.md @@ -14,6 +14,7 @@ _Last updated: 2026-04-01_ | Environment overlays (dev/stage/prod) | 🟡 In Progress | 40% | Structure exists; env-specific manifests and policy sets pending. | | Policy-as-code enforcement (OPA/Kyverno) | 🟡 In Progress | 60% | Conftest policy bundle and CI enforcement added for deployment security/image/resource guardrails. | | Observability productization | 🟡 In Progress | 60% | CloudWatch dashboard, alerts, and structured logging baseline implemented; Prometheus/Grafana/Loki/OTel deployments pending. | +| Consulting profile packaging | ✅ Complete | 100% | Consulting narrative, capability map, and portfolio positioning documentation added. | | EKS + Argo CD platform runtime | ⏳ Planned | 20% | Target model documented; implementation modules still to be added. | | Backstage portal deployment | ⏳ Planned | 15% | Template exists; portal deployment and catalog automation pending. | From 335bc83a49c6d41ee8ad7107d538a0edc8307a6c Mon Sep 17 00:00:00 2001 From: Tukue Gebregergis Date: Wed, 1 Apr 2026 11:44:07 +0200 Subject: [PATCH 5/7] docs: add OaaS implementation flow guide for clarity --- README.md | 1 + docs/oaas-implementation-flow.md | 103 +++++++++++++++++++++++++++++++ 2 files changed, 104 insertions(+) create mode 100644 docs/oaas-implementation-flow.md diff --git a/README.md b/README.md index 3772c98..7101a25 100644 --- a/README.md +++ b/README.md @@ -68,6 +68,7 @@ Template ID: `recommended-path-k8s-service` See detailed architecture and workflows in: - `docs/platform-product-architecture.md` +- `docs/oaas-implementation-flow.md` - `templates/service-catalog/template.yaml` diff --git a/docs/oaas-implementation-flow.md b/docs/oaas-implementation-flow.md new file mode 100644 index 0000000..b2df232 --- /dev/null +++ b/docs/oaas-implementation-flow.md @@ -0,0 +1,103 @@ +# OaaS Implementation Flow (Observability as a Service) + +_Last updated: 2026-04-01_ + +## Why this flow exists + +This document clarifies the practical implementation flow used in this repository to deliver Observability as a Service (OaaS) as a platform capability. + +It is intended for: +- platform engineers implementing shared observability controls, +- application teams consuming the paved road, +- consulting stakeholders reviewing delivery maturity and implementation evidence. + +## End-to-end implementation flow + +```mermaid +flowchart TD + A[Assess current state] --> B[Define OaaS baseline contract] + B --> C[Implement platform controls in CDK] + C --> D[Instrument service logging + correlation IDs] + D --> E[Expose discoverability outputs] + E --> F[Validate build + synth] + F --> G[Operationalize alert routing + runbooks] + G --> H[Scale to OSS runtime stack Prometheus/Grafana/Loki/OTel] +``` + +## Step-by-step breakdown + +### 1) Assess current state + +- Confirm what telemetry already exists (API/Lambda logs, tracing, encryption). +- Identify operational gaps: missing dashboards, missing default alarms, and weak log correlation patterns. + +**Output:** clear gap list and baseline scope. + +## 2) Define OaaS baseline contract + +Define what every service should get by default: +- shared alerting channel, +- baseline alarms for failures/latency, +- standard dashboard views, +- structured logging + correlation IDs, +- exported observability resource references. + +**Output:** platform-managed observability contract. + +## 3) Implement platform controls in CDK + +Provision shared controls in infrastructure code: +- SNS alarm topic for centralized fan-out. +- CloudWatch alarms for core API/Lambda health indicators. +- CloudWatch dashboard widgets for key operational views. + +**Output:** deployable observability control plane primitives. + +## 4) Instrument service logging and request correlation + +At the service handler level: +- emit structured JSON logs (`timestamp`, `level`, `service`, `message`, context fields), +- propagate `x-correlation-id` from inbound request to outbound response, +- guarantee correlation ID availability in both success and error paths. + +**Output:** logs and traces that can be stitched during incident response. + +## 5) Expose discoverability outputs + +Export runtime identifiers to make observability assets easy to consume by tools and docs: +- dashboard name, +- alarm topic ARN, +- log group name. + +**Output:** runbooks/portals can link to observability assets programmatically. + +## 6) Validate and operationalize + +- Validate build and CDK synthesis. +- Attach real notification endpoints (Slack/Email/PagerDuty) to alarm topic. +- Document routing and escalation policies. + +**Output:** alerts become actionable in real operations. + +## 7) Scale to open-source target stack + +Progress from baseline to full open-source observability stack: +- Prometheus (metrics), +- Grafana (dashboards), +- Loki (logs), +- OpenTelemetry Collector + Tempo/Jaeger (traces). + +**Output:** environment-wide, vendor-neutral observability platform. + +## Ownership model + +- **Platform team owns:** baseline architecture, alarms, shared dashboards, alert routing, policy and defaults. +- **Application teams own:** service SLOs, runbooks, business metrics, and on-call response for service-level incidents. + +## Definition of done for OaaS flow + +- [ ] Shared alarms deployed and routed to owned notification channels. +- [ ] Shared dashboard published and referenced in platform docs/runbooks. +- [ ] Correlation ID visible in both success and error API responses. +- [ ] Structured logs adopted in default service template. +- [ ] OSS observability stack rollout plan mapped for dev/stage/prod. From 0e66b7376326f0985170c257a6c3e23642e18807 Mon Sep 17 00:00:00 2001 From: Tukue Gebregergis Date: Wed, 1 Apr 2026 11:44:12 +0200 Subject: [PATCH 6/7] docs: remove profile statement section from consulting doc --- docs/platform-engineering-consulting-profile.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/platform-engineering-consulting-profile.md b/docs/platform-engineering-consulting-profile.md index f05015a..5b36b5e 100644 --- a/docs/platform-engineering-consulting-profile.md +++ b/docs/platform-engineering-consulting-profile.md @@ -56,7 +56,3 @@ Use this repository as an engagement artifact to communicate: - **Delivery plan**: phased implementation with visible checkpoints. - **Evidence of execution**: code, policies, templates, and telemetry defaults. - **Value realization**: KPIs tied to developer productivity and reliability outcomes. - -## Suggested profile statement - -> Platform Engineer (Consulting): I design and implement platform-as-a-product capabilities that improve developer experience, enforce secure defaults, and scale delivery through self-service, GitOps, policy-as-code, and observability. From 176a3ed2c6178bd3f566bc2553c5d2dd3b6b22bc Mon Sep 17 00:00:00 2001 From: Tukue Gebregergis Date: Wed, 1 Apr 2026 11:56:01 +0200 Subject: [PATCH 7/7] docs: update OaaS definition-of-done with implemented status --- docs/oaas-implementation-flow.md | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/docs/oaas-implementation-flow.md b/docs/oaas-implementation-flow.md index b2df232..512a703 100644 --- a/docs/oaas-implementation-flow.md +++ b/docs/oaas-implementation-flow.md @@ -96,8 +96,13 @@ Progress from baseline to full open-source observability stack: ## Definition of done for OaaS flow -- [ ] Shared alarms deployed and routed to owned notification channels. -- [ ] Shared dashboard published and referenced in platform docs/runbooks. -- [ ] Correlation ID visible in both success and error API responses. -- [ ] Structured logs adopted in default service template. -- [ ] OSS observability stack rollout plan mapped for dev/stage/prod. +- [ ] Shared alarms deployed and routed to owned notification channels. + _Status:_ Alarms and SNS topic are implemented; endpoint subscriptions/routing ownership still pending. +- [x] Shared dashboard published and referenced in platform docs/runbooks. + _Status:_ Dashboard is implemented and referenced through platform documentation/output exports. +- [x] Correlation ID visible in both success and error API responses. + _Status:_ `x-correlation-id` is returned in both 200 and 500 responses in the Lambda handler. +- [ ] Structured logs adopted in default service template. + _Status:_ Structured logs are implemented in the sample Lambda, but template-level adoption is still pending. +- [ ] OSS observability stack rollout plan mapped for dev/stage/prod. + _Status:_ Target stack and phased direction are documented; environment-specific implementation mapping remains pending.