tukue · tukue · Apr 1, 2026 · Apr 1, 2026 · Apr 1, 2026 · Apr 1, 2026
diff --git a/README.md b/README.md
@@ -7,6 +7,8 @@ This repository has been transformed from a single-stack IaC project into a **Pl
 
 It now provides opinionated architecture, repository layout, templates, and delivery workflows to support a scalable **Internal Developer Platform (IDP)**.
 
+It is also curated as a **Platform Engineering consulting profile project** that demonstrates strategy, architecture, implementation, and measurable outcomes for platform transformations.
+
 ## What is included
 
 - A target platform architecture with:
@@ -66,6 +68,7 @@ Template ID: `recommended-path-k8s-service`
 See detailed architecture and workflows in:
 
 - `docs/platform-product-architecture.md`
+- `docs/oaas-implementation-flow.md`
 - `templates/service-catalog/template.yaml`
 
 
@@ -81,6 +84,7 @@ Track implementation maturity and next milestones in:
 
 - `docs/platform-product-progress.md`
 - `docs/platform-product-operating-model.md`
+- `docs/platform-engineering-consulting-profile.md`
 
 ## Quick commands
 

diff --git a/docs/oaas-implementation-flow.md b/docs/oaas-implementation-flow.md
@@ -0,0 +1,108 @@
+# OaaS Implementation Flow (Observability as a Service)
+
+_Last updated: 2026-04-01_
+
+## Why this flow exists
+
+This document clarifies the practical implementation flow used in this repository to deliver Observability as a Service (OaaS) as a platform capability.
+
+It is intended for:
+- platform engineers implementing shared observability controls,
+- application teams consuming the paved road,
+- consulting stakeholders reviewing delivery maturity and implementation evidence.
+
+## End-to-end implementation flow
+
+```mermaid
+flowchart TD
+    A[Assess current state] --> B[Define OaaS baseline contract]
+    B --> C[Implement platform controls in CDK]
+    C --> D[Instrument service logging + correlation IDs]
+    D --> E[Expose discoverability outputs]
+    E --> F[Validate build + synth]
+    F --> G[Operationalize alert routing + runbooks]
+    G --> H[Scale to OSS runtime stack Prometheus/Grafana/Loki/OTel]
+```
+
+## Step-by-step breakdown
+
+### 1) Assess current state
+
+- Confirm what telemetry already exists (API/Lambda logs, tracing, encryption).
+- Identify operational gaps: missing dashboards, missing default alarms, and weak log correlation patterns.
+
+**Output:** clear gap list and baseline scope.
+
+## 2) Define OaaS baseline contract
+
+Define what every service should get by default:
+- shared alerting channel,
+- baseline alarms for failures/latency,
+- standard dashboard views,
+- structured logging + correlation IDs,
+- exported observability resource references.
+
+**Output:** platform-managed observability contract.
+
+## 3) Implement platform controls in CDK
+
+Provision shared controls in infrastructure code:
+- SNS alarm topic for centralized fan-out.
+- CloudWatch alarms for core API/Lambda health indicators.
+- CloudWatch dashboard widgets for key operational views.
+
+**Output:** deployable observability control plane primitives.
+
+## 4) Instrument service logging and request correlation
+
+At the service handler level:
+- emit structured JSON logs (`timestamp`, `level`, `service`, `message`, context fields),
+- propagate `x-correlation-id` from inbound request to outbound response,
+- guarantee correlation ID availability in both success and error paths.
+
+**Output:** logs and traces that can be stitched during incident response.
+
+## 5) Expose discoverability outputs
+
+Export runtime identifiers to make observability assets easy to consume by tools and docs:
+- dashboard name,
+- alarm topic ARN,
+- log group name.
+
+**Output:** runbooks/portals can link to observability assets programmatically.
+
+## 6) Validate and operationalize
+
+- Validate build and CDK synthesis.
+- Attach real notification endpoints (Slack/Email/PagerDuty) to alarm topic.
+- Document routing and escalation policies.
+
+**Output:** alerts become actionable in real operations.
+
+## 7) Scale to open-source target stack
+
+Progress from baseline to full open-source observability stack:
+- Prometheus (metrics),
+- Grafana (dashboards),
+- Loki (logs),
+- OpenTelemetry Collector + Tempo/Jaeger (traces).
+
+**Output:** environment-wide, vendor-neutral observability platform.
+
+## Ownership model
+
+- **Platform team owns:** baseline architecture, alarms, shared dashboards, alert routing, policy and defaults.
+- **Application teams own:** service SLOs, runbooks, business metrics, and on-call response for service-level incidents.
+
+## Definition of done for OaaS flow
+
+- [ ] Shared alarms deployed and routed to owned notification channels.  
+  _Status:_ Alarms and SNS topic are implemented; endpoint subscriptions/routing ownership still pending.
+- [x] Shared dashboard published and referenced in platform docs/runbooks.  
+  _Status:_ Dashboard is implemented and referenced through platform documentation/output exports.
+- [x] Correlation ID visible in both success and error API responses.  
+  _Status:_ `x-correlation-id` is returned in both 200 and 500 responses in the Lambda handler.
+- [ ] Structured logs adopted in default service template.  
+  _Status:_ Structured logs are implemented in the sample Lambda, but template-level adoption is still pending.
+- [ ] OSS observability stack rollout plan mapped for dev/stage/prod.  
+  _Status:_ Target stack and phased direction are documented; environment-specific implementation mapping remains pending.
diff --git a/docs/observability-as-a-service.md b/docs/observability-as-a-service.md
@@ -0,0 +1,65 @@
+# Observability as a Service (OaaS) - Assessment and Implementation
+
+_Last updated: 2026-04-01_
+
+## Objective
+
+Provide a platform-managed observability baseline for workloads so teams get actionable telemetry by default, without creating custom dashboards, alarms, or logging pipelines per service.
+
+## Current-state assessment (this repo)
+
+### Strengths already present
+
+- API Gateway execution/access logging is enabled with CloudWatch log delivery.
+- Lambda tracing is enabled (`Tracing.ACTIVE`) and API Gateway tracing is enabled.
+- Shared KMS encryption is already used for API logs and other platform resources.
+
+### Gaps identified
+
+- No pre-provisioned dashboard that aggregates API and Lambda golden signals.
+- No default alerting path for operational failures (e.g., Lambda errors, API 5xx).
+- Lambda application logs were not standardized for correlation and service-level context.
+- No explicit service-level observability outputs for integration into runbooks/portal metadata.
+
+## Implemented OaaS baseline in this change
+
+### 1) Platform alerting channel
+
+- Added an encrypted SNS topic (`ObservabilityAlarmTopic`) for alarm fan-out.
+- Wired core alarms to this topic to establish a shared notification mechanism.
+
+### 2) Recommended baseline recommendations for alarms
+
+- Added Lambda error alarm (sum errors over 5 minutes).
+- Added Lambda latency alarm (p95 duration over 2 seconds).
+- Added API Gateway 5xx alarm (sum server errors over 5 minutes).
+
+### 3) Shared dashboard
+
+- Added CloudWatch dashboard with:
+  - Lambda invocations/errors
+  - Lambda duration p50/p95
+  - API requests/5xx
+  - API latency p50/p95
+
+### 4) Structured application logging contract
+
+- Updated Lambda handler to emit JSON logs with:
+  - timestamp
+  - level
+  - service name
+  - correlation ID
+  - request metadata (method/path/request ID)
+- Added correlation ID propagation in API responses (`x-correlation-id`).
+
+### 5) Discoverability outputs
+
+- Added stack outputs for dashboard name, alarm topic ARN, and application log group name to simplify integration with platform docs and runbooks.
+
+## Recommended next steps (Phase 2+)
+
+1. Subscribe Slack/Email/PagerDuty endpoints to the SNS alarm topic via environment-specific config.
+2. Add metric filters and alarms for error-rate SLO burn thresholds.
+3. Standardize alarm severity labels and route high/medium/low priorities separately.
+4. Introduce OpenTelemetry collector path for vendor-neutral traces/metrics export.
+5. Export dashboard links into Backstage component annotations for developer self-service.
diff --git a/docs/platform-engineering-consulting-profile.md b/docs/platform-engineering-consulting-profile.md
@@ -0,0 +1,58 @@
+# Platform Engineering Consulting Profile
+
+_Last updated: 2026-04-01_
+
+## Purpose
+
+This repository is not only an implementation sandbox; it is a **consulting profile project** that demonstrates how to design, deliver, and operationalize a platform-as-a-product model for internal engineering teams.
+
+## Consulting narrative
+
+Use this project to show end-to-end consulting capability across four dimensions:
+
+1. **Strategy and operating model**
+   - Translate business/product constraints into a platform operating model.
+   - Define ownership boundaries (platform team vs application teams).
+   - Establish adoption metrics and maturity milestones.
+
+2. **Architecture and controls**
+   - Design the target architecture for runtime, delivery, governance, and observability.
+   - Implement secure-by-default guardrails and policy-as-code checks.
+   - Standardize golden paths for developer onboarding and service delivery.
+
+3. **Implementation and enablement**
+   - Deliver reusable IaC modules and environment composition patterns.
+   - Implement self-service templates and GitOps workflows.
+   - Provide practical observability defaults and incident-response hooks.
+
+4. **Adoption and measurable outcomes**
+   - Reduce lead time to first deployment.
+   - Improve deployment reliability and policy compliance.
+   - Improve mean time to detect through platform-managed telemetry.
+
+## Portfolio-ready capability map
+
+### Capability demonstrated today
+
+- Platform architecture blueprint and phased rollout guidance.
+- Backstage self-service template scaffolding.
+- CI quality gates for platform IaC and GitOps manifest checks.
+- Secure-by-default CDK reference stack.
+- Observability baseline implementation (dashboard, alarms, structured logging).
+
+### Capability in active roadmap
+
+- EKS runtime modules and environment compositions.
+- Argo CD app-of-apps implementation.
+- Expanded policy controls beyond Deployment resources.
+- Production-grade observability stack services (Prometheus/Grafana/Loki/OpenTelemetry).
+
+## How to present this in consulting engagements
+
+Use this repository as an engagement artifact to communicate:
+
+- **Current-state assessment**: what exists and where the risks/gaps are.
+- **Target-state design**: architecture and operating model with clear ownership.
+- **Delivery plan**: phased implementation with visible checkpoints.
+- **Evidence of execution**: code, policies, templates, and telemetry defaults.
+- **Value realization**: KPIs tied to developer productivity and reliability outcomes.
diff --git a/docs/platform-product-progress.md b/docs/platform-product-progress.md
@@ -1,6 +1,6 @@
 # Platform as a Product Progress Tracker
 
-_Last updated: 2026-03-26_
+_Last updated: 2026-04-01_
 
 ## Delivery status snapshot
 
@@ -13,7 +13,8 @@ _Last updated: 2026-03-26_
 | Secure-by-default CDK sample hardening | ✅ Complete | 100% | KMS, VPC, DLQ, IAM auth, caching, encrypted logs implemented. |
 | Environment overlays (dev/stage/prod) | 🟡 In Progress | 40% | Structure exists; env-specific manifests and policy sets pending. |
 | Policy-as-code enforcement (OPA/Kyverno) | 🟡 In Progress | 60% | Conftest policy bundle and CI enforcement added for deployment security/image/resource guardrails. |
-| Observability productization | 🟡 In Progress | 35% | Architecture defined; Prometheus/Grafana/Loki/OTel deployments pending. |
+| Observability productization | 🟡 In Progress | 60% | CloudWatch dashboard, alerts, and structured logging baseline implemented; Prometheus/Grafana/Loki/OTel deployments pending. |
+| Consulting profile packaging | ✅ Complete | 100% | Consulting narrative, capability map, and portfolio positioning documentation added. |
 | EKS + Argo CD platform runtime | ⏳ Planned | 20% | Target model documented; implementation modules still to be added. |
 | Backstage portal deployment | ⏳ Planned | 15% | Template exists; portal deployment and catalog automation pending. |
 
@@ -33,6 +34,14 @@ _Last updated: 2026-03-26_
 4. Add observability baseline (Prometheus, Grafana, Loki, OpenTelemetry Collector).
 5. Expand service repo structure with CI, Dockerfile, Helm chart, and SLO/runbook assets.
 
+## Latest implementation increment
+
+- Added a concrete "observability as a service" baseline to the CDK sample stack:
+  - CloudWatch dashboard for API + Lambda recommended baseline telemetry signals.
+  - Encrypted SNS-backed alarm fan-out for Lambda and API failures/latency.
+  - Structured Lambda JSON logging and correlation-ID propagation.
+- See `docs/observability-as-a-service.md` for assessment details and rollout recommendations.
+
 ## Definition of done for next milestone
 
 - [ ] `platform/environments/{dev,stage,prod}` contain concrete compositions.