Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ This repository has been transformed from a single-stack IaC project into a **Pl

It now provides opinionated architecture, repository layout, templates, and delivery workflows to support a scalable **Internal Developer Platform (IDP)**.

It is also curated as a **Platform Engineering consulting profile project** that demonstrates strategy, architecture, implementation, and measurable outcomes for platform transformations.

## What is included

- A target platform architecture with:
Expand Down Expand Up @@ -66,6 +68,7 @@ Template ID: `recommended-path-k8s-service`
See detailed architecture and workflows in:

- `docs/platform-product-architecture.md`
- `docs/oaas-implementation-flow.md`
- `templates/service-catalog/template.yaml`


Expand All @@ -81,6 +84,7 @@ Track implementation maturity and next milestones in:

- `docs/platform-product-progress.md`
- `docs/platform-product-operating-model.md`
- `docs/platform-engineering-consulting-profile.md`

## Quick commands

Expand Down
108 changes: 108 additions & 0 deletions docs/oaas-implementation-flow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# OaaS Implementation Flow (Observability as a Service)

_Last updated: 2026-04-01_

## Why this flow exists

This document clarifies the practical implementation flow used in this repository to deliver Observability as a Service (OaaS) as a platform capability.

It is intended for:
- platform engineers implementing shared observability controls,
- application teams consuming the paved road,
- consulting stakeholders reviewing delivery maturity and implementation evidence.

## End-to-end implementation flow

```mermaid
flowchart TD
A[Assess current state] --> B[Define OaaS baseline contract]
B --> C[Implement platform controls in CDK]
C --> D[Instrument service logging + correlation IDs]
D --> E[Expose discoverability outputs]
E --> F[Validate build + synth]
F --> G[Operationalize alert routing + runbooks]
G --> H[Scale to OSS runtime stack Prometheus/Grafana/Loki/OTel]
```

## Step-by-step breakdown

### 1) Assess current state

- Confirm what telemetry already exists (API/Lambda logs, tracing, encryption).
- Identify operational gaps: missing dashboards, missing default alarms, and weak log correlation patterns.

**Output:** clear gap list and baseline scope.

## 2) Define OaaS baseline contract

Define what every service should get by default:
- shared alerting channel,
- baseline alarms for failures/latency,
- standard dashboard views,
- structured logging + correlation IDs,
- exported observability resource references.

**Output:** platform-managed observability contract.

## 3) Implement platform controls in CDK

Provision shared controls in infrastructure code:
- SNS alarm topic for centralized fan-out.
- CloudWatch alarms for core API/Lambda health indicators.
- CloudWatch dashboard widgets for key operational views.

**Output:** deployable observability control plane primitives.

## 4) Instrument service logging and request correlation

At the service handler level:
- emit structured JSON logs (`timestamp`, `level`, `service`, `message`, context fields),
- propagate `x-correlation-id` from inbound request to outbound response,
- guarantee correlation ID availability in both success and error paths.

**Output:** logs and traces that can be stitched during incident response.

## 5) Expose discoverability outputs

Export runtime identifiers to make observability assets easy to consume by tools and docs:
- dashboard name,
- alarm topic ARN,
- log group name.

**Output:** runbooks/portals can link to observability assets programmatically.

## 6) Validate and operationalize

- Validate build and CDK synthesis.
- Attach real notification endpoints (Slack/Email/PagerDuty) to alarm topic.
- Document routing and escalation policies.

**Output:** alerts become actionable in real operations.

## 7) Scale to open-source target stack

Progress from baseline to full open-source observability stack:
- Prometheus (metrics),
- Grafana (dashboards),
- Loki (logs),
- OpenTelemetry Collector + Tempo/Jaeger (traces).

**Output:** environment-wide, vendor-neutral observability platform.

## Ownership model

- **Platform team owns:** baseline architecture, alarms, shared dashboards, alert routing, policy and defaults.
- **Application teams own:** service SLOs, runbooks, business metrics, and on-call response for service-level incidents.

## Definition of done for OaaS flow

- [ ] Shared alarms deployed and routed to owned notification channels.
_Status:_ Alarms and SNS topic are implemented; endpoint subscriptions/routing ownership still pending.
- [x] Shared dashboard published and referenced in platform docs/runbooks.
_Status:_ Dashboard is implemented and referenced through platform documentation/output exports.
- [x] Correlation ID visible in both success and error API responses.
_Status:_ `x-correlation-id` is returned in both 200 and 500 responses in the Lambda handler.
- [ ] Structured logs adopted in default service template.
_Status:_ Structured logs are implemented in the sample Lambda, but template-level adoption is still pending.
- [ ] OSS observability stack rollout plan mapped for dev/stage/prod.
_Status:_ Target stack and phased direction are documented; environment-specific implementation mapping remains pending.
65 changes: 65 additions & 0 deletions docs/observability-as-a-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Observability as a Service (OaaS) - Assessment and Implementation

_Last updated: 2026-04-01_

## Objective

Provide a platform-managed observability baseline for workloads so teams get actionable telemetry by default, without creating custom dashboards, alarms, or logging pipelines per service.

## Current-state assessment (this repo)

### Strengths already present

- API Gateway execution/access logging is enabled with CloudWatch log delivery.
- Lambda tracing is enabled (`Tracing.ACTIVE`) and API Gateway tracing is enabled.
- Shared KMS encryption is already used for API logs and other platform resources.

### Gaps identified

- No pre-provisioned dashboard that aggregates API and Lambda golden signals.
- No default alerting path for operational failures (e.g., Lambda errors, API 5xx).
- Lambda application logs were not standardized for correlation and service-level context.
- No explicit service-level observability outputs for integration into runbooks/portal metadata.

## Implemented OaaS baseline in this change

### 1) Platform alerting channel

- Added an encrypted SNS topic (`ObservabilityAlarmTopic`) for alarm fan-out.
- Wired core alarms to this topic to establish a shared notification mechanism.

### 2) Recommended baseline recommendations for alarms

- Added Lambda error alarm (sum errors over 5 minutes).
- Added Lambda latency alarm (p95 duration over 2 seconds).
- Added API Gateway 5xx alarm (sum server errors over 5 minutes).

### 3) Shared dashboard

- Added CloudWatch dashboard with:
- Lambda invocations/errors
- Lambda duration p50/p95
- API requests/5xx
- API latency p50/p95

### 4) Structured application logging contract

- Updated Lambda handler to emit JSON logs with:
- timestamp
- level
- service name
- correlation ID
- request metadata (method/path/request ID)
- Added correlation ID propagation in API responses (`x-correlation-id`).

### 5) Discoverability outputs

- Added stack outputs for dashboard name, alarm topic ARN, and application log group name to simplify integration with platform docs and runbooks.

## Recommended next steps (Phase 2+)

1. Subscribe Slack/Email/PagerDuty endpoints to the SNS alarm topic via environment-specific config.
2. Add metric filters and alarms for error-rate SLO burn thresholds.
3. Standardize alarm severity labels and route high/medium/low priorities separately.
4. Introduce OpenTelemetry collector path for vendor-neutral traces/metrics export.
5. Export dashboard links into Backstage component annotations for developer self-service.
58 changes: 58 additions & 0 deletions docs/platform-engineering-consulting-profile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Platform Engineering Consulting Profile

_Last updated: 2026-04-01_

## Purpose

This repository is not only an implementation sandbox; it is a **consulting profile project** that demonstrates how to design, deliver, and operationalize a platform-as-a-product model for internal engineering teams.

## Consulting narrative

Use this project to show end-to-end consulting capability across four dimensions:

1. **Strategy and operating model**
- Translate business/product constraints into a platform operating model.
- Define ownership boundaries (platform team vs application teams).
- Establish adoption metrics and maturity milestones.

2. **Architecture and controls**
- Design the target architecture for runtime, delivery, governance, and observability.
- Implement secure-by-default guardrails and policy-as-code checks.
- Standardize golden paths for developer onboarding and service delivery.

3. **Implementation and enablement**
- Deliver reusable IaC modules and environment composition patterns.
- Implement self-service templates and GitOps workflows.
- Provide practical observability defaults and incident-response hooks.

4. **Adoption and measurable outcomes**
- Reduce lead time to first deployment.
- Improve deployment reliability and policy compliance.
- Improve mean time to detect through platform-managed telemetry.

## Portfolio-ready capability map

### Capability demonstrated today

- Platform architecture blueprint and phased rollout guidance.
- Backstage self-service template scaffolding.
- CI quality gates for platform IaC and GitOps manifest checks.
- Secure-by-default CDK reference stack.
- Observability baseline implementation (dashboard, alarms, structured logging).

### Capability in active roadmap

- EKS runtime modules and environment compositions.
- Argo CD app-of-apps implementation.
- Expanded policy controls beyond Deployment resources.
- Production-grade observability stack services (Prometheus/Grafana/Loki/OpenTelemetry).

## How to present this in consulting engagements

Use this repository as an engagement artifact to communicate:

- **Current-state assessment**: what exists and where the risks/gaps are.
- **Target-state design**: architecture and operating model with clear ownership.
- **Delivery plan**: phased implementation with visible checkpoints.
- **Evidence of execution**: code, policies, templates, and telemetry defaults.
- **Value realization**: KPIs tied to developer productivity and reliability outcomes.
13 changes: 11 additions & 2 deletions docs/platform-product-progress.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Platform as a Product Progress Tracker

_Last updated: 2026-03-26_
_Last updated: 2026-04-01_

## Delivery status snapshot

Expand All @@ -13,7 +13,8 @@ _Last updated: 2026-03-26_
| Secure-by-default CDK sample hardening | ✅ Complete | 100% | KMS, VPC, DLQ, IAM auth, caching, encrypted logs implemented. |
| Environment overlays (dev/stage/prod) | 🟡 In Progress | 40% | Structure exists; env-specific manifests and policy sets pending. |
| Policy-as-code enforcement (OPA/Kyverno) | 🟡 In Progress | 60% | Conftest policy bundle and CI enforcement added for deployment security/image/resource guardrails. |
| Observability productization | 🟡 In Progress | 35% | Architecture defined; Prometheus/Grafana/Loki/OTel deployments pending. |
| Observability productization | 🟡 In Progress | 60% | CloudWatch dashboard, alerts, and structured logging baseline implemented; Prometheus/Grafana/Loki/OTel deployments pending. |
| Consulting profile packaging | ✅ Complete | 100% | Consulting narrative, capability map, and portfolio positioning documentation added. |
| EKS + Argo CD platform runtime | ⏳ Planned | 20% | Target model documented; implementation modules still to be added. |
| Backstage portal deployment | ⏳ Planned | 15% | Template exists; portal deployment and catalog automation pending. |

Expand All @@ -33,6 +34,14 @@ _Last updated: 2026-03-26_
4. Add observability baseline (Prometheus, Grafana, Loki, OpenTelemetry Collector).
5. Expand service repo structure with CI, Dockerfile, Helm chart, and SLO/runbook assets.

## Latest implementation increment

- Added a concrete "observability as a service" baseline to the CDK sample stack:
- CloudWatch dashboard for API + Lambda recommended baseline telemetry signals.
- Encrypted SNS-backed alarm fan-out for Lambda and API failures/latency.
- Structured Lambda JSON logging and correlation-ID propagation.
- See `docs/observability-as-a-service.md` for assessment details and rollout recommendations.

## Definition of done for next milestone

- [ ] `platform/environments/{dev,stage,prod}` contain concrete compositions.
Expand Down
Loading
Loading