CCM-16073 - Enhanced callbacks#145
Conversation
There was a problem hiding this comment.
Pull request overview
Introduces enhanced callback delivery security and operational controls by adding mTLS + certificate pinning configuration to callback targets, and shifting delivery from direct EventBridge API Destinations to an SQS + per-client HTTPS delivery Lambda model (with Redis-backed gating for rate limiting/circuit breaking).
Changes:
- Add
mtls,certPinning, anddeliveryfields to callback target model + schema validation, and update fixtures/tests accordingly. - Add CLI commands to manage mTLS, certificate pinning enable/disable, and SPKI hash extraction/storage for targets.
- Add new
https-client-lambda(delivery, signing, retries, DLQ handling, Redis/Lua gate) and new sharedconfig-cachepackage; update Terraform to provision per-client delivery infra (SQS/Lambda/ElastiCache) and mock mTLS ALB.
Reviewed changes
Copilot reviewed 104 out of 107 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/client-subscriptions-management/src/entrypoint/cli/targets-set-pinning.ts | New CLI command to enable/disable certificate pinning for a target. |
| tools/client-subscriptions-management/src/entrypoint/cli/targets-set-mtls.ts | New CLI command to enable/disable mTLS for a target. |
| tools/client-subscriptions-management/src/entrypoint/cli/targets-set-certificate.ts | New CLI command to extract/store SPKI hash from PEM for a target. |
| tools/client-subscriptions-management/src/entrypoint/cli/clients-put.ts | Minor cleanup in CLI file handling comment. |
| tools/client-subscriptions-management/src/domain/client-subscription-builder.ts | Adds default mtls/certPinning and emits security warnings when building targets. |
| tools/client-subscriptions-management/src/tests/helpers/client-subscription-fixtures.ts | Updates test target fixtures with required mtls/certPinning fields. |
| tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-pinning.test.ts | Unit tests for targets-set-pinning CLI behavior. |
| tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-mtls.test.ts | Unit tests for targets-set-mtls CLI behavior. |
| tools/client-subscriptions-management/src/tests/entrypoint/cli/targets-set-certificate.test.ts | Unit tests for targets-set-certificate CLI behavior. |
| tools/client-subscriptions-management/src/tests/domain/client-subscription-builder.test.ts | Adds tests around security warning emission in target builder. |
| tools/client-subscriptions-management/package.json | Adds picocolors dependency for warning output. |
| tests/integration/helpers/mock-client-config.ts | Adds a new integration fixture key for an mTLS-enabled mock client. |
| tests/integration/helpers/event-factories.ts | Adds factory for delivery messages compatible with the new delivery flow. |
| tests/integration/fixtures/subscriptions/mock-client-rate-limit.json | New integration fixture including required mtls/certPinning. |
| tests/integration/fixtures/subscriptions/mock-client-mtls.json | New integration fixture for mTLS + pinning enabled target. |
| tests/integration/fixtures/subscriptions/mock-client-circuit-breaker.json | New integration fixture including delivery circuit breaker configuration. |
| tests/integration/fixtures/subscriptions/mock-client-2.json | Updates existing integration fixture targets to include mtls/certPinning. |
| tests/integration/fixtures/subscriptions/mock-client-1.json | Updates existing integration fixture targets to include mtls/certPinning. |
| src/models/src/client-config.ts | Extends callback target type with mtls, certPinning, and optional delivery. |
| src/models/src/client-config-schema.ts | Adds Zod validation for mtls, certPinning (incl. SPKI hash constraints), and delivery. |
| src/models/src/tests/client-config-schema.test.ts | Adds/updates schema tests for new target fields and constraints. |
| src/config-cache/tsconfig.json | New package tsconfig for config-cache workspace. |
| src/config-cache/src/index.ts | Exports ConfigCache from new workspace package. |
| src/config-cache/src/config-cache.ts | Implements TTL-based in-memory config cache. |
| src/config-cache/src/tests/config-cache.test.ts | Unit tests for ConfigCache TTL and behaviors. |
| src/config-cache/package.json | New workspace package definition for config-cache. |
| src/config-cache/jest.config.ts | Jest config for config-cache package. |
| scripts/config/pre-commit.yaml | Adjusts pre-commit detect-private-key exclusions for a test file. |
| pnpm-workspace.yaml | Adds workspace catalog entries for @redis/client, picocolors, and Secrets Manager SDK. |
| lambdas/mock-webhook-lambda/src/index.ts | Adds ALB mTLS client-cert header validation for mock webhook endpoint. |
| lambdas/mock-webhook-lambda/src/tests/index.test.ts | Adds unit tests for ALB mTLS certificate verification flow. |
| lambdas/https-client-lambda/tsconfig.json | New lambda workspace tsconfig. |
| lambdas/https-client-lambda/src/services/ssm-applications-map.ts | Loads and caches clientId→applicationId map from SSM. |
| lambdas/https-client-lambda/src/services/sqs-visibility.ts | SQS ChangeMessageVisibility helper. |
| lambdas/https-client-lambda/src/services/record-result.lua | Redis Lua script to update circuit breaker state. |
| lambdas/https-client-lambda/src/services/payload-signer.ts | Adjusts payload signer function signature (parameter order). |
| lambdas/https-client-lambda/src/services/logger.ts | Re-exports shared logger for lambda local imports. |
| lambdas/https-client-lambda/src/services/endpoint-gate.ts | Redis admission + circuit-breaker integration (EVALSHA/EVAL Lua execution). |
| lambdas/https-client-lambda/src/services/dlq-sender.ts | DLQ send helper. |
| lambdas/https-client-lambda/src/services/delivery/tls-agent-factory.ts | Builds TLS agent with optional mTLS material and SPKI pinning. |
| lambdas/https-client-lambda/src/services/delivery/retry-policy.ts | Retry/backoff and Retry-After parsing helpers. |
| lambdas/https-client-lambda/src/services/delivery/https-client.ts | HTTPS delivery client + result classification. |
| lambdas/https-client-lambda/src/services/delivery-metrics.ts | Embedded metrics emission for delivery and circuit-breaker events. |
| lambdas/https-client-lambda/src/services/config-loader.ts | Loads client config/targets from S3 with TTL caching and schema validation. |
| lambdas/https-client-lambda/src/services/admit.lua | Redis Lua script for token-bucket rate limiting + CB admission. |
| lambdas/https-client-lambda/src/lua.d.ts | Type declaration for importing .lua as text. |
| lambdas/https-client-lambda/src/index.ts | Lambda entrypoint delegating to record processor. |
| lambdas/https-client-lambda/src/handler.ts | Main SQS batch handler: load config, sign, gate, deliver, retry/DLQ, metrics. |
| lambdas/https-client-lambda/src/tests/tls-agent-factory.test.ts | Unit tests for TLS agent factory behavior (S3/SecretsManager/pinning). |
| lambdas/https-client-lambda/src/tests/ssm-applications-map.test.ts | Unit tests for SSM applications map loading/caching/errors. |
| lambdas/https-client-lambda/src/tests/sqs-visibility.test.ts | Unit tests for SQS visibility changes. |
| lambdas/https-client-lambda/src/tests/retry-policy.test.ts | Unit tests for retry policy helpers. |
| lambdas/https-client-lambda/src/tests/payload-signer.test.ts | Unit tests for payload signing. |
| lambdas/https-client-lambda/src/tests/index.test.ts | Unit test for lambda entrypoint wiring. |
| lambdas/https-client-lambda/src/tests/https-client.test.ts | Unit tests for delivery HTTP behavior classification. |
| lambdas/https-client-lambda/src/tests/handler.test.ts | Unit tests for SQS record processing paths (DLQ/retry/gate/CB). |
| lambdas/https-client-lambda/src/tests/endpoint-gate.test.ts | Unit tests for Redis Lua invocation paths and client creation. |
| lambdas/https-client-lambda/src/tests/dlq-sender.test.ts | Unit tests for DLQ sender. |
| lambdas/https-client-lambda/src/tests/delivery-metrics.test.ts | Unit tests for embedded metrics behavior. |
| lambdas/https-client-lambda/src/tests/config-loader.test.ts | Unit tests for S3 config loading + TTL cache. |
| lambdas/https-client-lambda/package.json | New lambda workspace package definition. |
| lambdas/https-client-lambda/lua-transform.js | Jest transform to load .lua scripts as strings. |
| lambdas/https-client-lambda/jest.config.ts | Jest config enabling .lua transform. |
| lambdas/client-transform-filter-lambda/src/services/observability.ts | Removes callback signing observability (signing moved downstream). |
| lambdas/client-transform-filter-lambda/src/services/config-loader.ts | Switches to shared config-cache package import. |
| lambdas/client-transform-filter-lambda/src/services/config-loader-service.ts | Switches to shared config-cache package import. |
| lambdas/client-transform-filter-lambda/src/index.ts | Removes ApplicationsMapService wiring from handler creation. |
| lambdas/client-transform-filter-lambda/src/handler.ts | Removes per-target signatures from output; outputs deliverable payload + subscriptions only. |
| lambdas/client-transform-filter-lambda/src/tests/services/payload-signer.test.ts | Removes tests for payload signing in transform-filter lambda. |
| lambdas/client-transform-filter-lambda/src/tests/services/config-update.component.test.ts | Updates imports to new config-cache package. |
| lambdas/client-transform-filter-lambda/src/tests/services/config-loader.test.ts | Updates imports to new config-cache package. |
| lambdas/client-transform-filter-lambda/src/tests/services/config-cache.test.ts | Updates imports to new config-cache package. |
| lambdas/client-transform-filter-lambda/src/tests/index.test.ts | Updates tests to reflect removal of signatures and apps map dependency. |
| lambdas/client-transform-filter-lambda/src/tests/index.component.test.ts | Updates component tests to reflect new payload shape and removed SSM dependency. |
| lambdas/client-transform-filter-lambda/src/tests/helpers/client-subscription-fixtures.ts | Updates target fixtures to include required mtls/certPinning. |
| lambdas/client-transform-filter-lambda/package.json | Adds config-cache workspace dependency. |
| knip.ts | Updates Knip workspace config (new workspaces and integration entrypoints). |
| infrastructure/terraform/modules/client-destination/variables.tf | Removes legacy API Destination-based module. |
| infrastructure/terraform/modules/client-destination/module_target_dlq.tf | Removes legacy per-target DLQ module. |
| infrastructure/terraform/modules/client-destination/locals.tf | Removes legacy locals for old module. |
| infrastructure/terraform/modules/client-destination/iam_role_api_target_role.tf | Removes legacy IAM role/policy for API destinations. |
| infrastructure/terraform/modules/client-destination/cloudwatch_event_rule_main.tf | Removes legacy EventBridge rule/target to API destination setup. |
| infrastructure/terraform/modules/client-destination/cloudwatch_event_connection_main.tf | Removes legacy EventBridge Connection resources. |
| infrastructure/terraform/modules/client-destination/cloudwatch_event_api_destination_this.tf | Removes legacy API Destination resources. |
| infrastructure/terraform/modules/client-destination/README.md | Removes docs for legacy module. |
| infrastructure/terraform/modules/client-delivery/variables.tf | Adds new per-client delivery module variables. |
| infrastructure/terraform/modules/client-delivery/outputs.tf | Adds outputs for per-client queues and lambda details. |
| infrastructure/terraform/modules/client-delivery/module_sqs_per_client.tf | Provisions per-client delivery SQS queue. |
| infrastructure/terraform/modules/client-delivery/module_https_client_lambda.tf | Provisions per-client https-client lambda + SQS event source mapping. |
| infrastructure/terraform/modules/client-delivery/module_dlq_per_client.tf | Provisions per-client DLQ + DLQ depth alarm. |
| infrastructure/terraform/modules/client-delivery/locals.tf | Defines naming/tagging locals for per-client module. |
| infrastructure/terraform/modules/client-delivery/iam_role_sqs_target.tf | IAM policy for https-client lambda (SQS/S3/SSM/KMS and optional SecretsManager/S3 cert). |
| infrastructure/terraform/modules/client-delivery/cloudwatch_event_rule_per_subscription.tf | Defines EventBridge rules/targets routing to per-client SQS delivery queue. |
| infrastructure/terraform/modules/client-delivery/README.md | Adds docs for new per-client delivery module. |
| infrastructure/terraform/components/callbacks/variables.tf | Enables X-Ray by default and adds mTLS-related component variables. |
| infrastructure/terraform/components/callbacks/pipes_pipe_main.tf | Removes signatures from Pipe output template. |
| infrastructure/terraform/components/callbacks/module_mock_webhook_alb_mtls.tf | Adds internal ALB + ACM import + passthrough mTLS wiring for mock webhook. |
| infrastructure/terraform/components/callbacks/module_client_destination.tf | Removes legacy client_destination module usage. |
| infrastructure/terraform/components/callbacks/module_client_delivery.tf | Adds per-client delivery module instantiation. |
| infrastructure/terraform/components/callbacks/locals.tf | Reworks locals for per-client subscriptions/targets and mTLS mock endpoint selection. |
| infrastructure/terraform/components/callbacks/elasticache_delivery_state.tf | Adds ElastiCache Serverless Redis for delivery state + security groups/alarms. |
| infrastructure/terraform/components/callbacks/cloudwatch_metric_alarm_dlq_depth.tf | Removes legacy per-target DLQ alarm (replaced by per-client alarms). |
| infrastructure/terraform/components/callbacks/cloudwatch_eventbus_main.tf | Adds EventBridge archive for 7-day retention. |
| infrastructure/terraform/components/callbacks/README.md | Updates docs to reflect new modules, variables, and defaults. |
| eslint.config.mjs | Ignores lua-transform.js and expands rule scope to include src workspaces. |
| .gitleaksignore | Adds ignore for test private key fixture in tls-agent-factory tests. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aidenvaines-cgi
left a comment
There was a problem hiding this comment.
the TF stuff all looks cool from an rough check
mjewildnhs
left a comment
There was a problem hiding this comment.
As far as reviewing the terraform
mjewildnhs
left a comment
There was a problem hiding this comment.
Done transform filter lambda changes - in middle of http lambda
| async function processRecord(record: SQSRecord): Promise<void> { | ||
| const { CLIENT_ID } = process.env; | ||
| if (!CLIENT_ID) { | ||
| throw new Error("CLIENT_ID is required"); |
There was a problem hiding this comment.
If this is misconfigured then the behaviour will be undesirable - SQS will just keep retrying it 100 times.
However it will never happen so I don't think its worth explicitly handling it and treating the events as permanent failures (by sending to the DLQ).
There was a problem hiding this comment.
Re-opening as had a config issue in test and couldn't stop it!
mjewildnhs
left a comment
There was a problem hiding this comment.
Middle of http client lambda review - done with the handler, client, retry policy
| ELASTICACHE_IAM_USERNAME = var.elasticache_iam_username | ||
| } | ||
|
|
||
| vpc_config = var.lambda_security_group_id != "" ? { |
There was a problem hiding this comment.
Where is the security group defined?
Am I missing something?
We need to consider what the lambda can talk to and on what ports - the lambda supports a port being in the webhook URL.
There was a problem hiding this comment.
The security group is defined in the component (components/callbacks/elasticache_delivery_state.tf), not in this module. It's created there as aws_security_group.https_client_lambda alongside the ElastiCache security group, and its ID is passed into this module via var.lambda_security_group_id (line 58), which is then attached through the vpc_config block.
Fixed port mapping to allow any port since this is configurable per-client.
There was a problem hiding this comment.
Ahh wasn't looking in elasticache_delivery_state for lambda -> webhook permissions - but it makes sense why its the way it is.
I've unresolved this for a @cgitim opinion on which ports we open up.
I think its safe to allow any port because it ultimately comes from tightly restricted config.
Would be better if we could limit to 443 - but I don't suppose we have any kind of requirements on it.
| } | ||
|
|
||
| async function loadCertMaterial(): Promise<CertMaterial> { | ||
| const isProduction = Boolean(MTLS_CERT_SECRET_ARN); |
There was a problem hiding this comment.
I think we should revisit the decision to have the difference in cert handling between environments.
It adds a lot of complexity and unless we always test for it in non-prod there's a big danger we won't realize its broken until we've released it.
I seem to remember it was because were were concerned about costs for ephemeral environments.
Its priced per secret per month (at $0.40) - doesn't that mean we'll only be billed for a fraction of the based on how long the env was deployed?
https://aws.amazon.com/secrets-manager/pricing/
There was a problem hiding this comment.
At least loadFromSecretsManager is trivial but the loadFromS3 function is currently awful.
Theres a tonne of env vars as well backing the S3 solution.
mjewildnhs
left a comment
There was a problem hiding this comment.
Reviewed the http-lambda (minus unit tests)
There was a problem hiding this comment.
All looks very promising - but where are the tests?!
Happy to leave them out till we get something reasonable looking deployed.
71c3f08 to
5f9b348
Compare
5f9b348 to
816b60f
Compare
| variable "sqs_max_receive_count" { | ||
| type = number | ||
| description = "Maximum receive count before message moves to DLQ" | ||
| default = 100 |
There was a problem hiding this comment.
Did we have a max retry attempts in the spec?
There was a problem hiding this comment.
Doesn't look like it, should we omit this?
There was a problem hiding this comment.
We need something set for when things go badly wrong - any unexpected errors (e.g. redis connection failures) currently retry up to this limit.
I guess we just need the 100 to cover a reasonable amount of retry attempts for transient errors.
I guess the circuit should break and prevent us burning through the attempts too quickly.
copying @cgitim in if he has any opinion - given his research paper on rate limiting 😉
* ALB/webhook uses https for mTLS and non mTLS - remove http endpoint * Log thrown errors in http-client-lambda * Update int test debug script * Update SQS to webhook int tests to use correct queues * Update debug int script README * Fix redis script error and better logging for redis errors * Log status code of perm failures * fix: load test CA for server trust when mtls is disabled In test environments, the mock webhook ALB uses a server certificate signed by a locally-generated test CA. Previously, the CA was only loaded into the TLS agent when mtls.enabled was true (needed for client certificate auth). Targets with mtls.enabled: false used Node's default trust store, which does not include the test CA, causing every delivery attempt to fail with SELF_SIGNED_CERT_IN_CHAIN. Fix by loading the CA whenever MTLS_TEST_CA_S3_KEY is set, regardless of mtls.enabled. The client key and cert are still only applied when mtls.enabled is true. MTLS_TEST_CA_S3_KEY is not set in production, so non-mTLS targets in production are unaffected. * fix: set ERROR_CODE and ERROR_MESSAGE on DLQ messages for permanent delivery failures AWS API Destinations previously set these SQS message attributes automatically. After the migration to the https-client lambda, they were no longer being set. - Read the response body for 4xx responses (previously discarded with res.resume()) so the error message can be included in the DLQ message attributes - Set ERROR_CODE=HTTP_CLIENT_ERROR for 4xx webhook rejections, or the TLS error code (e.g. CERT_HAS_EXPIRED) for connection-level failures - Set ERROR_MESSAGE to the message field from the JSON response body, falling back to the raw body if not valid JSON - Extended sendToDlq to accept and forward MessageAttributes to SQS * Fix redrive IT dlq names * Fix metric IT dlq names * Fix alarm test * Bump test coverage * Fix redis client IAM / connectivity issues + logging improvements
f7c6d27 to
cb0b020
Compare
* CCM-16002 - Revised performance test implementation
|
|
||
| metricsInstance = createMetricsLogger(); | ||
| metricsInstance.setNamespace(namespace); | ||
| metricsInstance.setDimensions({ Environment: environment }); |
There was a problem hiding this comment.
Do we need ClientId as an additional dimension so we can filter metrics by client. I don't suppose we have anything defined in the spec or anything from any observability conversations? @cgitim
If we do add it then it will have an impact on cost as its $0.30 per metric and currently we have 10 metrics and if we have 100 clients thats $300 vs $3
| it("accepts maxRetryDurationSeconds below 60", () => { | ||
| const config = createValidConfig(); | ||
| config.targets[0].delivery = { maxRetryDurationSeconds: 10 }; | ||
|
|
||
| expect(parseClientSubscriptionConfiguration(config).success).toBe(true); | ||
| }); |
There was a problem hiding this comment.
I think we remove this test as its a bit redundant (arbitrarily testing 10 seconds).
There was a problem hiding this comment.
I think this was purely to not reduce coverage, since Sonar can be a bit strict about that
| metrics.putMetric("DeliveryPermanentFailure", 1, Unit.Count); | ||
| } | ||
|
|
||
| export function emitRateLimited(targetId: string): void { |
There was a problem hiding this comment.
The metrics DeliveryRateLimited and AdmissionDenied are a bit confusing.
DeliveryRateLimited is when the webhook responds with a 429 and AdmissionDenied could be us rate limiting or circuit breaking.
I think we should have something like:
- DeliveryRateLimited -> DeliveryServerRateLimited
- AdmissionDenied (reason "rate_limited") -> DeliveryRateLimited
- AdmissionDenied (reason "circuit_open") -> DeliveryCircuitBlocked
There was a problem hiding this comment.
When the IT and final perf test stuff has merged in we should get AI to assess the commonality in helpers between the 2 and put common code in test-support.
| "@nhs-notify-client-callbacks/models": "workspace:*", | ||
| "@nhs-notify-client-callbacks/test-support": "workspace:*", | ||
| "async-wait-until": "catalog:app" | ||
| "esbuild": "catalog:tools" |
There was a problem hiding this comment.
Is this not a dev dependency?
| const startSec = Math.floor(testStartMs / 1000); | ||
| const endSec = Math.floor(Date.now() / 1000); | ||
|
|
||
| const snap = await queryMetricsSnapshot( |
There was a problem hiding this comment.
Can we refactor some of this to cut down on the duplication in query code within the loop and the final snapshot?
| }), | ||
| ); | ||
|
|
||
| if (!queryId) return null; |
There was a problem hiding this comment.
Can we refactor away some of the duplication 60-73 vs 96-109 (the polling query result code)
* Rename mock-client-1/2 and add mtls test * Retry window and exhaustion test * DLQ redrive delivery queue test * Metric test coverage * Rate limit tests * Fix: reset metrics per invocation * Fix new tests to use UUIDs as message ids rather than timestamps * Add correlation id to logging * Batch count logging * Circuit breaker test * Integration test tidy up refactor * Log receive count and sqsMessageId * Fix batch success count on DLQ * Log receive count for all messages
| "enabled": true | ||
| }, | ||
| "maxRetryDurationSeconds": 7200, | ||
| "mtls": { |
There was a problem hiding this comment.
I think we should turn on mtls and pinning for all the perf clients as this will be the slowest config.
Description
Context
Type of changes
Checklist
Sensitive Information Declaration
To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.