Skip to content

v0.7.16: security hardening, db o11y and profiling, settings UI#5245

Merged
waleedlatif1 merged 17 commits into
mainfrom
staging
Jun 28, 2026
Merged

v0.7.16: security hardening, db o11y and profiling, settings UI#5245
waleedlatif1 merged 17 commits into
mainfrom
staging

Conversation

@waleedlatif1

@waleedlatif1 waleedlatif1 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

TheodoreSpeaks and others added 11 commits June 27, 2026 14:08
* perf(trigger): cap concurrency on background DB tasks

* test(trigger): update schedule concurrency assertion to 30
…cryption (#5236)

- Add dispatch-latency / trigger-age instrumentation: capture webhook receipt
  time + Slack x-slack-request-timestamp at the route and log structured
  dispatchLatencyMs + triggerAgeMs before execution, surfacing the pre-execution
  latency that per-block timings cannot see (Slack trigger_id expires at 3s).
- Guard the effective-env fetch in verifyProviderAuth: only fetch+decrypt when
  the handler verifies auth AND the providerConfig references env vars ({{VAR}}),
  avoiding a needless DB read/decrypt on the synchronous pre-ack path. The guard
  scope exactly matches resolveProviderConfigEnvVars, so resolution is identical.
* fix(connectors): harden Zendesk connector against SSRF

Route the Zendesk connector through the SSRF-safe secureFetchWithRetry (DNS-resolve + IP-pin + per-redirect revalidation) instead of the plain fetchWithRetry, and validate the user-supplied subdomain against a strict DNS-label pattern before building the base URL. Matches the GitLab/Sentry/Obsidian/S3 precedent.

* fix(connectors): retry transient DNS failures in secureFetchWithRetry

secureFetchWithValidation throws a validation error before the request when a hostname temporarily fails to resolve. Classify that transient DNS failure as retryable so secureFetchWithRetry mirrors the old fetchWithRetry network-retry behavior, while keeping the deterministic blocked-IP SSRF rejection non-retryable.
* perf(db): drive Postgres pool size + application_name from per-role profiles

Replace ad-hoc DB_APP_NAME sizing with a per-role profile map keyed by
SIM_DB_ROLE (web/trigger/realtime), defaulting to web. Trigger machines
open a small pool instead of 15 to avoid PgBouncer connection exhaustion.
Also size realtime's separate socketDb pool down to 10.

* fix(db): throw on invalid SIM_DB_ROLE instead of silently using web pools

* fix(db): use Object.hasOwn for SIM_DB_ROLE validation to avoid prototype keys
…tion DoS (#5240)

Knowledge-base ingestion downloaded an attacker-controlled external fileUrl
with no byte cap: downloadFileFromUrl defaults maxBytes to MAX_SAFE_INTEGER,
so the streaming reader buffered the entire response into memory uncapped.
An authenticated user could OOM the processing worker by pointing fileUrl at
a server that streams an unbounded body.

Wire the documented 100MB file-size limit (MAX_FILE_SIZE) into the ingestion
download helper. The existing stream limiter aborts the read once the cap is
exceeded and rejects up front on an oversized Content-Length, so the body is
never fully buffered.
…ory exhaustion (#5239)

* fix(file-parsers): guard OOXML parsers against decompression-bomb memory exhaustion

Pre-inspect the ZIP central directory of xlsx/docx/pptx buffers and reject
archives whose declared expanded size (>1 GiB) or compression ratio (>150x)
exceeds safe bounds, before SheetJS/mammoth/officeparser inflate them. The
existing pipeline only capped the compressed input (100 MB), which does not
bound decompressed size, so a crafted zip bomb could expand to many GB and OOM
the worker.

* fix(file-parsers): fail closed on unverifiable ZIP-shaped OOXML archives

Address review: the guard previously no-opped (fell through to the
decompression library) whenever the central directory could not be parsed,
and findEocdOffset accepted the first backward EOCD signature without checking
it sat at the buffer tail. A crafted archive with a decoy EOCD or an
unsupported directory layout could bypass the size limits.

- findEocdOffset now requires the EOCD comment length to place the record
  exactly at the buffer tail, defeating decoy signatures planted in the trailing
  region.
- assertOoxmlArchiveWithinLimits now fails closed: a ZIP-shaped buffer (local
  file header / EOCD magic) whose central directory cannot be parsed is rejected
  rather than passed through. Genuine non-ZIP inputs (legacy OLE .xls/.doc,
  plaintext) still no-op and defer to the downstream parser.
…nent (#5235)

* chore(data-drains): remove settings callout and unused InfoNote component

* improvement(data-retention): convert policy editor from modal to full-surface page

Mirror the access-control group-detail pattern: clicking a retention policy
now drills into a full-surface PolicyDetail page (back chip + dirty-gated
header Save/Discard + Remove) instead of an xl modal. Form fields use
SettingsSection (Workspaces / Retention / PII redaction) and the page chrome
matches group-detail exactly. The unsaved-changes confirm stays a small modal.

* improvement(settings): shared save/discard + unsaved-changes guard

Consolidate every editable settings surface onto one stack instead of
per-page custom logic:

- SaveDiscardActions: the canonical dirty-gated Discard+Save chip pair
- useSettingsUnsavedGuard: syncs local dirty into useSettingsDirtyStore (so the
  sidebar section-switch confirm applies) + provides guardBack/UnsavedChangesModal
  for detail sub-views' back chip
- useSettingsBeforeUnload: a single beforeunload in the settings shell

Migrate whitelabeling, sso, access-control group-detail, data-retention, and
secrets-manager (drops its duplicate beforeunload). Deletes the hand-rolled
'Unsaved changes' modals; the leave-confirm standardizes on Keep editing /
Discard. Documents the pattern in the settings rule + add-settings-page skill.

* fix(sso): reset originalFormData on discard/save so dirty state clears

handleDiscard and the post-save cleanup reset formData to DEFAULT but left
originalFormData on the edit snapshot, so hasChanges stayed true after leaving
the form — leaking a stuck-dirty state into the shared settings guard. Reset
originalFormData alongside formData in both leave paths (handleEdit re-seeds
both on re-entry).

* fix(settings): auto-dismiss unsaved-changes modal when page goes clean

useSettingsUnsavedGuard stashed a deferred leave + opened UnsavedChangesModal
when back was pressed while dirty, but never cleared them if isDirty later
became false. Confirming Discard could then run a stale leave with no unsaved
edits. Clear the pending leave and close the modal in the dirty-sync effect
whenever isDirty is false.
…5241)

* fix(copilot): gate post-tool output writes behind write permission

The Copilot/Mothership executor runs three post-tool output-redirection
sinks (maybeWriteOutputToFile, maybeWriteOutputToTable,
maybeWriteReadCsvToTable) that persist a tool's result into the
workspace. They were gated only on identity (workspaceId + userId), not
on permission. Because function_execute/user_table/read are read-allowed
for execution (absent from WRITE_ACTIONS in tools/server/router.ts), a
read-only collaborator could drive the agent to durably create/overwrite
workspace files and insert/overwrite table rows via output declarations —
a function-level authorization bypass (CWE-862) that the dedicated write
tools correctly reject.

Add a shared denyOutputWriteWithoutWritePermission guard built on the
canonical permissionSatisfies predicate and apply it to all three sinks,
once a write is actually intended, so read-only principals get the same
Permission denied outcome as the dedicated mutation tools.

* fix(copilot): move file output write-permission gate after no-op skip branches

Address Cursor review: in maybeWriteOutputToFile the gate ran before the
sandbox-export skip branch (which returns the result unchanged without
writing), so a read-only caller with a sandbox files payload was denied
even though no workspace write would occur. Move the check to immediately
before writeWorkspaceFileByPath so it only fires when a write is actually
performed.
…rop token (#5243)

GET /api/credential-sets/[id]/invite listed every invitation row — including
the bearer token — to any org member, matching neither its sibling methods
(POST/DELETE enforce admin/owner) nor the self-scoped /invitations endpoint.
A non-privileged member could harvest a null-email invite token and self-join
the credential set via POST /api/credential-sets/invite/[token].

- Add the admin/owner role gate to GET, matching POST/DELETE on the same route
- Project explicit columns (drop token) so the secret is never returned to the
  management list; the creating admin still receives it via the create response
… skip redundant actor lookup (#5242)

* improvement(execution): stop rewriting execution snapshots on reuse + skip redundant actor lookup

- SnapshotService.createSnapshotWithDeduplication: switch the per-execution dedup
  write from onConflictDoUpdate(set state_data) to onConflictDoNothing + a
  conditional select. A (workflowId, stateHash) row is byte-identical by hash, so
  rewriting the full state jsonb every run only churned a dead tuple + TOAST/WAL
  under Postgres MVCC. The reuse path (the common case) now performs no write.
- preprocessExecution: add an optional resolvedActorUserId so a caller that already
  resolved the billing actor upstream can skip the redundant workspace billed-account
  lookup. The ban/usage/rate/archived gates still run against the actor — only the
  resolution is reused, never a gate. The webhook background job passes the
  route-resolved payload.userId.

* fix(webhooks): scope actor reuse to inline execution only

Addresses review: a queued/Trigger.dev webhook can outlive a workspace
billed-account change, so reusing the route-resolved actor there could gate
against a stale account. Set resolvedActorUserId only on the in-process inline
payload (sub-second after resolution); queued and persisted payloads omit it,
so the background pass re-resolves the current billed account. Gates unchanged.

* docs(webhooks): convert inline comments on actor-reuse to TSDoc

* fix(logs): keep snapshot dedup a single atomic upsert (no select race)

Addresses review: the DO NOTHING + follow-up select could fail if cleanup
deletes the conflicting (orphaned, aged) snapshot between the no-op insert and
the select. Revert to one atomic upsert but SET only state_hash, so RETURNING
always yields the row (no race) while the unchanged TOASTed state_data jsonb is
still not rewritten under MVCC — keeping the per-execution write tiny.
@vercel

vercel Bot commented Jun 27, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 28, 2026 1:11am

Request Review

@cursor

cursor Bot commented Jun 27, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches webhook execution, billing actor resolution, DB pool sizing, and auth on credential APIs plus copilot write paths—meaningful for production reliability and security, but changes are targeted with tests on the highest-risk pieces (Zendesk, copilot permissions). Block registry split is mostly structural with accessor API preserved via registry.ts imports.

Overview
Security and auth: The Zendesk connector now validates subdomains and uses secure fetch instead of plain retry fetch, with tests for SSRF payloads. Credential-set invitation ms** listing requires admin/owner and returns invitations without the invite bearer token. Copilot output-to-file/table redirection is blocked for read-only workspace members via a shared permission helper.

Execution and infra: Trigger.dev tasks for workflow, webhook, and resume runs get env-configurable queue concurrency caps; default schedule concurrency drops 50 → 30 and the realtime Postgres pool 15 → 10, with SIM_DB_ROLE on web/trigger/realtime. Webhooks record receipt time and Slack x-slack-request-timestamp for dispatch/trigger-age logging; background webhook jobs can reuse a route-resolved billing actor only for inline runs.

Blocks/registry: Block data moves to registry-maps.ts (BLOCK_REGISTRY / BLOCK_META_REGISTRY); registry.ts keeps accessors only, plus an optional minimal dev alias (SIM_DEV_MINIMAL_REGISTRY) and updated integration docs.

Settings UX: Shared SaveDiscardActions, useSettingsUnsavedGuard, and a single beforeunload in the settings shell replace per-page save/unsaved patterns on SSO, whitelabel, access-control group detail, and data retention (policies move from modals to detail sub-views). Data drains loses an info callout; unused InfoNote is removed.

Reviewed by Cursor Bugbot for commit 2686f04. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps

greptile-apps Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR hardens several security-sensitive paths and updates settings and execution infrastructure. The main changes are:

  • Added SSRF and download-size protections for connectors, MCP servers, and knowledge documents.
  • Added OOXML ZIP archive checks before document parsing.
  • Added copilot output-write permission checks.
  • Added DB pool profiles and background task concurrency caps.
  • Moved settings pages onto shared save, discard, and unsaved-change guards.
  • Tightened credential-set invitation listing and snapshot reuse behavior.

Confidence Score: 4/5

The whitelabeling discard path can keep and later save a discarded uploaded image.

  • Most backend and security hardening paths look consistent with their changed contracts.
  • The shared settings migration leaves one concrete stale-state path in whitelabeling.
  • Fixing the upload preview reset should make the changed UI behavior match the new discard action.

apps/sim/ee/whitelabeling/components/whitelabeling-settings.tsx

Important Files Changed

Filename Overview
apps/sim/ee/whitelabeling/components/whitelabeling-settings.tsx Adds shared save/discard actions, but discard does not reset uploaded logo preview state.
apps/sim/lib/file-parsers/zip-guard.ts Adds central-directory checks to reject oversized or highly compressed OOXML archives.
apps/sim/connectors/zendesk/zendesk.ts Validates Zendesk subdomains and switches connector requests to the secure fetch path.
apps/sim/lib/mcp/domain-check.ts Pins public IP-literal MCP URLs so redirect handling stays inside SSRF controls.
apps/sim/lib/copilot/request/tools/files.ts Blocks workspace file output writes for callers without write access.
apps/sim/lib/copilot/request/tools/tables.ts Blocks table output writes for callers without write access.
packages/db/db.ts Adds per-role Postgres pool profiles selected by SIM_DB_ROLE.

Reviews (1): Last reviewed commit: "improvement(execution): stop rewriting e..." | Re-trigger Greptile

Comment thread apps/sim/ee/whitelabeling/components/whitelabeling-settings.tsx
#5223)

* perf(dev): SIM_DEV_MINIMAL_REGISTRY mode to slash local dev-server RAM

Adds a dev-only escape hatch (`bun run dev:minimal`, or `dev:full:minimal` with
the realtime server): when SIM_DEV_MINIMAL_REGISTRY=1, a Turbopack/webpack
resolve-alias swaps the two heavy registries for tiny curated variants —
`@/tools/registry` → 2 tools, `@/blocks/registry-maps` → ~20 core blocks. The
shared workspace layout drags the full ~247-tool registry (~2,074 modules) into
every route via providers/utils → tools/params, and the editor/executor pull all
~268 block configs; aliasing both stops Turbopack from compiling those graphs at
all.

To make the blocks alias clean, the heavy block import maps move out of
registry.ts into registry-maps.ts (registry.ts keeps only its accessors,
importing the maps); its public API is unchanged and full builds/tests use the
full maps. The alias is gated on isDev + the flag and is never applied in
production.

Measured (Turbopack dev, authenticated, /logs): peak next-server RSS ~16 GB →
~4.7 GB, compile 4.9 min → ~18 s; the workflow editor route similarly drops to
~5 GB / ~17 s. Only http_request + function_execute and the curated core blocks
work in minimal mode; unset the flag for the full set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(dev): add table block to minimal registry; doc block registration in registry-maps

Adds the Table block to the SIM_DEV_MINIMAL_REGISTRY curated set so the
tables surface works under dev:minimal. Updates the integration skills/rules
and CLAUDE.md to point block registration at blocks/registry-maps.ts (the
BLOCK_REGISTRY / BLOCK_META_REGISTRY maps), reflecting that registry.ts now
holds only the accessor functions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(dev): rename dev:full:minimal → dev:full:minimal-registry

Matches the dev:full:* formatting and makes the suffix self-explanatory —
it is the registry that's minimal, not the dev stack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TheodoreSpeaks TheodoreSpeaks requested a review from a team as a code owner June 27, 2026 20:41

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2686f04. Configure here.

Comment thread apps/sim/ee/sso/components/sso-settings.tsx Outdated
… tool versions (#5246)

* improvement(clickhouse): expand block templates and skills, normalize tool versions

- Expand ClickHouseBlockMeta templates from 3 to 9 (schema docs, table maintenance, partition retention, long-running-query alerts, table provisioning, storage growth report)
- Add document-schema and maintain-tables skills (now 5, all grounded in tools.access)
- Normalize tool version '1.0' to '1.0.0' across all 26 tools for repo consistency

* improvement(clickhouse): enforce explicit safeguards in destructive-op guidance

Address Greptile P1 review: tighten the partition-retention and kill-query
templates/skill so agent guidance requires an explicit retention cutoff and
elapsed-time threshold, lists/verifies targets first, defaults to alert-only,
and never drops a partition or kills a query without confirmation.

* improvement(clickhouse): make long-running-query template alert-only

Address Greptile P1: the kill-query path is a low-level primitive (like
drop_table/delete) and shouldn't carry tool-specific kill policy. Remove the
autonomous-kill suggestion from the scheduled template so no shipped template
steers an agent toward killing a query unattended; alert a human instead. The
kill tool stays available for explicit manual use.
After clicking Edit on an existing SSO provider, the only header action was
SaveDiscardActions, which renders nothing when the form is clean — so there was
no way to leave edit mode back to the read-only summary without first changing a
field or navigating away. Render a Cancel chip when isEditing && !hasChanges
(the dirty-gated Discard already exits when there are changes).
…ite amplification (#5248)

* improvement(logs): move per-block progress markers to Redis to cut write amplification

Per-block lastStartedBlock/lastCompletedBlock markers were persisted via a
jsonb_set UPDATE on workflow_execution_logs on every block start and complete
(~2N UPDATEs per run) — the heaviest write query in the DB. These are live
progress breadcrumbs with no DB-polling consumer (live progress comes from the
executor over WebSocket); their only durable value is a breadcrumb folded into
the final record.

Behind the redis-progress-markers flag, markers now live in Redis during the run
and are folded into the single terminal UPDATE at completion, dropping per-run
row UPDATEs from ~2N+1 to 1.

- New progress-markers module: HASH execution:progress:{id}, atomic Lua
  monotonic-guard writes preserving the existing <= ordering, reservation-aligned
  TTL backstop, graceful no-op when Redis is unavailable
- Deterministic GC: cleared at every terminal/pause boundary; TTL covers crashes
- Flag resolved once per logging session so a run never mixes write paths
- Fold markers into the completion record (Redis wins, falls back to row markers)
- Merge live markers for in-flight detail reads
- Extract shared getExecutionReservationTtlMs so marker and admission-slot TTLs
  share one source of truth

* fix(logs): SQL fallback when Redis marker write fails, fold markers on force-fail, validate marker shape

Addresses review feedback on the redis-progress-markers PR:
- persistLast* now falls back to the jsonb_set UPDATE when Redis is unavailable or the write fails (setLast* returns whether it persisted), so a marker is never dropped when the flag is on without a healthy Redis.
- markExecutionAsFailed folds live Redis markers into execution_data before clearing, so the last-started/last-completed breadcrumb survives the force-fail path.
- getProgressMarkers validates marker shape (rebuilds from typed fields), so a stale or wrong-shaped Redis value can never reach API consumers.

* chore(logs): convert inline marker comments to TSDoc

* fix(logs): preserve markers when the completion read fails

getProgressMarkers now returns null on a Redis read error (vs {} for genuinely empty). completeWorkflowExecution and markExecutionAsFailed skip clearProgressMarkers when the read returns null, so a transient read error at completion no longer wipes markers that are still durably in Redis — the TTL reclaims them instead.

* fix(logs): resolve marker store split-brain by latest-timestamp-wins + drain on force-fail

- When a Redis marker write falls back to SQL, Redis and the row can each hold a marker for a different block; reads/folds previously preferred Redis unconditionally and could pick a stale value. Now the completion fold, the in-flight detail read, and the force-fail fold all pick the marker with the later timestamp (pickLatestStartedMarker/pickLatestCompletedMarker; markExecutionAsFailed uses a monotonic SQL guard).
- markAsFailed now drains pending per-block marker writes (not just the completion promise) before folding, so a force-fail racing onBlockStart/onBlockComplete still captures the latest breadcrumb.

* fix(logs): harden Lua marker guard against non-table decoded values

Guard the monotonic-check index with type(decoded) == 'table' so a corrupted Redis field that decodes to a non-table (e.g. a number) can't error the eval; our write path only ever stores JSON objects, so this is defense-in-depth.

* perf(logs): skip completion Redis read/clear when markers went to SQL

completeWorkflowExecution now takes readProgressMarkers (the session's resolved marker mode); when the flag is off it skips the per-completion HGETALL+DEL entirely instead of probing a key that was never written. Sticky to the session so it stays flip-safe (an execution that wrote to Redis always folds+clears Redis). Non-session callers default to true (safe read-and-fold). Also hardened the Lua guard with type(decoded)=='table'.
updateWebhookProviderConfig built a DB-side merge with jsonb operators
(COALESCE(provider_config, '{}'::jsonb) || $1::jsonb), but the
provider_config column is json, not jsonb. Postgres cannot apply jsonb
merge operators to a json column, so every polling state write failed
with "could not convert type jsonb to json" — silently breaking
historyId/lastCheckedTimestamp/pageToken/lastSeenGuids persistence for
all polling webhooks (Gmail, RSS, Google Sheets/Drive, Outlook, IMAP)
since the atomic-merge change landed.

Cast the column to jsonb for the || / - merge and cast the result back
to json for storage, matching the existing pattern in subscription.ts.
…oy (#5250)

When a deploy activates a new version, superseded versions' webhooks are
removed by a separate, best-effort CLEANUP_INACTIVE outbox event. When that
event is lost/dead-letters, old-version webhooks linger as is_active orphans
that fetchActiveWebhooks skips (version mismatch), so they silently stop
polling (~515 webhooks across ~130 workflows in prod).

Run the existing cleanupInactiveDeploymentVersions synchronously in the
SYNC_ACTIVE handler, right after the active version's webhooks/schedules are
registered, falling back to the deferred outbox event only if the inline pass
throws. This reuses the existing guarded cleanup, which re-checks each version
is still inactive before tearing anything down (so it never touches the active
version) and runs strictly after registration (so a teardown failure can't
block it).
@waleedlatif1 waleedlatif1 merged commit 38c088a into main Jun 28, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants