Skip to content

fix(swarm): production deploy bugs (12 fixes across auth, docker, terminal)#20

Open
MarcelocardosoLeal wants to merge 7 commits intoEvolutionAPI:mainfrom
MarcelocardosoLeal:fix/swarm-deploy-bugs
Open

fix(swarm): production deploy bugs (12 fixes across auth, docker, terminal)#20
MarcelocardosoLeal wants to merge 7 commits intoEvolutionAPI:mainfrom
MarcelocardosoLeal:fix/swarm-deploy-bugs

Conversation

@MarcelocardosoLeal
Copy link
Copy Markdown

Summary

Fixes 12 bugs found during production Docker Swarm + Portainer + Traefik deploy. Grouped in 4 categories:

Persistência de autenticação (3 bugs)

  • Add evonexus_claude_auth:/root/.claude volume to all Swarm services so OAuth tokens survive redeploys
  • Restore /root/.claude.json from /root/.claude/backups/ on container start (file is a sibling of /root/.claude/, was in writable layer and wiped every deploy)
  • Apply same volume fix to official evonexus.stack.yml template

Assets da imagem Docker (2 bugs)

  • Copy .claude/ and docs/ into dashboard image (backend reads them for /api/agents, /api/skills, /api/commands, /api/templates). Without this, clean deploys showed "No agents found"
  • Exclude dashboard/data/ and workspace/ from build context (SQLite DB with hashed passwords was being baked into the image)

Terminal-server / sessões (4 bugs)

  • Allow ANTHROPIC_API_KEY env var in claude-bridge.js (was silently filtered, forcing OAuth fallback every session)
  • Clean up orphaned inactive sessions before restart (process died without firing onExit, causing "Session already exists")
  • Make startSession idempotent on WebSocket reconnect via Traefik (returns existing entry instead of throwing)
  • Send claude_started instead of error on duplicate active session (was showing as failure even when working)

Config / UX primeiro acesso (3 bugs)

  • Remove :ro from config mount so UI can write providers.json
  • Expose terminal port 32352 in docker-compose.yml
  • Pre-seed /root/.claude/settings.json with theme + onboarding flags (each agent has its own cwd, Claude Code treated each as separate project, prompting for theme on every one)

Files touched

  • Dockerfile.dashboard — COPY .claude/ and docs/
  • start-dashboard.sh — seed config + restore .claude.json from backup (+53 lines)
  • dashboard/terminal-server/src/claude-bridge.js — env vars + idempotent sessions
  • dashboard/terminal-server/src/server.js — correct success message on duplicate start
  • docker-compose.yml — volume + port + rw config
  • evonexus.stack.yml + new evonexus.portainer.stack.yml — volume claude_auth
  • .dockerignore — exclude dashboard/data/, workspace/

Test plan

  • Rebuild image locally via docker compose build
  • Deploy to Docker Swarm via Portainer stack
  • Verify volumes persist .claude/ across docker service update --force
  • Verify /api/agents, /api/skills, /api/commands return non-empty after fresh deploy
  • Verify terminal WebSocket reconnects through Traefik without "Session already exists"
  • Verify theme picker does NOT appear on new agent terminals after first deploy
  • Verify Claude Code uses ANTHROPIC_API_KEY from Providers UI instead of OAuth login

All validated on production host evonexus.advancedbot.com.br (Swarm + Portainer + Traefik with letsencryptresolver).

🤖 Generated with Claude Code

MarcelocardosoLeal and others added 7 commits April 18, 2026 18:49
1. Add ANTHROPIC_API_KEY to ALLOWED_VARS in claude-bridge.js
   The env var was silently filtered out, causing Claude Code to fall
   back to OAuth login on every session start instead of using the
   API key configured in the Providers page.

2. Fix orphaned session crash ("Session already exists")
   When a Claude process died without firing the PTY onExit event,
   the session remained in the bridge's in-memory Map as inactive.
   The next start attempt threw "already exists". Now detects dead
   sessions, cleans them up, and restarts normally.

3. Exclude dashboard/data/ and workspace/ from Docker build context
   Without these entries in .dockerignore, the local SQLite database
   (with hashed passwords) and workspace files were baked into the
   image. On first Swarm deploy, the volume was seeded from the image,
   making login impossible with any other credentials.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add evonexus_claude_auth:/root/.claude to all three Swarm services
  (dashboard, telegram, scheduler) so Claude Code OAuth tokens persist
  across redeploys — avoids re-authentication on every deploy
- docker-compose.yml: use Dockerfile.swarm.dashboard, expose terminal
  port 32352, add claude-auth volume, fix config mount (remove :ro so
  providers.json can be written by the UI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add evonexus_claude_auth:/root/.claude to all three services in
evonexus.stack.yml so Claude Code OAuth tokens persist across redeploys.
Same fix applied to evonexus.portainer.stack.yml in the previous commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 1 — Theme picker on every agent
Each agent runs in its own working directory, which Claude Code treats
as a separate project. Without a global theme set, the user is asked to
choose a theme on every single agent terminal. Pre-seed
/root/.claude/settings.json with theme + onboarding flags during
container startup so the first-run prompts are skipped. Only writes the
file if it doesn't exist (preserves user-chosen overrides).

Bug 2 — "Session already exists" error toast
The previous fix only cleaned up *inactive* orphans. The actual production
trigger is different: when a WebSocket reconnects through Traefik, the
frontend can re-send start_claude before learning the session is still
alive. The bridge's startSession then threw on a duplicate active session.
Make startSession idempotent: if the session is already active, return
the existing entry instead of throwing.

Bug 3 — Misleading error on duplicate start
Server.startClaude() responded with type:'error' "An agent is already
running" when the session was active. From the user's perspective this
looked like a failure even though everything was working. Send
type:'claude_started' instead so the frontend updates UI to "running"
and replays the buffer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude Code stores its main config at /root/.claude.json — a SIBLING
of the /root/.claude/ directory, not inside it. The Swarm volume
mounts /root/.claude/ only, so .claude.json sits in the container's
writable layer and is wiped on every redeploy. Result: theme picker
and onboarding reappear on every release, even though the OAuth
tokens (in /root/.claude/) survive.

Claude Code itself writes timestamped backups into
/root/.claude/backups/ (which IS in the volume), so we just need to
restore the latest one on startup when the main file is missing. If
no backup exists either, seed a minimal config so first-run prompts
are skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Dockerfile only copied dashboard/backend/, social-auth/, scheduler.py
and the built frontend. .claude/ (agents, skills, commands, templates,
rules) and docs/ were never copied, so on a fresh deploy the backend's
WORKSPACE / ".claude" / "agents" path was empty. Result: /api/agents,
/api/skills, /api/commands and /api/templates all returned empty lists,
and the UI showed "No agents found — Add agent files to .claude/agents/
to get started" on every clean Swarm deploy.

Local development worked because uv runs the backend with cwd at the
repo root, where .claude/ and docs/ exist.

.claude/agent-memory and .claude/.env stay excluded by .dockerignore so
user data and secrets remain out of the image.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tsconfig.app.json: multi-line lib array for readability
- evonexus.portainer.stack.yml: remove stray blank line in traefik labels

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @MarcelocardosoLeal, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant