feat(e2e): add Gherkin behaviour tests with dummy runtime by ifireball · Pull Request #1982 · fullsend-ai/fullsend

ifireball · 2026-06-07T10:49:19Z

Summary

Add defaults.runtime org config and --runtime install flag with shared runtime.ResolveFromConfig() selection in fullsend run.
Implement a dummy runtime that executes scripted sandbox operations and emits behaviour-results.json for deterministic assertions.
Introduce godog behaviour tests under e2e/behaviour/ with pluggable GitHub/GitHub Actions drivers, triage scenarios, CI job, and ADR/docs.

Reopened from upstream branch (replaces #1981) so CI jobs can access repository secrets.

Test plan

go test ./...
go test -tags behaviour -c ./e2e/behaviour/...
go test -tags e2e -c ./e2e/admin/...
make behaviour-test against halfsend org pool with GITHUB_TOKEN and --runtime dummy orgs
CI behaviour job after adding E2E_BEHAVIOUR_GITHUB_TOKEN secret

Made with Cursor

Introduce defaults.runtime config, a scripted dummy runtime for deterministic E2E validation, and godog-based behaviour tests against live GitHub/GHA. Signed-off-by: Barak Korren <bkorren@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-06-07T10:50:44Z

Site preview

Preview: https://e705e5ee-site.fullsend-ai.workers.dev

Commit: ad356f80a1c004948fa0adfe8ca4be8c7dde87c7

Add required title field and ## Status section so ADR hooks pass in CI. Signed-off-by: Barak Korren <bkorren@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fullsend-ai-review · 2026-06-07T11:19:41Z

Review

Findings

High

[protected-path] .github/workflows/e2e.yml — This PR modifies files under .github/, a protected path requiring human approval. No linked issue exists to provide authorization context for this change. The PR description explains the CI job addition, but protected-path changes require an explicit issue link justifying the modification.
Remediation: File an issue describing the need for the behaviour test CI job, link it to this PR, and obtain human approval for the .github/ change.

Medium

[command-injection] internal/runtime/dummy.go:159 — The run_command op passes op.Args directly to sandbox.Exec without sanitization or allowlisting. Other ops (read_file, url_get) use shellQuote for their arguments. While execution occurs inside a sandbox and the behaviour script requires config repo write access to plant, this creates an unsanitized command execution path that other ops deliberately avoid.
Remediation: Either remove run_command (other ops cover the test scenarios), restrict to an explicit allowlist of safe commands, or at minimum apply shellQuote consistently.
[misconfiguration-guard] internal/runtime/registry.go:16 — The dummy runtime is registered unconditionally in the production binary. Any org whose config.yaml has defaults.runtime: dummy will bypass LLM inference and execute behaviour script ops instead. There is no guardrail preventing a production org from being (accidentally or maliciously) configured with runtime: dummy. The env.go validator only runs during the e2e test suite, not in the fullsend run production path.
Remediation: Add a runtime environment guard (e.g., require FULLSEND_ALLOW_DUMMY_RUNTIME=1) so the dummy runtime cannot be activated in production CI runs without explicit opt-in.
[exit-code-contract] internal/runtime/dummy.go:115 — DummyRuntime.Run returns exitCode=1 when an operation fails, but executeBehaviourScript always returns nil error. The caller in run.go only aborts on a non-nil Go error — a non-zero exit code is just a warning. This means a failed operation is treated as non-fatal, diverging from how ClaudeRuntime treats non-zero exits.
Remediation: Return a Go error from Run when any operation fails (matching ClaudeRuntime behavior), or document that dummy runtime exit code 1 is intentionally non-fatal.
[missing-authorization] — No linked issue for this 1800+ line feature PR that adds significant testing infrastructure (Gherkin behaviour tests, dummy runtime, pluggable drivers, CI job, ADR, 6 doc files). While ADR 0043 (included in the PR) provides design authorization, non-trivial changes to protected paths require an explicit issue link per project governance.
Remediation: File an issue proposing this testing infrastructure addition and link it to the PR.
[stale-doc] docs/ADRs/0003-org-config-repo-convention.md:216 — The config schema example shows a nested runtime: section with harness and model fields. This PR introduces defaults.runtime as a flat string field (claude or dummy). The example config is now misleading for users configuring runtime selection.
Remediation: Update the example config to show defaults.runtime: claude under the defaults: section.
[missing-doc] docs/guides/getting-started/installation.md:240 — The admin install flags table does not document the new --runtime flag. Users installing behaviour test orgs need fullsend admin install --runtime dummy, but this flag is absent from the primary installation reference.
Remediation: Add a row to the flags table: | --runtime | claude | Agent runtime backend (claude or dummy); dummy is for behaviour test orgs only |

Low

[path-traversal] internal/runtime/dummy.go:195 — resolveSandboxPath joins a base path with user-controlled rel from the behaviour script without verifying the resolved path stays within the base directory. While execution is sandboxed, applying the same prefix check used in the zip extraction guard would be more robust.
[last-match-wins] e2e/behaviour/drivers/ci/githubactions/githubactions.go:222 — FindBehaviourResults and FindOutputFile walk the entire artifact tree and keep overwriting found with each successive match. If multiple artifacts contain the same filename, the last one in lexicographic walk order wins silently.
[test-adequacy] internal/runtime/dummy_test.go — Tests cover LoadBehaviourScript and resolveWriteFixture but do not test executeBehaviourOp for any op type, the exit code logic in Run, or the unknown-op error path.
[build-tag-consistency] e2e/admin/lock.go:1, e2e/admin/testutil.go:1 — Build tags changed from e2e to e2e || behaviour, diverging from other e2e/admin files that use the single e2e tag.
[function-placement] e2e/behaviour/steps/triage.go:297 — CleanupScenario performs cross-cutting cleanup (issues, artifacts, dummy script) not specific to triage, but lives in triage.go. Consider moving to registry.go or a dedicated cleanup.go.

fullsend-ai-review

See the review comment for full details.

fullsend-ai-review · 2026-06-07T11:19:41Z

+		if url == "" {
+			return fmt.Errorf("url_get requires a URL")
+		}
+		cmd := fmt.Sprintf("curl -sf %s -o /dev/null", shellQuote(url))


[medium] command-injection

The run_command op passes op.Args directly to sandbox.Exec without sanitization or allowlisting. Other ops (read_file, url_get) use shellQuote for their arguments but run_command does not.

Suggested fix: Either remove run_command, restrict to an explicit allowlist of safe commands, or apply shellQuote consistently.

fullsend-ai-review · 2026-06-07T11:19:41Z

+	case "", "claude":
+		r := ClaudeRuntime{}
+		return Backend{Runtime: r, Transcripts: r}, nil
+	case "dummy":


[medium] misconfiguration-guard

The dummy runtime is registered unconditionally in the production binary. Any org with defaults.runtime: dummy in config.yaml will bypass LLM inference with no guardrail preventing accidental or malicious activation on production orgs.

Suggested fix: Add a runtime environment guard (e.g., require FULLSEND_ALLOW_DUMMY_RUNTIME=1) so the dummy runtime cannot activate in production CI runs without explicit opt-in.

you need to actually be able to commit to the org configuration to activate this! we really don't need extra protection layers beyond that. Someone that is able to commit there is able to override most stuff already.

fullsend-ai-review · 2026-06-07T11:19:41Z

+	var script BehaviourScript
+	if err := yaml.Unmarshal(data, &script); err != nil {
+		return nil, fmt.Errorf("parsing behaviour script %s: %w", path, err)
+	}


[medium] exit-code-contract

DummyRuntime.Run returns exitCode=1 when ops fail but executeBehaviourScript always returns nil error. The caller only aborts on non-nil Go error, making failed ops non-fatal — diverging from ClaudeRuntime behavior.

Suggested fix: Return a Go error from Run when any operation fails, or document that dummy runtime exit code 1 is intentionally non-fatal.

ifireball · 2026-06-07T11:32:37Z

About the @fullsend-ai-review "[misconfiguration-guard]" comment - I reject adding extra layers, users having to explicitly commit the "dummy" value into their configuration should be enough to convince us they know what they are doing if they do it.

ralphbean

I think this needs a few changes before we can merge. See inline comments.

ralphbean · 2026-06-07T14:05:13Z

@@ -0,0 +1,40 @@
+---
+title: "43. Behaviour tests with Gherkin and pluggable drivers"


[note] Heads up — PR #1816 also uses ADR number 0043 (for GitLab support). Whichever merges second will need renumbering. We have a renumber-adr skill that can help with that.

ralphbean · 2026-06-07T14:05:13Z

+
+	"github.com/cucumber/godog"
+
+	gaci "github.com/fullsend-ai/fullsend/e2e/behaviour/drivers/ci/githubactions"


[important] The guide in behaviour-drivers.md says steps should only go through the driver interfaces, not concrete implementations. This import pulls in the githubactions package directly — FindBehaviourResults and FindOutputFile are artifact-parsing utilities that aren't GitHub Actions-specific. If we moved them to a shared package (maybe e2e/behaviour/results/ or onto the ci.Driver interface), a future Tekton driver wouldn't need to import the GitHub Actions package to parse results.

Does that seem right to you, or is there a reason to keep them coupled to the GHA driver?

ralphbean · 2026-06-07T14:05:13Z

+		_ = os.RemoveAll(w.ArtifactDir)
+	}
+	empty := []byte("ops: []\n")
+	_ = w.SCM.CommitFile(ctx, w.Org, ".fullsend", world.BehaviourScriptRepoPath, "behaviour: clear dummy agent script", empty)


[important] If this CommitFile fails (GitHub API flake), the next scenario inherits the previous scenario's dummy script and produces wrong results — with no visible error. The World struct doesn't carry a logger, so there's no way to surface the failure right now. Could we at least add a Logf func(string, ...any) to World and log the error here? Silent cleanup failures in CI are rough to debug.

ralphbean · 2026-06-07T14:05:14Z

+			if readErr != nil {
+				return readErr
+			}
+			found = data


[important] This overwrites found on every match without stopping the walk. If an artifact directory has multiple behaviour-results.json files, this returns whichever was walked last — which depends on filesystem ordering. Returning on first match (filepath.SkipAll) would make the behavior deterministic. Same issue in FindOutputFile below at line 240.

ralphbean · 2026-06-07T14:05:14Z

+}
+
+func resolveSandboxPath(base, rel string) string {
+	if filepath.IsAbs(rel) || strings.HasPrefix(rel, sandbox.SandboxWorkspace) {


[moderate] This returns absolute paths as-is (/etc/passwd) and doesn't check that relative paths stay within base after filepath.Join canonicalizes ../ components. The sandbox is the outer containment boundary, so this isn't exploitable from outside — but the function's intent seems to be scoping to a directory. Adding a "resolved path must have base as prefix" check would be consistent with how downloadArtifact validates zip paths a few files over.

ralphbean · 2026-06-07T14:05:14Z

+		_ = os.MkdirAll(filepath.Dir(outPath), 0o755)
+		rc, err := f.Open()
+		if err != nil {
+			continue


[moderate] Errors from f.Open() and io.ReadAll are silently continued past. If behaviour-results.json fails to extract, the test later fails with "not found" instead of the actual I/O error. Propagating the error here would make CI failures much easier to debug.

ralphbean · 2026-06-07T14:05:14Z

+	}
+}
+
+func resolveWriteFixture(op BehaviourOperation) (dest string, content string, err error) {


[note] The error message says args should be dest_path, fixture_path, but parts[1] (the fixture path) is parsed and then discarded — only op.Content is used. If the fixture path is always ignored, the args format could just be the destination path. If there's an intent to load from disk later, a TODO comment would help a future reader.

ralphbean · 2026-06-07T14:05:14Z

+	}
+
+	message := fmt.Sprintf("behaviour: set dummy agent script (%s)", time.Now().UTC().Format(time.RFC3339))
+	if err := w.SCM.CommitFile(context.Background(), w.Org, ".fullsend", world.BehaviourScriptRepoPath, message, data); err != nil {


[note] Hardcoded ".fullsend" here and in CleanupScenario — forge.ConfigRepoName would be the safer reference.

ralphbean · 2026-06-07T14:05:14Z

          if-no-files-found: ignore
          retention-days: 5
+
+  behaviour:


[note] The existing e2e job uploads screenshots on failure (if: always()). This job doesn't have an equivalent artifact upload step. If a behaviour test fails in CI, there's no way to retrieve downloaded artifacts or logs for debugging. Might be worth adding one when you have the chance.

ralphbean · 2026-06-07T14:05:14Z

+		t.Skipf("org %s not ready for behaviour tests: %v", org, err)
+	}
+
+	w := &world.World{


[note] This single World is shared across all scenarios, which works for serial execution. godog supports --concurrency though — if someone ever passes that, scenarios would race on shared fields. Might be worth a comment noting the single-threaded assumption, or creating a new World per scenario in the Before hook.

waynesun09

Review squad findings (6 agents, deduplicated against 13 existing review threads). 1 original CRITICAL finding dropped as false positive (claude.go not modified in this PR). 5 new findings posted — 1 HIGH, 4 MEDIUM.

waynesun09 · 2026-06-08T15:08:36Z

+		return err
+	}
+
+	if err := verifyDummyExpectations(w, artifactDir); err != nil {


HIGH — Dummy/output assertions are no-ops due to step ordering

thenTriageWorkflowCompletes calls verifyDummyExpectations and verifyOutputExpectations here, but w.DummyExpectations and w.OutputExpectations are still empty at this point — they get populated by the later godog steps (the agent will succeed to ..., the agent will output ...) which run after this step returns.

In the feature file:

Then the triage workflow completes successfully ← runs verification here (empty lists) And the agent will succeed to Emit triage JSON ← appends to DummyExpectations (too late)

The for _, exp := range w.DummyExpectations loop iterates zero times and returns nil — every scenario passes regardless of actual results.

Suggestion: Move verification into its own step (e.g., Then the dummy agent expectations are met) placed after the assertion steps, or accumulate expectations from the table in the Given step and verify them here.

waynesun09 · 2026-06-08T15:08:37Z

+			return err
+		}
+		remoteDest := resolveSandboxPath(sandbox.SandboxWorkspace, dest)
+		mkdirCmd := fmt.Sprintf("mkdir -p $(dirname %s)", shellQuote(remoteDest))


MEDIUM — Command substitution in shell command

mkdir -p $(dirname ...) uses shell command substitution. While the path is shellQuoted, the $(...) evaluates before the shell interprets the quoted argument. If remoteDest contains characters that survive shellQuote but interact with the subshell (or if a future caller passes an unquoted path), this could behave unexpectedly.

Safer to compute the directory in Go:

dir := filepath.Dir(remoteDest) mkdirCmd := fmt.Sprintf("mkdir -p %s", shellQuote(dir))

This also avoids the dirname dependency and is clearer about intent.

waynesun09 · 2026-06-08T15:08:39Z

+	return nil, fmt.Errorf("workflow %s run %d did not complete within deadline", workflowFile, triageRun.ID)
+}
+
+func (d *Driver) AssertNoWorkflow(ctx context.Context, owner, repo, workflowFile string, after time.Time) error {


MEDIUM — AssertNoWorkflow checks once with no settle delay

This function queries the workflow runs list once and returns immediately. If the workflow dispatch is still propagating through GitHub's API (eventual consistency), this will falsely pass. A short polling window (e.g., 3 checks over 15-30s) would guard against the race.

for i := 0; i < 3; i++ { // check for unexpected runs time.Sleep(10 * time.Second) }

waynesun09 · 2026-06-08T15:08:40Z

          if-no-files-found: ignore
          retention-days: 5
+
+  behaviour:


MEDIUM — Behaviour job inherits workflow-level id-token: write permission

The workflow has permissions: { id-token: write, contents: read } at the top level. The behaviour job doesn't need OIDC tokens (it runs make behaviour-test with a PAT), but it inherits id-token: write anyway because it doesn't declare its own permissions: block.

Add an explicit override to follow least-privilege:

behaviour: runs-on: ubuntu-latest permissions: contents: read timeout-minutes: 30

waynesun09 · 2026-06-08T15:08:42Z

+	pollInterval   = 15 * time.Second
+	dispatchWait   = 12 * time.Minute
+	dispatchPoll   = 5 * time.Second
+	dispatchMaxTry = 12


MEDIUM — 60s dispatch detection window may be too short

dispatchMaxTry = 12 × dispatchPoll = 5s = 60s maximum wait for a workflow run to appear in the API after dispatching it. GitHub Actions dispatch-to-run visibility can take longer than 60s under load, especially for workflow_dispatch or repository_dispatch events.

Consider increasing to dispatchMaxTry = 24 (120s) or 36 (180s) to reduce flakiness in CI. The dispatchWait = 12min for completion is generous, but the detection window is the bottleneck.

github-actions Bot deployed to site-preview June 7, 2026 10:50 View deployment

fullsend-ai-retro Bot mentioned this pull request Jun 7, 2026

Add Gherkin behaviour tests with dummy runtime #1981

Closed

5 tasks

ifireball self-assigned this Jun 7, 2026

fix(docs): align ADR 0043 frontmatter with repo lint rules

ad356f8

Add required title field and ## Status section so ADR hooks pass in CI. Signed-off-by: Barak Korren <bkorren@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

ifireball changed the title ~~Add Gherkin behaviour tests with dummy runtime~~ feat(e2e): add Gherkin behaviour tests with dummy runtime Jun 7, 2026

github-actions Bot deployed to site-preview June 7, 2026 11:00 View deployment

ifireball marked this pull request as ready for review June 7, 2026 11:07

fullsend-ai-review Bot suggested changes Jun 7, 2026

View reviewed changes

ifireball requested review from ralphbean and waynesun09 and removed request for ralphbean June 7, 2026 11:35

ralphbean requested changes Jun 7, 2026

View reviewed changes

waynesun09 reviewed Jun 8, 2026

View reviewed changes

		@@ -0,0 +1,40 @@
		---
		title: "43. Behaviour tests with Gherkin and pluggable drivers"


		"github.com/cucumber/godog"

		gaci "github.com/fullsend-ai/fullsend/e2e/behaviour/drivers/ci/githubactions"

Conversation

ifireball commented Jun 7, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Site preview

Uh oh!

fullsend-ai-review Bot commented Jun 7, 2026

Review

Findings

High

Medium

Low

Uh oh!

fullsend-ai-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

fullsend-ai-review Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

fullsend-ai-review Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fullsend-ai-review Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

ifireball commented Jun 7, 2026

Uh oh!

ralphbean left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

waynesun09 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 7, 2026 •

edited

Loading