Skip to content

feat(e2e): add Gherkin behaviour tests with dummy runtime#1982

Open
ifireball wants to merge 2 commits into
mainfrom
cursor/5525b289
Open

feat(e2e): add Gherkin behaviour tests with dummy runtime#1982
ifireball wants to merge 2 commits into
mainfrom
cursor/5525b289

Conversation

@ifireball

Copy link
Copy Markdown
Contributor

Summary

  • Add defaults.runtime org config and --runtime install flag with shared runtime.ResolveFromConfig() selection in fullsend run.
  • Implement a dummy runtime that executes scripted sandbox operations and emits behaviour-results.json for deterministic assertions.
  • Introduce godog behaviour tests under e2e/behaviour/ with pluggable GitHub/GitHub Actions drivers, triage scenarios, CI job, and ADR/docs.

Reopened from upstream branch (replaces #1981) so CI jobs can access repository secrets.

Test plan

  • go test ./...
  • go test -tags behaviour -c ./e2e/behaviour/...
  • go test -tags e2e -c ./e2e/admin/...
  • make behaviour-test against halfsend org pool with GITHUB_TOKEN and --runtime dummy orgs
  • CI behaviour job after adding E2E_BEHAVIOUR_GITHUB_TOKEN secret

Made with Cursor

Introduce defaults.runtime config, a scripted dummy runtime for deterministic
E2E validation, and godog-based behaviour tests against live GitHub/GHA.

Signed-off-by: Barak Korren <bkorren@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

Site preview

Preview: https://e705e5ee-site.fullsend-ai.workers.dev

Commit: ad356f80a1c004948fa0adfe8ca4be8c7dde87c7

Add required title field and ## Status section so ADR hooks pass in CI.

Signed-off-by: Barak Korren <bkorren@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@ifireball ifireball changed the title Add Gherkin behaviour tests with dummy runtime feat(e2e): add Gherkin behaviour tests with dummy runtime Jun 7, 2026
@ifireball ifireball marked this pull request as ready for review June 7, 2026 11:07
@fullsend-ai-review

Copy link
Copy Markdown

Review

Findings

High

  • [protected-path] .github/workflows/e2e.yml — This PR modifies files under .github/, a protected path requiring human approval. No linked issue exists to provide authorization context for this change. The PR description explains the CI job addition, but protected-path changes require an explicit issue link justifying the modification.
    Remediation: File an issue describing the need for the behaviour test CI job, link it to this PR, and obtain human approval for the .github/ change.

Medium

  • [command-injection] internal/runtime/dummy.go:159 — The run_command op passes op.Args directly to sandbox.Exec without sanitization or allowlisting. Other ops (read_file, url_get) use shellQuote for their arguments. While execution occurs inside a sandbox and the behaviour script requires config repo write access to plant, this creates an unsanitized command execution path that other ops deliberately avoid.
    Remediation: Either remove run_command (other ops cover the test scenarios), restrict to an explicit allowlist of safe commands, or at minimum apply shellQuote consistently.

  • [misconfiguration-guard] internal/runtime/registry.go:16 — The dummy runtime is registered unconditionally in the production binary. Any org whose config.yaml has defaults.runtime: dummy will bypass LLM inference and execute behaviour script ops instead. There is no guardrail preventing a production org from being (accidentally or maliciously) configured with runtime: dummy. The env.go validator only runs during the e2e test suite, not in the fullsend run production path.
    Remediation: Add a runtime environment guard (e.g., require FULLSEND_ALLOW_DUMMY_RUNTIME=1) so the dummy runtime cannot be activated in production CI runs without explicit opt-in.

  • [exit-code-contract] internal/runtime/dummy.go:115DummyRuntime.Run returns exitCode=1 when an operation fails, but executeBehaviourScript always returns nil error. The caller in run.go only aborts on a non-nil Go error — a non-zero exit code is just a warning. This means a failed operation is treated as non-fatal, diverging from how ClaudeRuntime treats non-zero exits.
    Remediation: Return a Go error from Run when any operation fails (matching ClaudeRuntime behavior), or document that dummy runtime exit code 1 is intentionally non-fatal.

  • [missing-authorization] — No linked issue for this 1800+ line feature PR that adds significant testing infrastructure (Gherkin behaviour tests, dummy runtime, pluggable drivers, CI job, ADR, 6 doc files). While ADR 0043 (included in the PR) provides design authorization, non-trivial changes to protected paths require an explicit issue link per project governance.
    Remediation: File an issue proposing this testing infrastructure addition and link it to the PR.

  • [stale-doc] docs/ADRs/0003-org-config-repo-convention.md:216 — The config schema example shows a nested runtime: section with harness and model fields. This PR introduces defaults.runtime as a flat string field (claude or dummy). The example config is now misleading for users configuring runtime selection.
    Remediation: Update the example config to show defaults.runtime: claude under the defaults: section.

  • [missing-doc] docs/guides/getting-started/installation.md:240 — The admin install flags table does not document the new --runtime flag. Users installing behaviour test orgs need fullsend admin install --runtime dummy, but this flag is absent from the primary installation reference.
    Remediation: Add a row to the flags table: | --runtime | claude | Agent runtime backend (claude or dummy); dummy is for behaviour test orgs only |

Low

  • [path-traversal] internal/runtime/dummy.go:195resolveSandboxPath joins a base path with user-controlled rel from the behaviour script without verifying the resolved path stays within the base directory. While execution is sandboxed, applying the same prefix check used in the zip extraction guard would be more robust.

  • [last-match-wins] e2e/behaviour/drivers/ci/githubactions/githubactions.go:222FindBehaviourResults and FindOutputFile walk the entire artifact tree and keep overwriting found with each successive match. If multiple artifacts contain the same filename, the last one in lexicographic walk order wins silently.

  • [test-adequacy] internal/runtime/dummy_test.go — Tests cover LoadBehaviourScript and resolveWriteFixture but do not test executeBehaviourOp for any op type, the exit code logic in Run, or the unknown-op error path.

  • [build-tag-consistency] e2e/admin/lock.go:1, e2e/admin/testutil.go:1 — Build tags changed from e2e to e2e || behaviour, diverging from other e2e/admin files that use the single e2e tag.

  • [function-placement] e2e/behaviour/steps/triage.go:297CleanupScenario performs cross-cutting cleanup (issues, artifacts, dummy script) not specific to triage, but lives in triage.go. Consider moving to registry.go or a dedicated cleanup.go.

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread internal/runtime/dummy.go
if url == "" {
return fmt.Errorf("url_get requires a URL")
}
cmd := fmt.Sprintf("curl -sf %s -o /dev/null", shellQuote(url))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] command-injection

The run_command op passes op.Args directly to sandbox.Exec without sanitization or allowlisting. Other ops (read_file, url_get) use shellQuote for their arguments but run_command does not.

Suggested fix: Either remove run_command, restrict to an explicit allowlist of safe commands, or apply shellQuote consistently.

case "", "claude":
r := ClaudeRuntime{}
return Backend{Runtime: r, Transcripts: r}, nil
case "dummy":

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] misconfiguration-guard

The dummy runtime is registered unconditionally in the production binary. Any org with defaults.runtime: dummy in config.yaml will bypass LLM inference with no guardrail preventing accidental or malicious activation on production orgs.

Suggested fix: Add a runtime environment guard (e.g., require FULLSEND_ALLOW_DUMMY_RUNTIME=1) so the dummy runtime cannot activate in production CI runs without explicit opt-in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to actually be able to commit to the org configuration to activate this! we really don't need extra protection layers beyond that. Someone that is able to commit there is able to override most stuff already.

Comment thread internal/runtime/dummy.go
var script BehaviourScript
if err := yaml.Unmarshal(data, &script); err != nil {
return nil, fmt.Errorf("parsing behaviour script %s: %w", path, err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[medium] exit-code-contract

DummyRuntime.Run returns exitCode=1 when ops fail but executeBehaviourScript always returns nil error. The caller only aborts on non-nil Go error, making failed ops non-fatal — diverging from ClaudeRuntime behavior.

Suggested fix: Return a Go error from Run when any operation fails, or document that dummy runtime exit code 1 is intentionally non-fatal.

@ifireball

Copy link
Copy Markdown
Contributor Author

About the @fullsend-ai-review "[misconfiguration-guard]" comment - I reject adding extra layers, users having to explicitly commit the "dummy" value into their configuration should be enough to convince us they know what they are doing if they do it.

@ifireball ifireball requested review from ralphbean and waynesun09 and removed request for ralphbean June 7, 2026 11:35

@ralphbean ralphbean left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a few changes before we can merge. See inline comments.

@@ -0,0 +1,40 @@
---
title: "43. Behaviour tests with Gherkin and pluggable drivers"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] Heads up — PR #1816 also uses ADR number 0043 (for GitLab support). Whichever merges second will need renumbering. We have a renumber-adr skill that can help with that.


"github.com/cucumber/godog"

gaci "github.com/fullsend-ai/fullsend/e2e/behaviour/drivers/ci/githubactions"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[important] The guide in behaviour-drivers.md says steps should only go through the driver interfaces, not concrete implementations. This import pulls in the githubactions package directly — FindBehaviourResults and FindOutputFile are artifact-parsing utilities that aren't GitHub Actions-specific. If we moved them to a shared package (maybe e2e/behaviour/results/ or onto the ci.Driver interface), a future Tekton driver wouldn't need to import the GitHub Actions package to parse results.

Does that seem right to you, or is there a reason to keep them coupled to the GHA driver?

_ = os.RemoveAll(w.ArtifactDir)
}
empty := []byte("ops: []\n")
_ = w.SCM.CommitFile(ctx, w.Org, ".fullsend", world.BehaviourScriptRepoPath, "behaviour: clear dummy agent script", empty)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[important] If this CommitFile fails (GitHub API flake), the next scenario inherits the previous scenario's dummy script and produces wrong results — with no visible error. The World struct doesn't carry a logger, so there's no way to surface the failure right now. Could we at least add a Logf func(string, ...any) to World and log the error here? Silent cleanup failures in CI are rough to debug.

if readErr != nil {
return readErr
}
found = data

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[important] This overwrites found on every match without stopping the walk. If an artifact directory has multiple behaviour-results.json files, this returns whichever was walked last — which depends on filesystem ordering. Returning on first match (filepath.SkipAll) would make the behavior deterministic. Same issue in FindOutputFile below at line 240.

Comment thread internal/runtime/dummy.go
}

func resolveSandboxPath(base, rel string) string {
if filepath.IsAbs(rel) || strings.HasPrefix(rel, sandbox.SandboxWorkspace) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[moderate] This returns absolute paths as-is (/etc/passwd) and doesn't check that relative paths stay within base after filepath.Join canonicalizes ../ components. The sandbox is the outer containment boundary, so this isn't exploitable from outside — but the function's intent seems to be scoping to a directory. Adding a "resolved path must have base as prefix" check would be consistent with how downloadArtifact validates zip paths a few files over.

_ = os.MkdirAll(filepath.Dir(outPath), 0o755)
rc, err := f.Open()
if err != nil {
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[moderate] Errors from f.Open() and io.ReadAll are silently continued past. If behaviour-results.json fails to extract, the test later fails with "not found" instead of the actual I/O error. Propagating the error here would make CI failures much easier to debug.

Comment thread internal/runtime/dummy.go
}
}

func resolveWriteFixture(op BehaviourOperation) (dest string, content string, err error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] The error message says args should be dest_path, fixture_path, but parts[1] (the fixture path) is parsed and then discarded — only op.Content is used. If the fixture path is always ignored, the args format could just be the destination path. If there's an intent to load from disk later, a TODO comment would help a future reader.

}

message := fmt.Sprintf("behaviour: set dummy agent script (%s)", time.Now().UTC().Format(time.RFC3339))
if err := w.SCM.CommitFile(context.Background(), w.Org, ".fullsend", world.BehaviourScriptRepoPath, message, data); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] Hardcoded ".fullsend" here and in CleanupScenarioforge.ConfigRepoName would be the safer reference.

Comment thread .github/workflows/e2e.yml
if-no-files-found: ignore
retention-days: 5

behaviour:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] The existing e2e job uploads screenshots on failure (if: always()). This job doesn't have an equivalent artifact upload step. If a behaviour test fails in CI, there's no way to retrieve downloaded artifacts or logs for debugging. Might be worth adding one when you have the chance.

t.Skipf("org %s not ready for behaviour tests: %v", org, err)
}

w := &world.World{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] This single World is shared across all scenarios, which works for serial execution. godog supports --concurrency though — if someone ever passes that, scenarios would race on shared fields. Might be worth a comment noting the single-threaded assumption, or creating a new World per scenario in the Before hook.

@waynesun09 waynesun09 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review squad findings (6 agents, deduplicated against 13 existing review threads). 1 original CRITICAL finding dropped as false positive (claude.go not modified in this PR). 5 new findings posted — 1 HIGH, 4 MEDIUM.

return err
}

if err := verifyDummyExpectations(w, artifactDir); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIGH — Dummy/output assertions are no-ops due to step ordering

thenTriageWorkflowCompletes calls verifyDummyExpectations and verifyOutputExpectations here, but w.DummyExpectations and w.OutputExpectations are still empty at this point — they get populated by the later godog steps (the agent will succeed to ..., the agent will output ...) which run after this step returns.

In the feature file:

Then the triage workflow completes successfully   ← runs verification here (empty lists)
And the agent will succeed to Emit triage JSON    ← appends to DummyExpectations (too late)

The for _, exp := range w.DummyExpectations loop iterates zero times and returns nil — every scenario passes regardless of actual results.

Suggestion: Move verification into its own step (e.g., Then the dummy agent expectations are met) placed after the assertion steps, or accumulate expectations from the table in the Given step and verify them here.

Comment thread internal/runtime/dummy.go
return err
}
remoteDest := resolveSandboxPath(sandbox.SandboxWorkspace, dest)
mkdirCmd := fmt.Sprintf("mkdir -p $(dirname %s)", shellQuote(remoteDest))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM — Command substitution in shell command

mkdir -p $(dirname ...) uses shell command substitution. While the path is shellQuoted, the $(...) evaluates before the shell interprets the quoted argument. If remoteDest contains characters that survive shellQuote but interact with the subshell (or if a future caller passes an unquoted path), this could behave unexpectedly.

Safer to compute the directory in Go:

dir := filepath.Dir(remoteDest)
mkdirCmd := fmt.Sprintf("mkdir -p %s", shellQuote(dir))

This also avoids the dirname dependency and is clearer about intent.

return nil, fmt.Errorf("workflow %s run %d did not complete within deadline", workflowFile, triageRun.ID)
}

func (d *Driver) AssertNoWorkflow(ctx context.Context, owner, repo, workflowFile string, after time.Time) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM — AssertNoWorkflow checks once with no settle delay

This function queries the workflow runs list once and returns immediately. If the workflow dispatch is still propagating through GitHub's API (eventual consistency), this will falsely pass. A short polling window (e.g., 3 checks over 15-30s) would guard against the race.

for i := 0; i < 3; i++ {
    // check for unexpected runs
    time.Sleep(10 * time.Second)
}

Comment thread .github/workflows/e2e.yml
if-no-files-found: ignore
retention-days: 5

behaviour:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM — Behaviour job inherits workflow-level id-token: write permission

The workflow has permissions: { id-token: write, contents: read } at the top level. The behaviour job doesn't need OIDC tokens (it runs make behaviour-test with a PAT), but it inherits id-token: write anyway because it doesn't declare its own permissions: block.

Add an explicit override to follow least-privilege:

behaviour:
  runs-on: ubuntu-latest
  permissions:
    contents: read
  timeout-minutes: 30

pollInterval = 15 * time.Second
dispatchWait = 12 * time.Minute
dispatchPoll = 5 * time.Second
dispatchMaxTry = 12

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM — 60s dispatch detection window may be too short

dispatchMaxTry = 12 × dispatchPoll = 5s = 60s maximum wait for a workflow run to appear in the API after dispatching it. GitHub Actions dispatch-to-run visibility can take longer than 60s under load, especially for workflow_dispatch or repository_dispatch events.

Consider increasing to dispatchMaxTry = 24 (120s) or 36 (180s) to reduce flakiness in CI. The dispatchWait = 12min for completion is generous, but the detection window is the bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants