feat(e2e): add Gherkin behaviour tests with dummy runtime#1982
feat(e2e): add Gherkin behaviour tests with dummy runtime#1982ifireball wants to merge 2 commits into
Conversation
Introduce defaults.runtime config, a scripted dummy runtime for deterministic E2E validation, and godog-based behaviour tests against live GitHub/GHA. Signed-off-by: Barak Korren <bkorren@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Site previewPreview: https://e705e5ee-site.fullsend-ai.workers.dev Commit: |
Add required title field and ## Status section so ADR hooks pass in CI. Signed-off-by: Barak Korren <bkorren@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
ReviewFindingsHigh
Medium
Low
|
| if url == "" { | ||
| return fmt.Errorf("url_get requires a URL") | ||
| } | ||
| cmd := fmt.Sprintf("curl -sf %s -o /dev/null", shellQuote(url)) |
There was a problem hiding this comment.
[medium] command-injection
The run_command op passes op.Args directly to sandbox.Exec without sanitization or allowlisting. Other ops (read_file, url_get) use shellQuote for their arguments but run_command does not.
Suggested fix: Either remove run_command, restrict to an explicit allowlist of safe commands, or apply shellQuote consistently.
| case "", "claude": | ||
| r := ClaudeRuntime{} | ||
| return Backend{Runtime: r, Transcripts: r}, nil | ||
| case "dummy": |
There was a problem hiding this comment.
[medium] misconfiguration-guard
The dummy runtime is registered unconditionally in the production binary. Any org with defaults.runtime: dummy in config.yaml will bypass LLM inference with no guardrail preventing accidental or malicious activation on production orgs.
Suggested fix: Add a runtime environment guard (e.g., require FULLSEND_ALLOW_DUMMY_RUNTIME=1) so the dummy runtime cannot activate in production CI runs without explicit opt-in.
There was a problem hiding this comment.
you need to actually be able to commit to the org configuration to activate this! we really don't need extra protection layers beyond that. Someone that is able to commit there is able to override most stuff already.
| var script BehaviourScript | ||
| if err := yaml.Unmarshal(data, &script); err != nil { | ||
| return nil, fmt.Errorf("parsing behaviour script %s: %w", path, err) | ||
| } |
There was a problem hiding this comment.
[medium] exit-code-contract
DummyRuntime.Run returns exitCode=1 when ops fail but executeBehaviourScript always returns nil error. The caller only aborts on non-nil Go error, making failed ops non-fatal — diverging from ClaudeRuntime behavior.
Suggested fix: Return a Go error from Run when any operation fails, or document that dummy runtime exit code 1 is intentionally non-fatal.
|
About the @fullsend-ai-review "[misconfiguration-guard]" comment - I reject adding extra layers, users having to explicitly commit the "dummy" value into their configuration should be enough to convince us they know what they are doing if they do it. |
ralphbean
left a comment
There was a problem hiding this comment.
I think this needs a few changes before we can merge. See inline comments.
| @@ -0,0 +1,40 @@ | |||
| --- | |||
| title: "43. Behaviour tests with Gherkin and pluggable drivers" | |||
There was a problem hiding this comment.
[note] Heads up — PR #1816 also uses ADR number 0043 (for GitLab support). Whichever merges second will need renumbering. We have a renumber-adr skill that can help with that.
|
|
||
| "github.com/cucumber/godog" | ||
|
|
||
| gaci "github.com/fullsend-ai/fullsend/e2e/behaviour/drivers/ci/githubactions" |
There was a problem hiding this comment.
[important] The guide in behaviour-drivers.md says steps should only go through the driver interfaces, not concrete implementations. This import pulls in the githubactions package directly — FindBehaviourResults and FindOutputFile are artifact-parsing utilities that aren't GitHub Actions-specific. If we moved them to a shared package (maybe e2e/behaviour/results/ or onto the ci.Driver interface), a future Tekton driver wouldn't need to import the GitHub Actions package to parse results.
Does that seem right to you, or is there a reason to keep them coupled to the GHA driver?
| _ = os.RemoveAll(w.ArtifactDir) | ||
| } | ||
| empty := []byte("ops: []\n") | ||
| _ = w.SCM.CommitFile(ctx, w.Org, ".fullsend", world.BehaviourScriptRepoPath, "behaviour: clear dummy agent script", empty) |
There was a problem hiding this comment.
[important] If this CommitFile fails (GitHub API flake), the next scenario inherits the previous scenario's dummy script and produces wrong results — with no visible error. The World struct doesn't carry a logger, so there's no way to surface the failure right now. Could we at least add a Logf func(string, ...any) to World and log the error here? Silent cleanup failures in CI are rough to debug.
| if readErr != nil { | ||
| return readErr | ||
| } | ||
| found = data |
There was a problem hiding this comment.
[important] This overwrites found on every match without stopping the walk. If an artifact directory has multiple behaviour-results.json files, this returns whichever was walked last — which depends on filesystem ordering. Returning on first match (filepath.SkipAll) would make the behavior deterministic. Same issue in FindOutputFile below at line 240.
| } | ||
|
|
||
| func resolveSandboxPath(base, rel string) string { | ||
| if filepath.IsAbs(rel) || strings.HasPrefix(rel, sandbox.SandboxWorkspace) { |
There was a problem hiding this comment.
[moderate] This returns absolute paths as-is (/etc/passwd) and doesn't check that relative paths stay within base after filepath.Join canonicalizes ../ components. The sandbox is the outer containment boundary, so this isn't exploitable from outside — but the function's intent seems to be scoping to a directory. Adding a "resolved path must have base as prefix" check would be consistent with how downloadArtifact validates zip paths a few files over.
| _ = os.MkdirAll(filepath.Dir(outPath), 0o755) | ||
| rc, err := f.Open() | ||
| if err != nil { | ||
| continue |
There was a problem hiding this comment.
[moderate] Errors from f.Open() and io.ReadAll are silently continued past. If behaviour-results.json fails to extract, the test later fails with "not found" instead of the actual I/O error. Propagating the error here would make CI failures much easier to debug.
| } | ||
| } | ||
|
|
||
| func resolveWriteFixture(op BehaviourOperation) (dest string, content string, err error) { |
There was a problem hiding this comment.
[note] The error message says args should be dest_path, fixture_path, but parts[1] (the fixture path) is parsed and then discarded — only op.Content is used. If the fixture path is always ignored, the args format could just be the destination path. If there's an intent to load from disk later, a TODO comment would help a future reader.
| } | ||
|
|
||
| message := fmt.Sprintf("behaviour: set dummy agent script (%s)", time.Now().UTC().Format(time.RFC3339)) | ||
| if err := w.SCM.CommitFile(context.Background(), w.Org, ".fullsend", world.BehaviourScriptRepoPath, message, data); err != nil { |
There was a problem hiding this comment.
[note] Hardcoded ".fullsend" here and in CleanupScenario — forge.ConfigRepoName would be the safer reference.
| if-no-files-found: ignore | ||
| retention-days: 5 | ||
|
|
||
| behaviour: |
There was a problem hiding this comment.
[note] The existing e2e job uploads screenshots on failure (if: always()). This job doesn't have an equivalent artifact upload step. If a behaviour test fails in CI, there's no way to retrieve downloaded artifacts or logs for debugging. Might be worth adding one when you have the chance.
| t.Skipf("org %s not ready for behaviour tests: %v", org, err) | ||
| } | ||
|
|
||
| w := &world.World{ |
There was a problem hiding this comment.
[note] This single World is shared across all scenarios, which works for serial execution. godog supports --concurrency though — if someone ever passes that, scenarios would race on shared fields. Might be worth a comment noting the single-threaded assumption, or creating a new World per scenario in the Before hook.
waynesun09
left a comment
There was a problem hiding this comment.
Review squad findings (6 agents, deduplicated against 13 existing review threads). 1 original CRITICAL finding dropped as false positive (claude.go not modified in this PR). 5 new findings posted — 1 HIGH, 4 MEDIUM.
| return err | ||
| } | ||
|
|
||
| if err := verifyDummyExpectations(w, artifactDir); err != nil { |
There was a problem hiding this comment.
HIGH — Dummy/output assertions are no-ops due to step ordering
thenTriageWorkflowCompletes calls verifyDummyExpectations and verifyOutputExpectations here, but w.DummyExpectations and w.OutputExpectations are still empty at this point — they get populated by the later godog steps (the agent will succeed to ..., the agent will output ...) which run after this step returns.
In the feature file:
Then the triage workflow completes successfully ← runs verification here (empty lists)
And the agent will succeed to Emit triage JSON ← appends to DummyExpectations (too late)The for _, exp := range w.DummyExpectations loop iterates zero times and returns nil — every scenario passes regardless of actual results.
Suggestion: Move verification into its own step (e.g., Then the dummy agent expectations are met) placed after the assertion steps, or accumulate expectations from the table in the Given step and verify them here.
| return err | ||
| } | ||
| remoteDest := resolveSandboxPath(sandbox.SandboxWorkspace, dest) | ||
| mkdirCmd := fmt.Sprintf("mkdir -p $(dirname %s)", shellQuote(remoteDest)) |
There was a problem hiding this comment.
MEDIUM — Command substitution in shell command
mkdir -p $(dirname ...) uses shell command substitution. While the path is shellQuoted, the $(...) evaluates before the shell interprets the quoted argument. If remoteDest contains characters that survive shellQuote but interact with the subshell (or if a future caller passes an unquoted path), this could behave unexpectedly.
Safer to compute the directory in Go:
dir := filepath.Dir(remoteDest)
mkdirCmd := fmt.Sprintf("mkdir -p %s", shellQuote(dir))This also avoids the dirname dependency and is clearer about intent.
| return nil, fmt.Errorf("workflow %s run %d did not complete within deadline", workflowFile, triageRun.ID) | ||
| } | ||
|
|
||
| func (d *Driver) AssertNoWorkflow(ctx context.Context, owner, repo, workflowFile string, after time.Time) error { |
There was a problem hiding this comment.
MEDIUM — AssertNoWorkflow checks once with no settle delay
This function queries the workflow runs list once and returns immediately. If the workflow dispatch is still propagating through GitHub's API (eventual consistency), this will falsely pass. A short polling window (e.g., 3 checks over 15-30s) would guard against the race.
for i := 0; i < 3; i++ {
// check for unexpected runs
time.Sleep(10 * time.Second)
}| if-no-files-found: ignore | ||
| retention-days: 5 | ||
|
|
||
| behaviour: |
There was a problem hiding this comment.
MEDIUM — Behaviour job inherits workflow-level id-token: write permission
The workflow has permissions: { id-token: write, contents: read } at the top level. The behaviour job doesn't need OIDC tokens (it runs make behaviour-test with a PAT), but it inherits id-token: write anyway because it doesn't declare its own permissions: block.
Add an explicit override to follow least-privilege:
behaviour:
runs-on: ubuntu-latest
permissions:
contents: read
timeout-minutes: 30| pollInterval = 15 * time.Second | ||
| dispatchWait = 12 * time.Minute | ||
| dispatchPoll = 5 * time.Second | ||
| dispatchMaxTry = 12 |
There was a problem hiding this comment.
MEDIUM — 60s dispatch detection window may be too short
dispatchMaxTry = 12 × dispatchPoll = 5s = 60s maximum wait for a workflow run to appear in the API after dispatching it. GitHub Actions dispatch-to-run visibility can take longer than 60s under load, especially for workflow_dispatch or repository_dispatch events.
Consider increasing to dispatchMaxTry = 24 (120s) or 36 (180s) to reduce flakiness in CI. The dispatchWait = 12min for completion is generous, but the detection window is the bottleneck.
Summary
defaults.runtimeorg config and--runtimeinstall flag with sharedruntime.ResolveFromConfig()selection infullsend run.behaviour-results.jsonfor deterministic assertions.e2e/behaviour/with pluggable GitHub/GitHub Actions drivers, triage scenarios, CI job, and ADR/docs.Reopened from upstream branch (replaces #1981) so CI jobs can access repository secrets.
Test plan
go test ./...go test -tags behaviour -c ./e2e/behaviour/...go test -tags e2e -c ./e2e/admin/...make behaviour-testagainst halfsend org pool withGITHUB_TOKENand--runtime dummyorgsE2E_BEHAVIOUR_GITHUB_TOKENsecretMade with Cursor