Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
id: ca-152
title: '`actions/setup-go` Cache Restore Fails With "Cannot open: File exists" When `go.mod` Has Mismatched `go` and `toolchain` Directives'
category: caching-artifacts
severity: warning
tags:
- setup-go
- cache
- tar
- go-mod
- toolchain-directive
- go-1-21
- file-exists
- go-toolchain
patterns:
- regex: 'golang\.org/toolchain@.*Cannot open: File exists'
flags: 'i'
- regex: 'Cannot open: File exists.*toolchain@v0\.0\.1-go'
flags: 'i'
- regex: 'Warning: Failed to restore.*tar.*exit code 2'
flags: 'i'
- regex: 'go/pkg/mod/golang\.org/toolchain@.*Cannot open'
flags: 'i'
error_messages:
- "/usr/bin/tar: ../../../go/pkg/mod/golang.org/toolchain@v0.0.1-go1.21.1.linux-amd64/...: Cannot open: File exists"
- "/usr/bin/tar: Exiting with failure status due to previous errors"
- "Warning: Failed to restore: \"/usr/bin/tar\" failed with error: The process '/usr/bin/tar' failed with exit code 2"
root_cause: |
When a `go.mod` file specifies both a `go` version and a **different**
`toolchain` version (a Go 1.21+ feature), the Go toolchain automatically
downloads and installs the requested toolchain during the first `go`
invocation in the workflow:

```
go 1.21.0
toolchain go1.21.1
```

On the **first** `actions/setup-go` cache-enabled run:
1. `setup-go` installs Go 1.21.0 (the `go` directive version).
2. Go detects the `toolchain go1.21.1` directive and auto-downloads toolchain
1.21.1 from the module proxy, writing files to:
`$GOPATH/pkg/mod/golang.org/toolchain@v0.0.1-go1.21.1.linux-amd64/`
3. `setup-go`'s post-step saves the module cache (which now includes the
downloaded toolchain files) to the Actions cache.

On **subsequent** runs:
1. `setup-go` again installs Go 1.21.0 — Go immediately auto-downloads
toolchain 1.21.1 again **before** the cache restore completes, writing
the toolchain files to disk.
2. `setup-go`'s cache restore step then tries to extract the saved cache
archive, which also contains the toolchain files. `tar` finds them
already on disk and fails with "Cannot open: File exists" (exit code 2).
3. A warning is emitted; the job continues but the cache restore is partial.

This is distinct from the Go 1.23 telemetry issue (`golang.org/x/telemetry`)
and the double-invocation issue — the trigger here is the `toolchain`
directive in `go.mod` causing automatic toolchain downloads that race with
cache restore.
fix: |
**Option 1 (recommended)**: Upgrade to `actions/setup-go@v5` or later, which
handles this scenario better. Version 5 introduced awareness of the `toolchain`
directive and avoids the race.

**Option 2**: Align the `go` and `toolchain` directives in `go.mod` so they
reference the same version, preventing the auto-download entirely.

**Option 3**: Disable Go's toolchain auto-download by setting
`GOTOOLCHAIN=local` in the workflow environment. This prevents Go from
downloading a different toolchain version and avoids the file-exists conflict.

**Option 4**: Pre-delete the conflicting toolchain directory before
`setup-go` restores the cache.
fix_code:
- language: yaml
label: 'Recommended: set GOTOOLCHAIN=local to prevent auto-download'
code: |
env:
GOTOOLCHAIN: local # Prevents Go from auto-downloading toolchain per go.mod directive

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-go@v5
with:
go-version-file: go.mod
cache: true

- run: go build ./...
- language: yaml
label: 'Alternative: align go and toolchain versions in go.mod'
code: |
# In go.mod — use the same version for both directives to avoid auto-download
# BEFORE (causes toolchain auto-download):
# go 1.21.0
# toolchain go1.21.1
#
# AFTER (no auto-download needed, no tar conflict):
# go 1.21.1
# toolchain go1.21.1

# Run locally to update:
# go get go@1.21.1
- language: yaml
label: 'Workaround: clean toolchain cache dir before setup-go'
code: |
- name: Remove pre-existing toolchain cache to prevent tar conflict
run: |
rm -rf "$(go env GOPATH)/pkg/mod/golang.org/toolchain@"* 2>/dev/null || true
# Go may not be in PATH yet; use a raw path if needed:
# rm -rf ~/go/pkg/mod/golang.org/toolchain@* 2>/dev/null || true

- uses: actions/setup-go@v5
with:
go-version-file: go.mod
cache: true
prevention:
- 'Set `GOTOOLCHAIN=local` in workflow-level `env` when using `go.mod` with a `toolchain` directive; this prevents auto-download races with cache restore'
- 'Prefer `actions/setup-go@v5` or later which has improved handling of the `toolchain` directive'
- 'If you intentionally use different `go` and `toolchain` versions in `go.mod`, expect cache restore warnings on second run and add `GOTOOLCHAIN=local` to silence them'
- 'Run `go env GOTOOLCHAIN` locally to verify toolchain behavior before committing a `go.mod` with mismatched directives'
docs:
- url: 'https://github.com/actions/setup-go/issues/424'
label: 'actions/setup-go#424 — Tar errors on cache restore after toolchain installation (13 reactions)'
- url: 'https://go.dev/doc/toolchain'
label: 'Go Toolchain documentation — go and toolchain directives in go.mod'
- url: 'https://pkg.go.dev/cmd/go#hdr-Go_toolchain_selection'
label: 'Go toolchain selection — GOTOOLCHAIN env var behavior'
- url: 'https://github.com/actions/setup-go'
label: 'actions/setup-go README'
100 changes: 100 additions & 0 deletions errors/caching-artifacts/stale-state-cache-pagination-miss.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
id: ca-151
title: '`actions/stale` State Cache Missed Due to Pagination — Processing Restarts From First Issue'
category: caching-artifacts
severity: warning
tags:
- actions-stale
- cache
- pagination
- state
- operations-per-run
- first-page
- checkpoint
patterns:
- regex: 'The saved state was not found.*process starts from the first issue'
flags: 'i'
- regex: 'Unable to reserve cache with key _state.*another job may be creating this cache'
flags: 'i'
- regex: 'Failed to save.*cache entry with the same key.*scope already exists'
flags: 'i'
- regex: 'Cache already exists.*Scope.*Key: _state'
flags: 'i'
error_messages:
- "The saved state was not found, the process starts from the first issue."
- "Failed to save: Unable to reserve cache with key _state, another job may be creating this cache. More details: Cache already exists. Scope: refs/heads/master, Key: _state, Version: ..."
- "Received non-retryable error: Failed request: (409) Conflict: cache entry with the same key, version, and scope already exists"
root_cause: |
`actions/stale` uses the GitHub Actions cache API to persist its processing
checkpoint between runs (stored under the key `_state`). To check whether
this cache entry already exists before a run, the action calls
`checkIfCacheExists`, which lists the repository's caches using the default
page size of 30 entries — **without filtering by key or ref**.

If a repository accumulates more than 30 cache entries (common on active
repos with many matrix jobs, dependency caches, or build caches), the
`_state` entry may be pushed to page 2 or later of the API results. The
action only inspects the first page, so `_state` is treated as absent.

As a result, on each run:
1. The state is reported missing → stale processing restarts from issue #1
instead of continuing from the saved checkpoint.
2. At the end of the run, when the action tries to save the new state,
a 409 Conflict is returned because the `_state` cache key already exists
from the previous run (it was not found, so it was never deleted first).

This bug was reported in actions/stale#1136. The underlying fix requires
passing `key` and `ref` query parameters to the list-caches API call so
only the relevant cache entry is checked, regardless of pagination.
fix: |
**Short-term workaround** — manually delete the `_state` cache entry from
the repository's Actions cache list before the next run:
1. Go to `Settings → Actions → Caches` and delete any `_state` cache entry.
2. The next `actions/stale` run will start fresh without the 409 conflict.

**Medium-term** — pin to the community fork that has the pagination fix:
`itchyny/actions-stale` includes a correct `checkIfCacheExists` implementation.

**Long-term** — upgrade `actions/stale` when a release incorporating the
PR #1152 pagination fix is published.
fix_code:
- language: yaml
label: 'Workaround: delete _state cache via GitHub CLI before stale runs'
code: |
# Add this as a step before actions/stale in your workflow
- name: Clear stale action state cache
env:
GH_TOKEN: ${{ github.token }}
run: |
gh api repos/${{ github.repository }}/actions/caches \
--jq '.actions_caches[] | select(.key == "_state") | .id' \
| xargs -I{} gh api --method DELETE \
repos/${{ github.repository }}/actions/caches/{}
- language: yaml
label: 'Workaround: pin to community fork with pagination fix'
code: |
- uses: itchyny/actions-stale@0980a21d84c23bd4d8c62b0958f47f25822286f2
with:
repo-token: ${{ github.token }}
days-before-stale: 60
days-before-close: 7
operations-per-run: 30
- language: yaml
label: 'Long-term: watch for actions/stale release with PR #1152 fix'
code: |
# When actions/stale releases a version including PR #1152, upgrade:
- uses: actions/stale@v10 # bump to the version that includes PR #1152
with:
repo-token: ${{ github.token }}
days-before-stale: 60
days-before-close: 7
prevention:
- 'If a repository has many cache entries (matrix jobs, large mono-repos), periodically clean up unused caches to keep the total count under 30 to avoid stale state pagination issues'
- 'Monitor `actions/stale` runs for the "saved state was not found" message — it indicates the checkpoint is being ignored and processing is restarting unnecessarily'
- 'Set `operations-per-run` high enough that stale can complete in a single run, eliminating the need for cross-run state persistence'
docs:
- url: 'https://github.com/actions/stale/issues/1136'
label: 'actions/stale#1136 — State restoration fails if a repo has many caches'
- url: 'https://github.com/actions/stale/pull/1152'
label: 'actions/stale#1152 — Fix checkIfCacheExists to use key and ref filters'
- url: 'https://docs.github.com/en/rest/actions/cache#list-github-actions-caches-for-a-repository'
label: 'GitHub REST API — list caches for a repository'
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
id: sf-230
title: '`actions/download-artifact` v4 Azure Blob Request Timeout — Step Exits 0 Despite Download Failure'
category: silent-failures
severity: silent-failure
tags:
- download-artifact
- azure-blob-storage
- timeout
- silent-failure
- intermittent
- matrix
- v4
patterns:
- regex: 'Unable to download and extract artifact: Request timeout'
flags: 'i'
- regex: 'Unable to download artifact\(s\):.*Request timeout'
flags: 'i'
- regex: 'Unexpected HTTP response from blob storage: 503'
flags: 'i'
- regex: 'Unable to download and extract artifact: Unexpected HTTP response from blob storage'
flags: 'i'
error_messages:
- "Error: Unable to download artifact(s): Unable to download and extract artifact: Request timeout: /actions-results/..."
- "Error: Unable to download artifact(s): Unable to download and extract artifact: Unexpected HTTP response from blob storage: 503 The server is busy."
root_cause: |
`actions/download-artifact@v4` changed its storage backend from Azure Blob
Storage legacy API to the new "actions-results" service. During the v4 early
rollout (late 2023–early 2024), the new backend exhibited intermittent
connection timeouts and 503 responses under load, particularly in matrix
jobs where multiple jobs simultaneously download the same artifact.

The silent-failure aspect: when the download times out, the step may report
**exit code 0** (success) instead of failing the job. At least one confirmed
case shows the step always succeeds regardless of whether the artifact was
actually downloaded, leaving downstream steps operating on a missing or
empty artifact directory without any workflow failure signal.

Additional HTTP errors observed from the same backend instability:
- `503 The server is busy` — Azure Blob Storage overloaded
- `409 Public access is not permitted on this storage account` — wrong
endpoint or storage account configuration for the new v4 backend

These are transient infrastructure errors on GitHub's side, not caused by
workflow misconfiguration. The issue was most prevalent in December 2023
when v4 was first released, but intermittent timeouts continue to occur
under high concurrency.
fix: |
1. **Retry on failure**: The most reliable fix is to re-run the failed job.
The issue is transient and typically resolves on retry.

2. **Explicit failure check**: After downloading, verify the expected files
actually exist before proceeding. This surfaces the silent failure:

3. **Reduce concurrent downloads**: If many matrix jobs download the same
artifact simultaneously, stagger them with `max-parallel` to reduce
pressure on the storage backend.

4. **Version consistency**: Ensure upload and download actions use the
same major version (v4 with v4, v3 with v3) — mismatched versions can
also cause download failures that manifest as timeouts.
fix_code:
- language: yaml
label: 'Add file existence verification after download to catch silent failures'
code: |
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: my-artifact
path: dist/

# Explicitly verify download succeeded — surfaces silent timeout failures
- name: Verify artifact download
run: |
if [ ! -f dist/expected-file.txt ]; then
echo "::error::Artifact download failed silently — expected file missing"
exit 1
fi
- language: yaml
label: 'Limit parallel matrix downloads to reduce backend pressure'
code: |
jobs:
test:
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
max-parallel: 2 # Stagger artifact downloads across matrix legs
steps:
- uses: actions/download-artifact@v4
with:
name: build-output
path: dist/
- language: yaml
label: 'Ensure version consistency between upload and download'
code: |
jobs:
build:
steps:
- uses: actions/upload-artifact@v4 # ✅ v4
with:
name: my-artifact
path: dist/

test:
needs: build
steps:
- uses: actions/download-artifact@v4 # ✅ must match upload version
with:
name: my-artifact
path: dist/
prevention:
- 'Always verify artifact file existence after download in critical workflows — do not assume exit 0 means the file is present'
- 'Pin upload and download artifact action versions to the same major version to prevent backend API mismatches'
- 'Use `max-parallel` on matrix jobs that download artifacts to avoid thundering-herd pressure on blob storage'
- 'Add `continue-on-error: false` explicitly and a post-download verification step to detect silent timeouts early'
docs:
- url: 'https://github.com/actions/download-artifact/issues/249'
label: 'actions/download-artifact#249 — Unable to download and extract artifact: Request timeout (115 reactions)'
- url: 'https://github.com/actions/download-artifact/blob/main/docs/MIGRATION.md'
label: 'actions/download-artifact migration guide — v3 to v4 breaking changes'
- url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-workflow-data-as-artifacts'
label: 'GitHub Docs — Storing workflow data as artifacts'
Loading