Narwal

The S3 bucket behind https://cache.nixos.org contains more than 1 billion files requiring more than 600 TB of storage. This project is yet another attempt at garbage collecting that behemoth.

History

We started out in the summer of 2025 building a write-through proxy that would sit between Hydra and the S3 bucket during upload, parsing Narinfo files and storing the metadata in a Postgres db.

Combined with an historical import process based on the S3 Inventory Service, this would have allowed a real-time view of every store path within the cache and how they related to each other. From there, we could develop GC strategies.

We got pretty far along this path before a pause due to other commitments. When we returned to finish it, we quickly realised that a rewrite of the Hydra Queue Runner would introduce architectural changes that would mean a write-through proxy was no longer appropriate.

So we shifted gears and adapted the approach to work with S3 Notification Events instead to track changes to the bucket.

This lasted little more than a week before Simon Hauser pointed out in the bi-weekly queue runner meeting that "Hydra should have all this state".

Current Status

We are currently investigating the assertion made by Simon. So far it seems that Hydra does indeed have a record of 99.5% of the store paths ever uploaded to the cache.

What it does not have (to the best of our understanding) is knowledge of how those paths relate to each other. We are currently investigating what it would take to import that history and maintain it going forward.

In parallel, we have begun interrogating the inventory data and downloaded Narinfos we already have to see if there are any quick wins.

A proper write-up of those findings will be published in the near-future, along with the underlying datasets so that others can verify them and perhaps identify other opportunities.

Note

This repository still retains some of the server functionality we developed, but is now mostly focused on inventory analysis and export.

Simple GC

The Simple GC command removes store paths from an S3-based binary cache. It reads a list of GC targets from a parquet file and deletes the corresponding .narinfo and .nar files from S3.

Usage

narwal gc simple <input_file> <output_file> [flags]

Arguments:

input_file - Parquet file containing NarInfoRecord entries identifying store paths to delete
output_file - Parquet file where removal results will be written

AWS Authentication

The command requires AWS credentials to be configured in the usual ways.

We have been using the profile defined in ./aws-config for development and testing:

export AWS_CONFIG_FILE=$PWD/aws-config
export AWS_PROFILE=nixos-archeologist

After running aws sso login you can use the narwal gc simple command as normal.

Input Format

The input file must be a parquet file with the NarInfoRecord schema found in pkgs/inventory/types.go.

From each record, we construct two S3 keys to delete:

<hash>.narinfo - The narinfo itself
nar/<file_hash>.nar<compression> - The compressed NAR archive it refers to

Output Format

The output parquet file contains RemovalRecord entries:

Field	Description
`key`	S3 object key that was targeted
`store_path`	Full Nix store path (e.g., `/nix/store/abc...-hello-2.12.1`)
`not_found`	`true` if the object didn't exist in S3
`error`	Error message if deletion failed

Note

A NoSuchKey error is not considered a failure. Multiple narinfos may refer to the same NAR archive, with the NAR being removed early in the process and later attempts to remove it failing with NoSuchKey. It also means it is safe to retry removal several times with the same or evolving input file.

Stats Output

On completion, the command outputs JSON stats to stdout:

{
    "targets": {
        "nar_infos": 5000,
        "missing_in_s3": {
            "nars": 12,
            "nar_infos": 5
        }
    },
    "removals": {
        "nars": 5000,
        "nar_infos": 5000,
        "errors": 0
    }
}

The command exits with a non-zero status if:

Any objects were missing in S3 (possible prior deletion or data inconsistency)
Any removal errors occurred

Dry Run Mode

Use --dry-run to verify which files exist without deleting them:

narwal gc simple targets.parquet results.parquet --dry-run

In dry-run mode, the command checks for the presence of each target file using HEAD requests. Missing files are reported in the output and stats, allowing you to identify data inconsistencies before performing actual deletions.

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
.zed		.zed
cmd		cmd
duckdb		duckdb
nix		nix
pkg		pkg
.envrc		.envrc
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
LICENSE		LICENSE
README.md		README.md
aws-config		aws-config
flake.lock		flake.lock
flake.nix		flake.nix
go.mod		go.mod
go.sum		go.sum
main.go		main.go
narwal.toml		narwal.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Narwal

History

Current Status

Simple GC

Usage

AWS Authentication

Input Format

Output Format

Stats Output

Dry Run Mode

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

numtide/narwal

Folders and files

Latest commit

History

Repository files navigation

Narwal

History

Current Status

Simple GC

Usage

AWS Authentication

Input Format

Output Format

Stats Output

Dry Run Mode

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages