C/C++ has no single package registry, so this ecosystem is assembled: it
unifies Debian and Homebrew C/C++ packages — joined at the
Repology canonical-project level — and uses
OSS-Fuzz as a security signal. There is no
sources/cpp.md; this component page is the cpp pipeline's home, and the
underlying sources are documented separately:
debian, homebrew,
repology, ossfuzz.
| Source | Data collected | Raw location |
|---|---|---|
| Debian popcon (via Wayback) | install-base counts (downloads proxy) | data/sources/debian/raw/downloads.csv |
Debian Packages.xz |
Depends+Pre-Depends (runtime) edges, homepage, vcs_browser, section |
data/sources/debian/raw/{dependencies,package-metadata}.csv |
| Debian UDD | C/C++ classification via debtags | data/sources/debian/raw/cpp-packages.csv |
| Homebrew formula API | formula deps (runtime+build), homepage, source_url, license, language |
data/sources/homebrew/raw/{formulas,dependencies}.csv |
| Homebrew analytics (via Wayback) | 365-day install counts (downloads proxy) | data/sources/homebrew/raw/downloads.csv |
| Repology | canonical project name + upstream repo URLs | data/sources/repology/packages.csv |
| OSS-Fuzz | fuzz-tested project list + main_repo |
data/sources/ossfuzz/projects.csv |
Both download proxies come from sparse Wayback snapshots — see the per-source docs for the snapshot-coverage caveats. No authentication required.
src/sources/cpp/process_data.py joins Debian + Homebrew on the Repology canonical
name, then:
- Downloads — MAX within each ecosystem (avoid double-counting variants like
boost1.74/boost1.81), SUM across ecosystems (Debian and Homebrew are disjoint user populations). - Dependencies — union of runtime-only project→project edges (see below).
- is_cpp — true if any constituent binary/formula is flagged C/C++.
- Top selection — within the 95% cumulative download mass of either Debian or Homebrew.
The cpp dep tree contains runtime project→project edges only. Build-time tooling
(cmake, pkgconf, autoconf, gettext, …) and Debian Recommends/Suggests do not
propagate PageRank. Two filters combine:
| Source | Collected at fetch | Filter applied by cpp |
|---|---|---|
Debian (fetch_debian_data.py) |
Depends + Pre-Depends only — already runtime-only; Build-Depends/Recommends/Suggests not collected |
none (all of it is used) |
Homebrew (fetch_homebrew_data.py) |
both runtime and build types stored in raw deps |
cpp/process_data.py:277 — if dep_type != "runtime": continue |
The type column in data/sources/cpp/dependency-tree.csv is uniformly
"declared" — the cpp pipeline's own term for "runtime dep declared by either
ecosystem", not a faithful copy of the source-side type. Consequence: PageRank
reflects who runs with whom, not who builds whom, so build infrastructure
(cmake, pkgconf) is undervalued relative to its real load-bearing role.
After unification, cpp uses the shared scoring mechanics (download-weighted
PageRank α = 0.85, then A/B/C/D cumulative-share cutoffs — see
value.md). Orchestrated by src.value.cpp_pipeline, which runs
the Debian → Homebrew → Repology sub-pipelines, then the cpp aggregation.
C / C++ (Debian + Homebrew + Repology)
├── debian_avg_downloads ← Debian popcon (Wayback snapshots) [2021–2025]
├── homebrew_avg_downloads ← Homebrew analytics (Wayback) [2021–2025]
├── downloads_score ← derived (debian+homebrew composite) [2021–2025]
├── dep edges (package→dep)← Debian Packages.xz (Depends/Pre-) [most recent]
│ + Homebrew formula.json (runtime) [most recent]
├── pagerank ← derived [2021–2025]
├── value_class ← derived [2021–2025]
└── package→repo ← Repology project URLs [most recent]
- Value —
value_classfeeds theclass_cppcolumn ofdata/value/value.csv. - Risk / Eligibility — these stages key off
github_repo, and cpp identity is GitHub-only. Many flagship cpp upstreams live off GitHub — glibc (sourceware.org), gcc (Savannah), glib (gitlab.gnome.org), mpfr (gitlab.inria.fr), curl (curl.se) — so they carry agit_urlinvalue.csvbutgithub_repo="", and slip out of Risk and Eligibility. Coverage jumps once non-GitHub Git hosts are counted: 26% → 41% of results overall, and A+B 32% → 95% (most non-GitHub upstreams are the load-bearing A/B libraries).
In data/sources/cpp/:
raw/packages.csv— per-project join with aggregated signalstop-packages.csv— top C/C++ projects by download massdependency-tree.csv— runtime project→project edges (type="declared")github-repos.csv— project→GitHub-repo mappingsresults.csv— all dep-tree projects withpagerank+value_class
uv run python -m src.sources.cpp.process_data [--top-share F] [--include-non-cpp]Carried from the cross-ecosystem tables in value.md:
| Stage | Count |
|---|---|
| Top packages (95% downloads) | 1,643 |
| After dep tree | 2,648 |
| Results | 1,882 |
| With GitHub repo | 482 (26%) |
| With any Git URL | 770 (41%) |
Results (1,882) < After dep tree (2,648) because the is_cpp filter drops
language-agnostic distro packages that rode in as dependencies.
| Class | A | B | C | D | Total |
|---|---|---|---|---|---|
| Packages | 10 | 82 | 291 | 1,499 | 1,882 |
Repos (value.csv) |
10 | 81 | 291 | 1,491 | — |
A+B repos: 32% have a GitHub repo, 95% have some Git URL.
- Runtime-only dep tree — build infrastructure (cmake, pkgconf) is undervalued; PageRank reflects runtime coupling, not build coupling.
- GitHub-only identity downstream — Risk/Eligibility miss non-GitHub upstreams
even though
value.csvnow exposes theirgit_url. Fully fixing this needs per-host adapters (GitLab API, Savannah, sourceware) for license/EOL/contributor checks. is_cppdrops — language-agnostic distro packages are filtered out ofresults.csv, so the cpp result set is smaller than its raw dep tree.- Wayback-derived installs — both download proxies have sparse/truncated snapshots (see debian / homebrew).