Skip to content

Settle the mean-vs-median prediction-centering convention (PEtab v2 median vs legacy mean; reconcile the inconsistent per-family defaults) #424

@wshlavacek

Description

@wshlavacek

Problem

PyBNF's prediction-centering convention — whether the deterministic model prediction is interpreted as a noise distribution's mean or median (CONTEXT.md "Prediction Centering"; ADR-0011 location axis, ADR-0024 native surface) — is currently inconsistent across families and ambiguous by default. There is a genuine tension to settle as policy before more capability lands (#419):

  • PEtab v2 hardcodes the median for every noise model. The exporter-first / importer dogfood (ADR-0023/0025/0026) needs a clear, defensible median story.
  • Backward compatibility pulls toward the legacy interpretations (the original chi_sq is mean-on-linear; lognormal_var was median-on-log; etc.).
  • The current state defaults some families to mean and others to median — a confusing mess that we should resolve deliberately rather than per-PR.

This issue is for discussion of the go-forward convention (and how to keep legacy configs working). It deliberately does not prescribe the answer; it gathers the evidence and the options. #419 (implement mean and median for every family) is the capability; this issue is the policy that decides the defaults and the config surface — #419's default/surface choices should gate on the outcome here.

Current state (the inconsistency, precisely)

Legacy objfuncs (each hardcodes its centering):

objfunc noise model centering
chi_sq, chi_sq_dynamic Gaussian() = (LINEAR) mean
lognormal Gaussian(LOG10, MEDIAN) median
laplace Laplace() = (LINEAR) median
neg_bin, neg_bin_dynamic NegBinomial() mean (prediction is the mean)

Native noise_model family tokens (_NOISE_FAMILIES) and class defaults:

token / class default centering
normal / gaussian, Gaussian.__init__ mean
lognormal (Gaussian on LOG10) median
laplace, Laplace.__init__ median
neg_bin, NegBinomial mean

Global noise_location key (ADR-0024): optional, default unset → falls through to each family's class default (i.e. inherits the inconsistency).

Observations:

  • Two location-scale families ship opposite defaults: Gaussian → mean, Laplace → median.
  • The code comments (location.py, _NOISE_LOCATIONS in objective.py) state median is the default "consistent with PEtab v2", but Gaussian/chi_sq actually default to mean — docs and code disagree on what "the default" is.
  • On a linear scale the choice is invisible (mean = median for these symmetric families), so the inconsistency is latent today and only surfaces on log scales (lognormal, a future log-Laplace) and for neg_bin (asymmetric count family).

Why it matters

  • PEtab v2 interop: a round-trip export/import (the dogfood goal) must agree with PEtab's median convention, or silently shift the likelihood.
  • Legacy reproducibility: users porting old .conf files expect their fits unchanged; flipping a default changes results on log/count models.
  • Clarity: a single, documented convention (plus an explicit escape hatch) replaces "which family am I, and what does it happen to default to?"

Options to weigh (for discussion)

  • A. Median-everywhere default (PEtab v2-aligned) + explicit location = mean opt-in. Cleanest forward story; changes legacy behavior wherever mean ≠ median (log-scale Gaussian if anyone used a mean-centered log model; neg_bin).
  • B. Freeze the current per-family defaults, require nothing, just document. Zero behavior change; preserves the inconsistency permanently.
  • C. A config-level convention switch (e.g. centering_convention = petab | legacy, or piggyback on a broader petab_compat mode) that selects the default family-by-family: new PEtab v2 configs get median-everywhere, legacy configs are byte-identical to today. This is the "easily tell new-vs-old behavior" idea — it isolates the breaking change behind an explicit opt-in and lets a config self-declare its era.
  • D. No implicit default where it matters: make location mandatory-explicit for any (family × scale) where mean ≠ median (all log scales, neg_bin), so ambiguity can never resolve silently; keep the implicit default only where it's a provable no-op (linear symmetric families).

(These are not mutually exclusive — e.g. C + D: a convention switch and an explicit-required rule for the genuinely ambiguous cases.)

Backward-compat analysis (what actually changes)

  • Linear-scale Gaussian/Laplace (chi_sq, laplace, sos-adjacent): nothing changes under any option (symmetric).
  • Log-scale Gaussian (lognormal): already median; stays median under A/C. Only a (currently non-existent) mean-centered log model would move.
  • neg_bin: defaults to mean today; option A would flip it to median (a real change to the count likelihood). This is the concrete decision flagged in Mean/median prediction-centering for every noise model (complete the location axis; neg_bin median) #419.
  • A .conf era switch (C) makes all of the above opt-in, so no existing file changes unless it declares PEtab-v2 mode.

Acceptance / outcome

A written decision (ADR) that fixes: (1) the go-forward default per (family × scale), (2) the backward-compat mechanism (and whether .conf should carry an explicit era/convention marker), (3) the doc/code reconciliation so "the default" means one thing. #419 then implements the capability under that convention.

Relevant ADRs: 0011 (location axis), 0024 (native location surface + global noise_location + the "median default" intent), 0021 (per-observable noise), 0023/0025/0026 (PEtab v2 interop). Related: #419 (capability), and the per-observable noise work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions