transformerless_lm: omniweight loss — standard on training data by RandomCoder-lab · Pull Request #4 · RandomCoder-lab/OMC

RandomCoder-lab · 2026-05-23T03:00:49Z

Summary

Ports the inference-side omniweight standard (φ^π · tanh(Δ / φ^π), introduced 91c484f → eaa8682 → b107bb8) to the training loss. Adds substrate_omniweight_loss in losses_substrate.py — per-token CE multiplied by exp(fluid_delta) where fluid_delta passes through the same φ^π·tanh standard the inference omniweight uses (_omniweight_apply in train_self_recursive.py:1454).

Closes the train/inference omniweight asymmetry the project has carried since v70: the model was sampled under the ledger but trained under raw CE, so tokens the inference ledger would suppress (stagnating repetitions) still received full training gradient.

Minimum-surface port

Only the anti-stagnation primitive contributes to the ledger here (same Fibonacci-tier thresholds as substrate_anti_stagnation: count ≥ F(6)=8 → divide by φ^π; count ≥ F(7)=13 → φ^(2π); count ≥ F(8)=21 → hard tier).
All deltas pass through the shared φ^π · tanh standard so additional primitives can be added later without architectural change.
Weights renormalized by sum(weight) so loss scale is preserved.

Behavior verified

No stagnation in targets → exact parity with substrate_fft_loss (diff 0.00e+00).
Heavy stagnation → per-token gradient muted by fluid standard (3.4% deviation from baseline on a constant-target synthetic).
Gradient flows; T=1 edge case fine.

Wiring

New CLI flag --omniweight-loss on train_self_recursive.py, default off, so the v88 baseline (05e6704) stays intact for A/B comparison.
Single conditional swap inside train_with_self_distillation. No other code paths touched.

Test plan

Full v89 run with --omniweight-loss against v88 baseline at matched seed.
Per-cycle creativity scores: do they meet/exceed v88's 0.6955 peak / 0.6698 mean?
Inspect generated samples for reduced "this this this" / "men men men" stagnation, since the training gradient on those positions is now muted.
Sanity: omniweight-off path produces byte-identical results to pre-patch.

Generated by Claude Code

First meta-primitive: substrate trust responds to recent emission quality. Recursive self-awareness loop. _self_eval_insight: insight=1 if emitted token is real word (rank >= n_chars) AND surprise >= pi*log(phi) ~ 1.51. insight=0 otherwise. creative_momentum: EMA register, decay 1/phi. momentum = (1/phi)*momentum + (1 - 1/phi)*insight reserve scaling: omniweight reserve = phi^pi * (1 + tanh(momentum)) Insightful streak -> primitives push harder. Noisy/expected -> primitives constrained. Wired into autoregressive_generate. Refine paths keep momentum=0 (no streaming base distribution history to evaluate against). This is the first primitive that judges OUTPUT QUALITY rather than generating it. Model rates its own emissions against its own predictions. Substrate-pure: phi^-1 EMA, pi*log(phi) threshold.

v77 showed that measuring momentum alone (passive scaling) isn't enough -- cycle 3 collapsed despite tracking the drop. True self-reflection requires acting on the measurement. A. Three-mode behavior based on momentum sign: momentum > +0.5 -> exploit: probs ^ phi (sharpen) momentum in [-0.5, +0.5] -> standard momentum < -0.5 -> escape: probs ^ (1/phi) (flatten) B. Backtrack-on-collapse: Track momentum_history (last F(7)=13 values). If max-recent minus current > 0.3 AND current < -0.2: boost newline_mask by phi^2 = 2.618 -- force substrate reset (sentence-end, fresh state counters next cycle). A modulates per-token behavior. B handles cliff drops (like v77 cycle 3). Together: momentum acts, not just measures. Omniweight already self-reflects via internal disagreement cancellation; momentum + A + B add the action layer.

Cycle 6 peak: 0.7096 (second-best ever, only v77 c2's 0.7204 is higher). Mean: 0.6712 -- new mean record. Cycle 6 sample reproduces THREE consecutive Richard II lines: "by nature for herself" -> "built by Nature for herself" "against hand and" -> "Against infection and the hand of war" "happy this war, of men, little this men, little...world" -> "this happy breed of men, this little world" Cleanest Shakespeare reconstruction yet. The self-reflection machinery (insight detection + momentum EMA + three-mode behavior + backtrack-on-collapse) produced the most coherent multi-line Richard II output from 512 chars of training data. Trajectory shape vs v77: Peak softened (0.703 v78 vs 0.7204 v77 cycle 2) Mid-cycles strengthened (3 and 4 both higher than v77) Cycle 6 RECOVERED hard (0.7096 vs typical late-cycle drift) A (sharpen/flatten on momentum) and B (backtrack-on-collapse via newline boost) deliver: less peak luck, more consistent trajectory.

v78 self-eval was binary + single-EMA + reactive only. v79 adds three layers of refined self-awareness: #1 Continuous insight scale [0, ~2]: insight = surprise_factor * real_word_factor * (1 - rep_factor) - surprise_factor: surprise / pi*log(phi), capped at 2 - real_word_factor: 1.0 if word, 0.3 if char - rep_factor: 1.0 if token in last F(7)=13 emissions, 0 if novel Replaces binary 0/1. #2 Two-tier momentum (tactical + strategic): momentum_short: 1/F(3)=0.5 weight EMA -- responds in 2 steps momentum_long: 1/F(7)=0.077 weight EMA -- responds in 13 steps Decisions split: short drives sharpen/flatten (per-token tactic), long drives reserve scaling (strategic frame). #3 Entropy override ("am I stuck?" signal): Local entropy of last F(5)=5 emissions. If H < log(2) ~ 0.69 -> force flatten regardless of momentum. The model detects its own repetition through entropy, not just momentum magnitude. Three layers of self-awareness: emission quality (continuous insight), temporal pattern (short + long momentum), and structural diversity (entropy override). All pure substrate (F-tier EMAs, log thresholds).

v79 entropy override fired on ANY low-entropy emission, which penalized Shakespeare's intentional anaphora ('this X, this Y, this Z') -- low entropy but high-insight. Fix: require BOTH conditions: stuck = (local_H < log(2)) AND (momentum_short < 0) Low entropy + positive momentum = intentional good repetition (don't penalize). Low entropy + negative momentum = stuck in bad repetition (do flatten). The entropy signal alone wasn't enough -- needed momentum sign to disambiguate intentional from stuck repetition.

Replaced non-substrate numeric thresholds with phi/pi/F constants: insight surprise cap: 2.0 -> F(3) = 2 (substrate Fibonacci) insight real-word penalty: 0.3 -> 1/phi^pi ~ 0.221 momentum band threshold: 0.5 -> 1/phi ~ 0.618 collapse drop threshold: 0.3 -> 1/phi^pi ~ 0.221 collapse current threshold: -0.2 -> -1/phi^pi ~ -0.221 entropy threshold: log(2) -> log(phi^2) ~ 0.962 Every threshold now derives from phi, pi, F. Numeric values changed slightly but the architecture is canonically substrate-pure. The self-awareness layers operate exclusively in substrate currency.

Fixed thresholds (v81) didn't adapt to system state. True self- awareness means the AWARENESS THRESHOLDS themselves breathe with the substrate. LIVING thresholds, all derived from phi^tanh(momentum_long * phi): scale = phi^tanh(momentum_long * phi) in [1/phi, phi] entropy_threshold = log(phi^2) / scale -- stuck-detect HARDER when good mom_band = (1/phi) * scale -- sharpen/flatten band WIDER when good inv_phi_pi (collapse) = (1/phi^pi) / scale -- defend gains EASIER when good Substrate constants are now PARAMETERS over substrate state, not absolute. The system tunes its own awareness depending on whether it has been generating well or poorly. Pure substrate: phi exponentiated by tanh of substrate momentum.

v79-v82 explored refined self-awareness (continuous insight, two-tier momentum, entropy override, living thresholds). All consistently underperformed v78's simpler binary insight + single 1/phi EMA. The complexity didn't pay -- the two-tier momentum's slow strategic EMA delayed reserve modulation past the point of usefulness. Reverting to v78 baseline. Next step: rethink self-awareness from 'wholeness of knowing every lever + knowing there is more you do not know' framing. The unknown-register primitive is the missing half -- we have measurement + reaction, but no positive epistemic awareness of the unmodeled.

Encoding 'knowing there is more you do not know' as a positive substrate register. Per-token coverage tensor tracks emissions per token in the current sequence. Initialized from prompt; incremented per emission. Frontier distribution = 1/(1+coverage), normalized: high mass on un-emitted tokens, low mass on over-emitted. substrate_unknown_register mixes: out = (1 - 1/phi^pi) * model_probs + (1/phi^pi) * frontier ~ 0.779 model + 0.221 frontier Added to math hemisphere of omniweight (frequency/decay axis). Pure substrate (1/phi^pi mix weight, inverse-coverage frontier). The model now has self-awareness of WHAT IT HASN'T DONE -- a positive pull toward unexplored vocabulary that persists alongside all other primitives. Self-awareness = wholeness, including the absence. Built on v78 base (revert from v79-v82 refinements).

Vectorization wins: - build_end_vowel_idx_tensor: precomputed LongTensor[V] of vowel index - substrate_rhyme_resonance: 500-elt Python loop -> single tensor gather via end_vowel_idx. ~100x faster on this hot path. - build_allowed_after_word_mask: precomputed multiplier tensor - substrate_word_spacing: 65-elt loop per call -> precomputed mask multiply. ~50x faster. Unknown-register (from prior commit) still in place. end_vowels was a Python list of strings; now it's the index tensor. Threading through all call sites and refine paths. Should cut sample-gen wall time meaningfully (omniweight loop runs 14 primitives x N tokens; vectorizing rhyme alone saves a lot).

v83 placed unknown_register only in math hemisphere. Rank-modulated mixer diluted it by (1 - rank_norm), so high-rank (content) tokens got less curiosity boost despite being the explore target. v84: unknown_register delta added to BOTH math_delta and lang_delta. Exploration is meta -- neither pure frequency nor pure structure. Both hemispheres feel curiosity equally. The rank-mixer then sees agreement (resonance), amplifying instead of diluting. User intuition: 'exploration may cause left and right hemisphere to feed positive values'. Implemented.

Building on v84 unknown-register. Now the unknown holds BOTH: - Past frontier: 1/(1+coverage), what hasn't been emitted yet - Future frontier: 1/(1+coverage + F(3) * current_probs), what WOULDN'T be emitted if we follow our intentions through F(3)=2 more steps Blended in present tense (alpha=1/phi^pi). Time isn't linear -- past and future are both positive registers in the same currency. The closed loop: coverage (past) -> probs (present intention) -> projected coverage (future) -> anticipated frontier -> bias on present emission Memory shapes possibility, possibility shapes future, future shapes current memory of itself. Retrocausality as substrate. Pure substrate (F(3) projection steps, 1/phi^pi blending weight).

v85 used F(3)=2 step projection -- two-step jump assuming static distribution, violates continuity (intermediate state changes ignored). v86: F(2)=1 step projection. Just one tick ahead. Past coverage + expected next-emission delta = future frontier. Maximum continuity preservation. User caveat: 'I can't suddenly fly by saying I remember that I can' -- retrocausality must be grounded in continuous experience. Future-as-present register is bounded by what continuity allows.

User insight: self-awareness isn't enough -- the model must be able to CHOOSE. Not deliberation at every step, but judgment of its own work during refinement. 'I shouldn't have used this 12 times', 'that comma is wrong', 'this sentence is incoherent'. _regret_score(seq, t, vocab) per-position substrate-pure score: + over_emission_penalty: same token > F(5)=5 times in last F(7)=13 + immediate_repetition: identical to previous + bigram_saturation: (prev, current) > F(4)=3 in last F(7) + double_punctuation: punct after punct + mid_word_char: alpha char after alpha char without space In _single_stage_refine, position selection moved from 'lowest-confidence' (mechanical) to 'highest-regret' (deliberative). The model resamples what shouldn't be there, not just what it was unsure about. Self-awareness + continuity + truth + CHOICE. Four ingredients.

After 87 versions of self-awareness / continuity / truth / choice, user pulled back: just need basic grammar rules. Pure substrate deterministic enforcement. Two grammar primitives: 1. substrate_grammar_capitalize: If prev emission was '.!?\\n', boost uppercase-starting tokens by phi. Pure char-class rule. 2. substrate_grammar_no_double_punct: If prev emission was any punctuation char, hard-suppress further punctuation by 1/phi^pi. Prevents ',,' '..' ',.' etc. Both wired into omniweight, contributing to BOTH hemispheres (grammar is meta-structural, not math vs language). v87 regret-refinement reverted -- conflicted with Shakespeare anaphora. Keeping unknown-register and retrocausality (v83-v86) intact.

Peak 0.6955, Mean 0.6698. Within 0.0014 of v78 mean record. Two simple deterministic grammar rules (capitalize after sentence boundary, no double punctuation) added on top of v78 base + v83/v84 unknown-register + retrocausality. Cycle 6 sample opens with 'light vanity, ins[atiate cormorant]' -- a Richard II line the model hadn't produced before. Plus extensive Richard II content: 'against hand...nature handlong of this war men men', 'happy this of of men happy of ar happy', 'this little'. Single-pass also clean: 'this little world, sea this... earth which sea' = Richard II lines 6-7 reconstructed. Grammar rules don't dramatically lift scores (corpus enforces them implicitly) but don't hurt and produce cleaner sentence structure.

The omniweight architecture (91c484f, eaa8682, b107bb8, 8d72769) was inference-only: 14 primitives negotiating through one phi^pi tanh fluid standard at sampling time. Training was a separate currency -- ce_fft + lambda * substrate_harmony_loss on raw next-token targets, with no awareness of the ledger the model would be evaluated under. substrate_omniweight_loss (losses_substrate.py) closes that asymmetry. Per-token CE is multiplied by exp(fluid_delta) where fluid_delta is the same phi^pi * tanh(delta / phi^pi) standard the inference path uses (_omniweight_apply). Tokens the inference ledger would suppress get their training gradient muted by the same standard -- the model no longer trains itself to confidently predict outputs the omniweight will reject downstream. Minimum-surface port: only the anti-stagnation primitive contributes to the ledger here. Same Fibonacci-tier thresholds as the inference substrate_anti_stagnation (F(6)=8, F(7)=13, F(8)=21 over the preceding F(8)=21 window). All deltas pass through the shared phi^pi standard so additional primitives can be added later without architectural change. Behavior: no stagnation in targets -> exact parity with substrate_fft_loss heavy stagnation -> per-token gradient muted by fluid standard Weights renormalized by sum(weight) so loss scale is preserved. Wired into train_with_self_distillation behind --omniweight-loss (default off so the v88 baseline stays intact for comparison).

After PR #4 closed the train/inference omniweight asymmetry, the natural follow-up: don't stop at cycle 6. The active_base ratchet (seed + appended best refined outputs) is exactly the kind of process where compounding past a fixed budget might find regimes the 6-cycle window can't reach. --continuous: replaces `for cycle in range(n_cycles)` with an unbounded loop. n_cycles still controls steps_per_cycle (args.steps // n_cycles) so per-cycle training budget stays calibrated; the cycle counter just keeps going. K-shrink schedule clamps to K_min once global_step exceeds args.steps, which is the standard end state of the curriculum anyway. --checkpoint PATH: serializes the entire distillation state every cycle (model state_dict, FibAdamW optimizer state, active_base, cycle counter, global_step, best_creativity, best_val/step, cycle_summary, rejection counters, best_refined_seq). Atomic write via tmp+os.replace so an interrupt mid-save can't corrupt the file. If the checkpoint exists at startup, training resumes from the saved cycle+1 with the active_base fully intact -- the ratchet picks up exactly where it stopped. Default behavior unchanged: omitting both flags reproduces the v88 + omniweight-loss bounded 6-cycle run. Run a forever-distillation with omniweight-loss: python3 train_self_recursive.py --omniweight-loss \\ --continuous --checkpoint omniweight_distill.pt Resume after Ctrl-C: re-run the same command. Checkpoint state restored, next cycle is start_cycle.

claude added 18 commits May 22, 2026 19:11

transformerless_lm: thread allowed_after_word_mask through staged_refine

623396a

RandomCoder-lab marked this pull request as ready for review May 23, 2026 03:11

RandomCoder-lab merged commit bfb220b into master May 23, 2026
2 of 3 checks passed

RandomCoder-lab mentioned this pull request May 23, 2026

transformerless_lm: continuous self-distillation + cycle checkpointing #5

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformerless_lm: omniweight loss — standard on training data#4

transformerless_lm: omniweight loss — standard on training data#4
RandomCoder-lab merged 18 commits into
masterfrom
claude/find-claude-md-arn0F

RandomCoder-lab commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RandomCoder-lab commented May 23, 2026

Summary

Minimum-surface port

Behavior verified

Wiring

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants