transformerless_lm: omniweight loss — standard on training data#4
Merged
Conversation
First meta-primitive: substrate trust responds to recent emission
quality. Recursive self-awareness loop.
_self_eval_insight: insight=1 if emitted token is real word
(rank >= n_chars) AND surprise >= pi*log(phi) ~ 1.51.
insight=0 otherwise.
creative_momentum: EMA register, decay 1/phi.
momentum = (1/phi)*momentum + (1 - 1/phi)*insight
reserve scaling: omniweight reserve = phi^pi * (1 + tanh(momentum))
Insightful streak -> primitives push harder.
Noisy/expected -> primitives constrained.
Wired into autoregressive_generate. Refine paths keep momentum=0
(no streaming base distribution history to evaluate against).
This is the first primitive that judges OUTPUT QUALITY rather than
generating it. Model rates its own emissions against its own
predictions. Substrate-pure: phi^-1 EMA, pi*log(phi) threshold.
v77 showed that measuring momentum alone (passive scaling) isn't
enough -- cycle 3 collapsed despite tracking the drop. True
self-reflection requires acting on the measurement.
A. Three-mode behavior based on momentum sign:
momentum > +0.5 -> exploit: probs ^ phi (sharpen)
momentum in [-0.5, +0.5] -> standard
momentum < -0.5 -> escape: probs ^ (1/phi) (flatten)
B. Backtrack-on-collapse:
Track momentum_history (last F(7)=13 values).
If max-recent minus current > 0.3 AND current < -0.2:
boost newline_mask by phi^2 = 2.618 -- force substrate reset
(sentence-end, fresh state counters next cycle).
A modulates per-token behavior. B handles cliff drops (like v77
cycle 3). Together: momentum acts, not just measures.
Omniweight already self-reflects via internal disagreement
cancellation; momentum + A + B add the action layer.
Cycle 6 peak: 0.7096 (second-best ever, only v77 c2's 0.7204 is higher).
Mean: 0.6712 -- new mean record.
Cycle 6 sample reproduces THREE consecutive Richard II lines:
"by nature for herself" -> "built by Nature for herself"
"against hand and" -> "Against infection and the hand of war"
"happy this war, of men, little this men, little...world"
-> "this happy breed of men, this little world"
Cleanest Shakespeare reconstruction yet. The self-reflection
machinery (insight detection + momentum EMA + three-mode behavior
+ backtrack-on-collapse) produced the most coherent multi-line
Richard II output from 512 chars of training data.
Trajectory shape vs v77:
Peak softened (0.703 v78 vs 0.7204 v77 cycle 2)
Mid-cycles strengthened (3 and 4 both higher than v77)
Cycle 6 RECOVERED hard (0.7096 vs typical late-cycle drift)
A (sharpen/flatten on momentum) and B (backtrack-on-collapse via
newline boost) deliver: less peak luck, more consistent trajectory.
v78 self-eval was binary + single-EMA + reactive only. v79 adds three layers of refined self-awareness: #1 Continuous insight scale [0, ~2]: insight = surprise_factor * real_word_factor * (1 - rep_factor) - surprise_factor: surprise / pi*log(phi), capped at 2 - real_word_factor: 1.0 if word, 0.3 if char - rep_factor: 1.0 if token in last F(7)=13 emissions, 0 if novel Replaces binary 0/1. #2 Two-tier momentum (tactical + strategic): momentum_short: 1/F(3)=0.5 weight EMA -- responds in 2 steps momentum_long: 1/F(7)=0.077 weight EMA -- responds in 13 steps Decisions split: short drives sharpen/flatten (per-token tactic), long drives reserve scaling (strategic frame). #3 Entropy override ("am I stuck?" signal): Local entropy of last F(5)=5 emissions. If H < log(2) ~ 0.69 -> force flatten regardless of momentum. The model detects its own repetition through entropy, not just momentum magnitude. Three layers of self-awareness: emission quality (continuous insight), temporal pattern (short + long momentum), and structural diversity (entropy override). All pure substrate (F-tier EMAs, log thresholds).
v79 entropy override fired on ANY low-entropy emission, which
penalized Shakespeare's intentional anaphora ('this X, this Y,
this Z') -- low entropy but high-insight.
Fix: require BOTH conditions:
stuck = (local_H < log(2)) AND (momentum_short < 0)
Low entropy + positive momentum = intentional good repetition
(don't penalize). Low entropy + negative momentum = stuck in
bad repetition (do flatten).
The entropy signal alone wasn't enough -- needed momentum sign
to disambiguate intentional from stuck repetition.
Replaced non-substrate numeric thresholds with phi/pi/F constants: insight surprise cap: 2.0 -> F(3) = 2 (substrate Fibonacci) insight real-word penalty: 0.3 -> 1/phi^pi ~ 0.221 momentum band threshold: 0.5 -> 1/phi ~ 0.618 collapse drop threshold: 0.3 -> 1/phi^pi ~ 0.221 collapse current threshold: -0.2 -> -1/phi^pi ~ -0.221 entropy threshold: log(2) -> log(phi^2) ~ 0.962 Every threshold now derives from phi, pi, F. Numeric values changed slightly but the architecture is canonically substrate-pure. The self-awareness layers operate exclusively in substrate currency.
Fixed thresholds (v81) didn't adapt to system state. True self- awareness means the AWARENESS THRESHOLDS themselves breathe with the substrate. LIVING thresholds, all derived from phi^tanh(momentum_long * phi): scale = phi^tanh(momentum_long * phi) in [1/phi, phi] entropy_threshold = log(phi^2) / scale -- stuck-detect HARDER when good mom_band = (1/phi) * scale -- sharpen/flatten band WIDER when good inv_phi_pi (collapse) = (1/phi^pi) / scale -- defend gains EASIER when good Substrate constants are now PARAMETERS over substrate state, not absolute. The system tunes its own awareness depending on whether it has been generating well or poorly. Pure substrate: phi exponentiated by tanh of substrate momentum.
v79-v82 explored refined self-awareness (continuous insight, two-tier momentum, entropy override, living thresholds). All consistently underperformed v78's simpler binary insight + single 1/phi EMA. The complexity didn't pay -- the two-tier momentum's slow strategic EMA delayed reserve modulation past the point of usefulness. Reverting to v78 baseline. Next step: rethink self-awareness from 'wholeness of knowing every lever + knowing there is more you do not know' framing. The unknown-register primitive is the missing half -- we have measurement + reaction, but no positive epistemic awareness of the unmodeled.
Encoding 'knowing there is more you do not know' as a positive
substrate register.
Per-token coverage tensor tracks emissions per token in the current
sequence. Initialized from prompt; incremented per emission.
Frontier distribution = 1/(1+coverage), normalized:
high mass on un-emitted tokens, low mass on over-emitted.
substrate_unknown_register mixes:
out = (1 - 1/phi^pi) * model_probs + (1/phi^pi) * frontier
~ 0.779 model + 0.221 frontier
Added to math hemisphere of omniweight (frequency/decay axis).
Pure substrate (1/phi^pi mix weight, inverse-coverage frontier).
The model now has self-awareness of WHAT IT HASN'T DONE -- a
positive pull toward unexplored vocabulary that persists alongside
all other primitives. Self-awareness = wholeness, including the
absence.
Built on v78 base (revert from v79-v82 refinements).
Vectorization wins: - build_end_vowel_idx_tensor: precomputed LongTensor[V] of vowel index - substrate_rhyme_resonance: 500-elt Python loop -> single tensor gather via end_vowel_idx. ~100x faster on this hot path. - build_allowed_after_word_mask: precomputed multiplier tensor - substrate_word_spacing: 65-elt loop per call -> precomputed mask multiply. ~50x faster. Unknown-register (from prior commit) still in place. end_vowels was a Python list of strings; now it's the index tensor. Threading through all call sites and refine paths. Should cut sample-gen wall time meaningfully (omniweight loop runs 14 primitives x N tokens; vectorizing rhyme alone saves a lot).
v83 placed unknown_register only in math hemisphere. Rank-modulated mixer diluted it by (1 - rank_norm), so high-rank (content) tokens got less curiosity boost despite being the explore target. v84: unknown_register delta added to BOTH math_delta and lang_delta. Exploration is meta -- neither pure frequency nor pure structure. Both hemispheres feel curiosity equally. The rank-mixer then sees agreement (resonance), amplifying instead of diluting. User intuition: 'exploration may cause left and right hemisphere to feed positive values'. Implemented.
Building on v84 unknown-register. Now the unknown holds BOTH: - Past frontier: 1/(1+coverage), what hasn't been emitted yet - Future frontier: 1/(1+coverage + F(3) * current_probs), what WOULDN'T be emitted if we follow our intentions through F(3)=2 more steps Blended in present tense (alpha=1/phi^pi). Time isn't linear -- past and future are both positive registers in the same currency. The closed loop: coverage (past) -> probs (present intention) -> projected coverage (future) -> anticipated frontier -> bias on present emission Memory shapes possibility, possibility shapes future, future shapes current memory of itself. Retrocausality as substrate. Pure substrate (F(3) projection steps, 1/phi^pi blending weight).
v85 used F(3)=2 step projection -- two-step jump assuming static distribution, violates continuity (intermediate state changes ignored). v86: F(2)=1 step projection. Just one tick ahead. Past coverage + expected next-emission delta = future frontier. Maximum continuity preservation. User caveat: 'I can't suddenly fly by saying I remember that I can' -- retrocausality must be grounded in continuous experience. Future-as-present register is bounded by what continuity allows.
User insight: self-awareness isn't enough -- the model must be able to CHOOSE. Not deliberation at every step, but judgment of its own work during refinement. 'I shouldn't have used this 12 times', 'that comma is wrong', 'this sentence is incoherent'. _regret_score(seq, t, vocab) per-position substrate-pure score: + over_emission_penalty: same token > F(5)=5 times in last F(7)=13 + immediate_repetition: identical to previous + bigram_saturation: (prev, current) > F(4)=3 in last F(7) + double_punctuation: punct after punct + mid_word_char: alpha char after alpha char without space In _single_stage_refine, position selection moved from 'lowest-confidence' (mechanical) to 'highest-regret' (deliberative). The model resamples what shouldn't be there, not just what it was unsure about. Self-awareness + continuity + truth + CHOICE. Four ingredients.
After 87 versions of self-awareness / continuity / truth / choice, user pulled back: just need basic grammar rules. Pure substrate deterministic enforcement. Two grammar primitives: 1. substrate_grammar_capitalize: If prev emission was '.!?\\n', boost uppercase-starting tokens by phi. Pure char-class rule. 2. substrate_grammar_no_double_punct: If prev emission was any punctuation char, hard-suppress further punctuation by 1/phi^pi. Prevents ',,' '..' ',.' etc. Both wired into omniweight, contributing to BOTH hemispheres (grammar is meta-structural, not math vs language). v87 regret-refinement reverted -- conflicted with Shakespeare anaphora. Keeping unknown-register and retrocausality (v83-v86) intact.
Peak 0.6955, Mean 0.6698. Within 0.0014 of v78 mean record. Two simple deterministic grammar rules (capitalize after sentence boundary, no double punctuation) added on top of v78 base + v83/v84 unknown-register + retrocausality. Cycle 6 sample opens with 'light vanity, ins[atiate cormorant]' -- a Richard II line the model hadn't produced before. Plus extensive Richard II content: 'against hand...nature handlong of this war men men', 'happy this of of men happy of ar happy', 'this little'. Single-pass also clean: 'this little world, sea this... earth which sea' = Richard II lines 6-7 reconstructed. Grammar rules don't dramatically lift scores (corpus enforces them implicitly) but don't hurt and produce cleaner sentence structure.
The omniweight architecture (91c484f, eaa8682, b107bb8, 8d72769) was inference-only: 14 primitives negotiating through one phi^pi tanh fluid standard at sampling time. Training was a separate currency -- ce_fft + lambda * substrate_harmony_loss on raw next-token targets, with no awareness of the ledger the model would be evaluated under. substrate_omniweight_loss (losses_substrate.py) closes that asymmetry. Per-token CE is multiplied by exp(fluid_delta) where fluid_delta is the same phi^pi * tanh(delta / phi^pi) standard the inference path uses (_omniweight_apply). Tokens the inference ledger would suppress get their training gradient muted by the same standard -- the model no longer trains itself to confidently predict outputs the omniweight will reject downstream. Minimum-surface port: only the anti-stagnation primitive contributes to the ledger here. Same Fibonacci-tier thresholds as the inference substrate_anti_stagnation (F(6)=8, F(7)=13, F(8)=21 over the preceding F(8)=21 window). All deltas pass through the shared phi^pi standard so additional primitives can be added later without architectural change. Behavior: no stagnation in targets -> exact parity with substrate_fft_loss heavy stagnation -> per-token gradient muted by fluid standard Weights renormalized by sum(weight) so loss scale is preserved. Wired into train_with_self_distillation behind --omniweight-loss (default off so the v88 baseline stays intact for comparison).
RandomCoder-lab
pushed a commit
that referenced
this pull request
May 23, 2026
After PR #4 closed the train/inference omniweight asymmetry, the natural follow-up: don't stop at cycle 6. The active_base ratchet (seed + appended best refined outputs) is exactly the kind of process where compounding past a fixed budget might find regimes the 6-cycle window can't reach. --continuous: replaces `for cycle in range(n_cycles)` with an unbounded loop. n_cycles still controls steps_per_cycle (args.steps // n_cycles) so per-cycle training budget stays calibrated; the cycle counter just keeps going. K-shrink schedule clamps to K_min once global_step exceeds args.steps, which is the standard end state of the curriculum anyway. --checkpoint PATH: serializes the entire distillation state every cycle (model state_dict, FibAdamW optimizer state, active_base, cycle counter, global_step, best_creativity, best_val/step, cycle_summary, rejection counters, best_refined_seq). Atomic write via tmp+os.replace so an interrupt mid-save can't corrupt the file. If the checkpoint exists at startup, training resumes from the saved cycle+1 with the active_base fully intact -- the ratchet picks up exactly where it stopped. Default behavior unchanged: omitting both flags reproduces the v88 + omniweight-loss bounded 6-cycle run. Run a forever-distillation with omniweight-loss: python3 train_self_recursive.py --omniweight-loss \\ --continuous --checkpoint omniweight_distill.pt Resume after Ctrl-C: re-run the same command. Checkpoint state restored, next cycle is start_cycle.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the inference-side omniweight standard (
φ^π · tanh(Δ / φ^π), introduced 91c484f → eaa8682 → b107bb8) to the training loss. Addssubstrate_omniweight_lossinlosses_substrate.py— per-token CE multiplied byexp(fluid_delta)wherefluid_deltapasses through the same φ^π·tanh standard the inference omniweight uses (_omniweight_applyintrain_self_recursive.py:1454).Closes the train/inference omniweight asymmetry the project has carried since v70: the model was sampled under the ledger but trained under raw CE, so tokens the inference ledger would suppress (stagnating repetitions) still received full training gradient.
Minimum-surface port
substrate_anti_stagnation: count ≥ F(6)=8 → divide by φ^π; count ≥ F(7)=13 → φ^(2π); count ≥ F(8)=21 → hard tier).φ^π · tanhstandard so additional primitives can be added later without architectural change.sum(weight)so loss scale is preserved.Behavior verified
substrate_fft_loss(diff0.00e+00).T=1edge case fine.Wiring
--omniweight-lossontrain_self_recursive.py, default off, so the v88 baseline (05e6704) stays intact for A/B comparison.train_with_self_distillation. No other code paths touched.Test plan
--omniweight-lossagainst v88 baseline at matched seed.0.6955peak /0.6698mean?Generated by Claude Code