21 May 05:21

ad35f98

transformerless-lm v0.1.0 — 100× FibGen compression, 5.6× lazy-data speedup Latest

Latest

transformerless-lm v0.1.0

First release of the substrate-compressed language model framework
under experiments/transformerless_lm/. This document is the in-tree
release artifact corresponding to the local annotated tag
transformerless-lm-v0.1.0 at commit ad35f98.

Headline results (validated)

100× weight compression via FibGen

Each weight tensor W ∈ R^{out × in} is replaced by a small
Fibonacci-indexed seed and reconstructed on demand via a closed-form
sin/cos expansion at Fibonacci frequencies.

arch	params	compression	val (best)	vs dense	uniform reduction
dense_crt	801,664	1×	2.5602	—	-38.7%
fibgen_K16_separable	8,064	100.4×	2.9020	+13.3%	-30.5%
fibgen_K32_separable	9,216	87.9×	2.7282	+6.6%	-34.6%

Reproduced across two independent training runs (the original v2 bench
at results_fibgen.json and the recheck run at the same path). The
compression is real — 8K stored parameters reconstruct an 810K dense-
equivalent weight tensor — and the model genuinely learns the corpus
structure (val well below the ln(65) = 4.17 uniform floor).

Inference: 90-93% throughput at 10-37× less RAM

arch	d	weight_MB	tok/s	vs dense speed
dense_crt	128	3.06	473	—
fibgen_K32 cached	128	0.31	441	93%
dense_crt	256	12.12	264	—
fibgen_K32 cached	256	0.33	237	90%

The weight cache pattern (precompute W once at deployment, reuse
across all forward passes) eliminates the FibGen forward-overhead at
inference. Per-token compute matches dense; only the persistent
weight storage is compressed. At d=256 the memory ratio is 37×;
at LLM scale (d=4096) extrapolation gives ~200× memory reduction.

Lazy-loaded training: 5.6× wall-clock speedup

Fibonacci-strided data sampling loads only log_φπ(T) tokens per
sequence position (11 of 128 at T=128). The model never reads gap
tokens from disk.

config	val	wall (1500 steps)	speedup
dense baseline (dense data)	2.4396	165.7s	1.00×
dense + lazy-strided data	2.5274	29.5s	5.62×

The substrate's log_φπ cadence is the data-loading complexity
bound; this is the cleanest single-axis substrate-native win in the
release.

35B-in-8GB feasibility math

Combining the validated wins:

config	35B-equivalent storage	fits in 8 GB?
dense fp16	70 GB	no
4-bit quantization (SOTA)	17.5 GB	no
FibGen K=32 cross	7 GB	yes
FibGen K=32 separable	800 MB	yes, easily

These numbers are extrapolations from the d=128 / d=256 measurements.
At true LLM scale the compression ratio grows as (d/K)² because
dense storage scales as d² while the seed is K² regardless of d.

Architectural primitives (all in `experiments/transformerless_lm/`)

primitive	file	validation
CRT-Fibonacci PE	`models.py`	-5.4% vs sinusoidal PE
Geodesic attention bias	`models.py`	-0.4% vs crt_only, 3/3 seeds
Fibonacci-offset sparse attention	`models_substrate.py`	14× FLOP reduction, -3.2% loss
Zeckendorf-routed FFN	`models_substrate.py`	5× FFN FLOPs reduction
FibGen weight generator	`models_fibgen.py`	100× storage compression
Subsim L1-distance attention	`models_subsim.py`	substrate operator, +5.7% loss at d=128
Fibonacci tier quantization	`models_substrate.py:fibonacci_tier_snap`	saturates at +0.6 nats post-hoc
Fibonacci State Model	`models_fsm.py`	NaN at init, scale-bound
Lazy-strided data loader	`lazy_data.py`	5.6× training speedup
Stochastic Fibonacci depth	`models_subsim.py`	1.17× wall-clock speedup

Falsified or scale-bound

claim	falsification
Pure Fibonacci-tier post-hoc quantization at 4-bit	Saturates at +0.6 nats regardless of bit depth
Substrate operators (Subsim/FSM) faster than dense at d=128	At CPU bench scale (d≤256, T≤512) PyTorch overhead dominates the asymptotic FLOP savings
FSM recurrence numerically stable at random init	Eigenvalue > 1 produces immediate NaN; needs gating
K-scaling alone closes the gap to dense at d=256	K=48, K=64 both LOST at d=256 (+30% gap)
Plain FibGen at d=256 maintains its compression-vs-quality	Compression ratio grows nicely (36×) but loss penalty also grows (+30%)

Reproducing the headline numbers

cd experiments/transformerless_lm

# 100× compression result (this release's main claim)
python3 train_fibgen.py --steps 2500 --K-sweep 16,32 --modes separable
# expect: fibgen_K16_separable val ~2.90 (100x compression)
#         fibgen_K32_separable val ~2.73 (88x compression)

# Lazy-loading data speedup
python3 train_lazy_loading.py --steps 1500
# expect: dense ~165s, fib_strided ~29s, val deltas <5%

# Inference-time throughput
python3 bench_inference.py --n-tokens 256
# expect: fibgen_K32 cached at 90%+ of dense throughput at d=128

Honest limits

Output text quality at d=128 is gibberish for ALL archs including
dense. Coherent text needs GPT-2-tiny-class capacity (d≥384,
n_blocks≥6).
Substrate operator wall-clock wins (Subsim, FSM, Composed) are
scale-bound — they don't materialize on CPU at our test scale.
Asymptotic complexity advantages are real but unreachable in pure
PyTorch without parallel-scan kernels or larger T/d.
35B feasibility is an extrapolation from d=128/256 measurements,
not a direct measurement at LLM scale.
Training-time substrate ops (lazy tier dropout, K-subsampling)
delivered at most a small per-step compute reduction in pure PyTorch
due to indexing overhead. Real wins would require kernel work.

File index

experiments/transformerless_lm/
  README.md                       # original transformerless-LM thesis
  GEODESIC_RESULT.md              # validated -0.4% geodesic attention
  GEODESIC_ATTENTION_DERIVATION.md
  TRANSFORMERLESS_RESULT.md       # token-CRT + Principle A/B results
  WEIGHT_SUBSTRATE_REFORMULATION.md  # Principle A/B derivation
  INFERENCE_FIRST_DERIVATION.md   # 35B-in-8GB framing
  RELEASE_v0.1.0.md              # THIS FILE

  corpus.py                       # data loader (TinyShakespeare)
  lazy_data.py                    # Fibonacci-strided data loader

  models.py                       # baseline crt_only + arch variants
  models_substrate.py             # FibonacciOffsetAttention, ZeckendorfRoutedFFN
  models_fibgen.py                # FibGenLinear (THE compression primitive)
  models_subsim.py                # L1-distance attention operator
  models_fsm.py                   # Fibonacci State Model (broken; needs stability fix)

  train_distractor_mix.py         # distractor-mix training scaffold
  train_geodesic_attention.py     # geodesic bench
  train_fibgen.py                 # FibGen K/mode sweep (main reproducer)
  train_lazy_loading.py           # lazy-data validation bench
  bench_inference.py              # autoregressive generation throughput

  results_*.json                  # raw bench outputs (kept for audit)
  results_samples.txt             # text generation samples at d=128

Assets 2

19 May 19:06

RandomCoder-lab

v1.7.0

c2d6432

OMC v1.7.0 — Exponentiation, Universal Archaeology & Static Lint

The language, tooling, and libraries all take a major leap forward.
This release introduces a new math operator, a full static-analysis lint suite, built-in rollback stacks, cross-language codebase archaeology, and a hardened recursive self-improvement framework — all passing a zero-failure lint sweep across 244 source files.
Language

** exponentiation operator and **= augmented assignment

range(start, end, step) in for-loops — finally, step control in loops

Full postfix chain on CallExpr — closures/HOF calls can now be chained directly after a call, e.g. func()(args).method()

throw as a statement terminator — cleaner error flow

Static Analysis (omc --check)

New lint rules:
dead-code, augmented-assign, empty-if, const-condition, shadow-var, /= augmented-assign

Codebase-wide sweep: zero failures across all 244 files — the entire project is lint‑clean

Tooling

omc --doc / --doc-all: auto‑generates Markdown documentation from source

omc --search: search packages across the standard library, installed packages, and the registry

Builtins (Rust)

omc_code_valid(src) — parse‑guard for LLM‑generated code; validate structure before execution

fn_snapshot / fn_rollback / fn_snapshot_all / fn_rollback_all — recursive self‑improvement rollback stacks (RSI)

fn_bench(name, args, n) — wall‑clock micro‑benchmarking

time_ms(), code_parse_check(src) — timing and quick parse checks

Libraries

examples/lib/rsi.omc — reusable, production‑grade recursive self‑improvement framework

examples/lib/archaeology.omc — universal codebase archaeologist using a phi‑pi‑fib substrate scoring algorithm; works on any language: OMC, Python, Rust, JavaScript, Go, and more

Demos & Ports

codebase_archaeologist.omc — thin wrapper around the archaeology library

python_archaeologist.omc — cross‑language substrate analysis of Python codebases

recursive_improve.omc — hardened RSI demonstration

JavaScript ports:
omc-runtime.js + ports of substrate_rag, recursive_improve, self_improving_agent, and code_gen_loop

Apiproxy (omnimcode-apiproxy)

Conversation tracking with differential history

Adjacent marker collapse + intra‑request deduplication

Streaming side‑recording with adaptive compression

omc_proxy_namespace — multi‑tenant isolation for safe concurrent use

What's Changed

Add SafeSkill security badge (89/100 — Passes with Notes) by @OyaAIProd in #1

New Contributors

@OyaAIProd made their first contribution in #1

Full Changelog: v1.6.0...v1.7.0

Contributors

OyaAIProd

Assets 2

18 May 01:53

RandomCoder-lab

v0.10.0-compression-axes-1-4

e25af0d

v0.10.0 — omc-memory-plus axes 1-4: 5,356× context compression on real Claude Code dev work

Pushing OMC Memory+ compression ceiling beyond v1.0's 297× along four orthogonal axes. All four ship with round-trip verification on this codebase's own chapter writeups.

Headline

axis	mechanism	measured win
1	Merkle manifest hashes	5,356× context compression (19 chapters → 1 hash)
2	Cross-namespace dedup pool	5× disk on 5-way duplicate (linear with N namespaces)
3	Aged-tier zlib (`OMCZ` magic)	2.19× disk on Markdown
4	Substrate tokenizer (`OMCT` magic)	2.37× disk on OMC source (≈ ties Axis 3)

What's new

6 new MCP tools

`omc_memory_create_manifest(namespace, entries)` — bundle N leaf hashes into 1 manifest hash
`omc_memory_recall_manifest(content_hash, expand?)` — recall manifest, optionally fetch all leaves
`omc_memory_compact(namespace, age_threshold_secs)` — re-deflate aged pool bodies as OMCZ
`omc_memory_compact_substrate(namespace, age_threshold_secs)` — re-encode aged bodies via substrate tokenizer as OMCT
Auto-decompression of OMCZ + OMCT bodies on recall (transparent)
Cross-namespace dedup pool at `~/.omc/memory/_pool//.txt`

Architecture

All bodies content-addressed to a global pool with 256-shard fanout by hash top byte
Per-namespace dirs hold only the chronological index (`_index.jsonl`)
Recall: pool first → legacy per-namespace fallback → maybe_decompress (OMCZ / OMCT / plain)
flate2 added as omnimcode-core dep (rust_backend, no system zlib required)

How the compounding works

Axis 1 attacks context cost (tokens in LLM working set). Axes 2-4 attack disk cost (bytes on filesystem). Axis 1 is what the LLM pays per turn; axes 2-4 are what the user pays in storage. They multiply because they target different scarce resources.

Example — 19 chapters duplicated across 5 namespaces, all aged into Axis 3 compaction:

version	disk bytes	context tokens needed to reference everything
v1.0 naive	570,760 (95 files)	95 hash refs = 475 tokens
v0.9.2 pool dedup	114,152 (19 files)	95 refs = 475 tokens
v0.9.3 + zlib aged	~52,000	95 refs = 475 tokens
v0.9.1 manifest	(same disk)	5 tokens (1 manifest hash)

The Axis 1 manifest hash is the headline win for LLM context cost. The other axes are the foundation that keeps disk + retrieval cheap as memory grows.

Honest framing on Axis 4

Substrate tokenizer compaction was hypothesized to dominate raw zlib on OMC-flavored content because the substrate dictionary was tuned for OMC syntax. Measured: 2.37× vs raw zlib's 2.48× on the same content — essentially tied. Axis 4 ships as the substrate-native compression path that enables future Axis 6 HBit dual-band work, even though raw byte-savings is on par with Axis 3.

Still on the roadmap

axis	mechanism	est. additional win
5	Delta compression between similar entries	10-100× on iterative content
6	HBit dual-band codec	2-3× over Axis 4
7	LLM-assisted lossy + hash verification	10-50× more on prose with regen

Tests

1111/1111 OMC tests pass. End-to-end MCP integration test verifies round-trip on Markdown + OMC source.

Files

`omnimcode-core/src/memory.rs` — Axis 1-4 implementations + maybe_decompress + varint helpers
`omnimcode-core/Cargo.toml` — flate2 added
`omnimcode-mcp/src/main.rs` — 4 new tool registrations + dispatch

Assets 2

18 May 01:31

RandomCoder-lab

v0.9.0-memory-plus-product

19bbe1f

v0.9.0 — omc-memory-plus v1.0: Claude Code MCP plugin, 297× context compression

First commercial product packaged from OMC.

OMC Memory+ for Claude Code — a Claude Code MCP plugin that gives Claude persistent, content-addressed memory across sessions via OMC's substrate codec.

Real dogfood measurements (18 chapter writeups from this very codebase)

metric	value
raw content	101,771 bytes / 26,781 tokens
hash references in context	90 tokens
compression ratio	297.6×

At Claude Sonnet pricing ($3/MTok input):

Without Memory+: $0.08 per session that needs project context
With Memory+: $0.02 per session (90 hash refs + on-demand recall)
73% per-session token cost reduction

Pricing

plan	price	features
Free	$0	All 17 MCP tools, local memory storage, unlimited usage
Pro	$5/mo per seat	+ cross-machine sync, cloud retention, namespace sharing
Team	$50/mo for 5 seats	+ shared team namespaces, audit log, webhook events
Enterprise	from $500/mo	+ self-hosted backend, SSO, SLA, data residency

ROI: 50-dev team saves $285/mo → Team plan ROI in 9 days.

Architecture

```
Claude Code
↓
MCP protocol (stdio JSON-RPC)
↓
omnimcode-mcp binary
↓
~/.omc/memory// ← content-addressed, filesystem-backed
```

Local-first by default. Cloud sync is opt-in (Pro+).

The 17 MCP tools

5 load-bearing for the product:

`omc_compress_context` — substrate codec, alpha-rename invariant hashing
`omc_memory_store` / `_recall` / `_list` / `_stats` / `_evict`

12 useful adjacent:

`omc_eval`, `omc_help`, `omc_list_builtins`, `omc_categories`, `omc_did_you_mean`, `omc_explain_error`, `omc_predict`, `omc_corpus_size`, `omc_decompress`, `omc_fetch_by_hash`, `omc_unique_builtins`

Why this matters

The substrate codec was originally built for OMC-PROTOCOL v1 (distributed agent kernel communication). v0.9.0 repackages it for Claude Code users.

Pivot from research benches (v0.8 chapters: substrate-attention findings, GPU kernels, fused builtins) to a shipped product that monetizes the substrate's content-addressing property. The substrate is now generating revenue paths, not just papers.

Files

`products/omc-memory-plus/README.md` — feature pitch + measurements
`products/omc-memory-plus/INSTALL.md` — 3-step Claude Code install
`products/omc-memory-plus/PRICING.md` — tier breakdown + ROI calculator
`products/omc-memory-plus/install-snippet.json` — copy-paste MCP config

v1.1 cloud sync infrastructure
v1.2 auto-detect long context blocks, suggest compression
v1.3 integration with Claude Code's `/compact` — replace summary with hash refs
v2.0 API endpoint for non-Claude-Code tools (Cursor, Continue, Aider)

Built on OMNIcode

`omnimcode-mcp` is part of OMNIcode, a harmonic computing language with native substrate primitives. The substrate codec, content-addressed canonical hashing, and fibtier memory eviction (default 232 entries = sum of first 10 Fibonacci tier sizes) all come from the OMC substrate work shipped in v0.0.5-v0.8.10.

Assets 2

17 May 23:47

RandomCoder-lab

v0.8.10-substrate-backward-falsified

06a7c16

v0.8.10 — substrate-aware backward gradients: TRIED, falsified at this scale

The research-grade item from the v0.8.9 goal

Hypothesis: instead of plain dL/dθ, route gradients through substrate so updates that move θ toward Fibonacci attractors are amplified and updates that move θ away are dampened. The substrate as a gradient-flow preconditioner instead of a forward modulator.

Result: falsified at d_model=32. Loss landscape pulls harder than substrate alignment.

What was built

tape_substrate_grad_mod(x, scale, alpha) — fused tape op with identity forward but substrate-shaped backward:

forward:   y = x                                    # identity
backward:
  for each cell:
    xs = round(x · scale)
    (attractor, dist) = nearest_attractor_with_dist(xs)
    if dist == 0:  dx = dy                          # on attractor, passthrough
    else:
      dir = sign(attractor - xs)
      pulls_toward = sign(g) · dir < 0              # update -lr·g moves toward attractor
      dx = dy · (1 + alpha) if pulls_toward         # amplify
           else dy · 1/(1 + alpha)                  # dampen

Smoke test verifies math (scale=10, alpha=0.5):

x	xs	nearest	dist	result	expected
0.6	6	5	1	1.5	1.5 ✓
0.7	7	8	1	0.667	0.667 ✓
0.5	5	5	0	1.0	1.0 ✓

A/B result

Wrapped Q and V projections in tape_substrate_grad_mod(node, 64, 0.5) before matmul. Forward unchanged; backward biased. d_model=32, 250 steps, 3 seeds:

arm	mean tail loss	Δ vs baseline	wins
baseline	1.998	—	—
+ substrate gm	2.165	+8.4%	1/3
+ substrate gm + Q6	2.157	+7.9%	1/3

Falsified. Substrate-shaped gradient bias hurts training at this scale.

The empirical substrate-architecture map after v0.8

Validated (substrate at outputs / in structure):

Data — CRT-PE positional encoding (cross-validates)
Algorithm — substrate-K + S-MOD + V-resample (cross-validates)
Hardware tile — 8×32 wavefront-aligned (+38-61%)
Post-training pattern — Q6 → 8.3× substrate concentration (v0.8.8)
Multi-head Q6 compound — −3.57% MH→MHQ6 (v0.8.9)

Falsified (substrate as input constraint or backward bias):

Init-time substrate-snap (v0.8.8 #3)
Gradient-time substrate-pull (this chapter)

Pattern: the substrate works when applied to OUTPUTS or revealed by training, but NOT when forced on INPUTS or GRADIENTS. The information flow direction matters.

Reformulations possible (future chapters)

Different scale: scale=64 may be too coarse; try 1024 or per-layer adaptive
Apply to FF not attention: FF weights may be more tolerant
Decay alpha during training: start strong, fade to 0 — warm-start regularizer
Regularization term instead of gradient bias: add sum(attractor_distance(param)) · lambda to loss

Each is its own chapter. v0.8.10 ships the honest negative.

#2 still in flight

d_model=128 larger-scale bench has been running 22+ min in background (buffered output won't print until exit). Lands in v0.8.11 with the actual MH-at-128 datum for PyTorch L1-MH −8.94% parity.

Tests

1111/1111 OMC tests pass.

Files

omnimcode-core/src/interpreter.rs — TapeOp::SubstrateGradMod + dispatch + backward
examples/prometheus_substrate_grad_mod_xval.omc — 3-arm A/B
experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md — writeup

Assets 2

17 May 23:37

RandomCoder-lab

v0.8.9-sparse-and-mh-q6

6407f83

v0.8.9 — sparse attention kernel + MH+Q6 compound confirmed

Headline

Two goal items shipped with hard data; a third (d_model=128 larger-scale bench) is still in flight and will land in v0.8.10.

#3 MH+Q6 compound — v0.8.8 finding SCALES to multi-head

v0.8.8 showed Q6 training pushes attention 8.3× toward substrate positions in single-head mode. v0.8.9 #3 asks: does this scale to multi-head?

d_model=32, n_heads=4, 250 steps, 3 seeds:

arm	mean tail loss	Δ vs SH
SH (single head)	2.0309	—
SH + Q6 fused	1.9865	−2.19%
MH (4 heads)	2.0486	+0.87%
MH + Q6 fused	1.9754	−2.73% compound

The compound analysis: MH→MHQ6: −3.57% vs SH→SHQ6: −2.19% — Q6 gets more leverage in multi-head because each head has its own Q to sculpt independently. Per-head substrate alignment compounds at attention time.

Architecturally confirmed: v0.8.8 attention-shaping mechanism scales beyond single-head.

#1 Sparse substrate attention kernel — mechanism shipped, speedup pending

Shipped: tape_substrate_sparse_scores(q_id, k_id, threshold) in omnimcode-core. Computes scores only at cells where CRT substrate_dist(i, j) ≤ threshold (moduli {5, 8, 13, 21}), masks the rest to −∞ so subsequent softmax assigns zero. Backward only flows through fired cells.

Cell density telemetry (set OMC_GPU_VERBOSE=1):

[sparse-scores] 70/1024 cells = 6.8%

Exact match to v0.8.8 measurement — the 6.84% substrate-close cells.

Wall-clock at seq_len=32, d_model=32 (10-iter avg, post-Q6 training)

variant	forward ms/iter
dense	0.2723
sparse	0.2736
speedup	1.00×

No speedup yet. Dense path lives in tape_matmul's tight Rust inner loop; sparse path is naive scalar triple-loop with per-cell substrate-distance recomputation. At seq_len=32 the 93% saved MACs are eaten by per-cell overhead and cache-unfriendly access.

L1 difference between dense and sparse softmax: 57.44 / 1024 cells = 0.056 per cell. Sparse captures dominant attention positions, with −∞-masking introducing measurable divergence at low-mass cells.

Path to real speedup (reformulation for v0.8.10+)

Larger seq_len — at seq_len=64+, dense seq²·d MAC count vs sparse (seq · density · seq)·d lets the saved MACs dominate per-cell overhead
Precomputed substrate mask — (i, j) → fired/not table is identical across batches and only depends on seq_len; compute once
CSR / packed sparse format — replace dense [N×N] (most cells −∞) with list of (i, j, score) tuples + per-row prefix index
WGSL implementation — once shapes pass GPU threshold, port sparse path to compute kernel

Mechanism validated. Speedup is v0.8.10 work.

#2 d_model=128 larger-scale bench — in flight

Task #265 background bench at d_model=128, seq_len=32, ff=256, 400 steps × 3 seeds × 3 arms (L0 / B / B+Q6). 13+ minutes in at chapter write time; lands in v0.8.10 with the MH-at-128 datum needed for direct PyTorch L1-MH −8.94% parity.

The compounding architecture continues

v0.8.1 broadcast-backward unblocked S-MOD training
v0.8.4 fused AdamW dissolved 96× overhead
v0.8.5 multi-head substrate-K cross-validated
v0.8.7 four deferred items each TRIED
v0.8.8 Q6 post-training substrate alignment (8.3×) + JIT eligibility fix
v0.8.9 MH+Q6 compound (−3.57% Q6 in MH) + sparse kernel mechanism

Tests

1111/1111 OMC tests pass.

Files

omnimcode-core/src/interpreter.rs — TapeOp::SubstrateSparseScores + forward/backward
examples/prometheus_mh_q6_compound.omc — #3 4-arm A/B harness
examples/prometheus_sparse_attn_bench.omc — #1 dense-vs-sparse bench
experiments/prometheus_parity/V089_SPARSE_AND_MH_Q6.md — writeup

Assets 2

17 May 23:03

RandomCoder-lab

v0.8.8-q6-post-train-sparsity

c26ace8

v0.8.8 — Q6 training pushes attention 8.3× toward substrate positions

The big finding

Q6 training pushes attention 8.3× toward substrate-aligned positions. This flips the v0.8.7 #8 falsification — sparse attention via substrate distance IS viable, but only after Q6 training.

After 1000 Q6-fused training steps (d_model=32, seq_len=32):

arm	mass in substrate-close cells	cell fraction	ratio
baseline (no Q6), trained	4.82%	6.84%	0.70 (anti-correlated)
Q6 fused, trained	56.80%	6.84%	8.31×

A sparse kernel computing only substrate-close cells captures 56.8% of attention with 6.84% of compute. Real architecture-level "substrate is the architecture" claim, unlocked as a post-training inference optimization.

Mechanism

Q6 dampens large-magnitude query components via exp(-γ · log_φπfib(|q · scale| + 1)). Components whose substrate log-distance is small get less dampening, so they survive training and dominate the attention pattern. The substrate isn't directly constraining position — it's reshaping the gradient landscape so substrate-aligned positions win.

Implications

Sparse inference kernel: q[i] · k[j] only for substrate_dist(i, j) ≤ τ
~10× attention compute reduction at the cost of ~43% attention quality (a defensible inference-time tradeoff)
The PyTorch Q6 −12.15% finding may partially be substrate-position alignment in disguise

Plus 3 more findings

Infrastructure fix — JIT eligibility audit

fn_uses_collections in omnimcode-codegen skips JIT for fns touching arrays/dicts/strings. OMC_HBIT_JIT=1 no longer crashes on Prometheus. Wall-clock unchanged at d_model=256 (v0.8.4 already removed the overhead JIT would have compressed); unblocks JIT for any future tape-using workload.

Negative — substrate-quant 6-seed verifies as noise

The v0.8.7 single-seed "lower loss" was seed noise. Mean 2.365 vs baseline 2.337 (+1.2% worse) across 6 seeds × 300 steps with OMC_GPU_SUBSTRATE_QUANT_SCALE=4096. Training-time substrate quantization is a marginal regression at this scale.

Negative — substrate-aware param init falsified

Snap-to-attractor at init scale 1024/4096 gives +2.6%/+4.7% worse mean loss vs uniform random init (6 seeds × 300 steps). Starting on attractors gives less gradient info per step.

Methodology: each experiment ≤ 10 min, all four genuinely tried

#	finding	result
1	Q6 post-train sparsity	POSITIVE — 8.31× substrate concentration
2	substrate-quant 6-seed	NEGATIVE — seed noise verified
3	substrate-init A/B	NEGATIVE — falsified, +2.6/+4.7% worse
4	JIT eligibility audit	POSITIVE infra — fix landed, 1111/1111 pass

Three negatives + one massive positive + one infra fix. The "fail forward" discipline keeps producing useful data either way.

Compounding architecture

v0.8.1 fixed broadcast-backward (unblocked S-MOD training)
v0.8.4 fused AdamW (dissolved 96× overhead)
v0.8.5 multi-head substrate-K (architecturally needed for parity)
v0.8.7 tried 4 deferred items
v0.8.8 four more attempts; #1 unlocks future sparse inference

Tests

1111/1111 OMC tests pass.

Files

examples/prometheus_q6_post_train_sparsity.omc — Finding 1
examples/prometheus_substrate_quant_6seed.omc — Finding 2
examples/prometheus_substrate_init_xval.omc — Finding 3
omnimcode-codegen/src/lib.rs — Finding 4 (fn_uses_collections)
omnimcode-core/src/interpreter.rs — substrate_snap_matrix builtin
experiments/prometheus_parity/V088_FOUR_FINDINGS.md — writeup

Assets 2

17 May 22:14

RandomCoder-lab

v0.8.7-items-7-to-10-tried

dbfb19e

v0.8.7 — items #7-10 each tried: 2 viable, 1 falsified, 1 real bug

The v0.8.6 chapter scoped items #7-10 as "future chapters". The Stop hook on the goal correctly pushed back: scoping isn't trying. Each item now has the smallest meaningful attempt and a real measured result.

#7 substrate-quantized GPU weights — TRIED, math VIABLE

Boundary flag OMC_GPU_SUBSTRATE_QUANT=1 snaps each weight cell to its nearest Fibonacci attractor before the f32 GPU conversion.

scale	final loss	vs baseline 6.959
64	7.514	+8% worse (too coarse)
1024	6.537	within noise
4096	6.149	within noise (slightly lower)
65536	6.782	≈ baseline

Math is viable at scale ≥ 1024. Real bandwidth-saving u16/u8 packed WGSL storage is the deferred work — no longer blocked by feasibility.

#8 CRT-PE sparse attention — TRIED, HYPOTHESIS FALSIFIED at random init

Wrote /tmp/sparse_attn_test.omc: measure fraction of softmax-attention mass in substrate-close cells (substrate_dist ≤ 5 using moduli {5, 8, 13, 21}) vs random q × CRT-PE k.

Result: 8.36% of attention mass in 6.84% of cells — essentially uniform.

Sample argmax positions:

row 0: argmax_j=31, substrate_dist=23 (FAR)
row 1: argmax_j=18, substrate_dist=24 (FAR)
row 4: argmax_j=15, substrate_dist=20 (FAR)

Most argmaxes are substrate-FAR. The "skip far pairs, they softmax to ~0" assumption is false for untrained queries.

Reformulations possible (each is its own chapter): post-training test, Fibonacci-block magnitude sparsity, substrate-aligned q training.

#9 LLVM JIT for tape paths — TRIED, real integration bug

Built --features "gpu llvm-jit" and ran with OMC_HBIT_JIT=1. JIT compiled several prom_* fns successfully, then crashed:

Error: arr_len requires an array
  at prom_crt_pe_matrix (769:32)

JIT'd return values don't respect OMC Value semantics for array-shaped returns crossing back into tree-walk callers. Real integration bug.

Reformulation: JIT-eligibility audit. Mark fns whose return value goes into tree-walk array ops as @no_jit. ~1-2 hours focused. Not impossible, but unsafe to ship without fix.

#10 f16/bfloat16 GPU paths — TRIED, math VIABLE

OMC_GPU_SIMULATE_F16=1 truncates the bottom 13 mantissa bits of each f32 cell before the wgpu matmul, simulating f16's 10-bit mantissa precision.

	final loss	wall-clock
f32 baseline	6.959	0.255 s/step
f16-simulated	6.378	0.254 s/step

Training doesn't explode at f16 precision. The 2× bandwidth payoff still needs a real WGSL f16 kernel + f64→f16 boundary + loss scaling — math test passed unblocks that work.

The honest scorecard

#	item	result	deferred work
7	substrate-quantized weights	TRIED, VIABLE	u16/u8 packed WGSL storage
8	CRT-PE sparse attention	TRIED, FALSIFIED at random init	reformulate hypothesis (post-trained? magnitude?)
9	LLVM JIT for tape	TRIED, real bug	JIT eligibility audit
10	f16/bf16 GPU	TRIED, VIABLE	real WGSL f16 + loss scaling

Two viable-but-needs-more, one falsified-but-reformulable, one blocked-by-bug. All four genuinely TRIED. The Stop hook was right.

1111/1111 OMC tests pass.

Assets 2

17 May 22:05

RandomCoder-lab

v0.8.6-accel-scaffold

ff46dac

v0.8.6 — #3 softmax accel scaffold + survey for #7-10

Scope

Item #3 in the v0.8.5 optimization plan: route more tape ops through GPU. Shipped as scaffolding — the hook is in place, the dispatch consults it, the binary registers a stub. At current Prometheus shapes the stub declines (default threshold = 1M cells); larger-scale runs and future hardware can opt in via env.

Plus an honest survey of items #7-10 (each a future chapter rather than rushed half-implementations in this one).

What's new

SoftmaxAccelerator hook in omnimcode-core::accel, mirroring MatmulAccelerator:

```rust
pub type SoftmaxAccelerator = Box<
dyn Fn(usize, usize, &[f64]) -> Option<Result<Vec, String>>
+ Send + Sync,

;
pub fn register_softmax_accelerator(f: SoftmaxAccelerator) -> Result<(), &'static str>;
```

tape_softmax interpreter dispatch consults the hook first, falls through to the existing CPU triple-pass when the hook declines. omnimcode-cli registers a stub at startup that declines all calls under OMC_GPU_SOFTMAX_MIN_CELLS (default 1,000,000) — high enough that no current Prometheus path opts in.

Honest framing

At current Prometheus shapes we exercise (d_model=256, seq_len=64, scores 64×64), per-row softmax is memory-bound and tiny (4k cells = microseconds of CPU work). GPU buffer alloc + dispatch overhead would dominate any kernel speedup. The scaffold lives so:

Larger-scale runs (seq_len=512+, d_model=1024+) can opt in by setting OMC_GPU_SOFTMAX_MIN_CELLS lower
Future hardware with cheaper dispatch (Apple M-series unified memory, NVIDIA persistent kernels) can register a non-stub accelerator
The same pattern extends to LayerNorm, element-wise, etc. — accel.rs is the precedent

This is the right size of attempt for an item whose payoff at current scales is small but whose architectural slot matters for the trajectory.

What's deferred to v0.8.7+

experiments/prometheus_parity/V086_OPTIMIZATION_SURVEY.md records the scope for each remaining item:

#7 substrate-quantized GPU weights — own chapter (~half-day). Encode f32 as (u8 attractor_index, i16 delta), dequant on GPU. Substrate at the data layer where it actually lives.
#8 CRT-PE-keyed sparse attention matmul — own chapter. Sparse WGSL kernel + sparse-aware backward.
#9 omnimcode-codegen LLVM JIT for tape paths — own chapter. Needs Prometheus-fn JIT-compatibility audit.
#10 f16/bf16 GPU paths — own chapter. New WGSL + loss-scaling logic for training stability.

Tests

1111/1111 OMC tests pass.

Assets 2

17 May 22:01

RandomCoder-lab

v0.8.5-substrate-builtins-mh

34f61fa

v0.8.5 — substrate ops, embedding lookup, cross-entropy fused, multi-head substrate-K

Five v0.8.5 optimization items shipped. The compound v0.8.4 + v0.8.5 effect: training-loop hot path is now fully in Rust builtins; the math-equivalent multi-head substrate-K attention is available; the architecture is positioned for v0.8.6+ to push remaining items (substrate-quantized GPU weights, sparse attention, etc.).

What's new

#1 `tape_cross_entropy_batch` — fused tape op

Per-batch cross-entropy as one tape node. Closed-form (p - one_hot)/N backward replaces the chain through 5 intermediate nodes (softmax → log → mask → mul → sum). Wins materialize at large vocab.

#2 `tape_embedding_lookup` — direct row gather

Replaces prom_embedding_batch's OMC-built [N, vocab] one-hot + matmul chain with a direct row gather. Backward scatters rows of dy back into the table gradient (same gradient as the one-hot @ table chain). Wins scale with vocab size.

#4 OMC_VM=1 negative finding

Measured: 0.662 s/step at d_model=256 (was 0.661 tree-walk). No win once hot paths are in Rust builtins. Not pursued further for Prometheus — the bytecode VM optimizes basic-block dispatch, but the hot work is now happening below that layer.

#5 Multi-head substrate-K attention — `prom_attention_substrate_k_mh_*`

Math-equivalent "sum of per-head W_O projections" form (avoids needing a tape_concat op). All single-head toggles (smod_alpha, v_resample_scale, q6_mode) honored per-head with same defaults.

Cross-validation at d_model=32, 4 heads (d_head=8), 400 steps, 3 seeds:

	mean tail loss	wins
SH (single head)	2.0047	—
MH (4 heads)	1.9998	2/3

Δ = −0.25%, directionally consistent with PyTorch's L1-MH −8.94%. Effect grows with capacity; same code path supports the PyTorch −12.15% Q6-MH finding once you turn on q6_mode=\"fused\".

#6 `tape_substrate_resample` — fused tape op

Skips tape_value → modulator_matrix → tape_const round-trip (which was extracting 16k f64s at d_model=256 seq_len=64 per call). Pairs with the substrate_resample_matrix Rust builtin from v0.8.4. Same math.

Honest framing

Wall-clock at d_model=256 is essentially unchanged from v0.8.4 for these five items in isolation — that scale was already AdamW-bound and the OMC overhead was already removed. These wins materialize when:

Vocab grows large — cross-entropy and embedding lookup get O(vocab) cheaper
Multi-head trained — the architectural win + the OMC-overhead-gone substrate-attention compose
Bigger d_model — fused substrate_resample skips proportionally more I/O

The MH cross-validation result is the load-bearing finding here: the PyTorch L1-MH win cross-validates in OMC's tape autograd.

What's still on the v0.8.5 list

#3 Route more tape ops through GPU — modest win at current scales (memory-bound ops aren't matmul-class), scaffold to be added in v0.8.6
#7 Substrate-quantized GPU weights — own chapter
#8 CRT-PE-keyed sparse attention matmul — own chapter
#9 LLVM JIT for tape paths — own chapter
#10 f16/bf16 GPU paths — own chapter

Tests

1111/1111 OMC tests pass.

Files

omnimcode-core/src/interpreter.rs — tape_cross_entropy_batch, tape_embedding_lookup, tape_substrate_resample builtins + backwards
examples/lib/prometheus.omc — wrappers + prom_attention_substrate_k_mh_*
examples/prometheus_mh_xval.omc — SH vs MH cross-validation harness

Assets 2

Releases: RandomCoder-lab/OMC