Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
0264f7a
transformerless_lm: progressive Fibonacci-K growth — substrate curric…
claude May 21, 2026
dd20b3c
transformerless_lm: progressive K-growth results — 4% faster, +0.11 v…
claude May 21, 2026
dcabfd0
transformerless_lm: recursive substrate primitives (Idea 1 + Idea 4)
claude May 21, 2026
719f486
transformerless_lm: substrate-recursive wins — golden-ratio momentum …
claude May 21, 2026
f8281db
transformerless_lm: d-scale ablation — does the substrate composition…
claude May 21, 2026
e45d27e
transformerless_lm: d-scale ablation partial results — gap non-monotonic
claude May 21, 2026
d674366
transformerless_lm: add OMC codebase corpus for diversity test
claude May 21, 2026
071a6d6
transformerless_lm: d-scaling on OMC corpus — gap halves on richer data
claude May 21, 2026
52cc0e2
transformerless_lm: 20K-step OMC bench + text sampling — the capacity…
claude May 21, 2026
b49e465
transformerless_lm: 20K-step OMC bench falsifies "free parity" hypoth…
claude May 21, 2026
d867ce9
transformerless_lm: K-sweep at d=128 OMC — does K growth close the gap?
claude May 21, 2026
7ae56ec
transformerless_lm: K must be Fibonacci — substrate-canonical K sweep
claude May 21, 2026
48134e9
transformerless_lm: K-shrink schedule — hierarchical substrate compre…
claude May 21, 2026
d3d3bbd
transformerless_lm: K-shrink supports tinyshakespeare via --corpus flag
claude May 21, 2026
aa49f43
transformerless_lm: K-shrink TinyShakespeare results — Pareto win, no…
claude May 21, 2026
bde1b76
transformerless_lm: sample_K_shrink — does the shrink model speak Sha…
claude May 21, 2026
960730a
transformerless_lm: --skip-dense flag for sample_K_shrink — shrink-on…
claude May 21, 2026
0f4b549
transformerless_lm: K-shrink text sample on TS — gibberish at d=128 (…
claude May 21, 2026
31c50ce
transformerless_lm: larger K-shrink — tier-walk schedule, K=144 to K=3
claude May 21, 2026
809dae7
transformerless_lm: larger K-shrink (144→3) sample — terminal K shape…
claude May 21, 2026
7cb6f41
transformerless_lm: substrate-aware loss — the missing formulation
claude May 21, 2026
c3111cd
transformerless_lm: word-level tokenizer for atomic semantic units
claude May 21, 2026
a4203b8
transformerless_lm: ce_fft loss WINS — -6.09% val vs standard CE
claude May 21, 2026
26e8a96
transformerless_lm: THE MYTHOS — three sibling models, three poetic t…
claude May 21, 2026
80e519a
transformerless_lm: substrate-aware activation function — every layer…
claude May 21, 2026
8a5e5be
transformerless_lm: reverse the activation math — reciprocal Fibonacc…
claude May 21, 2026
bc609ca
transformerless_lm: skip gelu_baseline rerun, use cached reference 2.…
claude May 21, 2026
caa2de7
transformerless_lm: PhiPiFibActivation — smooth substrate via F(k)/ph…
claude May 21, 2026
7021ce4
transformerless_lm: PhiPiFib activation falsified — +2.3% worse than …
claude May 21, 2026
cbcd804
transformerless_lm: BinetFibActivation — pure substrate replacement f…
claude May 21, 2026
a98bc29
transformerless_lm: Binet activation falsified — +15.7% worse than GELU
claude May 21, 2026
05ee03a
transformerless_lm: substrate-asymmetric activation — peak-decay nega…
claude May 21, 2026
1b6feb0
transformerless_lm: NaN-safe substrate activation (log-domain phi^x)
claude May 21, 2026
dfd23f9
transformerless_lm: substrate-asymmetric activation lands within 1% o…
claude May 21, 2026
80eab0e
transformerless_lm: drop the activation clamp — log-domain handles al…
claude May 21, 2026
46aa622
transformerless_lm: multi-tier substrate activation — F(k)/phi^(pi*k)…
claude May 21, 2026
6eaf45c
transformerless_lm: multi-tier substrate activation — +0.80% vs GELU …
claude May 21, 2026
fabd863
transformerless_lm: refined multi-tier — learnable weights + tanh sat…
claude May 21, 2026
0b2c515
transformerless_lm: SUBSTRATE ACTIVATION BEATS GELU — substrate_neg_m…
claude May 21, 2026
e0a2616
transformerless_lm: SubstrateNegMultiAdvanced — R2+R4+R5+R6 merged
claude May 21, 2026
4cad87e
transformerless_lm: AdvancedV2 — R4+R5 reformulated after first-pass …
claude May 21, 2026
c48e055
transformerless_lm: AdvancedV2 result + LN/softmax A/B bench
claude May 21, 2026
ad867b9
transformerless_lm: pivot LN/softmax bench to V2 activation
claude May 21, 2026
51e9001
transformerless_lm: refined LN + softmax — full L1 / tier mixture
claude May 21, 2026
ff06ea1
transformerless_lm: fix LN sparse-grad + drop softmax exp
claude May 22, 2026
0bf26b1
transformerless_lm: calibrated L1LN remix — match LN scale at init
claude May 22, 2026
beda9ad
transformerless_lm: skip baseline + failed v1 in bench
claude May 22, 2026
3eebf02
transformerless_lm: L1 LN dead, isolate softmax test
claude May 22, 2026
19fd3a2
transformerless_lm: SubstrateBlendedSoftmax — learnable blend at init=0
claude May 22, 2026
3d211e5
transformerless_lm: substrate-similarity attention bench
claude May 22, 2026
bb84009
transformerless_lm: SUBSIM ATTENTION BEATS GELU BY 2.02%
claude May 22, 2026
a00c353
transformerless_lm: phase 2 — substrate harmony loss + self-recursion
claude May 22, 2026
180a73f
transformerless_lm: phase 2 bench — lighter config for fast iteration
claude May 22, 2026
38127d7
transformerless_lm: parametric substrate mutation + K-harmony sync
claude May 22, 2026
b073705
transformerless_lm: data-guided substrate mutation — 3.4010 tiny-data…
claude May 22, 2026
160231b
transformerless_lm: inference-time refinement loop (substrate recursion)
claude May 22, 2026
030cbfe
transformerless_lm: creativity score reveals refinement is working
claude May 22, 2026
16611ef
transformerless_lm: creativity-gated refinement loop
claude May 22, 2026
99efc88
transformerless_lm: staged refinement reveals harmony/creativity tension
claude May 22, 2026
3f0b22d
transformerless_lm: self-distillation lifts creativity ceiling 0.67 -…
claude May 22, 2026
4ddb289
transformerless_lm: growing active_base — self-distillation ratchet w…
claude May 22, 2026
7e8b6de
transformerless_lm: anti-gibberish creativity + self-distillation col…
claude May 22, 2026
3a1414f
transformerless_lm: three-gate self-distillation against collapse
claude May 22, 2026
bcd2e08
transformerless_lm: SubstrateEmbedding -- substrate-canonical char ma…
claude May 22, 2026
58b4b79
transformerless_lm: SubstrateEmbedding helps val, not creativity ceiling
claude May 22, 2026
52183df
transformerless_lm: substrate word tokenizer + embedding -> real words
claude May 22, 2026
ba57e2c
transformerless_lm: substrate sampling + lambda=1/phi^pi -> Shakespea…
claude May 22, 2026
e4a5bca
transformerless_lm: substrate sampling + recency + bigram blend -> 0.599
claude May 22, 2026
2fe7406
transformerless_lm: pure substrate-derived bigram - data still needed
claude May 22, 2026
3d3cc07
transformerless_lm: pure-substrate framework -> 10+ Shakespeare chara…
claude May 22, 2026
b2f7146
transformerless_lm: trigram + syntax gate -> 0.616 peak, "the silver …
claude May 22, 2026
1c81947
transformerless_lm: refined + graduated multi-back -> 0.640 peak + Ri…
claude May 22, 2026
e0d08b8
transformerless_lm: sentence boundary + Fib seq_len -> 0.6444 peak
claude May 22, 2026
c24bcf3
transformerless_lm: substrate anti-stagnation -> 0.666 peak
claude May 22, 2026
9ea77c0
transformerless_lm: nested recency F(k)/phi^(pi*k) -> 0.661
claude May 22, 2026
92afe8c
transformerless_lm: cross-sentence subject threading
claude May 22, 2026
fb96769
transformerless_lm: golden-phase primitive (2π/φ² rhythm)
claude May 22, 2026
208624f
transformerless_lm: v55 results (subject threading)
claude May 22, 2026
a2944ff
transformerless_lm: theme momentum + v56 results
claude May 22, 2026
2663c7f
transformerless_lm: vocab curriculum (F-tier expansion)
claude May 22, 2026
1aeee21
transformerless_lm: iambic stress primitive (period 2 = F(3))
claude May 22, 2026
0dfa843
transformerless_lm: disable theme + vocab curriculum for v59
claude May 22, 2026
cf10678
transformerless_lm: v59 results (iambic + threading)
claude May 22, 2026
ef72160
transformerless_lm: symbolic primitives (substitution + reference)
claude May 22, 2026
4e90138
transformerless_lm: disable symbolic substitution for v61
claude May 22, 2026
ec1c229
transformerless_lm: v61 results (reference chain)
claude May 22, 2026
7695674
transformerless_lm: 3 new language primitives
claude May 22, 2026
b81287b
transformerless_lm: v62 results (full language stack)
claude May 22, 2026
48fa059
transformerless_lm: refine 4 language primitives
claude May 22, 2026
36b411b
transformerless_lm: damp rhyme boost by F(3)=2
claude May 22, 2026
4596e06
transformerless_lm: structural refinement of 3 primitives
claude May 22, 2026
c2894bc
transformerless_lm: thread punct_mask + newline_mask through refine p…
claude May 22, 2026
dcaf6a6
transformerless_lm: v65 results -- second-highest peak ever
claude May 22, 2026
e607e33
transformerless_lm: phonics layer -- word_spacing + pronounceability
claude May 22, 2026
3f331c4
transformerless_lm: relax unpronounceable mask + fix refine sigs
claude May 22, 2026
5a1fe73
transformerless_lm: anti-char-cascade primitive
claude May 22, 2026
da02eae
transformerless_lm: add unpronounceable_mask to _single_stage_refine sig
claude May 22, 2026
ec7d59e
transformerless_lm: v68 results -- NEW ALL-TIME RECORD
claude May 22, 2026
b78c4aa
transformerless_lm: ABCD revisions on top of v68 record
claude May 22, 2026
610cedc
transformerless_lm: v69 results -- NEW MEAN RECORD
claude May 22, 2026
5aae99b
transformerless_lm: 3 refinements on v69 ABCD stack
claude May 22, 2026
91c484f
transformerless_lm: omniweight -- shared log-pressure ledger
claude May 22, 2026
eaa8682
transformerless_lm: omniweight fluid form (tanh-backed standard)
claude May 22, 2026
b107bb8
transformerless_lm: v72 results -- fluid omniweight NEW RECORDS
claude May 22, 2026
8d72769
transformerless_lm: split-brain omniweight (math + lang hemispheres)
claude May 22, 2026
e1269d7
transformerless_lm: split-brain mixer -> golden-weighted arithmetic
claude May 22, 2026
c2bca6b
transformerless_lm: split-brain resonance-aware mixer
claude May 22, 2026
f696bb3
transformerless_lm: split-brain rank-modulated mixer
claude May 22, 2026
6709e74
transformerless_lm: v76 results -- NEW PEAK RECORD via rank-modulated…
claude May 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
453 changes: 453 additions & 0 deletions experiments/transformerless_lm/activations_substrate.py

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions experiments/transformerless_lm/corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,19 @@ def make_dataset(seq_len: int = 64, source: str = "embedded"):
fast smoke tests and the original tiny-bench)
- "tinyshakespeare": load tinyshakespeare.txt (1.1 MB) — used
by the scale experiment
- "omc": load omc_codebase.txt (~4 MB of OMC source: .py/.rs/.md/.toml).
More diverse than English prose; 210 unique chars.
"""
import os
import torch
if source == "tinyshakespeare":
path = os.path.join(os.path.dirname(__file__), "tinyshakespeare.txt")
with open(path, "r") as f:
text = f.read()
elif source == "omc":
path = os.path.join(os.path.dirname(__file__), "omc_codebase.txt")
with open(path, "r") as f:
text = f.read()
else:
text = CORPUS
chars = sorted(set(text))
Expand Down
88 changes: 88 additions & 0 deletions experiments/transformerless_lm/corpus_word.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
"""Word-level tokenizer for TinyShakespeare.

The char-level vocab (65 chars) requires the model to learn that
letters form words before it can learn word structure. Word-level
tokenization gives the model atomic semantic units directly — the
model's per-step prediction is a meaningful WORD, not a letter.

Splits on whitespace + punctuation. Keeps punctuation as separate
tokens (so 'ROMEO:' becomes ['ROMEO', ':']). Lowercase'd to keep
vocab small.

For TinyShakespeare (1.1 MB) the word vocab is roughly 25K unique
tokens — much larger than 65 chars but each token carries more
semantic weight per step.
"""

import os
import re

import torch


_TOKEN_PATTERN = re.compile(r"[A-Za-z]+|[0-9]+|[^A-Za-z0-9\s]|\n+|\s+")


def tokenize_text(text: str) -> list[str]:
"""Split text into word-like tokens. Keeps newlines as their own
tokens so the model can learn line structure."""
tokens = _TOKEN_PATTERN.findall(text)
# Lowercase alphabetic tokens to shrink vocab. Keep punctuation as-is.
return [t.lower() if t.isalpha() else t for t in tokens]


def make_word_dataset(source: str = "tinyshakespeare"):
"""Returns (vocab, stoi, itos, encoded) for word-level tokenization.

vocab: list of unique tokens, sorted
stoi: token -> int
itos: int -> token
encoded: 1-D int tensor of token ids
"""
base = os.path.dirname(__file__)
if source == "tinyshakespeare":
path = os.path.join(base, "tinyshakespeare.txt")
elif source == "omc":
path = os.path.join(base, "omc_codebase.txt")
else:
raise ValueError(f"unknown source: {source}")
with open(path) as f:
text = f.read()
tokens = tokenize_text(text)
vocab = sorted(set(tokens))
stoi = {t: i for i, t in enumerate(vocab)}
itos = {i: t for t, i in stoi.items()}
encoded = torch.tensor([stoi[t] for t in tokens], dtype=torch.long)
return vocab, stoi, itos, encoded


def detokenize(token_ids, itos) -> str:
"""Inverse of tokenize_text. Reconstructs text by joining tokens —
keeps newlines/whitespace tokens visible so the line structure
is preserved in the output."""
out = []
prev_alpha = False
for tid in token_ids:
t = itos[int(tid)]
# Add a space between alphanumeric runs; whitespace/newline
# tokens are emitted directly.
if t.isalnum():
if prev_alpha:
out.append(" ")
out.append(t)
prev_alpha = True
else:
out.append(t)
prev_alpha = False
return "".join(out)


if __name__ == "__main__":
for src in ("tinyshakespeare", "omc"):
vocab, stoi, itos, enc = make_word_dataset(src)
print(f"{src}:")
print(f" total tokens: {enc.numel():,}")
print(f" unique vocab: {len(vocab):,}")
sample = detokenize(enc[:30].tolist(), itos)
print(f" first 30 detok: {sample!r}")
print()
247 changes: 247 additions & 0 deletions experiments/transformerless_lm/creativity_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
"""Shakespeare-aware creativity scoring.

Replaces val=CE-on-next-token (which only rewards exact reproduction)
with metrics that measure whether GENERATED text is Shakespeare-LIKE
without being identical:

- n-gram overlap: fraction of n-char windows in generated text that
appear ANYWHERE in the corpus. Measures Shakespearean character
patterns without exact-word requirement.
- vocab overlap: fraction of generated tokens (whitespace-separated)
that match corpus vocabulary. Real English/Shakespeare words even
if not in the same sentence.
- line structure: avg line length, ratio of letters to total chars.
Captures stanza/line-break patterns.
- vowel-consonant transition rate: English alternates v/c; random
text doesn't. Score the alternation pattern.

Use these to evaluate creative output of substrate-aligned model. A
model that produces statistically-Shakespearean GIBBERISH gets ~0;
a model that produces creative but recognizable English gets high.
"""

import string
from collections import Counter




VOWELS = set("aeiouAEIOU")
LETTERS = set(string.ascii_letters)
WHITESPACE = set(" \n\t")


def char_ngram_overlap(generated: str, corpus_text: str, n: int) -> float:
"""Fraction of n-char windows in generated that appear in corpus.
Higher = more Shakespearean char-pattern overlap."""
if len(generated) < n:
return 0.0
corpus_ngrams = set(corpus_text[i:i+n] for i in range(len(corpus_text) - n + 1))
gen_ngrams = [generated[i:i+n] for i in range(len(generated) - n + 1)]
if not gen_ngrams:
return 0.0
matches = sum(1 for g in gen_ngrams if g in corpus_ngrams)
return matches / len(gen_ngrams)


def vocab_overlap(generated: str, corpus_text: str) -> float:
"""Fraction of generated 'words' (whitespace-split) that appear in
the corpus vocabulary. Punctuation stripped for comparison."""
def clean(s):
return s.lower().strip(string.punctuation)
corpus_vocab = set(clean(w) for w in corpus_text.split() if clean(w))
gen_words = [clean(w) for w in generated.split() if clean(w)]
if not gen_words:
return 0.0
matches = sum(1 for w in gen_words if w in corpus_vocab)
return matches / len(gen_words)


def line_structure_stats(generated: str) -> dict:
"""Line-level statistics: line count, mean line length, std line
length. Compare to corpus to see if the model matches Shakespeare's
typical line structure."""
lines = [ln for ln in generated.split("\n") if ln.strip()]
if not lines:
return {"n_lines": 0, "mean_line_len": 0.0, "std_line_len": 0.0}
lengths = [len(ln) for ln in lines]
mean = sum(lengths) / len(lengths)
var = sum((L - mean) ** 2 for L in lengths) / len(lengths)
return {"n_lines": len(lines),
"mean_line_len": mean,
"std_line_len": var ** 0.5}


def vc_alternation_rate(generated: str) -> float:
"""Vowel-consonant alternation rate. English alternates v/c more
often than random text. Returns the fraction of adjacent letter
pairs that are (v,c) or (c,v) -- alternating, not same class."""
letters = [c for c in generated if c in LETTERS]
if len(letters) < 2:
return 0.0
alts = 0
for i in range(len(letters) - 1):
a, b = letters[i] in VOWELS, letters[i+1] in VOWELS
if a != b:
alts += 1
return alts / (len(letters) - 1)


def line_length_match(generated: str, corpus_text: str) -> float:
"""How close is the generated line-length distribution to the
corpus's? L1 distance over normalized histograms (lower = closer
to Shakespeare's line structure)."""
def hist(text, max_len=80):
lines = [ln for ln in text.split("\n") if ln.strip()]
h = [0] * (max_len + 1)
for ln in lines:
L = min(len(ln), max_len)
h[L] += 1
total = sum(h) or 1
return [x / total for x in h]
gen_h = hist(generated)
corp_h = hist(corpus_text)
return sum(abs(g - c) for g, c in zip(gen_h, corp_h))


def real_word_fraction(generated: str, corpus_text: str,
min_word_len: int = 3) -> float:
"""Fraction of generated 'words' that are real (length >= min_word_len
AND appear in the corpus vocabulary). The strict gate against
gibberish: 'fan' is real even if Shakespeare uses it, 'xqrt' is not.
Short tokens (1-2 chars) excluded because they're noise-prone.
"""
def clean(s):
return s.lower().strip(string.punctuation)
corpus_vocab = set(clean(w) for w in corpus_text.split() if clean(w))
gen_words = [clean(w) for w in generated.split() if clean(w)]
long_words = [w for w in gen_words if len(w) >= min_word_len]
if not long_words:
return 0.0
real = sum(1 for w in long_words if w in corpus_vocab)
return real / len(long_words)


def common_word_presence(generated: str, corpus_text: str,
top_k: int = 50) -> float:
"""How many of the corpus's top-K most-common words appear in the
generated text. This is the strongest anti-gibberish signal:
Shakespeare uses 'the', 'and', 'of', 'my', 'I' frequently;
gibberish doesn't.
"""
def clean(s):
return s.lower().strip(string.punctuation)
corpus_words = [clean(w) for w in corpus_text.split() if clean(w)]
corpus_freq = Counter(corpus_words)
top_words = set(w for w, _ in corpus_freq.most_common(top_k))
gen_words = set(clean(w) for w in generated.split() if clean(w))
if not top_words:
return 0.0
overlap = len(gen_words & top_words)
return overlap / len(top_words)


def avg_word_length_match(generated: str, corpus_text: str) -> float:
"""How close is generated avg word length to corpus avg?
Returns 1.0 - normalized_distance, clamped to [0, 1]."""
def clean(s):
return s.lower().strip(string.punctuation)
def avg(text):
words = [clean(w) for w in text.split() if clean(w)]
return (sum(len(w) for w in words) / len(words)) if words else 0.0
g = avg(generated); c = avg(corpus_text)
if c == 0:
return 0.0
return max(0.0, 1.0 - abs(g - c) / c)


def ngram_diversity(generated: str, n: int = 3) -> float:
"""Fraction of n-grams in the generated text that are UNIQUE.
1.0 = every n-gram appears once (max diversity).
0.0 = all n-grams identical (max repetition).
Counter-Goodhart against the model gaming overlap by repetition."""
if len(generated) < n:
return 0.0
ngrams = [generated[i:i+n] for i in range(len(generated) - n + 1)]
if not ngrams:
return 0.0
return len(set(ngrams)) / len(ngrams)


def repetition_penalty(generated: str, n: int = 4,
max_freq_threshold: int = 3) -> float:
"""Penalty in [0, 1] for excessive n-gram repetition. 0 = no penalty.

For each n-gram appearing more than max_freq_threshold times, add a
penalty proportional to the excess. Strong signal against the
'fan fan, fan, fan' failure mode.
"""
if len(generated) < n:
return 0.0
ngrams = [generated[i:i+n] for i in range(len(generated) - n + 1)]
counts = Counter(ngrams)
excess = sum(max(0, c - max_freq_threshold) for c in counts.values())
# Normalize by total ngrams; cap penalty at 1.0
return min(1.0, excess / max(1, len(ngrams)))


def lexical_diversity(generated: str) -> float:
"""Type-token ratio over 'words' (whitespace-split). Higher = more
varied vocabulary, lower = repetitive word use."""
import string as _s
words = [w.lower().strip(_s.punctuation) for w in generated.split()]
words = [w for w in words if w]
if not words:
return 0.0
return len(set(words)) / len(words)


def creativity_score(generated: str, corpus_text: str) -> dict:
"""Comprehensive Shakespeare-creativity score with anti-gibberish.

Penalties added in v2 to counter Goodhart's failure (model gaming
overlap metrics by repetition):
- ngram_diversity (multiplier; low = repetitive output)
- lexical_diversity (multiplier; low = same word over and over)
- repetition_penalty (subtractive; n-gram appears too many times)
"""
n2 = char_ngram_overlap(generated, corpus_text, 2)
n3 = char_ngram_overlap(generated, corpus_text, 3)
n4 = char_ngram_overlap(generated, corpus_text, 4)
vocab = vocab_overlap(generated, corpus_text)
vc = vc_alternation_rate(generated)
line_dist = line_length_match(generated, corpus_text)
line_stats = line_structure_stats(generated)
# Strong anti-gibberish: common-word, real-word, and word-length.
cw = common_word_presence(generated, corpus_text, top_k=50)
rw = real_word_fraction(generated, corpus_text, min_word_len=3)
awl = avg_word_length_match(generated, corpus_text)
# Repetition penalty: only severe excess counts now (threshold scales
# with text length so real text's natural repetition doesn't penalize).
threshold = max(2, len(generated) // 50)
rep_pen = repetition_penalty(generated, n=4, max_freq_threshold=threshold)

composite = (
0.25 * rw + # real-word fraction (HARDEST anti-gibberish)
0.15 * cw + # common-word presence
0.15 * vocab + # any vocab overlap (short tokens count)
0.10 * awl + # word-length sanity
0.15 * n3 + # 3-gram match (corpus patterns)
0.10 * n4 + # 4-gram match (longer patterns)
0.10 * max(0.0, 1.0 - line_dist) # line structure
) - 0.3 * rep_pen
composite = max(0.0, min(1.0, composite))
return {
"ngram_2": n2,
"ngram_3": n3,
"ngram_4": n4,
"vocab_overlap": vocab,
"common_word_presence": cw,
"real_word_fraction": rw,
"avg_word_len_match": awl,
"vc_alternation": vc,
"line_dist": line_dist,
"line_stats": line_stats,
"repetition_penalty": rep_pen,
"creativity_score": composite,
}
Loading
Loading