Skip to content

Refactor: single data_hex word identity (drop TERM<hex> class, hash the token) #237

@HugoFara

Description

@HugoFara

Status: Proposed — deferred until after the next release (a major feature lands first; this is a broad cross-cutting change we don't want to collide with it).

Proposal doc in-repo: docs-src/developer/word-identity-data-hex.md.

Problem

In the reading view, every occurrence of the same term shares an identity token so a status change can restyle all occurrences client-side at once. That identity is currently carried two ways:

  • a CSS class TERM<hex> on each word span, and
  • a data_hex attribute on JS-rendered spans (the same value, duplicated).

The <hex> comes from StringUtils::toClassName() (original-LWT 2011 ¤/hex encoder). Issues:

  1. Dual identity — same value as both a class and an attribute; lookups split across .TERM<hex> selectors and data_hex reads.
  2. Hacky, subtly-broken encoding — iterates per character (mb_substr) but tests per byte (ord), so the ¤-sentinel/165-threshold unambiguity scheme was never actually realized. PHP 8.5 surfaced it by deprecating ord() on a multi-byte string.
  3. Fragile extraction — JS TERM([a-f0-9]+) extractors mis-handle tokens containing g-z/G-Z/¤ and silently fall back to data_hex.

Proposal

Make data_hex the single identity:

  • Select via [data_hex="…"].
  • Drop the TERM class entirely (zero CSS dependencies — purely an index).
  • toClassNamesubstr(hash('sha256', $s), 0, 16) (pure [0-9a-f]).

Token stays opaque/recomputable/contained → no API wire-format change, no CSS.escape needed, and the extractor regexes become correct by construction. The ¤/165/mb_ord-vs-ord question disappears.

Why it's safe

  • The token is never reversed back to text (backend re-derives it from WoTextLC).
  • .TERM has no CSS rules.
  • Tokens are computed per render, never stored — no desync risk.

Trade-off: the token is opaque in devtools (accepted).

Scope (post-release)

  • PHP token: StringUtils::toClassName() → hash (keep toHex()).
  • PHP emit (5 spans): TextReadingService ×3, ExpressionService ×2 — drop TERM from class, add data_hex.
  • JS emit: remove TERM${word.hex} push in text_renderer.ts (data_hex already emitted).
  • JS selectors (~9): .TERM${hex}[data_hex="${hex}"] (word_dom_updates.ts, word_result_init.ts, text_renderer.ts).
  • JS extractors (4): read data_hex (text_reader.ts, text_keyboard.ts, word_actions.ts, text_events.ts).
  • Tests: PHP toClassName assertions (IntegrationTest, TextProcessingTest) + frontend fixtures (tests/frontend/reading/*, tests/frontend/words/*, texts/text_reader.test.ts).

Out of scope: toHex(); table_review_row.php's id="TERM<woId>" (different mechanism).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions