Refactor: single `data_hex` word identity (drop `TERM<hex>` class, hash the token)

**Status:** Proposed — deferred until after the next release (a major feature lands first; this is a broad cross-cutting change we don't want to collide with it).

Proposal doc in-repo: `docs-src/developer/word-identity-data-hex.md`.

## Problem

In the reading view, every occurrence of the same term shares an identity token so a status change can restyle **all** occurrences client-side at once. That identity is currently carried two ways:

- a CSS class `TERM<hex>` on each word span, and
- a `data_hex` attribute on JS-rendered spans (the same value, duplicated).

The `<hex>` comes from `StringUtils::toClassName()` (original-LWT 2011 `¤`/hex encoder). Issues:

1. **Dual identity** — same value as both a class and an attribute; lookups split across `.TERM<hex>` selectors and `data_hex` reads.
2. **Hacky, subtly-broken encoding** — iterates per character (`mb_substr`) but tests per byte (`ord`), so the `¤`-sentinel/165-threshold unambiguity scheme was never actually realized. PHP 8.5 surfaced it by deprecating `ord()` on a multi-byte string.
3. **Fragile extraction** — JS `TERM([a-f0-9]+)` extractors mis-handle tokens containing `g-z`/`G-Z`/`¤` and silently fall back to `data_hex`.

## Proposal

Make `data_hex` the single identity:

- Select via `[data_hex="…"]`.
- Drop the `TERM` class entirely (zero CSS dependencies — purely an index).
- `toClassName` → `substr(hash('sha256', $s), 0, 16)` (pure `[0-9a-f]`).

Token stays opaque/recomputable/contained → **no API wire-format change, no `CSS.escape` needed**, and the extractor regexes become correct by construction. The `¤`/165/`mb_ord`-vs-`ord` question disappears.

## Why it's safe

- The token is never reversed back to text (backend re-derives it from `WoTextLC`).
- `.TERM` has no CSS rules.
- Tokens are computed per render, never stored — no desync risk.

Trade-off: the token is opaque in devtools (accepted).

## Scope (post-release)

- PHP token: `StringUtils::toClassName()` → hash (keep `toHex()`).
- PHP emit (5 spans): `TextReadingService` ×3, `ExpressionService` ×2 — drop `TERM` from class, add `data_hex`.
- JS emit: remove `TERM${word.hex}` push in `text_renderer.ts` (`data_hex` already emitted).
- JS selectors (~9): `.TERM${hex}` → `[data_hex="${hex}"]` (`word_dom_updates.ts`, `word_result_init.ts`, `text_renderer.ts`).
- JS extractors (4): read `data_hex` (`text_reader.ts`, `text_keyboard.ts`, `word_actions.ts`, `text_events.ts`).
- Tests: PHP `toClassName` assertions (`IntegrationTest`, `TextProcessingTest`) + frontend fixtures (`tests/frontend/reading/*`, `tests/frontend/words/*`, `texts/text_reader.test.ts`).

**Out of scope:** `toHex()`; `table_review_row.php`'s `id="TERM<woId>"` (different mechanism).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor: single `data_hex` word identity (drop `TERM<hex>` class, hash the token) #237

Problem

Proposal

Why it's safe

Scope (post-release)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Refactor: single data_hex word identity (drop TERM<hex> class, hash the token) #237

Description

Problem

Proposal

Why it's safe

Scope (post-release)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Refactor: single `data_hex` word identity (drop `TERM<hex>` class, hash the token) #237