HTML API: Add CSS selector support by sirreal · Pull Request #7857 · WordPress/wordpress-develop

sirreal · 2024-11-21T10:56:52Z

This is not ready for final review but is ready for early feedback, especially around the open questions listed below.

Introduce CSS selector based traversal of HTML documents in the HTML API. Add new select_all and select methods to the Tag Processor and HTML Processor.

// With select_all to traverse a document stopping on matching selectors
$processor = WP_HTML_Processor::create_full_parser( '<p match><div att match><em><i match></i><a match>' );
foreach ( $processor->select_all( 'p, [att], em > *' ) as $_ ) {
	assert( $processor->get_attribute( 'match' ) );
}

// With select to move to a matching selector
$processor = WP_HTML_Processor::create_full_parser( '<p match><div att match><em><i match></i><a match>' );
assert( $processor->select( 'p, [att], em > *' ) );
assert( $processor->get_attribute( 'match' ) );
assert( 'P' === $processor->get_tag() );

A subset of the CSS selector grammar is available as described here:

wordpress-develop/src/wp-includes/html-api/class-wp-css-compound-selector-list.php

Lines 24 to 39 in 355c9a2

    
            * This class is analogous to <compound-selector-list> in the grammar. The supported grammar is: 
        
            * 
        
            *     <selector-list> = <complex-selector-list> 
        
            *     <complex-selector-list> = <complex-selector># 
        
            *     <compound-selector-list> = <compound-selector># 
        
            *     <complex-selector> = [ <type-selector> <combinator>? ]* <compound-selector> 
        
            *     <compound-selector> = [ <type-selector>? <subclass-selector>* ]! 
        
            *     <combinator> = '>' | [ '|' '|' ] 
        
            *     <type-selector> = <ident-token> | '*' 
        
            *     <subclass-selector> = <id-selector> | <class-selector> | <attribute-selector> 
        
            *     <id-selector> = <hash-token> 
        
            *     <class-selector> = '.' <ident-token> 
        
            *     <attribute-selector> = '[' <ident-token> ']' | 
        
            *                            '[' <ident-token> <attr-matcher> [ <string-token> | <ident-token> ] <attr-modifier>? ']' 
        
            *     <attr-matcher> = [ '~' | '|' | '^' | '$' | '*' ]? '=' 
        
            *     <attr-modifier> = i | s

Notable variations from selectors specification:

Pseudo-element selectors are not supported. Pseudo elements will not exist in the HTML and it's unclear what benefit they would bring.

Pseudo-class selectors are not supported. Pseudo classes could be useful, but the logic to parse and match pseudo-class selectors would add significant complexity. There's also a lot of variety in pseudo-selectors, and rather than supporting simpler selectors (e.g. :empty) and not supporting more complex selectors (e.g. :nth-child()), pseudo-class selectors are completely unsupported. This is a clear and simple rule that greatly simplifies the implementation.

Complex selectors have limited support. Complex selectors are combined selectors using one of the combinators: (whitespace), >, +, or ~. The Tag Processor does not support complex selectors at all, it has no concept of document structure, and all complex selectors are structural. The HTML Processor does not support the sibling combinators + or ~, it only supports a parent/child or ancestor/descendant relationship. These selectors can be handled without tracking additional state in the document by analyzing breadcrumbs. Finally, only type selectors are allowed in non-final position, again because this allows matching against breadcrumbs without tracking additional state:

Supported: body heading > h1.page-title[attribute]
Unsupported (class / id selector in non final position): #page > main, .page main
Unsupported (sibling selectors not supported): ul li ~ li, ul li + li.

Importantly, the selectors supported by the HTML Processor should be sufficient to support all core block attribute selectors according to this PR:

wordpress-develop/src/wp-includes/html-api/class-wp-html-attribute-sourcer.php

Lines 18 to 49 in 88e7d30

    
           /** 
        
            * Existing Core Selectors 
        
            * 
        
            *     .blocks-gallery-caption 
        
            *     .blocks-gallery-item 
        
            *     .blocks-gallery-item__caption 
        
            *     .book-author 
        
            *     .message 
        
            *     a 
        
            *     a[download] 
        
            *     audio 
        
            *     blockquote 
        
            *     cite 
        
            *     code 
        
            *     div 
        
            *     figcaption 
        
            *     figure > a 
        
            *     figure a 
        
            *     figure img 
        
            *     figure video,figure img 
        
            *     h1,h2,h3,h4,h5,h6 
        
            *     img 
        
            *     li 
        
            *     ol,ul 
        
            *     p 
        
            *     pre 
        
            *     tbody tr 
        
            *     td,th 
        
            *     tfoot tr 
        
            *     thead tr 
        
            *     video 
        
            */

Most of the listed selectors are also supported by the Tag Processor with the exception of the complex selectors:

figure > a
figure a
figure img
figure video,figure img
tbody tr
tfoot tr
thead tr

Open questions

Implementation

The implementation introduces a number of classes. The classes roughly correspond to different parts of the selector grammar. Parsing is handled in the _list classes that represent the top of the grammar. Matching logic is handled by each selector class. The selector classes implement a matches interface.

~~I'm not happy with how the implementation is distributed. I'd like to move in one of two directions:~~

Move the parsing logic for selectors into the selector classes, so each selector class is responsible for parsing itself.

I've implemented this change.

OR

~~Move matching logic into the top level *_list classes to that the lower level selectors are only data containers.~~

~~My preference at this time (and the original implementation) would be the former.~~

Selector traversal APIs

select_all is implemented as a generator and (except _doing_it_wrong) has no way to differentiate between "nothing matched" and "the selector was invalid or unsupported". It simply yields nothing in both cases.
select uses select_all internally and has the same limitation.
select_all expects the document position to remain the same in order to visit all matching tags. Because none of the supported selectors rely on stateful logic, this is not an issue at this time.

Todo:

Get and address feedback on open questions.
Split into smaller PRs for review. Specifically, the selector lists can be split into sepearate PRs starting with compound and following with complex selectors.

sirreal/html-api-debugger#5 can be used for testing the parsing and matching behavior.

Trac ticket: https://core.trac.wordpress.org/ticket/62653

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

github-actions · 2024-11-21T11:18:36Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

sirreal · 2024-11-26T18:41:48Z

+ * - The following combinators:
+ *   - descendant (e.g. `.parent .descendant`)
+ *   - child (`.parent > .child`)


I don't think this will be supported initially. Maybe no combinators at all, only simple selectors.

These two combinators can probably be supported by the HTML API as long as they non-final selector is an element selector:

div > .className ✅ supported

.className > div ⛔️ not supported

Tags can be "seen" via bookmarks, while things like IDs, classes, attributes etc. would require seeking or more advanced internal tracking.

Is the problem that we'd need to keep track of all the attributes requested by the selector in the entire breadcrumbs trail?

Exactly. Either breadcrumbs would need to store all of their attributes or we'd need to seek around the document to check them which would be very costly.

All of the supported selectors are things that can be checked from given tag in the document without any seeking as the HTML API is implemented now.

If we knew the selector upfront, we could always store the attributes needed to match it, e.g. store the class attribute if we know we'll only ever look for div > .className. Only that wouldn't be practical.

However, keeping track of classnames in all the breadcrumbs elements seems okay-ish resource-wise.

Perhaps we could even have a "CSS-enabled" mode where we'd keep track of all the breadcrumbs attributes shorter than, say, 100 bytes? It wouldn't be perfect but should suffice for most attribute-based matches.

There will be ways to support more selectors, one good idea is to set some state to track data for certain selectors as the document is traversed.

One tricky thing with that implementation is that it may be necessary to return to the start of the document depending on the selectors used and scan again from there so that the necessary data is collected during traversal. This probably involves analyzing the selectors, deciding what data needs to be stored, seeking to the beginning if necessary, seeking back to the current location, and then trying to find the selector.

The complexity here is significant. It can be overcome, but I think the set of selectors supported in this PR as it is now is a good starting place with very limited matching complexity. Iteration to improve support can be added in future PRs.

Oh! Everything I said I meant for the future, no need to hold up this work :)

On seeking - sounds reasonable, although it won't combine well with streaming. What would an example of a selector that requires backtracking?

What would an example of a selector that requires backtracking?

This is mostly necessary to support selecting from arbitrary positions in the document. If we knew we were at the start of the document, we could determine what additional state needed to be tracked and start collecting it.

The sibling combinators + and ~ select based on preceding elements, so it would be necessary to back up to check for matches.

Support for subclass selectors (ID, class, and attribute) in non-final position, like #my-id div would require backing up to parse some attributes along with breadcrumbs.

If pseudo-class selectors were added, I'd expect many of them to require more understanding of the document structure.

in earlier work I did I actually proposed a list-interface for selecting elements and that list took all selectors of interest. that’s not a great solution to mandate, but it does allow for up-front analysis of the set of selectors, and then a crawl to find each one in succession.

I was going to make a comment about select_all() and the generator it yields and how I worry it miscommunicates the ability to chain selection at that point. if we start selecting a different selector from within the foreach, we move the underlying HTML Processor and now it will select a different element on the next iteration than it would have (potentially) if we hadn’t called one from inside.

to that end, if the underlying HTML Processor is moving, can we not get select_all() behavior simply by calling select() in succession in while ( $processor->select() )?

then, one thing is more explicit, that if we call select() from inside that loop, we kind of acknowledge we might change the outer loop because the $processor is shared and stateful.

it doesn’t solve the problem of the unknown necessary context from above.

thinking that the FirTree representation might be able to help us by tracking attribute spans, making it faster to re-scan parent nodes, but also we could say someone can register selectors before traversal, and we maintain the relevant attributes as we go, regardless of which select() is running…

This is a more appropriate name for the type of match. > `[att|=val]` Represents an element with the att attribute, its value > either being exactly "val" or beginning with "val" immediately > followed by "-" (U+002D). This is primarily intended to allow language > subcode matches (e.g., the hreflang attribute on the a element in > HTML) as described in BCP 47 ([BCP47]) or its successor.

Matches the calling interface for the other HTML API classes, avoids creating the `Generator`, using a static var to avoid re-parsing the selector string instead.

The length guard before the attribute matcher required 4 remaining bytes where the minimum valid tail `=x]` is 3, so a valid exact-match attribute selector with a single-character unquoted value at the end of the selector string (e.g. `[a=b]`) was wrongly rejected as unparseable. Relax the guard from `>=` to `>`. All reads after the guard are bounded: the operator reads touch at most offset+1, and every later read re-checks the length itself. Adds the exact-fit valid case and invalid cases at the same boundary (`[a=]`, `[a~=]`, `[a==b]`, `[a=1]`) to the parse tests, plus an assertNotNull so parse failures report cleanly instead of erroring on a null property read. Found by the CSS selector fuzzer (tools/css-selector-fuzz, Bug 3 in FINDINGS.md). (cherry picked from commit 16d03e2)

Per Selectors level 4, the substring attribute matchers with an empty value — [x^=""], [x$=""], [x*=""] — represent nothing and must never match. The matcher instead matched any element carrying the attribute (prefix and contains) or an element whose attribute value was exactly empty (suffix). Add an early return for the empty operand on those three matchers, before the case modifier and boolean-attribute normalization so [x^="" i] and valueless attributes are covered too. [x=""], [x|=""], and [x~=""] are unaffected and remain spec-correct: exact and hyphen matchers may match an empty value, and the one-of matcher already matched nothing because a whitespace-delimited list never yields an empty item. Tests pin all of these, including |= against a hyphen-prefixed value. https://www.w3.org/TR/selectors-4/#attribute-substrings Found by the CSS selector fuzzer (tools/css-selector-fuzz, Bug 2 in FINDINGS.md). (cherry picked from commit 0cefeb2)

consume_escaped_codepoint() read the escaped codepoint for non-hex (identity) escapes with mb_substr( $input, $offset, 1 ), but $offset is a byte offset while mb_substr()'s second argument is a character index. Any multibyte content earlier in the selector string shifts the read one character right per continuation byte, decoding the wrong codepoint: 'Ü\sup' parsed as ident 'Üuup', and the corruption threads across an entire selector list ('#ÜÜÜ,\sup #x' parsed the second selector's type as ' up'). Depending on the mis-decoded codepoint this also caused spurious parse failures of valid selectors. Hex escapes were already byte-correct and are unaffected. Read the codepoint from the byte offset instead. ASCII-only inputs are byte-for-byte unchanged, and the returned codepoint's byte length keeps the offset advancing exactly past it. Adds parse_ident and parse_string cases pinning identity escapes after multibyte characters, plus a hex-escape control. Found by the CSS selector fuzzer (tools/css-selector-fuzz, Bug 1 in FINDINGS.md). (cherry picked from commit 7419a9f)

Per CSS Syntax 3, a backslash followed by EOF is a valid escape in ident context -- §4.3.8 rejects only a newline as the second code point, and EOF is not a newline -- and consuming it returns U+FFFD REPLACEMENT CHARACTER (§4.3.7). WP rejected the whole selector: next_two_are_valid_escape() required a code point after the backslash, so '.foo\' parsed to null instead of the class "foo\u{FFFD}". Fix: consume_escaped_codepoint() returns U+FFFD at EOF without advancing, and next_two_are_valid_escape() accepts a backslash as the final byte. String context is unaffected: parse_string() guards EOF itself before consuming an escape, preserving the §4.3.5 'do nothing' EOF rule ('foo\ still parses to foo). Review of the fix surfaced a second bug in the same family: normalize_selector_input() trimmed *trailing* whitespace before tokenizing, so '.foo\ ' (escaped space: the valid, unmatchable class 'foo ') and ".foo\\n" (invalid escape: must be rejected) both collapsed to '.foo\' and matched elements with class "foo\u{FFFD}" -- a wrong-match-set bug, where before the EOF-escape fix the collapse was a harmless fail-safe rejection. Now only leading whitespace is stripped; the grammar already consumes insignificant trailing whitespace via parse_whitespace() in both selector-list parsers. Verified against lexbor: '.foo\' matches class "foo\u{FFFD}", lone '\' parses as type U+FFFD and matches nothing, '.foo\ ' is valid and matches nothing, and the LF/CR/FF escape variants are rejected -- exact agreement on all probes. (NEXT-STEPS.md 'candidate finding 4', now confirmed and closed.) (cherry picked from commit 203858b)

Per CSS Syntax 3 §5.4.8/§4.3.5, tokenization auto-closes unterminated simple blocks and unterminated strings at EOF (a parse error, but the block/string is returned), and the selector grammar then applies to the block contents. So '[att=val' is the same selector as '[att=val]', and '[att="a b' carries the string value 'a b'. WP rejected all of these with null. The attribute parser now treats the end of input like a closing ']' at the two positions where the grammar is complete (after the name, and after the value/modifier), and the early length guards that required room for a closing bracket are relaxed accordingly. Truncation inside the grammar itself is still invalid: '[', '[a=', '[a~', '[a=b x', and a comma inside the open block ('[a=b, div') all stay null. Escape interplay (verified per spec and in Chromium): '[a=b\' carries the value "b\u{FFFD}" (escape at EOF in ident context), while '[a="b\' carries 'b' (backslash-then-EOF in a string 'does nothing'). '[a\]' parses as a presence selector for the attribute 'a]' (the escaped ']' joins the ident and EOF closes the block). Chromium agrees with every accepted and rejected form above. lexbor rejects all EOF-truncated forms (it does not implement the auto-close rule) and diverges from browsers and the spec here; the fuzzer's lexbor differential is unaffected because it compares canonical re-renders, which always include the closing bracket. (cherry picked from commit 5eea359)

HTML defines 46 attributes (type, rel, lang, dir, media, hreflang, http-equiv, ...) whose values must match ASCII case-insensitively in attribute selectors on an HTML element when the selector carries no i/s modifier: https://html.spec.whatwg.org/multipage/semantics-other.html#case-sensitivity-of-selectors WP honored only the explicit modifiers, so [type=TEXT] silently failed to match <input type="text"> — a wrong match set rather than a refusal, invisible to callers. The matcher now folds case when all three hold: no modifier on the selector, the element is in the html namespace (per the processor's get_namespace()), and the lowercased attribute name is in the list. An explicit s modifier still forces case-sensitive matching, per Selectors 4 §6.3: 'the UA must match the value case-sensitively ... regardless of document language rules.' All six matchers and |='s hyphen check honor the rule via the existing case-insensitive comparison branches. Namespace scoping follows the spec's 'on an HTML element' wording: SVG/MathML elements keep case-sensitive matching, while elements at HTML integration points (e.g. inside <svg><foreignObject>) fold, since they are html-namespace. Verified in Chromium, which agrees on the integration point but also folds plain SVG-namespace elements, diverging from the spec's scoping; WP follows the spec. The standalone Tag Processor tracks no namespaces and folds everywhere — the same class of approximation as its ancestor-blind matching. The review panel machine-diffed both list constants against the live spec (exact, in spec order). (cherry picked from commit 40640d1)

phpcbf reports 40 WordPress.Arrays.MultipleStatementAlignment warnings in this file's data providers, and the coding-standards workflow runs phpcs over the test suite without -n, so warnings fail CI. Pure whitespace; no test changes. (cherry picked from commit 3db43ea)

The identity arm of consume_escaped_codepoint() read one character via mb_substr( substr( $input, $offset ), 0, 1 ), copying the entire remaining input per escape: O(n^2) over selectors composed of escapes, plus an O(n) temporary allocation each time. Size the code point in place instead with the bounded scanner _wp_scan_utf8( $input, $at, $invalid_length, 4, 1 ) from compat-utf8.php (WP 6.9, loaded unconditionally before the HTML API), then copy at most 4 bytes. Escapes of invalid UTF-8 fall through to the literal previous mb_substr() line, so behavior is preserved by construction under every mb_substitute_character setting; that fallback remains O(tail) per call, accepted for developer-supplied selectors. _wp_utf8_codepoint_span() is deliberately not used: it leaves the scanner's ASCII fast-path unbounded, which is quadratic again (noted in-code). 200KB of repeated \g through parse_ident: 180 ms before, 45 ms after, with linear scaling after (47/90/180 ms at 200/400/800KB; previously ~4x per doubling) and half the peak memory. Escape pin coverage grows to 14 cases: 2/3/4-byte characters including at end of input, NUL, and each invalid-byte class (lone continuation, overlong lead, invalid lead, truncated 3/4-byte, encoded surrogate, above U+10FFFF), with expectations probe-verified against the pre-change implementation. Adversarial review: equivalence reviewer ran ~74M differential old-vs-new cases (exhaustive byte-class boundaries at every offset, random fuzz, non-default mb_substitute_character) with 0 mismatches; perf reviewer independently reproduced the quadratic-before / linear-after curves; integration reviewer verified load order (including SHORTINIT), private-function precedent, and phpcs. All approved. Gates: full html-api PHPUnit group green (1654 tests), fuzzer 5000 seeds 0 failures. (cherry picked from commit 9d82c1c)

Decoding an identity escape of invalid UTF-8 leaks the process-global mb_substitute_character() setting into parse results: the substitute character is returned and the offset advances by the byte length of the substitute, not of the invalid sequence. Under the default '?' this is nearly invisible; under a multibyte substitute it swallows following characters and can push the offset past the end of the input. Pin the setting to a distinctive canary -- U+2603 SNOWMAN -- in set_up()/tear_down() and rewrite the seven invalid-byte pins to the canary expectations, making the dependence unmistakable: five cases show the trailing 'z' being eaten, and a dedicated test asserts the offset overrun that the rest-of-input assertion cannot see (substr() returns '' both at and past the end). A differential run of all provider cases under canary/default/'none' confirms exactly these seven react to the setting; everything else is independent of it. These pins document the leak, not endorse it. They are the ready-made red suite for the planned fix: decoding invalid bytes to U+FFFD per maximal subpart (CSS Syntax 3 section 3.2 via the WHATWG Encoding Standard) makes the outputs setting-independent and flips every one of these expectations. Adversarial review approved; full html-api group green (1654 tests) with the substitute character verified restored after the run. (cherry picked from commit 9b0b1df)

Selector strings are UTF-8 text. from_selectors() now decodes the input byte stream before parsing: normalize_selector_input() replaces each maximal subpart of an ill-formed byte sequence with U+FFFD via wp_scrub_utf8() (WP 6.9), per the byte-decoding step CSS Syntax 3 section 3.2 defines through the WHATWG Encoding Standard's UTF-8 decoder. A replaced selector is almost always a developer mistake (mojibake, double encoding) that would otherwise yield a silently empty match set, so the replacement also reports _doing_it_wrong(), named "<called class>::from_selectors" via late static binding. The mb_substitute_character() leak in consume_escaped_codepoint() dies structurally: with all public input scrubbed, the identity arm's mb_substr() fallback became unreachable through from_selectors() and is replaced by a deterministic decode for direct parse() callers — consume the maximal subpart the existing _wp_scan_utf8() call already reported and return one U+FFFD, consistent with the scrub. This also removes the remaining O(tail)-per-escape copy for invalid bytes. Design decision: reject (wp_is_valid_utf8() -> null) and raw byte passthrough were both considered and discarded by a three-persona adversarial design review; scrub is the option that stays stable under both the current raw value getters and a future where the getters scrub their return values. An escape-arm-only U+FFFD decode was ruled out unanimously: it would break the identity property that escaping a non-special code point is equivalent to writing it unescaped. The known divergence is pinned in a test: a scrubbed selector cannot match raw invalid bytes in a document (the Tag Processor reports raw bytes); if the HTML API value getters are ever changed to scrub, that pin flips to a match and must be updated in the same change. The compound-list class docblock gains a "Text Encoding" section recording the contract. Tests: the seven U+2603-canary escape pins flip to maximal-subpart U+FFFD expectations, and the canary is retained permanently — its job inverted from documenting the leak to proving setting-independence (a reintroduced mb_substitute_character dependence fails eight tests). New coverage: scrub + notice through from_selectors() on both list classes (the complex-list test pins the late-static-binding notice name), a lone invalid byte parsing as a U+FFFD type selector, the notice firing even when the scrubbed selector is rejected by the grammar, string-token invalid-byte decode, identity-escape equivalence and U+FFFD matching through select(), and the raw-document-bytes no-match pin (deliberately unique 0xC1 byte: select() memoizes the last parsed selector string, so a unique selector guarantees the parse-time notice under any test order). Adversarial review: three hostile reviewers. The equivalence reviewer verified the decode against an independent reference WHATWG UTF-8 decoder (exhaustive 1-2-byte tails, ~204k boundary-alphabet 3-4-byte tails, 100k random; zero mismatches, all under a U+2603 canary), scrub idempotence and ordering-neutrality (200k cases), and the notice-name propagation. The test reviewer killed four core mutations against the suite and demonstrated two test defects (select()-cache coupling, an unpinned complex-list notice name), both fixed before commit. The integration reviewer verified worker-model equivalence and determinism (10000 fuzz seeds clean). All approved. Gates: full html-api group green (1665 tests), fuzzer 5000 seeds 0 failures, self-check OK, phpcs clean. (cherry picked from commit 598ed6f)

sirreal force-pushed the html-api/add-css-selector-parser branch 2 times, most recently from 184ad41 to 8d9d9e0 Compare November 25, 2024 17:33

sirreal commented Nov 26, 2024

View reviewed changes

sirreal force-pushed the html-api/add-css-selector-parser branch 3 times, most recently from 370539b to 9ecaab5 Compare November 29, 2024 15:25

sirreal force-pushed the html-api/add-css-selector-parser branch from 8659898 to e5e4c7e Compare December 4, 2024 20:57

sirreal changed the title ~~HTML API: Add CSS selector parsing~~ HTML API: Add CSS selector support Dec 5, 2024

sirreal force-pushed the html-api/add-css-selector-parser branch from e5e4c7e to b7e032e Compare December 5, 2024 13:52

sirreal added 20 commits December 5, 2024 22:52

WIP class skeleton

0e8c4fb

Document class

2d3d283

Do not support namespaced selectors

40222d3

Flesh out stuff

6092642

Starting to actually parse

3e3b2b2

Add ident tests

967557f

Fix ident non-ascii bug

2ec1db3

Use class after defined

ee2c7ce

Fix some char stuff

0f708ba

Improve tests

3cb455d

Housekeeping

5609e50

Require new file in WP

4f25bc2

Fix offset type

943293f

Add more tests and invalid tests

24c9744

Fix wrong offset var usage

a7c10b9

comment tweak

dd718b7

Implement codepoint escape with strspn

5884aca

Test with UPPER HEX

a9a077f

Add ID tests

5f53e0a

Improve tests

effbbbe

sirreal and others added 16 commits July 10, 2025 19:36

Use "attr" instead of "att" as short "attributes"

c0fa8d5

Simplify exact or hyphen suffixed implementation

cfa2bc2

Only expose select() for a while ( $->select() ) loop

dc789e4

Matches the calling interface for the other HTML API classes, avoids creating the `Generator`, using a static var to avoid re-parsing the selector string instead.

Fix expectedIncorrectUsage method name

2d931ef

Improve and fix complex selector list documentation

53ee08d

Add unsupports sibling selector tests

a627471

Do not support + and ~ selectors

0058e15

Update @SInCE tags to WP_VERSION placeholder

a99207d

Add unsupported complex selector test

0fb0c20

Fix spelling and grammar in documentation

2f47c32

Improve documentation

c572391

More documentation improvements

ac2fb56

Avoid redundant get_attribute call

57b5128

Reformat some documentation

5f06f84

Rename files and remove duplicate character

e559f6a

sirreal mentioned this pull request Sep 19, 2025

Fonts: Normalize font face font-family #9951

Closed

sirreal mentioned this pull request Mar 3, 2026

Global styles: Simplify CSS processing WordPress/gutenberg#76078

Draft

sirreal added 12 commits June 10, 2026 08:35

Merge branch 'trunk' into html-api/add-css-selector-parser

a32ffbe

Merge branch 'trunk' into html-api/add-css-selector-parser

9b194d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Add CSS selector support#7857

HTML API: Add CSS selector support#7857
sirreal wants to merge 163 commits into
WordPress:trunkfrom
sirreal:html-api/add-css-selector-parser

sirreal commented Nov 21, 2024 •

edited

Loading

Uh oh!

github-actions Bot commented Nov 21, 2024

Uh oh!

sirreal Nov 26, 2024

Uh oh!

adamziel Dec 10, 2024

Uh oh!

sirreal Dec 10, 2024

Uh oh!

adamziel Dec 10, 2024

Uh oh!

sirreal Dec 10, 2024

Uh oh!

adamziel Dec 10, 2024

Uh oh!

sirreal Dec 11, 2024

Uh oh!

dmsnell Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	* This class is analogous to <compound-selector-list> in the grammar. The supported grammar is:
	*
	* <selector-list> = <complex-selector-list>
	* <complex-selector-list> = <complex-selector>#
	* <compound-selector-list> = <compound-selector>#
	* <complex-selector> = [ <type-selector> <combinator>? ]* <compound-selector>
	* <compound-selector> = [ <type-selector>? <subclass-selector>* ]!
	* <combinator> = '>' \| [ '\|' '\|' ]
	* <type-selector> = <ident-token> \| '*'
	* <subclass-selector> = <id-selector> \| <class-selector> \| <attribute-selector>
	* <id-selector> = <hash-token>
	* <class-selector> = '.' <ident-token>
	* <attribute-selector> = '[' <ident-token> ']' \|
	* '[' <ident-token> <attr-matcher> [ <string-token> \| <ident-token> ] <attr-modifier>? ']'
	* <attr-matcher> = [ '~' \| '\|' \| '^' \| '$' \| '*' ]? '='
	* <attr-modifier> = i \| s

	/**
	* Existing Core Selectors
	*
	* .blocks-gallery-caption
	* .blocks-gallery-item
	* .blocks-gallery-item__caption
	* .book-author
	* .message
	* a
	* a[download]
	* audio
	* blockquote
	* cite
	* code
	* div
	* figcaption
	* figure > a
	* figure a
	* figure img
	* figure video,figure img
	* h1,h2,h3,h4,h5,h6
	* img
	* li
	* ol,ul
	* p
	* pre
	* tbody tr
	* td,th
	* tfoot tr
	* thead tr
	* video
	*/

Conversation

sirreal commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Open questions

Implementation

Selector traversal APIs

Uh oh!

github-actions Bot commented Nov 21, 2024

Test using WordPress Playground

Some things to be aware of

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sirreal commented Nov 21, 2024 •

edited

Loading