-
-
Notifications
You must be signed in to change notification settings - Fork 56
Advice on how to use libfsm for generating performant pattern matchers #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e5f78fb to
a5bc8d4
Compare
doc/GUIDE.md
Outdated
|
|
||
| ### 2. Anchor When You Require Full Matches | ||
|
|
||
| FSMs only do what’s specified. Explicitly anchor when matching entire strings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't about doing only what's specified, it's about writing what we mean. without the anchors, you'd allow a prefix and a suffix. if you mean that, great. if you don't mean that, we're advising to not write them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated by adding "When the intention is to match an entire string, use anchors."
Let me know if you still think it is better to not write them.
doc/GUIDE.md
Outdated
|
|
||
| ### Compilation Takes Too Long | ||
|
|
||
| Likely caused by unrestricted wildcards (`.*`, `.+`). Fix with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's worth explaining why this happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanation and link back to the Writing Effective Regex guide. Let me know if you think it is too repetitive or too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, thank you
doc/GUIDE.md
Outdated
| @@ -0,0 +1,186 @@ | |||
| # Using libfsm for High-Performance Pattern Matching | |||
|
|
|||
| libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in **linear time O(n)** with **no backtracking**. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such a significant feature that it may be worth saying a little more here?
Do you think the target audience will know what backtracking is, or the implications of that? Maybe saying that it doesn't need to make any complex decisions at runtime? In contrast to PCRE, where the same regex can have wildly different runtime costs depending on the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some information, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we have anything else to say about that, it should go in the top-level readme. we're underselling there
doc/GUIDE.md
Outdated
|
|
||
| * Word boundaries (`\b`) | ||
| * Non-greedy quantifiers (`*?`, `+?`, `??`) | ||
| * Group capture and backreferences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group capture is coming soon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added "coming soon" in the places I mention "group capture".
Let me know if I should retract those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine!
doc/GUIDE.md
Outdated
| Likely caused by unrestricted wildcards (`.*`, `.+`). Fix with: | ||
|
|
||
| * Negated classes (`[^)]*`) | ||
| * Bounded repeats (`{0,50}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what's being repeated, bounded repeats can be substantially more expensive, because instead of representing and then adding a loop around it, the FSM needs to explicitly represent 50 instances of the pattern (which, in this case, are all optional) in order to have the fixed upper limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely worth explaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a more informative example: If you have the pattern ^x{3,5}$, libfsm's resulting DFA will be structured like "match an x, then match an x, then match an x, then match an x or skip it, then match an x or skip it, then report an overall match if at the end of input". It has to repeat the pattern, noting each time whether it's required or optional (beyond the lower count in {min,max}), because DFA execution doesn't have a counter, just the current state within the overall DFA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation!
Now that I know that the suggestion is not correct, I have removed it from the docs.
Do you think this information is good to be included in the document?
If so, in which section should it be added into? Is it recommended to not use this? What is the alternative instead?
Or is it just a "good to know" section? What other "good to know" do you think is worth mentioning in the document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to add back in a section about bounded repetition
8b22b55 to
7306570
Compare
|
Thank you so much for the reviews! |
doc/GUIDE.md
Outdated
| # Risky: $ may allow an extra newline | ||
| /foo$ | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not "risky", it's either incorrect, or it's unnecessary complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to:
# Incorrect: allows a trailing newline,
# which is usually unintended and adds unnecessary complexity
Let me know if this is a bit wordy?
(Contributed by Scott)
This PR introduces a new documentation file (doc/advice.md) that provides practical advice on using libfsm for generating reasonably performant pattern matching code.