Adding edges() and iteredges() Functions for DAWGs#1
Adding edges() and iteredges() Functions for DAWGs#1EliFinkelshteyn wants to merge 19 commits intopytries:masterfrom
Conversation
2 similar comments
…tionDawgs; adding tests for all
…acing dev data for those
…d for all new edges methods
|
These latest additions add edges() and iteredges() functionality for all applicable DAWGs and clean up the code since the original pull request. They complete all the work I planned to implement that we originally discussed. Would love to hear your thoughts @kmike. |
dawg_python/dawgs.py
Outdated
There was a problem hiding this comment.
this is backwards incompatible - .items should return an empty list, not None here
There was a problem hiding this comment.
Good call. Will fix (and add a test for future).
1 similar comment
There was a problem hiding this comment.
👍 for separating Completer and EdgeFollower
…iate comments to doc strings
1 similar comment
… always be used-- not utf-8
dawg_python/dawgs.py
Outdated
There was a problem hiding this comment.
I think that .edges method should return the same data regardless of DAWG class. It it returns a list of strings in a base class it should return a list of strings in all subclasses.
There was a problem hiding this comment.
For BytesDAWG it could make sense to filter out edges leading to the values.
There was a problem hiding this comment.
It's similar data for all. It never returns a list of strings. It always returns a list of 2-tuples. For dawgs with no data, the tuples are (str, True) for terminal edges and (str, False) for non-terminals.
For dawgs with data, they're (str, data) for terminal edges, and (str, False) for non-terminals. Since data evaluates to true in a boolean situation, this seems most logical to me. If you want the data in an edge, you have it. If you want to just use the edges and know whether they're terminals or not, you can do that the same way across dawgs.
There was a problem hiding this comment.
If we really want them to be the same, we could make them return (str, True) for terminal edges always, and just add an extra edges_with_data() method for dawgs that provide any kind of data storage. That actually seems most consistent to me. If you agree, I'll make that addition.
1 similar comment
…a and iteredges_data for appropriate dawgs; adding tests for new methods
|
@kmike latest change makes Everything looks done as far as I can tell. I'd like to start using this in prod soon. I can use my fork, but if you plan to merge soon, that would be even better. |
|
Hi @EliFinkelshteyn,
It seems the main complexity is that some characters are represented by multiple transitions, right? You solved it by trying to decode data until is succeeds, which is reasonable for UTF-8. Regarding the API - so .edges is like .keys, but it only traverses graph to depth of 1 unicode character, and also returns if the result is terminal or not? I think it is reasonable. One question is whether it should return full keys or partial keys, without the prefix. You've implemented it the same way as Completer, which looks fine. Could you please add more tests? For example, based on https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fwrapper.py, the code which handles UnicodeDecodeErrors is untested; some conditions are also missing in dawgs.py (see https://coveralls.io/builds/2376072/source?filename=dawg_python%2Fdawgs.py). Thanks for your PR! It is wel-written 👍 But I need a bit more time to review it. I'm not sure I'll be able to finish the review during this work week; weekend is more likely. |
|
So, there's actually an issue here. When unicode chars share the same first bytes, this will only return one of the chars. I am working on fixing that now. I realized you can tell exactly how many bytes are in a unicode char by how many leading ones the first byte has, so I can use this to speed up the whole thing a bit as well. |
|
A good catch. For some reason I thought that UTF8 synchronization is enough to make repeated decoding work, but it is not. |
1 similar comment
1 similar comment
|
That was my bad. I didn't know python3.2 has an issue with 'u'. I'm also just tired, so this took way longer than it should have. Should all be fixed and working now though. |
|
Ping here. Anything else you want done for this to be merged in? |
As discussed at pytries/marisa-trie#20, this is support for adding the edges() and iteredges() methods for CompletionDAWG. If this looks good, I'll add similar support for RecordDAWGs and ByteDAWGs. The code isn't as optimized as it could be, but it works, it's clean (IMO), and it's fast enough for me.