-
-
Notifications
You must be signed in to change notification settings - Fork 9
perf: Replace unordered_map with bitset for vocabulary lookups #2040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
perf: Replace unordered_map with bitset for vocabulary lookups #2040
Conversation
f8ad20c to
223613e
Compare
benchmark_vocabulary.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haves benchmarks setup on benchmark using GoogleBenchmark. Let's not add random files
benchmark_vocabulary_detailed.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haves benchmarks setup on benchmark using GoogleBenchmark. Let's not add random files
| #ifndef SOURCEMETA_CORE_JSONSCHEMA_VOCABULARIES_H_ | ||
| #define SOURCEMETA_CORE_JSONSCHEMA_VOCABULARIES_H_ | ||
|
|
||
| #include <sourcemeta/core/json.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this include AFTER the export #endif + a blank line? For consistency
| Draft07 = 1u << 12, | ||
| Draft07Hyper = 1u << 13, | ||
| // 2019-09 vocabularies | ||
| Draft201909Core = 1u << 14, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current name is not bad, but it's a bit hard to read. Maybe we can adopt something like Draft_2019_09_Applicator? And the same for others? I think in this case that casing its much easier to understand
| Draft07 = 1u << 12, | ||
| Draft07Hyper = 1u << 13, | ||
| // 2019-09 vocabularies | ||
| Draft201909Core = 1u << 14, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also keep in mind that 2019-09 and 2020-12 vocabularies are not called "drafts". That's only a legacy moniker for Draft 7 and older. Though I think you can't make these identifiers start with a number...
Might be worth saying i.e. JSON_Schema_2019_09_Core? And for older, JSON_Schema_Draft_7
| } | ||
| } | ||
|
|
||
| auto sourcemeta::core::Vocabularies::all_vocabularies() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method looks super expensive. Can we NOT use it? If there are any places that require this, can we do it in some other way?
| return std::ranges::any_of(container, [&values](const auto &element) { | ||
| return values.contains(element.first); | ||
| }); | ||
| for (const auto &element : container.all_vocabularies()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do this without all_vocabularies. i.e. loop over the values set and check if we define any in the container using .contains()?
| // https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/beginthread-beginthreadex?view=msvc-170 | ||
| inline unsigned __stdcall parallel_for_each_windows_thread_start( | ||
| void *argument) { | ||
| // clang-format off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we disable this? Also it doesn't seem to have anything to do with this code change?
| std::copy(VOCABULARIES_2019_09_VALIDATION.cbegin(), | ||
| VOCABULARIES_2019_09_VALIDATION.cend(), | ||
| std::inserter(vocabularies, vocabularies.end())); | ||
| vocabularies.merge(VOCABULARIES_2019_09_APPLICATOR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I see why you needed .merge. Fine then!
| -> sourcemeta::core::SchemaWalkerResult { | ||
| if (vocabularies.find("https://sourcemeta.com/vocab/test-1") != | ||
| vocabularies.end()) { | ||
| if (vocabularies.find("https://sourcemeta.com/vocab/test-1").has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, in C++, .find has a common convention of returning an iterator. I think having .find but returning an optional can be confusing. Maybe let's just call it .at() or even .get()
|
I left various comments but the approach is awesome. My big overall comment is that you have a lot of interesting capabilities here, but you are not using them very much yet:
My suggestion: Once you add getters that work on the |
|
Hm, though there are a LOT of places to go through, and many are tricky, like in Maybe just add a TODO in the class definition saying that we should start using getters on enum values as much as possible? |
This commit addresses all maintainer feedback on PR sourcemeta#2040: API improvements: - Use initializer_list constructor instead of chained insert() calls - Use brace initialization {} for construction - Change insert() signature: insert(uri, required) instead of pair - Add enum-based overloads (contains/insert/get with Known enum) - Rename method from find() to get() for clarity Implementation improvements: - Rename jsonschema_vocabularies.cc to vocabularies.cc - Reorder vocabulary checks from newest to oldest (2020-12 first) - Rename short variable names (it -> iterator) - Move inline comments above their respective lines - Remove merge() method and update all usages Test updates: - Replace all merge() calls with combined vocabulary constants - Add VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_UNEVALUATED_AND_APPLICATOR - Add VOCABULARIES_2020_12_APPLICATOR_UNEVALUATED_AND_VALIDATION - Update test macros to use get() instead of at() - Update walker_test to use get() instead of find() Documentation: - Add TODO comment documenting future optimization opportunities - Note priority areas: alterschema rules, bundle.cc, frame.cc, official_walker.cc - Document potential for orders of magnitude performance improvement Cleanup: - Remove benchmark files per maintainer request All changes maintain backward compatibility while providing optimized enum-based API for known vocabularies.
223613e to
7704ff3
Compare
This commit addresses all maintainer feedback on PR sourcemeta#2040: API improvements: - Use initializer_list constructor instead of chained insert() calls - Use brace initialization {} for construction - Change insert() signature: insert(uri, required) instead of pair - Add enum-based overloads (contains/insert/get with Known enum) - Rename method from find() to get() for clarity Implementation improvements: - Rename jsonschema_vocabularies.cc to vocabularies.cc - Reorder vocabulary checks from newest to oldest (2020-12 first) - Rename short variable names (it -> iterator) - Move inline comments above their respective lines - Remove merge() method and update all usages Test updates: - Replace all merge() calls with combined vocabulary constants - Add VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_UNEVALUATED_AND_APPLICATOR - Add VOCABULARIES_2020_12_APPLICATOR_UNEVALUATED_AND_VALIDATION - Update test macros to use get() instead of at() - Update walker_test to use get() instead of find() Documentation: - Add TODO comment documenting future optimization opportunities - Note priority areas: alterschema rules, bundle.cc, frame.cc, official_walker.cc - Document potential for orders of magnitude performance improvement Cleanup: - Remove benchmark files per maintainer request All changes maintain backward compatibility while providing optimized enum-based API for known vocabularies. Signed-off-by: Syed Azeez <syedazeez337@gmail.com>
7bcdcfd to
25ba9c3
Compare
| /// | ||
| /// TODO: To maximize performance gains, convert string-based vocabulary checks | ||
| /// throughout the codebase to use enum-based methods. Priority areas: | ||
| /// - src/extension/alterschema (linter and canonicalizer rules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first sentence in the TODO is good, but the rest is more for your internal notes!
| /// This will eliminate expensive string comparisons in hot paths and could | ||
| /// improve linter performance by orders of magnitude. | ||
| struct SOURCEMETA_CORE_JSONSCHEMA_EXPORT Vocabularies { | ||
| /// @ingroup jsonschema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// @ingroup jsonschema |
You only need this tag for the struct. Any members will by default inherit the scope
| /// Construct from initializer list (for backward compatibility) | ||
| Vocabularies(std::initializer_list<std::pair<JSON::String, bool>> init); | ||
|
|
||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the private members at the end of the struct? The convention is usually to have them at the back of it, after all public stuff
| #pragma warning(push) | ||
| #pragma warning(disable : 4251) | ||
| #endif | ||
| std::bitset<29> required_known{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you statically derive the number 29 from the amount of members in the enum? I don't know from the top of my head, but I bet there is some templating utility or trick to do it?
| if (base_dialect == "https://json-schema.org/draft/2020-12/schema" || | ||
| base_dialect == "https://json-schema.org/draft/2020-12/hyper-schema") { | ||
| return "https://json-schema.org/draft/2020-12/vocab/core"; | ||
| return sourcemeta::core::Vocabularies::Known::JSON_Schema_2020_12_Core; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I like this a lot!
src/core/jsonschema/vocabularies.cc
Outdated
| return this->get(maybe_known.value()); | ||
| } | ||
| // Fallback to custom vocabularies | ||
| const auto iterator = this->custom.find(std::string{uri}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you will need to convert to std::string anyway, then maybe make this method just take const std::string & as argument?
src/core/jsonschema/vocabularies.cc
Outdated
| // Verify invariant: vocabulary cannot be both required and optional | ||
| // Use [] operator instead of test() to avoid exceptions in noexcept function | ||
| assert(!this->required_known[index] || !this->optional_known[index]); | ||
| if (this->required_known[index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always use curly braces on conditionals? This is a convention we have throughout the repo
src/core/jsonschema/walker.cc
Outdated
| if (base_dialect.has_value() && dialect.has_value()) { | ||
| vocabularies.merge(sourcemeta::core::vocabularies( | ||
| resolver, base_dialect.value(), dialect.value())); | ||
| vocabularies = sourcemeta::core::vocabularies( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but given the new assignment operator, you are first constructing vocabularies and then doing pretty much a reset. Can you construct it as expected in one shot? i.e. maybe something like this:
Vocabularies vocabularies{
base_dialect.has_value() && dialect.has_value()
? sourcemeta::core::vocabularies(resolver, base_dialect.value(), dialect.value())
: {}};| return std::ranges::any_of(container, [&values](const auto &element) { | ||
| return values.contains(element.first); | ||
| }); | ||
| static auto contains_any(const Vocabularies &container, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static auto contains_any(const Vocabularies &container, | |
| // TODO: Move this to `Vocabularies::contains_any` or something like that | |
| static auto contains_any(const Vocabularies &container, |
Not for now. Just a note to self. I think this will be a nice refactoring later on
|
|
||
| static const sourcemeta::core::Vocabularies | ||
| VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION{ | ||
| {"https://json-schema.org/draft/2019-09/vocab/core", true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all cases where we can use the enum values. But I guess we can do it on a separate PR
25ba9c3 to
8136266
Compare
Replace unordered_map with hybrid bitset approach for known vocabularies, falling back to unordered_map for custom vocabularies. This eliminates expensive string comparisons and hash lookups in hot paths. Key changes: - Introduce Vocabularies struct with Known enum for official vocabularies - Use std::bitset for O(1) lookup of 29 known JSON Schema vocabularies - Maintain custom vocabulary support via unordered_map fallback - Add both string and enum-based constructors for flexibility - Update all vocabulary construction sites to use enum values - Derive bitset size from Known::COUNT enum sentinel - Move private members to end of struct per convention - Use const std::string& instead of string_view to avoid conversions - Add curly braces to all conditionals per repo convention - Add TODO for future contains_any refactoring Performance impact: - Eliminates hash computation and string comparisons for known vocabularies - Reduces memory overhead with compact bitset representation - Maintains backward compatibility with existing API Signed-off-by: Syed Azeez <syedazeez337@gmail.com>
8136266 to
02f7c10
Compare
Uh oh!
There was an error while loading. Please reload this page.