perf: Replace unordered_map with bitset for vocabulary lookups #2040

syedazeez337 · 2025-11-13T14:19:02Z

move KnownVocabulary + Vocabularies into their own translation unit
add compatibility alias + fixes for callers
ensure disabled vocabularies still report presence
update alter schema helper and walker tests

jviotti · 2025-11-17T12:50:07Z

benchmark_vocabulary.cc

We haves benchmarks setup on benchmark using GoogleBenchmark. Let's not add random files

jviotti · 2025-11-17T12:50:16Z

benchmark_vocabulary_detailed.cc

We haves benchmarks setup on benchmark using GoogleBenchmark. Let's not add random files

jviotti · 2025-11-17T12:53:04Z

src/core/jsonschema/include/sourcemeta/core/jsonschema_vocabularies.h

+#ifndef SOURCEMETA_CORE_JSONSCHEMA_VOCABULARIES_H_
+#define SOURCEMETA_CORE_JSONSCHEMA_VOCABULARIES_H_
+
+#include <sourcemeta/core/json.h>


Can you put this include AFTER the export #endif + a blank line? For consistency

jviotti · 2025-11-17T12:54:38Z

src/core/jsonschema/include/sourcemeta/core/jsonschema_vocabularies.h

+  Draft07 = 1u << 12,
+  Draft07Hyper = 1u << 13,
+  // 2019-09 vocabularies
+  Draft201909Core = 1u << 14,


The current name is not bad, but it's a bit hard to read. Maybe we can adopt something like Draft_2019_09_Applicator? And the same for others? I think in this case that casing its much easier to understand

jviotti · 2025-11-17T12:55:53Z

src/core/jsonschema/include/sourcemeta/core/jsonschema_vocabularies.h

+  Draft07 = 1u << 12,
+  Draft07Hyper = 1u << 13,
+  // 2019-09 vocabularies
+  Draft201909Core = 1u << 14,


Also keep in mind that 2019-09 and 2020-12 vocabularies are not called "drafts". That's only a legacy moniker for Draft 7 and older. Though I think you can't make these identifiers start with a number...

Might be worth saying i.e. JSON_Schema_2019_09_Core? And for older, JSON_Schema_Draft_7

jviotti · 2025-11-17T13:19:34Z

src/core/jsonschema/jsonschema_vocabularies.cc

+  }
+}
+
+auto sourcemeta::core::Vocabularies::all_vocabularies() const


This method looks super expensive. Can we NOT use it? If there are any places that require this, can we do it in some other way?

jviotti · 2025-11-17T13:20:21Z

src/extension/alterschema/alterschema.cc

-  return std::ranges::any_of(container, [&values](const auto &element) {
-    return values.contains(element.first);
-  });
+  for (const auto &element : container.all_vocabularies()) {


I think you can do this without all_vocabularies. i.e. loop over the values set and check if we define any in the container using .contains()?

jviotti · 2025-11-17T13:20:50Z

src/lang/parallel/include/sourcemeta/core/parallel_for_each.h

 // https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/beginthread-beginthreadex?view=msvc-170
-inline unsigned __stdcall parallel_for_each_windows_thread_start(
-    void *argument) {
+// clang-format off


Why do we disable this? Also it doesn't seem to have anything to do with this code change?

jviotti · 2025-11-17T13:22:46Z

test/jsonschema/jsonschema_official_walker_2019_09_test.cc

-  std::copy(VOCABULARIES_2019_09_VALIDATION.cbegin(),
-            VOCABULARIES_2019_09_VALIDATION.cend(),
-            std::inserter(vocabularies, vocabularies.end()));
+  vocabularies.merge(VOCABULARIES_2019_09_APPLICATOR);


Hm, I see why you needed .merge. Fine then!

jviotti · 2025-11-17T13:23:45Z

test/jsonschema/jsonschema_walker_test.cc

    -> sourcemeta::core::SchemaWalkerResult {
-  if (vocabularies.find("https://sourcemeta.com/vocab/test-1") !=
-      vocabularies.end()) {
+  if (vocabularies.find("https://sourcemeta.com/vocab/test-1").has_value()) {


In general, in C++, .find has a common convention of returning an iterator. I think having .find but returning an optional can be confusing. Maybe let's just call it .at() or even .get()

jviotti · 2025-11-17T13:25:55Z

I left various comments but the approach is awesome. My big overall comment is that you have a lot of interesting capabilities here, but you are not using them very much yet:

Internally, you store all official vocabulary URIs using the enum
But then almost every other part of the codebase will still access them using strings

My suggestion: Once you add getters that work on the enum values, can you go through the codebase and try to replace any containment check, etc for an official vocabulary URI with the actual enum value? Then we avoid expensive string comparisons. Mainly on src/extension/alterschema rules. I think that will speed up the linter by orders of magnitude

jviotti · 2025-11-17T13:31:04Z

Hm, though there are a LOT of places to go through, and many are tricky, like in frame.cc. Maybe let's just focus on the foundations merging this in this PR without a lot of extra changes and then we do the other ones in separate PRs? Otherwise this PR will become a monster...

Maybe just add a TODO in the class definition saying that we should start using getters on enum values as much as possible?

This commit addresses all maintainer feedback on PR sourcemeta#2040: API improvements: - Use initializer_list constructor instead of chained insert() calls - Use brace initialization {} for construction - Change insert() signature: insert(uri, required) instead of pair - Add enum-based overloads (contains/insert/get with Known enum) - Rename method from find() to get() for clarity Implementation improvements: - Rename jsonschema_vocabularies.cc to vocabularies.cc - Reorder vocabulary checks from newest to oldest (2020-12 first) - Rename short variable names (it -> iterator) - Move inline comments above their respective lines - Remove merge() method and update all usages Test updates: - Replace all merge() calls with combined vocabulary constants - Add VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_UNEVALUATED_AND_APPLICATOR - Add VOCABULARIES_2020_12_APPLICATOR_UNEVALUATED_AND_VALIDATION - Update test macros to use get() instead of at() - Update walker_test to use get() instead of find() Documentation: - Add TODO comment documenting future optimization opportunities - Note priority areas: alterschema rules, bundle.cc, frame.cc, official_walker.cc - Document potential for orders of magnitude performance improvement Cleanup: - Remove benchmark files per maintainer request All changes maintain backward compatibility while providing optimized enum-based API for known vocabularies.

This commit addresses all maintainer feedback on PR sourcemeta#2040: API improvements: - Use initializer_list constructor instead of chained insert() calls - Use brace initialization {} for construction - Change insert() signature: insert(uri, required) instead of pair - Add enum-based overloads (contains/insert/get with Known enum) - Rename method from find() to get() for clarity Implementation improvements: - Rename jsonschema_vocabularies.cc to vocabularies.cc - Reorder vocabulary checks from newest to oldest (2020-12 first) - Rename short variable names (it -> iterator) - Move inline comments above their respective lines - Remove merge() method and update all usages Test updates: - Replace all merge() calls with combined vocabulary constants - Add VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_APPLICATOR_AND_VALIDATION - Add VOCABULARIES_2020_12_UNEVALUATED_AND_APPLICATOR - Add VOCABULARIES_2020_12_APPLICATOR_UNEVALUATED_AND_VALIDATION - Update test macros to use get() instead of at() - Update walker_test to use get() instead of find() Documentation: - Add TODO comment documenting future optimization opportunities - Note priority areas: alterschema rules, bundle.cc, frame.cc, official_walker.cc - Document potential for orders of magnitude performance improvement Cleanup: - Remove benchmark files per maintainer request All changes maintain backward compatibility while providing optimized enum-based API for known vocabularies. Signed-off-by: Syed Azeez <syedazeez337@gmail.com>

jviotti · 2025-11-18T13:09:54Z

src/core/jsonschema/include/sourcemeta/core/jsonschema_vocabularies.h

+///
+/// TODO: To maximize performance gains, convert string-based vocabulary checks
+/// throughout the codebase to use enum-based methods. Priority areas:
+/// - src/extension/alterschema (linter and canonicalizer rules)


I think the first sentence in the TODO is good, but the rest is more for your internal notes!

jviotti · 2025-11-18T13:10:24Z

src/core/jsonschema/include/sourcemeta/core/jsonschema_vocabularies.h

+/// This will eliminate expensive string comparisons in hot paths and could
+/// improve linter performance by orders of magnitude.
+struct SOURCEMETA_CORE_JSONSCHEMA_EXPORT Vocabularies {
+  /// @ingroup jsonschema


Suggested change

/// @ingroup jsonschema

You only need this tag for the struct. Any members will by default inherit the scope

jviotti · 2025-11-18T13:11:22Z

src/core/jsonschema/include/sourcemeta/core/jsonschema_vocabularies.h

+  /// Construct from initializer list (for backward compatibility)
+  Vocabularies(std::initializer_list<std::pair<JSON::String, bool>> init);
+
+private:


Can you put the private members at the end of the struct? The convention is usually to have them at the back of it, after all public stuff

jviotti · 2025-11-18T13:12:04Z

src/core/jsonschema/include/sourcemeta/core/jsonschema_vocabularies.h

+#pragma warning(push)
+#pragma warning(disable : 4251)
+#endif
+  std::bitset<29> required_known{};


Can you statically derive the number 29 from the amount of members in the enum? I don't know from the top of my head, but I bet there is some templating utility or trick to do it?

jviotti · 2025-11-18T13:12:51Z

src/core/jsonschema/jsonschema.cc

  if (base_dialect == "https://json-schema.org/draft/2020-12/schema" ||
      base_dialect == "https://json-schema.org/draft/2020-12/hyper-schema") {
-    return "https://json-schema.org/draft/2020-12/vocab/core";
+    return sourcemeta::core::Vocabularies::Known::JSON_Schema_2020_12_Core;


Yes, I like this a lot!

jviotti · 2025-11-18T13:18:04Z

src/core/jsonschema/vocabularies.cc

+    return this->get(maybe_known.value());
+  }
+  // Fallback to custom vocabularies
+  const auto iterator = this->custom.find(std::string{uri});


If you will need to convert to std::string anyway, then maybe make this method just take const std::string & as argument?

jviotti · 2025-11-18T13:18:46Z

src/core/jsonschema/vocabularies.cc

+  // Verify invariant: vocabulary cannot be both required and optional
+  // Use [] operator instead of test() to avoid exceptions in noexcept function
+  assert(!this->required_known[index] || !this->optional_known[index]);
+  if (this->required_known[index])


Can we always use curly braces on conditionals? This is a convention we have throughout the repo

jviotti · 2025-11-18T13:20:53Z

src/core/jsonschema/walker.cc

  if (base_dialect.has_value() && dialect.has_value()) {
-    vocabularies.merge(sourcemeta::core::vocabularies(
-        resolver, base_dialect.value(), dialect.value()));
+    vocabularies = sourcemeta::core::vocabularies(


Looks good, but given the new assignment operator, you are first constructing vocabularies and then doing pretty much a reset. Can you construct it as expected in one shot? i.e. maybe something like this:

Vocabularies vocabularies{ base_dialect.has_value() && dialect.has_value() ? sourcemeta::core::vocabularies(resolver, base_dialect.value(), dialect.value()) : {}};

jviotti · 2025-11-18T13:21:59Z

src/extension/alterschema/alterschema.cc

-  return std::ranges::any_of(container, [&values](const auto &element) {
-    return values.contains(element.first);
-  });
+static auto contains_any(const Vocabularies &container,


Suggested change

static auto contains_any(const Vocabularies &container,

// TODO: Move this to `Vocabularies::contains_any` or something like that

static auto contains_any(const Vocabularies &container,

Not for now. Just a note to self. I think this will be a nice refactoring later on

jviotti · 2025-11-18T13:22:57Z

test/jsonschema/jsonschema_official_walker_2019_09_test.cc


+static const sourcemeta::core::Vocabularies
+    VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION{
+        {"https://json-schema.org/draft/2019-09/vocab/core", true},


These are all cases where we can use the enum values. But I guess we can do it on a separate PR

Replace unordered_map with hybrid bitset approach for known vocabularies, falling back to unordered_map for custom vocabularies. This eliminates expensive string comparisons and hash lookups in hot paths. Key changes: - Introduce Vocabularies struct with Known enum for official vocabularies - Use std::bitset for O(1) lookup of 29 known JSON Schema vocabularies - Maintain custom vocabulary support via unordered_map fallback - Add both string and enum-based constructors for flexibility - Update all vocabulary construction sites to use enum values - Derive bitset size from Known::COUNT enum sentinel - Move private members to end of struct per convention - Use const std::string& instead of string_view to avoid conversions - Add curly braces to all conditionals per repo convention - Add TODO for future contains_any refactoring Performance impact: - Eliminates hash computation and string comparisons for known vocabularies - Reduces memory overhead with compact bitset representation - Maintains backward compatibility with existing API Signed-off-by: Syed Azeez <syedazeez337@gmail.com>

syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch 5 times, most recently from f8ad20c to 223613e Compare November 13, 2025 17:48

jviotti requested changes Nov 17, 2025

View reviewed changes

syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch from 223613e to 7704ff3 Compare November 17, 2025 17:49

syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch 8 times, most recently from 7bcdcfd to 25ba9c3 Compare November 18, 2025 06:20

jviotti requested changes Nov 18, 2025

View reviewed changes

syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch from 25ba9c3 to 8136266 Compare November 18, 2025 15:29

syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch from 8136266 to 02f7c10 Compare November 18, 2025 16:19

	static auto contains_any(const Vocabularies &container,
	// TODO: Move this to `Vocabularies::contains_any` or something like that
	static auto contains_any(const Vocabularies &container,

Uh oh!

perf: Replace unordered_map with bitset for vocabulary lookups #2040

Are you sure you want to change the base?

perf: Replace unordered_map with bitset for vocabulary lookups #2040

Uh oh!

Conversation

syedazeez337 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jviotti commented Nov 17, 2025

Uh oh!

jviotti commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

syedazeez337 commented Nov 13, 2025 •

edited

Loading