Skip to content

Conversation

@syedazeez337
Copy link

@syedazeez337 syedazeez337 commented Nov 13, 2025

  • move KnownVocabulary + Vocabularies into their own translation unit
  • add compatibility alias + fixes for callers
  • ensure disabled vocabularies still report presence
  • update alter schema helper and walker tests

@syedazeez337 syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch 5 times, most recently from f8ad20c to 223613e Compare November 13, 2025 17:48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haves benchmarks setup on benchmark using GoogleBenchmark. Let's not add random files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haves benchmarks setup on benchmark using GoogleBenchmark. Let's not add random files

#ifndef SOURCEMETA_CORE_JSONSCHEMA_VOCABULARIES_H_
#define SOURCEMETA_CORE_JSONSCHEMA_VOCABULARIES_H_

#include <sourcemeta/core/json.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this include AFTER the export #endif + a blank line? For consistency

Draft07 = 1u << 12,
Draft07Hyper = 1u << 13,
// 2019-09 vocabularies
Draft201909Core = 1u << 14,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current name is not bad, but it's a bit hard to read. Maybe we can adopt something like Draft_2019_09_Applicator? And the same for others? I think in this case that casing its much easier to understand

Draft07 = 1u << 12,
Draft07Hyper = 1u << 13,
// 2019-09 vocabularies
Draft201909Core = 1u << 14,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also keep in mind that 2019-09 and 2020-12 vocabularies are not called "drafts". That's only a legacy moniker for Draft 7 and older. Though I think you can't make these identifiers start with a number...

Might be worth saying i.e. JSON_Schema_2019_09_Core? And for older, JSON_Schema_Draft_7

}
}

auto sourcemeta::core::Vocabularies::all_vocabularies() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks super expensive. Can we NOT use it? If there are any places that require this, can we do it in some other way?

return std::ranges::any_of(container, [&values](const auto &element) {
return values.contains(element.first);
});
for (const auto &element : container.all_vocabularies()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this without all_vocabularies. i.e. loop over the values set and check if we define any in the container using .contains()?

// https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/beginthread-beginthreadex?view=msvc-170
inline unsigned __stdcall parallel_for_each_windows_thread_start(
void *argument) {
// clang-format off
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we disable this? Also it doesn't seem to have anything to do with this code change?

std::copy(VOCABULARIES_2019_09_VALIDATION.cbegin(),
VOCABULARIES_2019_09_VALIDATION.cend(),
std::inserter(vocabularies, vocabularies.end()));
vocabularies.merge(VOCABULARIES_2019_09_APPLICATOR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I see why you needed .merge. Fine then!

-> sourcemeta::core::SchemaWalkerResult {
if (vocabularies.find("https://sourcemeta.com/vocab/test-1") !=
vocabularies.end()) {
if (vocabularies.find("https://sourcemeta.com/vocab/test-1").has_value()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, in C++, .find has a common convention of returning an iterator. I think having .find but returning an optional can be confusing. Maybe let's just call it .at() or even .get()

@jviotti
Copy link
Member

jviotti commented Nov 17, 2025

I left various comments but the approach is awesome. My big overall comment is that you have a lot of interesting capabilities here, but you are not using them very much yet:

  • Internally, you store all official vocabulary URIs using the enum
  • But then almost every other part of the codebase will still access them using strings

My suggestion: Once you add getters that work on the enum values, can you go through the codebase and try to replace any containment check, etc for an official vocabulary URI with the actual enum value? Then we avoid expensive string comparisons. Mainly on src/extension/alterschema rules. I think that will speed up the linter by orders of magnitude

@jviotti
Copy link
Member

jviotti commented Nov 17, 2025

Hm, though there are a LOT of places to go through, and many are tricky, like in frame.cc. Maybe let's just focus on the foundations merging this in this PR without a lot of extra changes and then we do the other ones in separate PRs? Otherwise this PR will become a monster...

Maybe just add a TODO in the class definition saying that we should start using getters on enum values as much as possible?

syedazeez337 added a commit to syedazeez337/core that referenced this pull request Nov 17, 2025
This commit addresses all maintainer feedback on PR sourcemeta#2040:

API improvements:
- Use initializer_list constructor instead of chained insert() calls
- Use brace initialization {} for construction
- Change insert() signature: insert(uri, required) instead of pair
- Add enum-based overloads (contains/insert/get with Known enum)
- Rename method from find() to get() for clarity

Implementation improvements:
- Rename jsonschema_vocabularies.cc to vocabularies.cc
- Reorder vocabulary checks from newest to oldest (2020-12 first)
- Rename short variable names (it -> iterator)
- Move inline comments above their respective lines
- Remove merge() method and update all usages

Test updates:
- Replace all merge() calls with combined vocabulary constants
- Add VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION
- Add VOCABULARIES_2020_12_APPLICATOR_AND_VALIDATION
- Add VOCABULARIES_2020_12_UNEVALUATED_AND_APPLICATOR
- Add VOCABULARIES_2020_12_APPLICATOR_UNEVALUATED_AND_VALIDATION
- Update test macros to use get() instead of at()
- Update walker_test to use get() instead of find()

Documentation:
- Add TODO comment documenting future optimization opportunities
- Note priority areas: alterschema rules, bundle.cc, frame.cc, official_walker.cc
- Document potential for orders of magnitude performance improvement

Cleanup:
- Remove benchmark files per maintainer request

All changes maintain backward compatibility while providing optimized
enum-based API for known vocabularies.
@syedazeez337 syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch from 223613e to 7704ff3 Compare November 17, 2025 17:49
syedazeez337 added a commit to syedazeez337/core that referenced this pull request Nov 17, 2025
This commit addresses all maintainer feedback on PR sourcemeta#2040:

API improvements:
- Use initializer_list constructor instead of chained insert() calls
- Use brace initialization {} for construction
- Change insert() signature: insert(uri, required) instead of pair
- Add enum-based overloads (contains/insert/get with Known enum)
- Rename method from find() to get() for clarity

Implementation improvements:
- Rename jsonschema_vocabularies.cc to vocabularies.cc
- Reorder vocabulary checks from newest to oldest (2020-12 first)
- Rename short variable names (it -> iterator)
- Move inline comments above their respective lines
- Remove merge() method and update all usages

Test updates:
- Replace all merge() calls with combined vocabulary constants
- Add VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION
- Add VOCABULARIES_2020_12_APPLICATOR_AND_VALIDATION
- Add VOCABULARIES_2020_12_UNEVALUATED_AND_APPLICATOR
- Add VOCABULARIES_2020_12_APPLICATOR_UNEVALUATED_AND_VALIDATION
- Update test macros to use get() instead of at()
- Update walker_test to use get() instead of find()

Documentation:
- Add TODO comment documenting future optimization opportunities
- Note priority areas: alterschema rules, bundle.cc, frame.cc, official_walker.cc
- Document potential for orders of magnitude performance improvement

Cleanup:
- Remove benchmark files per maintainer request

All changes maintain backward compatibility while providing optimized
enum-based API for known vocabularies.

Signed-off-by: Syed Azeez <syedazeez337@gmail.com>
@syedazeez337 syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch 8 times, most recently from 7bcdcfd to 25ba9c3 Compare November 18, 2025 06:20
///
/// TODO: To maximize performance gains, convert string-based vocabulary checks
/// throughout the codebase to use enum-based methods. Priority areas:
/// - src/extension/alterschema (linter and canonicalizer rules)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first sentence in the TODO is good, but the rest is more for your internal notes!

/// This will eliminate expensive string comparisons in hot paths and could
/// improve linter performance by orders of magnitude.
struct SOURCEMETA_CORE_JSONSCHEMA_EXPORT Vocabularies {
/// @ingroup jsonschema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// @ingroup jsonschema

You only need this tag for the struct. Any members will by default inherit the scope

/// Construct from initializer list (for backward compatibility)
Vocabularies(std::initializer_list<std::pair<JSON::String, bool>> init);

private:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the private members at the end of the struct? The convention is usually to have them at the back of it, after all public stuff

#pragma warning(push)
#pragma warning(disable : 4251)
#endif
std::bitset<29> required_known{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you statically derive the number 29 from the amount of members in the enum? I don't know from the top of my head, but I bet there is some templating utility or trick to do it?

if (base_dialect == "https://json-schema.org/draft/2020-12/schema" ||
base_dialect == "https://json-schema.org/draft/2020-12/hyper-schema") {
return "https://json-schema.org/draft/2020-12/vocab/core";
return sourcemeta::core::Vocabularies::Known::JSON_Schema_2020_12_Core;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this a lot!

return this->get(maybe_known.value());
}
// Fallback to custom vocabularies
const auto iterator = this->custom.find(std::string{uri});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you will need to convert to std::string anyway, then maybe make this method just take const std::string & as argument?

// Verify invariant: vocabulary cannot be both required and optional
// Use [] operator instead of test() to avoid exceptions in noexcept function
assert(!this->required_known[index] || !this->optional_known[index]);
if (this->required_known[index])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always use curly braces on conditionals? This is a convention we have throughout the repo

if (base_dialect.has_value() && dialect.has_value()) {
vocabularies.merge(sourcemeta::core::vocabularies(
resolver, base_dialect.value(), dialect.value()));
vocabularies = sourcemeta::core::vocabularies(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but given the new assignment operator, you are first constructing vocabularies and then doing pretty much a reset. Can you construct it as expected in one shot? i.e. maybe something like this:

Vocabularies vocabularies{
  base_dialect.has_value() && dialect.has_value() 
  ? sourcemeta::core::vocabularies(resolver, base_dialect.value(), dialect.value()) 
  : {}};

return std::ranges::any_of(container, [&values](const auto &element) {
return values.contains(element.first);
});
static auto contains_any(const Vocabularies &container,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static auto contains_any(const Vocabularies &container,
// TODO: Move this to `Vocabularies::contains_any` or something like that
static auto contains_any(const Vocabularies &container,

Not for now. Just a note to self. I think this will be a nice refactoring later on


static const sourcemeta::core::Vocabularies
VOCABULARIES_2019_09_APPLICATOR_AND_VALIDATION{
{"https://json-schema.org/draft/2019-09/vocab/core", true},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all cases where we can use the enum values. But I guess we can do it on a separate PR

@syedazeez337 syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch from 25ba9c3 to 8136266 Compare November 18, 2025 15:29
Replace unordered_map with hybrid bitset approach for known vocabularies,
falling back to unordered_map for custom vocabularies. This eliminates
expensive string comparisons and hash lookups in hot paths.

Key changes:
- Introduce Vocabularies struct with Known enum for official vocabularies
- Use std::bitset for O(1) lookup of 29 known JSON Schema vocabularies
- Maintain custom vocabulary support via unordered_map fallback
- Add both string and enum-based constructors for flexibility
- Update all vocabulary construction sites to use enum values
- Derive bitset size from Known::COUNT enum sentinel
- Move private members to end of struct per convention
- Use const std::string& instead of string_view to avoid conversions
- Add curly braces to all conditionals per repo convention
- Add TODO for future contains_any refactoring

Performance impact:
- Eliminates hash computation and string comparisons for known vocabularies
- Reduces memory overhead with compact bitset representation
- Maintains backward compatibility with existing API

Signed-off-by: Syed Azeez <syedazeez337@gmail.com>
@syedazeez337 syedazeez337 force-pushed the feature/bitset-vocabulary-optimization branch from 8136266 to 02f7c10 Compare November 18, 2025 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants