Skip to content

Tokenize endpoint empty array crash fix #4208

Open
przepeck wants to merge 7 commits into
mainfrom
przepeck/tokenize_psirt
Open

Tokenize endpoint empty array crash fix #4208
przepeck wants to merge 7 commits into
mainfrom
przepeck/tokenize_psirt

Conversation

@przepeck
Copy link
Copy Markdown
Collaborator

🛠 Summary

CVS-186166
Adding checks for empty nested array, which was making OVMS crash

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@przepeck przepeck added this to the 2026.2_rc milestone May 14, 2026
@przepeck przepeck marked this pull request as ready for review May 14, 2026 07:48
Copilot AI review requested due to automatic review settings May 14, 2026 07:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a crash in the tokenize request parser when the text field contains an empty nested array (e.g. {"text":[[]]}), by explicitly rejecting empty inner arrays and adding regression tests across tokenize endpoint test suites.

Changes:

  • Add an explicit guard in TokenizeParser::parseInput() to reject empty inner arrays before indexing array[0].
  • Add negative HTTP tests to ensure text with empty nested arrays returns an error (and does not crash) for embeddings, rerank, and LLM tokenize endpoints.
  • Remove a previously existing GTEST_SKIP() in the LLM tokenize “add_special_tokens” test for VLM models.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/tokenize/tokenize_parser.cpp Adds empty-inner-array validation to prevent out-of-bounds access when inspecting nested arrays.
src/test/reranknode_test.cpp Adds HTTP regression tests covering empty nested arrays for the rerank tokenize endpoint.
src/test/llm/tokenize_endpoint_test.cpp Adds HTTP regression tests for empty nested arrays across all LLM tokenize test parameters; also removes a VLM-specific skip for add_special_tokens.
src/test/embeddingsnode_test.cpp Adds HTTP regression tests covering empty nested arrays for the embeddings tokenize endpoint.

Comment thread src/tokenize/tokenize_parser.cpp Outdated
Comment on lines 418 to 421
TEST_P(LLMTokenizeTests, tokenizeStringWithAddSpecialTokens) {
auto params = GetParam();
if (params.modelName == "vlm_cb_regular" || params.modelName == "vlm_legacy_regular") {
GTEST_SKIP() << "Skipping test for " << params.modelName;
}

std::string requestBody = R"(
Comment thread src/test/embeddingsnode_test.cpp Outdated
Comment on lines +790 to +798
TEST_F(EmbeddingsTokenizeHttpTest, tokenizeEmptyNestedArray) {
std::string requestBody = R"(
{
"model": "embeddings_ov",
"text": [[]]
}
)";
Status status = handler->dispatchToProcessor(endpointTokenize, requestBody, &response, comp, responseComponents, writer, multiPartParser);
ASSERT_EQ(status, ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR) << status.string();
Comment thread src/test/reranknode_test.cpp Outdated
Comment on lines +609 to +617
TEST_F(RerankTokenizeHttpTest, tokenizeEmptyNestedArray) {
std::string requestBody = R"(
{
"model": "rerank_ov",
"text": [[]]
}
)";
Status status = handler->dispatchToProcessor(endpointTokenize, requestBody, &response, comp, responseComponents, writer, multiPartParser);
ASSERT_EQ(status, ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR) << status.string();
Comment on lines +443 to +455
TEST_P(LLMTokenizeTests, tokenizeEmptyNestedArray) {
auto params = GetParam();

std::string requestBody = R"(
{
"model": ")" + params.modelName +
R"(",
"text": [[]]
}
)";
Status status = handler->dispatchToProcessor(endpointTokenize, requestBody, &response, comp, responseComponents, writer, multiPartParser);
ASSERT_EQ(status, ovms::StatusCode::MEDIAPIPE_EXECUTION_ERROR) << status.string();
}
TEST_P(LLMTokenizeTests, tokenizeStringWithAddSpecialTokens) {
auto params = GetParam();
if (params.modelName == "vlm_cb_regular" || params.modelName == "vlm_legacy_regular") {
GTEST_SKIP() << "Skipping test for " << params.modelName;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it working for all paths, or only special token path? Do other tests have it unskipped?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants