Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .clang-format-ignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ be/src/util/sse2neon.h
be/src/util/mustache/mustache.h
be/src/util/mustache/mustache.cc
be/src/util/utf8_check.cpp
be/src/storage/index/inverted/analyzer/kuromoji/dict/darts.h
cloud/src/common/defer.h
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -153,3 +153,6 @@ compile_commands.json
.github

.worktrees/

# generated kuromoji dictionary binaries
/be/dict/kuromoji/*.bin
1 change: 1 addition & 0 deletions .licenserc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ header:
- "be/src/util/sse2neo.h"
- "be/src/util/sse2neon.h"
- "be/src/util/utf8_check.cpp"
- "be/src/storage/index/inverted/analyzer/kuromoji/dict/darts.h"
- "be/src/pch/*"
- "be/test/data"
- "be/test/expected_result"
Expand Down
3 changes: 3 additions & 0 deletions NOTICE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ This software includes third party software subject to the following copyrights:
- Netty Reactive Streams - https://github.com/playframework/netty-reactive-streams
- Jackson-core - https://github.com/FasterXML/jackson-core
- Jackson-dataformat-cbor - https://github.com/FasterXML/jackson-dataformats-binary
- Darts-clone (double-array trie) - Copyright 2008-2014 Susumu Yata - https://github.com/s-yata/darts-clone (BSD 2-clause; see dist/licenses/LICENSE-darts-clone.txt)
- mecab-ipadic (IPADIC) Japanese morphological dictionary - Copyright 2000-2003 Nara Institute of Science and Technology (NAIST) - licensed under NAIST-2003 (BSD-style); the kuromoji analyzer bundles the UTF-8 form from https://github.com/lindera/mecab-ipadic (content of mecab-ipadic-2.7.0-20070801). See dist/licenses/LICENSE-ipadic.txt.
- Apache Lucene - https://github.com/apache/lucene (Apache-2.0): the kuromoji Japanese analyzer under be/src/storage/index/inverted/analyzer/kuromoji is an independent C++ implementation modeled on Lucene's kuromoji analyzer (JapaneseTokenizer), including its search-mode compound-decomposition cost model.

The licenses for these third party components are included in LICENSE.txt

Expand Down
6 changes: 6 additions & 0 deletions be/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,12 @@ install(DIRECTORY
${BASE_DIR}/dict/pinyin
DESTINATION ${OUTPUT_DIR}/dict)

# Japanese kuromoji dictionary
install(DIRECTORY
${BASE_DIR}/dict/kuromoji
DESTINATION ${OUTPUT_DIR}/dict
OPTIONAL)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocking] This install rule does not guarantee that the runtime Kuromoji dictionary is present. The generated dict/kuromoji/*.bin files are ignored/not committed, kuromoji_build_dict is EXCLUDE_FROM_ALL, and kuromoji_dict is only a manual target. A default package can therefore ship only dict/kuromoji/README.md, while the BE later loads ${inverted_index_dict_path}/kuromoji.

Please make package/install depend on dictionary generation and fail if system.bin, matrix.bin, chardef.bin, and unkdict.bin are missing. OPTIONAL should not hide a missing required analyzer artifact.


# Check if functions are supported in this platform. All flags will generated
# in gensrc/build/common/env_config.h.
# You can check funcion here which depends on platform. Don't forget add this
Expand Down
55 changes: 55 additions & 0 deletions be/dict/kuromoji/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Kuromoji (Japanese) dictionary
Comment thread
nishant94 marked this conversation as resolved.

This directory holds the compiled IPADIC dictionary consumed at runtime by the
`kuromoji` inverted-index analyzer (`KuromojiAnalyzer` → `KuromojiDictionary`):

- `system.bin` — surface→word Darts trie + word entries + feature blob
- `matrix.bin` — connection-cost matrix (1316×1316)
- `chardef.bin` — character-category map + per-category flags
- `unkdict.bin` — unknown-word entries per category

These `*.bin` files are **generated** (not committed; see `.gitignore`). The
runtime resolves them at `${inverted_index_dict_path}/kuromoji`
(default `${DORIS_HOME}/dict/kuromoji`); `be/CMakeLists.txt` installs this
directory into the BE package.

## How it's (re)generated

Source: the UTF-8 IPADIC from <https://github.com/lindera/mecab-ipadic>
(tag `2.7.0-20250920`) — the original `mecab-ipadic-2.7.0-20070801` lexicon
converted to UTF-8 (license: NAIST-2003, see `dist/licenses/LICENSE-ipadic.txt`).

Automated, two steps:

```bash
# 1. thirdparty fetches + stages the UTF-8 IPADIC source into
# ${DORIS_THIRDPARTY}/installed/share/mecab-ipadic-2.7.0-20250920
sh thirdparty/build-thirdparty.sh mecab_ipadic

# 2. the CMake target builds the offline compiler and produces the *.bin here
ninja -C be/ut_build_RELEASE kuromoji_dict
```

CI/release should run `ninja kuromoji_dict` before packaging; the BE `install`
rule then ships this directory. Override the source dir with
`-DKUROMOJI_IPADIC_SRC=<path>` at CMake configure time. (The tool can also be
run directly: `kuromoji_build_dict <utf8_ipadic_src_dir> be/dict/kuromoji`.)
2 changes: 2 additions & 0 deletions be/src/common/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1287,6 +1287,8 @@ DEFINE_mDouble(inverted_index_ram_buffer_size, "512");
DEFINE_mInt32(inverted_index_max_buffered_docs, "-1");
// dict path for chinese analyzer
DEFINE_String(inverted_index_dict_path, "${DORIS_HOME}/dict");
// The kuromoji (Japanese) analyzer
DEFINE_mBool(enable_kuromoji_analyzer, "false");
DEFINE_Int32(inverted_index_read_buffer_size, "4096");
// tree depth for bkd index
DEFINE_Int32(max_depth_in_bkd_tree, "32");
Expand Down
2 changes: 2 additions & 0 deletions be/src/common/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -1329,6 +1329,8 @@ DECLARE_mDouble(inverted_index_ram_buffer_size);
DECLARE_mInt32(inverted_index_max_buffered_docs);
// dict path for chinese analyzer
DECLARE_String(inverted_index_dict_path);
// The kuromoji (Japanese) analyzer
DECLARE_mBool(enable_kuromoji_analyzer);
DECLARE_Int32(inverted_index_read_buffer_size);
// tree depth for bkd index
DECLARE_Int32(max_depth_in_bkd_tree);
Expand Down
15 changes: 14 additions & 1 deletion be/src/storage/index/inverted/analyzer/analyzer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
#include "storage/index/inverted/analyzer/basic/basic_analyzer.h"
#include "storage/index/inverted/analyzer/icu/icu_analyzer.h"
#include "storage/index/inverted/analyzer/ik/IKAnalyzer.h"
#include "storage/index/inverted/analyzer/kuromoji/KuromojiAnalyzer.h"
#include "storage/index/inverted/char_filter/char_replace_char_filter_factory.h"

namespace doris::segment_v2::inverted_index {
Expand Down Expand Up @@ -69,7 +70,8 @@ bool InvertedIndexAnalyzer::is_builtin_analyzer(const std::string& analyzer_name
analyzer_name == INVERTED_INDEX_PARSER_CHINESE ||
analyzer_name == INVERTED_INDEX_PARSER_ICU ||
analyzer_name == INVERTED_INDEX_PARSER_BASIC ||
analyzer_name == INVERTED_INDEX_PARSER_IK;
analyzer_name == INVERTED_INDEX_PARSER_IK ||
analyzer_name == INVERTED_INDEX_PARSER_KUROMOJI;
}

AnalyzerPtr InvertedIndexAnalyzer::create_builtin_analyzer(InvertedIndexParserType parser_type,
Expand Down Expand Up @@ -107,6 +109,17 @@ AnalyzerPtr InvertedIndexAnalyzer::create_builtin_analyzer(InvertedIndexParserTy
ik_analyzer->setMode(false);
}
analyzer = std::move(ik_analyzer);
} else if (parser_type == InvertedIndexParserType::PARSER_KUROMOJI) {
if (!config::enable_kuromoji_analyzer) {
throw Exception(ErrorCode::INVERTED_INDEX_ANALYZER_ERROR,
"kuromoji analyzer is disabled by default. Set "
"enable_kuromoji_analyzer=true in "
"be.conf (or via the BE config HTTP API) to enable it.");
}
auto kuromoji_analyzer = std::make_shared<KuromojiAnalyzer>();
kuromoji_analyzer->initDict(config::inverted_index_dict_path + "/kuromoji");
Comment thread
nishant94 marked this conversation as resolved.
kuromoji_analyzer->setMode(kuromoji_mode_from_string(parser_mode));
analyzer = std::move(kuromoji_analyzer);
} else {
// default
analyzer = std::make_shared<lucene::analysis::SimpleAnalyzer<char>>();
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <memory>
#include <string>

#include "common/logging.h"
#include "storage/index/inverted/analyzer/kuromoji/KuromojiTokenizer.h"
#include "storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.h"

namespace doris::segment_v2 {

class KuromojiAnalyzer : public Analyzer {
public:
KuromojiAnalyzer() {
_lowercase = true;
_ownReader = false;
}
~KuromojiAnalyzer() override = default;

bool isSDocOpt() override { return true; }

// Loads (once, process-wide) the IPADIC dictionary from `dictPath`. If it is
// unavailable the tokenizer degrades to a per-codepoint split (logged), rather
// than failing index/query.
void initDict(const std::string& dictPath) override {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocking] Missing or corrupt dictionary should not silently fall back for the production kuromoji parser. If indexing runs with dict_ == nullptr, segments are written with per-codepoint tokens; after the dictionary is installed/reloaded, query analyzers can produce real Kuromoji tokens, so old and new segments have different tokenization semantics.

Please fail analyzer creation/index/query with a clear error when the Kuromoji dictionary cannot be loaded. If fallback tokenization is desired, expose it as an explicit parser/mode persisted in index metadata.

dict_ = inverted_index::kuromoji::KuromojiDictionary::get_or_load(dictPath);
if (dict_ == nullptr) {
LOG(WARNING) << "kuromoji: dictionary unavailable at " << dictPath
<< "; falling back to per-codepoint tokenization";
}
}

void setMode(KuromojiMode mode) { mode_ = mode; }

TokenStream* tokenStream(const TCHAR* fieldName, lucene::util::Reader* reader) override {
auto* tokenizer = _CLNEW KuromojiTokenizer(mode_, _lowercase, _ownReader, dict_);
tokenizer->reset(reader);
return (TokenStream*)tokenizer;
}

TokenStream* reusableTokenStream(const TCHAR* fieldName,
lucene::util::Reader* reader) override {
if (tokenizer_ == nullptr) {
tokenizer_ = std::make_unique<KuromojiTokenizer>(mode_, _lowercase, _ownReader, dict_);
}
tokenizer_->reset(reader);
return (TokenStream*)tokenizer_.get();
}

private:
const inverted_index::kuromoji::KuromojiDictionary* dict_ {nullptr};
KuromojiMode mode_ {KuromojiMode::Search};
std::unique_ptr<KuromojiTokenizer> tokenizer_;
};

} // namespace doris::segment_v2
41 changes: 41 additions & 0 deletions be/src/storage/index/inverted/analyzer/kuromoji/KuromojiMode.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <string>

namespace doris::segment_v2 {

// Segmentation mode, mirroring Lucene's JapaneseTokenizer.Mode. Normal returns
// the minimum-cost segmentation. Search additionally decomposes long compounds
// into their shorter parts (via a length-based cost penalty) for better search
// recall. Extended applies the Search penalty and also splits unknown
// (out-of-vocabulary) words into per-character unigrams.
enum class KuromojiMode { Normal, Search, Extended };

inline KuromojiMode kuromoji_mode_from_string(const std::string& mode) {
if (mode == "normal") {
return KuromojiMode::Normal;
}
if (mode == "extended") {
return KuromojiMode::Extended;
}
return KuromojiMode::Search; // default (matches OpenSearch/Lucene)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] This silently maps any unknown Kuromoji mode to Search. That makes TOKENIZE(..., '"parser"="kuromoji","parser_mode"="bogus"') accepted and executed as search, while index DDL rejects the same value.

Please make mode parsing return an error/status for unknown values, and apply the same Kuromoji property validation to TOKENIZE as DDL/index creation.

}

} // namespace doris::segment_v2
Loading
Loading