-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer #64667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
686cbaa
4d6c804
8ac8e06
450155d
7023886
50f0fa9
0bf0d43
5566467
f2f9d30
f3d5f92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -153,3 +153,6 @@ compile_commands.json | |
| .github | ||
|
|
||
| .worktrees/ | ||
|
|
||
| # generated kuromoji dictionary binaries | ||
| /be/dict/kuromoji/*.bin | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one | ||
| or more contributor license agreements. See the NOTICE file | ||
| distributed with this work for additional information | ||
| regarding copyright ownership. The ASF licenses this file | ||
| to you under the Apache License, Version 2.0 (the | ||
| "License"); you may not use this file except in compliance | ||
| with the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, | ||
| software distributed under the License is distributed on an | ||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations | ||
| under the License. | ||
| --> | ||
|
|
||
| # Kuromoji (Japanese) dictionary | ||
|
nishant94 marked this conversation as resolved.
|
||
|
|
||
| This directory holds the compiled IPADIC dictionary consumed at runtime by the | ||
| `kuromoji` inverted-index analyzer (`KuromojiAnalyzer` → `KuromojiDictionary`): | ||
|
|
||
| - `system.bin` — surface→word Darts trie + word entries + feature blob | ||
| - `matrix.bin` — connection-cost matrix (1316×1316) | ||
| - `chardef.bin` — character-category map + per-category flags | ||
| - `unkdict.bin` — unknown-word entries per category | ||
|
|
||
| These `*.bin` files are **generated** (not committed; see `.gitignore`). The | ||
| runtime resolves them at `${inverted_index_dict_path}/kuromoji` | ||
| (default `${DORIS_HOME}/dict/kuromoji`); `be/CMakeLists.txt` installs this | ||
| directory into the BE package. | ||
|
|
||
| ## How it's (re)generated | ||
|
|
||
| Source: the UTF-8 IPADIC from <https://github.com/lindera/mecab-ipadic> | ||
| (tag `2.7.0-20250920`) — the original `mecab-ipadic-2.7.0-20070801` lexicon | ||
| converted to UTF-8 (license: NAIST-2003, see `dist/licenses/LICENSE-ipadic.txt`). | ||
|
|
||
| Automated, two steps: | ||
|
|
||
| ```bash | ||
| # 1. thirdparty fetches + stages the UTF-8 IPADIC source into | ||
| # ${DORIS_THIRDPARTY}/installed/share/mecab-ipadic-2.7.0-20250920 | ||
| sh thirdparty/build-thirdparty.sh mecab_ipadic | ||
|
|
||
| # 2. the CMake target builds the offline compiler and produces the *.bin here | ||
| ninja -C be/ut_build_RELEASE kuromoji_dict | ||
| ``` | ||
|
|
||
| CI/release should run `ninja kuromoji_dict` before packaging; the BE `install` | ||
| rule then ships this directory. Override the source dir with | ||
| `-DKUROMOJI_IPADIC_SRC=<path>` at CMake configure time. (The tool can also be | ||
| run directly: `kuromoji_build_dict <utf8_ipadic_src_dir> be/dict/kuromoji`.) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <memory> | ||
| #include <string> | ||
|
|
||
| #include "common/logging.h" | ||
| #include "storage/index/inverted/analyzer/kuromoji/KuromojiTokenizer.h" | ||
| #include "storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.h" | ||
|
|
||
| namespace doris::segment_v2 { | ||
|
|
||
| class KuromojiAnalyzer : public Analyzer { | ||
| public: | ||
| KuromojiAnalyzer() { | ||
| _lowercase = true; | ||
| _ownReader = false; | ||
| } | ||
| ~KuromojiAnalyzer() override = default; | ||
|
|
||
| bool isSDocOpt() override { return true; } | ||
|
|
||
| // Loads (once, process-wide) the IPADIC dictionary from `dictPath`. If it is | ||
| // unavailable the tokenizer degrades to a per-codepoint split (logged), rather | ||
| // than failing index/query. | ||
| void initDict(const std::string& dictPath) override { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Blocking] Missing or corrupt dictionary should not silently fall back for the production Please fail analyzer creation/index/query with a clear error when the Kuromoji dictionary cannot be loaded. If fallback tokenization is desired, expose it as an explicit parser/mode persisted in index metadata. |
||
| dict_ = inverted_index::kuromoji::KuromojiDictionary::get_or_load(dictPath); | ||
| if (dict_ == nullptr) { | ||
| LOG(WARNING) << "kuromoji: dictionary unavailable at " << dictPath | ||
| << "; falling back to per-codepoint tokenization"; | ||
| } | ||
| } | ||
|
|
||
| void setMode(KuromojiMode mode) { mode_ = mode; } | ||
|
|
||
| TokenStream* tokenStream(const TCHAR* fieldName, lucene::util::Reader* reader) override { | ||
| auto* tokenizer = _CLNEW KuromojiTokenizer(mode_, _lowercase, _ownReader, dict_); | ||
| tokenizer->reset(reader); | ||
| return (TokenStream*)tokenizer; | ||
| } | ||
|
|
||
| TokenStream* reusableTokenStream(const TCHAR* fieldName, | ||
| lucene::util::Reader* reader) override { | ||
| if (tokenizer_ == nullptr) { | ||
| tokenizer_ = std::make_unique<KuromojiTokenizer>(mode_, _lowercase, _ownReader, dict_); | ||
| } | ||
| tokenizer_->reset(reader); | ||
| return (TokenStream*)tokenizer_.get(); | ||
| } | ||
|
|
||
| private: | ||
| const inverted_index::kuromoji::KuromojiDictionary* dict_ {nullptr}; | ||
| KuromojiMode mode_ {KuromojiMode::Search}; | ||
| std::unique_ptr<KuromojiTokenizer> tokenizer_; | ||
| }; | ||
|
|
||
| } // namespace doris::segment_v2 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <string> | ||
|
|
||
| namespace doris::segment_v2 { | ||
|
|
||
| // Segmentation mode, mirroring Lucene's JapaneseTokenizer.Mode. Normal returns | ||
| // the minimum-cost segmentation. Search additionally decomposes long compounds | ||
| // into their shorter parts (via a length-based cost penalty) for better search | ||
| // recall. Extended applies the Search penalty and also splits unknown | ||
| // (out-of-vocabulary) words into per-character unigrams. | ||
| enum class KuromojiMode { Normal, Search, Extended }; | ||
|
|
||
| inline KuromojiMode kuromoji_mode_from_string(const std::string& mode) { | ||
| if (mode == "normal") { | ||
| return KuromojiMode::Normal; | ||
| } | ||
| if (mode == "extended") { | ||
| return KuromojiMode::Extended; | ||
| } | ||
| return KuromojiMode::Search; // default (matches OpenSearch/Lucene) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Major] This silently maps any unknown Kuromoji mode to Please make mode parsing return an error/status for unknown values, and apply the same Kuromoji property validation to |
||
| } | ||
|
|
||
| } // namespace doris::segment_v2 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Blocking] This install rule does not guarantee that the runtime Kuromoji dictionary is present. The generated
dict/kuromoji/*.binfiles are ignored/not committed,kuromoji_build_dictisEXCLUDE_FROM_ALL, andkuromoji_dictis only a manual target. A default package can therefore ship onlydict/kuromoji/README.md, while the BE later loads${inverted_index_dict_path}/kuromoji.Please make package/install depend on dictionary generation and fail if
system.bin,matrix.bin,chardef.bin, andunkdict.binare missing.OPTIONALshould not hide a missing required analyzer artifact.