[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer#64667
[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer#64667nishant94 wants to merge 10 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
@nishant94 have you tried icu analyzer? because I think icu could handle many different languages. |
@yiguolei The ICU Analyzer is not good as the Kuromoji. There is huge difference between icu and kuromoji when it comes to morphology of the Japanese words. So I think it worth it adding this new parser. |
|
Is the code under |
This is original code but it is modeled on Apache Lucene's kuromoji. |
FE UT Coverage ReportIncrement line coverage |
389fcfb to
b79db3c
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
db0ee69 to
06b4ef6
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| // keys in the given set of keys. Note that the `msg' of <Exception> must be a | ||
| // constant or static string because an <Exception> keeps only a pointer to | ||
| // that string. | ||
| class Exception : public std::exception { |
There was a problem hiding this comment.
Doris also has an exception definition in common/exception.h, I think you should use that one.
There was a problem hiding this comment.
@yiguolei Acutally it's third-party unmodified vendored code, and that our builder already catches the darts error as std::exception and converts to Status, so no custom exception leaks into Doris 🤔
I think let's keep it it as it is?
There was a problem hiding this comment.
No, I think you should modify it and using Doris's exception. "our builder already catches the darts error as std::exception and converts to Status" we add this convertor because clucene is a seperate lib for Doris BE. We have a plan to move it to doris's code and modify it. In this case we will unify the error status and exception logic to doris's. And also in doris's exception and status, we have some logic for example the call stack.
Add a built-in `kuromoji` inverted-index parser that segments Japanese text into morphemes, mirroring the existing Chinese IK analyzer.
- Added `darts.h` to `.clang-format-ignore` and `.licenserc.yaml`. - Improved code formatting in various Kuromoji source files for better readability. - Updated tests files to include necessary headers.
…mposition - Added support for search mode in the Kuromoji Viterbi segmenter, applying penalties for long all-kanji and other tokens to enhance search recall. - Updated the KuromojiMode enumeration to reflect the new search and extended modes. - Modified the KuromojiTokenizer to utilize the new mode functionality. - Added unit tests to validate the behavior of the search mode, ensuring correct segmentation of compounds. - Updated NOTICE.txt to include Apache Lucene as a dependency for the kuromoji analyzer.
…wn words - Implemented functionality in the Kuromoji Viterbi segmenter to decompose unknown (out-of-vocabulary) words into per-character unigrams when in extended mode, aligning with Lucene's JapaneseTokenizer behavior. - Added unit tests to validate the correct segmentation of unknown words in both normal and extended modes, ensuring expected outputs for various input scenarios.
- Modified error messages to include 'kuromoji' parser in the parser mode validation. - Enhanced tests for the Japanese analyzer to assert expected tokenization results.
- Introduced a new configuration option `enable_kuromoji_analyzer` to toggle the Kuromoji analyzer functionality. - Updated unit tests to validate the behavior of the Kuromoji analyzer when enabled and disabled. - Modified tests to enable the Kuromoji analyzer for specific test cases.
- Updated the namespace for Kuromoji components from `doris::segment_v2::kuromoji` to `doris::segment_v2::inverted_index::kuromoji` across multiple files for better organization and clarity.
6216d3f to
f3d5f92
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
Since this is modeled after Lucene Kuromoji, have you checked how the Doris implementation performs in practice? A small benchmark for indexing throughput would be helpful. |
There was a problem hiding this comment.
Requesting changes. The main issue is that parser=kuromoji is exposed as a production analyzer, but the default package path does not guarantee the runtime dictionary exists, and the runtime silently falls back to per-codepoint tokenization. That can build/query inverted indexes with semantics different from the documented Kuromoji morphology. I also found dictionary loader validation gaps and FE/BE/TOKENIZE default/validation inconsistencies.
| install(DIRECTORY | ||
| ${BASE_DIR}/dict/kuromoji | ||
| DESTINATION ${OUTPUT_DIR}/dict | ||
| OPTIONAL) |
There was a problem hiding this comment.
[Blocking] This install rule does not guarantee that the runtime Kuromoji dictionary is present. The generated dict/kuromoji/*.bin files are ignored/not committed, kuromoji_build_dict is EXCLUDE_FROM_ALL, and kuromoji_dict is only a manual target. A default package can therefore ship only dict/kuromoji/README.md, while the BE later loads ${inverted_index_dict_path}/kuromoji.
Please make package/install depend on dictionary generation and fail if system.bin, matrix.bin, chardef.bin, and unkdict.bin are missing. OPTIONAL should not hide a missing required analyzer artifact.
| // Loads (once, process-wide) the IPADIC dictionary from `dictPath`. If it is | ||
| // unavailable the tokenizer degrades to a per-codepoint split (logged), rather | ||
| // than failing index/query. | ||
| void initDict(const std::string& dictPath) override { |
There was a problem hiding this comment.
[Blocking] Missing or corrupt dictionary should not silently fall back for the production kuromoji parser. If indexing runs with dict_ == nullptr, segments are written with per-codepoint tokens; after the dictionary is installed/reloaded, query analyzers can produce real Kuromoji tokens, so old and new segments have different tokenization semantics.
Please fail analyzer creation/index/query with a clear error when the Kuromoji dictionary cannot be loaded. If fallback tokenization is desired, expose it as an explicit parser/mode persisted in index metadata.
| const uint8_t* p = _system_map.data(); | ||
| RETURN_IF_ERROR(check_header(p, _system_map.size(), KMJ_KIND_SYSTEM)); | ||
| KmjSystemHeader s {}; | ||
| std::memcpy(&s, p + sizeof(KmjFileHeader), sizeof(s)); |
There was a problem hiding this comment.
[Blocking] check_header() only validates the common header, but the loader then trusts all sub-header offsets/counts and installs mmap pointers from them. A header-valid but truncated/corrupt *.bin can make these memcpy/pointer assignments and later accessors read past the mmap.
Please validate each sub-header before exposing pointers: sizeof(header + subheader), every offset + count * sizeof(T) range, integer overflow, trie byte alignment, class_count == CAT_CLASS_COUNT, matrix dimensions/cell count, run ranges into entry arrays, feature offsets, and word left/right IDs against matrix bounds. Return Status::Corruption for invalid artifacts.
| return INVERTED_INDEX_PARSER_SMART; | ||
| } | ||
| if (parser_it->second == INVERTED_INDEX_PARSER_KUROMOJI) { | ||
| return INVERTED_INDEX_PARSER_KUROMOJI_SEARCH; |
There was a problem hiding this comment.
[Major] BE now defaults omitted parser_mode for parser=kuromoji to search, but FE InvertedIndexProperties.getInvertedIndexParserMode() still only special-cases IK and otherwise returns coarse_grained. Match predicate thrift serialization uses the FE helper, so FE and BE disagree on the effective default.
Please update the FE default helper to return search for Kuromoji and add FE coverage for a Kuromoji index/query without an explicit parser_mode.
| if (mode == "extended") { | ||
| return KuromojiMode::Extended; | ||
| } | ||
| return KuromojiMode::Search; // default (matches OpenSearch/Lucene) |
There was a problem hiding this comment.
[Major] This silently maps any unknown Kuromoji mode to Search. That makes TOKENIZE(..., '"parser"="kuromoji","parser_mode"="bogus"') accepted and executed as search, while index DDL rejects the same value.
Please make mode parsing return an error/status for unknown values, and apply the same Kuromoji property validation to TOKENIZE as DDL/index creation.
| sql """ INSERT INTO ${tableName} VALUES (3, "Apache Doris は高速です"); """ | ||
| sql "sync" | ||
|
|
||
| // The kuromoji dictionary is not shipped in the p0 package, so the |
There was a problem hiding this comment.
[Blocking] This regression is proving the fallback path, not the feature added by this PR. Because the p0 package does not ship the dictionary, CI can pass while the real Kuromoji/IPADIC morphology path is never exercised.
Please add a required CI/regression path that generates/ships the dictionary and asserts real morphology behavior, e.g. search-mode decompound (東京都 matching 東京), base-form/POS behavior, and extended-mode unknown-word splitting. Keep fallback coverage separate if fallback remains explicit.
What problem does this PR solve?
Issue Number: #64646
Related PR: None
Problem Summary:
Doris has no Japanese-aware tokenizer for the inverted index. Japanese text has no spaces between words, so the existing parsers can't segment it and
MATCH/MATCH_PHRASEon Japanese columns end up with poor recall and precision.This PR adds a built-in
kuromojiparser for Japanese, in the same style as the existing Chinese IK analyzer. It's opt-in per column:After indexing, MATCH, MATCH_PHRASE and TOKENIZE() run against the segmented Japanese terms.
How it works:
be/src/storage/index/inverted/analyzer/kuromoji/, so there's no JVM on the indexing path. KuromojiAnalyzer / KuromojiTokenizer mirror the IK analyzer/tokenizer, with a Viterbi cost-model segmenter over the IPADIC connection-cost matrix.Dictionary source is mecab-ipadic-2.7.0-20070801 (NAIST-2003 license, the same lexicon Lucene kuromoji uses).
Release note
Support Japanese text tokenization in the inverted index via a new kuromoji parser (
PROPERTIES("parser"="kuromoji")), withsearch/normal/extendedmodes.Check List (For Author)
parser="kuromoji".Check List (For Reviewer who merge this PR)