-
Notifications
You must be signed in to change notification settings - Fork 136
Description
When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.
Of course, (HF)BertTokenizerFast and (TF)tf_text.FastBertTokenizer also have more than 99% correct answers when run on the same vocab.txt, but blingfire only has 93% correct answers.
(vocab.txt is about 30000)
In the example below, the actual vocab.txthas ##ㅋbut no ㅋ, as shown below.
--vocab.txt--
##ㅋ
In this case, ##ㅋ must be concatenated with the preceding character, so they all match for 아ㅋ, as shown below.
| Tokenizer Framework | text | ids | decode |
|---|---|---|---|
| (HF) BertTokenizer | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
| (HF) BertTokenizerFast | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
| (BF) bert_custom.bin | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
| (TF) FastBertTokenizer | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
On the other hand, if there is a space in the middle, like in 아 ㅋ, only blingfire will produce a different result.
| Tokenizer Framework | text | ids | decode |
|---|---|---|---|
| (HF) BertTokenizer | 아 ㅋ | [31998, 21, 31997, 31999] | [CLS] 아 [UNK] [SEP] |
| (HF) BertTokenizerFast | 아 ㅋ | [31998, 21, 31997, 31999] | [CLS] 아 [UNK] [SEP] |
| (BF) bert_custom.bin | 아 ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
| (TF) FastBertTokenizer | 아 ㅋ | [31998, 21, 31997, 31999] | [CLS] 아 [UNK] [SEP] |
The blingfire settings are shown below.
ldb.conf.small
[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2
options.small
OUTPUT = bert_custom.bin
opt_build_wbd = --dict-root=. --full-unicode
opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
#opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap
resources = \
$(tmpdir)/wbd.fsa.$(mode).dump \
$(tmpdir)/wbd.mmap.$(mode).dump \
wdb.lex.utf8
_include common/bert.common.def.txt
_define LetterFromVocab [\x0030-\x0039\x0041-\x005a\x0061-...]
< (ChineseChars)|(BertPunctuation) > --> WORD _call FnTokWord
< (AllLettersWithoutToLower|LetterFromVocab)+ > --> WORD _call FnTokWord
#
# BERT specific
#
< [\[] UNK [\]] > --> WORD _call FnTokWord
< [\[] CLS [\]] > --> WORD _call FnTokWord
< [\[] SEP [\]] > --> WORD _call FnTokWord
< [\[] MASK [\]] > --> WORD _call FnTokWord
_function FnTokWord
_include bert_custom/vocab.falex
_end
Other than that, we specified vocab.falex, wdb.target.txt, ldb.conf.i2w, and options.small exactly as guided.
How do you know which part is the problem?