Skip to content

In some cases, blingfire models created with the new vocab.txt produce different results. #181

@springkim

Description

@springkim

When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.

Of course, (HF)BertTokenizerFast and (TF)tf_text.FastBertTokenizer also have more than 99% correct answers when run on the same vocab.txt, but blingfire only has 93% correct answers.

(vocab.txt is about 30000)

In the example below, the actual vocab.txthas ##ㅋbut no , as shown below.

--vocab.txt--
##ㅋ

In this case, ##ㅋ must be concatenated with the preceding character, so they all match for 아ㅋ, as shown below.

Tokenizer Framework text ids decode
(HF) BertTokenizer 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(HF) BertTokenizerFast 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(BF) bert_custom.bin 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]

On the other hand, if there is a space in the middle, like in 아 ㅋ, only blingfire will produce a different result.

Tokenizer Framework text ids decode
(HF) BertTokenizer 아 ㅋ [31998, 21, 31997, 31999] [CLS] 아 [UNK] [SEP]
(HF) BertTokenizerFast 아 ㅋ [31998, 21, 31997, 31999] [CLS] 아 [UNK] [SEP]
(BF) bert_custom.bin 아 ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer 아 ㅋ [31998, 21, 31997, 31999] [CLS] 아 [UNK] [SEP]

The blingfire settings are shown below.

ldb.conf.small

[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2

options.small

OUTPUT = bert_custom.bin

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
#opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

resources = \
	$(tmpdir)/wbd.fsa.$(mode).dump \
	$(tmpdir)/wbd.mmap.$(mode).dump \

wdb.lex.utf8

_include common/bert.common.def.txt

_define LetterFromVocab [\x0030-\x0039\x0041-\x005a\x0061-...]

< (ChineseChars)|(BertPunctuation) > --> WORD _call FnTokWord
< (AllLettersWithoutToLower|LetterFromVocab)+ > --> WORD _call FnTokWord

#
# BERT specific
#

< [\[] UNK [\]] > --> WORD _call FnTokWord
< [\[] CLS [\]] > --> WORD _call FnTokWord
< [\[] SEP [\]] > --> WORD _call FnTokWord
< [\[] MASK [\]] > --> WORD _call FnTokWord

_function FnTokWord
_include bert_custom/vocab.falex
_end

Other than that, we specified vocab.falex, wdb.target.txt, ldb.conf.i2w, and options.small exactly as guided.

How do you know which part is the problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions