lack of digit splitting in slow version of tokenizer

#11
by Forence - opened

Like the issue in https://huggingface.co/stabilityai/stablelm-2-12b/discussions/1#66151ad2d8148d0b668985f0, GPT2Tokenizer (slow) doesn't apply the pre-tokenization split rule defined in the tokenizer.json.

Suggest that this bug can be solved, because I met severe performance drop due to it, and it's quite hard to debug. Thanks!

Sign up or log in to comment