Unable to convert BioGpt slow tokenizer to fast: token out of vocabulary
I would like to construct a fast tokenizer class based on the BioGptTokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, it failed.
System Info
I was trying to use BioGpt model in my code for fine-tuning. I would like to construct a fast tokenizer class based on the BioGptTokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by convert_slow_tokenizer
, following error occurs: Error while initializing BPE: Token -@</w> out of vocabulary.
Reproduction
I copy the code related to colab.This is the link : https://colab.research.google.com/drive/1IMhiDz45GiarBLgXG9B2rA_u0ZOmmjJS?usp=sharing
Expected behavior
According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens in vocab.json or merge.txt. Could you please check it? Thank you very much!
Dude any update ?