Nonspacing marks by themselves causing problems for the tokenizer
#2
by
AngledLuffa
- opened
I ran into what I believe is a minor problem with the tokenizer for indic-bert. I was looking at the L3Cube NER dataset:
https://github.com/l3cube-pune/MarathiNLP
In the train section of the NER dataset is the following sentence (the numbers represent sentence number):
या O 17197.0
मंत्राची O 17197.0
देवता O 17197.0
गणपती O 17197.0
ँ O 17197.0
हा O 17197.0
तो O 17197.0
मंत्र O 17197.0
The fifth "word" appears to be a non-spacing Candrabindu mark by itself. If I feed the words to the indic-bert tokenizer word by word, I would expect it to produce or something similar for an untokenizable word such as that. Instead, it produces nothing. Is that expected behavior I should compensate for, or is it something that can be fixed in the tokenizer?
Thanks again!