Is it possible to decrease vocabulary size?

#50

by omers66 - opened Feb 4

Feb 4

Is there a way to access the tokenizer tokens distribution? I would like to decrease vocabulary size to speed up. Would be ideal if I could keep most request tokens.
Thanks

amgadhasan

Mar 25

You need to modify the embedding layer and the language modeling head as well.

omers66

Mar 25

Yes, but what about the histogram of tokens? I would like to remove the un-common ones...
Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment