Spaces:

xu-song
/

tokenizer-arena

Running

App Files Files Community

xu-song commited on Aug 24

Commit

1706767

•

1 Parent(s): ec34f57

update

Browse files

Files changed (1) hide show

compression_app.py +6 -7

compression_app.py CHANGED Viewed

@@ -36,17 +36,16 @@ The encoding and decoding process can be formulated as
 ```
 - **Lossless** <br>
-Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`.
-  - Most lossy tokenizers get many out-of-vocabulary(OOV) words. 👉 Check the
-    OOV of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and
     [t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).
-  - Even if a tokenizer has no OOV, it can be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,
     llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,
-    which may bring some slight differences to the reconstructed text. 👉 Check the diff of
     [qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and
     [llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).
@@ -146,7 +145,7 @@ with gr.Blocks(theme=theme) as demo:
                 # "- `g_bytes/b_tokens` measures how many gigabytes corpus per billion tokens.\n"
                 # "- `t_bytes/t_tokens` measures how many terabytes corpus per trillion tokens.\n"
                 "  - `char/token` measures how many chars per token on the tokenized corpus.\n"
-                "  - `oov_ratio`: out-of-vocabulary ratio on the selected corpus, 👉 get [OOV charset](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate.json)\n\n"
                 "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
             )

 ```
 - **Lossless** <br>
+Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`. There are mainly two causes of compression loss.
+  1. `OOV`: Most lossy tokenizers get many out-of-vocabulary(OOV) words. 👉 Check the OOV and
+    tokenization loss of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and
     [t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).
+  2. `Normalization`: Even if a tokenizer has no OOV, it can be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,
     llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,
+    which may bring some slight differences to the reconstructed text. 👉 Check the tokenization loss of
     [qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and
     [llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).
                 # "- `g_bytes/b_tokens` measures how many gigabytes corpus per billion tokens.\n"
                 # "- `t_bytes/t_tokens` measures how many terabytes corpus per trillion tokens.\n"
                 "  - `char/token` measures how many chars per token on the tokenized corpus.\n"
+                "  - `oov_ratio`: out-of-vocabulary ratio on the selected corpus, 👉 check [OOV charset](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate.json)\n\n"
                 "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
             )