--- inference: false language: ja license: apache-2.0 mask_token: "[MASK]" widget: - text: "LINE株式会社で[MASK]の研究・開発をしている。" --- # LINE DistilBERT Japanese This is a DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE. The model was trained by [LINE Corporation](https://linecorp.com/). ## For Japanese https://github.com/line/LINE-DistilBERT-Japanese/blob/main/README_ja.md is written in Japanese. ## How to use ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True) model = AutoModel.from_pretrained("line-corporation/line-distilbert-base-japanese") sentence = "LINE株式会社で[MASK]の研究・開発をしている。" print(model(**tokenizer(sentence, return_tensors="pt"))) ``` ## Model architecture The model architecture is the DitilBERT base model; 6 layers, 768 dimensions of hidden states, 12 attention heads, 66M parameters. ## Evaluation The evaluation by [JGLUE](https://github.com/yahoojapan/JGLUE) is as follows: | model name | #Params | Marc_ja | JNLI | JSTS | JSQuAD | JCommonSenseQA | |------------------------|:-------:|:-------:|:----:|:----------------:|:---------:|:--------------:| | | | acc | acc | Pearson/Spearman | EM/F1 | acc | | LINE-DistilBERT | 68M | 95.6 | 88.9 | 89.2/85.1 | 87.3/93.3 | 76.1 | | Laboro-DistilBERT | 68M | 94.7 | 82.0 | 87.4/82.7 | 70.2/87.3 | 73.2 | | BandaiNamco-DistilBERT | 68M | 94.6 | 81.6 | 86.8/82.1 | 80.0/88.0 | 66.5 | ## Tokenization The texts are first tokenized by MeCab with the Unidic dictionary and then split into subwords by the SentencePiece algorithm. The vocabulary size is 32768. ## Licenses The pretrained models are distributed under the terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). ## To cite this work We haven't published any paper on this work. Please cite [this GitHub repository](http://github.com/line/LINE-DistilBERT-Japanese): ``` @article{LINE DistilBERT Japanese, title = {LINE DistilBERT Japanese}, author = {"Koga, Kobayashi and Li, Shengzhe and Nakamachi, Akifumi and Sato, Toshinori"}, year = {2023}, howpublished = {\url{http://github.com/line/LINE-DistilBERT-Japanese}} } ```