Fill-Mask
Transformers
PyTorch
xlm-roberta
Inference Endpoints
File size: 9,446 Bytes
248ac9b
eb087ab
05139f2
 
 
 
 
c0e2471
 
248ac9b
05139f2
 
c0e2471
05139f2
 
 
 
 
 
d3f4e36
 
05139f2
c0e2471
05139f2
c0e2471
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05139f2
 
0da0849
05139f2
 
 
 
 
 
c0e2471
 
 
05139f2
c0e2471
05139f2
c0e2471
 
05139f2
c0e2471
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05139f2
 
22d7b25
a101ab9
22d7b25
 
da955ed
 
9d591bb
da955ed
 
 
 
 
a101ab9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: cc-by-sa-4.0
language:
- hr
- sl
- bs
- sr
datasets:
- classla/xlm-r-bertic-data
---
# XLM-R-SloBertić

This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co/xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co/datasets/classla/xlm-r-bertic-data)

# Benchmarking
Three tasks were chosen for model evaluation:
* Named Entity Recognition (NER)
* Sentiment regression
* COPA (Choice of plausible alternatives)

  
In all cases, this model was finetuned for specific downstream tasks.

## NER

Mean F1 scores were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co/datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co/datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co/datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co/datasets/classla/setimes_sr). 

| system                                                                 | dataset | F1 score |
|:-----------------------------------------------------------------------|:--------|---------:|
| [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic)            | hr500k  |    0.927 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | hr500k  |    0.925 |
| [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic)      | hr500k  |    0.923 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | hr500k  |    0.919 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | hr500k  |    0.918 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | hr500k  |    0.903 |

| system                                                                 | dataset  | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic)      | ReLDI-hr |    0.812 |
| [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic)            | ReLDI-hr |    0.809 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-hr |    0.794 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | ReLDI-hr |    0.792 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | ReLDI-hr |    0.791 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | ReLDI-hr |    0.763 |

| system                                                                 | dataset    | F1 score |
|:-----------------------------------------------------------------------|:-----------|---------:|
| [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic)      | SETimes.SR |    0.949 |
| [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic)            | SETimes.SR |    0.940 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | SETimes.SR |    0.936 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | SETimes.SR |    0.933 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | SETimes.SR |    0.922 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | SETimes.SR |    0.914 |

| system                                                                 | dataset  | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic)            | ReLDI-sr |    0.841 |
| [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic)      | ReLDI-sr |    0.824 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | ReLDI-sr |    0.798 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | ReLDI-sr |    0.774 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-sr |    0.751 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | ReLDI-sr |    0.734 |

## Sentiment regression

[ParlaSent dataset](https://huggingface.co/datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. 
The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment).

| system                                                                 | train               | test                     |   r^2 |
|:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:|
| [xlm-r-parlasent](https://huggingface.co/classla/xlm-r-parlasent)      | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
| [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic)      | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
| [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic)            | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
| dummy (mean)                                                           | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |


## COPA


| system                                                                 | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | Copa-SR |          0.689 |
| [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic)      | Copa-SR |          0.665 |
| [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic)            | Copa-SR |          0.637 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-SR |          0.607 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | Copa-SR |          0.573 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | Copa-SR |          0.570 |


| system                                                                 | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | Copa-HR |          0.669 |
| [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic)      | Copa-HR |          0.628 |
| [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic)            | Copa-HR |          0.635 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-HR |          0.669 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | Copa-HR |          0.585 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | Copa-HR |          0.571 |



# Citation

<!---
The following paper has been submitted for review:

```
@misc{ljubesic2024language,
  author       = "Ljube\v{s}i\'{c}, Nikola and Suchomel, Vit and Rupnik, Peter and Kuzman, Taja and van Noord, Rik",
  title        = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
  howpublished = "Submitted for review",
  year         = "2024",
}
```
--->


Please cite the following paper:
```
 @article{Ljubešić_Suchomel_Rupnik_Kuzman_van Noord_2024,
title={Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining},
url={http://arxiv.org/abs/2404.05428},
DOI={10.48550/arXiv.2404.05428},
abstractNote={The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.},
note={arXiv:2404.05428 [cs]},
 number={arXiv:2404.05428},
publisher={arXiv},
author={Ljubešić, Nikola and Suchomel, Vít and Rupnik, Peter and Kuzman, Taja and van Noord, Rik},
year={2024},
month=apr
}

```