Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,8 @@ widget:
|
|
9 |
- text: Fui a la librería a comprar un <mask>.
|
10 |
---
|
11 |
|
12 |
-
- [Version
|
|
|
13 |
- [Version v1-512](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512): July 26th, 2021
|
14 |
- [Version beta](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta): July 15th, 2021
|
15 |
|
@@ -26,6 +27,62 @@ This is part of the
|
|
26 |
|
27 |
The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
# Motivation
|
30 |
|
31 |
According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers), only after Chinese, and the fourth including those who speak it as a second language. However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
|
@@ -450,15 +507,6 @@ At a personal level, the experience has been incredible for all of us. We believ
|
|
450 |
|
451 |
Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
|
452 |
|
453 |
-
## Team members
|
454 |
-
|
455 |
-
- Javier de la Rosa ([versae](https://huggingface.co/versae))
|
456 |
-
- Eduardo González ([edugp](https://huggingface.co/edugp))
|
457 |
-
- Paulo Villegas ([paulo](https://huggingface.co/paulo))
|
458 |
-
- Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
|
459 |
-
- Manu Romero ([mrm8488](https://huggingface.co/))
|
460 |
-
- María Grandury ([mariagrandury](https://huggingface.co/))
|
461 |
-
|
462 |
## Useful links
|
463 |
|
464 |
- [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
|
@@ -468,14 +516,4 @@ Given our good results, on par with those of large corporations, we hope our wor
|
|
468 |
- [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
|
469 |
- [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
|
470 |
|
471 |
-
|
472 |
-
|
473 |
-
- Heafield, K. (2011). KenLM: faster and smaller language model queries. Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
|
474 |
-
|
475 |
-
- Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2021). Deduplicating Training Data Makes Language Models Better. arXiv preprint arXiv:2107.06499.
|
476 |
-
|
477 |
-
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
|
478 |
-
|
479 |
-
- Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1), 1-38.
|
480 |
-
|
481 |
-
- Wenzek, G., Lachaux, M. A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., & Grave, E. (2019). Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359.
|
|
|
9 |
- text: Fui a la librería a comprar un <mask>.
|
10 |
---
|
11 |
|
12 |
+
- [Version v2](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v2) (default): April 28th, 2022
|
13 |
+
- [Version v1](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1): July 26th, 2021
|
14 |
- [Version v1-512](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512): July 26th, 2021
|
15 |
- [Version beta](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta): July 15th, 2021
|
16 |
|
|
|
27 |
|
28 |
The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
|
29 |
|
30 |
+
|
31 |
+
## Team members
|
32 |
+
|
33 |
+
- Javier de la Rosa ([versae](https://huggingface.co/versae))
|
34 |
+
- Eduardo González ([edugp](https://huggingface.co/edugp))
|
35 |
+
- Paulo Villegas ([paulo](https://huggingface.co/paulo))
|
36 |
+
- Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
|
37 |
+
- Manu Romero ([mrm8488](https://huggingface.co/))
|
38 |
+
- María Grandury ([mariagrandury](https://huggingface.co/))
|
39 |
+
|
40 |
+
|
41 |
+
## Citation and Related Information
|
42 |
+
|
43 |
+
To cite this model:
|
44 |
+
```bibtex
|
45 |
+
@article{BERTIN,
|
46 |
+
author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury},
|
47 |
+
title = {BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling},
|
48 |
+
journal = {Procesamiento del Lenguaje Natural},
|
49 |
+
volume = {68},
|
50 |
+
number = {0},
|
51 |
+
year = {2022},
|
52 |
+
keywords = {},
|
53 |
+
abstract = {The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.},
|
54 |
+
issn = {1989-7553},
|
55 |
+
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},
|
56 |
+
pages = {13--23}
|
57 |
+
}
|
58 |
+
```
|
59 |
+
|
60 |
+
If you use this model, we would love to hear about it! Reach out on twitter, GitHub, Discord, or shoot us an email.
|
61 |
+
|
62 |
+
## Team
|
63 |
+
|
64 |
+
- Javier de la Rosa ([versae](https://huggingface.co/versae))
|
65 |
+
- Eduardo González ([edugp](https://huggingface.co/edugp))
|
66 |
+
- Paulo Villegas ([paulo](https://huggingface.co/paulo))
|
67 |
+
- Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
|
68 |
+
- Manu Romero ([mrm8488](https://huggingface.co/))
|
69 |
+
- María Grandury ([mariagrandury](https://huggingface.co/))
|
70 |
+
|
71 |
+
## Acknowledgements
|
72 |
+
|
73 |
+
This project would not have been possible without compute generously provided by the National Library of Norway and Google through the
|
74 |
+
[TPU Research Cloud](https://sites.research.google/trc/), as well as the Cloud TPU team for providing early access to the [Cloud TPU VM](https://cloud.google.com/blog/products/compute/introducing-cloud-tpu-vms) Alpha. And specially, to [Stella Biderman](https://www.stellabiderman.com) for her general openness, and [Ben Wang](https://github.com/kingoflolz/mesh-transformer-jax) for the main codebase.
|
75 |
+
|
76 |
+
## Disclaimer
|
77 |
+
|
78 |
+
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models be liable for any results arising from the use made by third parties of these models.
|
79 |
+
|
80 |
+
<hr>
|
81 |
+
|
82 |
+
|
83 |
+
<details>
|
84 |
+
<summary>Full report</summary>
|
85 |
+
|
86 |
# Motivation
|
87 |
|
88 |
According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers), only after Chinese, and the fourth including those who speak it as a second language. However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
|
|
|
507 |
|
508 |
Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
|
509 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
510 |
## Useful links
|
511 |
|
512 |
- [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
|
|
|
516 |
- [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
|
517 |
- [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
|
518 |
|
519 |
+
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|