versae commited on
Commit
725c101
1 Parent(s): 376bcea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -21
README.md CHANGED
@@ -9,7 +9,8 @@ widget:
9
  - text: Fui a la librería a comprar un <mask>.
10
  ---
11
 
12
- - [Version v1](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1) (default): July 26th, 2021
 
13
  - [Version v1-512](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512): July 26th, 2021
14
  - [Version beta](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta): July 15th, 2021
15
 
@@ -26,6 +27,62 @@ This is part of the
26
 
27
  The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  # Motivation
30
 
31
  According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers), only after Chinese, and the fourth including those who speak it as a second language. However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
@@ -450,15 +507,6 @@ At a personal level, the experience has been incredible for all of us. We believ
450
 
451
  Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
452
 
453
- ## Team members
454
-
455
- - Javier de la Rosa ([versae](https://huggingface.co/versae))
456
- - Eduardo González ([edugp](https://huggingface.co/edugp))
457
- - Paulo Villegas ([paulo](https://huggingface.co/paulo))
458
- - Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
459
- - Manu Romero ([mrm8488](https://huggingface.co/))
460
- - María Grandury ([mariagrandury](https://huggingface.co/))
461
-
462
  ## Useful links
463
 
464
  - [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
@@ -468,14 +516,4 @@ Given our good results, on par with those of large corporations, we hope our wor
468
  - [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
469
  - [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
470
 
471
- ## References
472
-
473
- - Heafield, K. (2011). KenLM: faster and smaller language model queries. Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
474
-
475
- - Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2021). Deduplicating Training Data Makes Language Models Better. arXiv preprint arXiv:2107.06499.
476
-
477
- - Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
478
-
479
- - Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1), 1-38.
480
-
481
- - Wenzek, G., Lachaux, M. A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., & Grave, E. (2019). Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359.
 
9
  - text: Fui a la librería a comprar un <mask>.
10
  ---
11
 
12
+ - [Version v2](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v2) (default): April 28th, 2022
13
+ - [Version v1](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1): July 26th, 2021
14
  - [Version v1-512](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512): July 26th, 2021
15
  - [Version beta](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/beta): July 15th, 2021
16
 
 
27
 
28
  The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
29
 
30
+
31
+ ## Team members
32
+
33
+ - Javier de la Rosa ([versae](https://huggingface.co/versae))
34
+ - Eduardo González ([edugp](https://huggingface.co/edugp))
35
+ - Paulo Villegas ([paulo](https://huggingface.co/paulo))
36
+ - Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
37
+ - Manu Romero ([mrm8488](https://huggingface.co/))
38
+ - María Grandury ([mariagrandury](https://huggingface.co/))
39
+
40
+
41
+ ## Citation and Related Information
42
+
43
+ To cite this model:
44
+ ```bibtex
45
+ @article{BERTIN,
46
+ author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury},
47
+ title = {BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling},
48
+ journal = {Procesamiento del Lenguaje Natural},
49
+ volume = {68},
50
+ number = {0},
51
+ year = {2022},
52
+ keywords = {},
53
+ abstract = {The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.},
54
+ issn = {1989-7553},
55
+ url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},
56
+ pages = {13--23}
57
+ }
58
+ ```
59
+
60
+ If you use this model, we would love to hear about it! Reach out on twitter, GitHub, Discord, or shoot us an email.
61
+
62
+ ## Team
63
+
64
+ - Javier de la Rosa ([versae](https://huggingface.co/versae))
65
+ - Eduardo González ([edugp](https://huggingface.co/edugp))
66
+ - Paulo Villegas ([paulo](https://huggingface.co/paulo))
67
+ - Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
68
+ - Manu Romero ([mrm8488](https://huggingface.co/))
69
+ - María Grandury ([mariagrandury](https://huggingface.co/))
70
+
71
+ ## Acknowledgements
72
+
73
+ This project would not have been possible without compute generously provided by the National Library of Norway and Google through the
74
+ [TPU Research Cloud](https://sites.research.google/trc/), as well as the Cloud TPU team for providing early access to the [Cloud TPU VM](https://cloud.google.com/blog/products/compute/introducing-cloud-tpu-vms) Alpha. And specially, to [Stella Biderman](https://www.stellabiderman.com) for her general openness, and [Ben Wang](https://github.com/kingoflolz/mesh-transformer-jax) for the main codebase.
75
+
76
+ ## Disclaimer
77
+
78
+ The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models be liable for any results arising from the use made by third parties of these models.
79
+
80
+ <hr>
81
+
82
+
83
+ <details>
84
+ <summary>Full report</summary>
85
+
86
  # Motivation
87
 
88
  According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers), only after Chinese, and the fourth including those who speak it as a second language. However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.
 
507
 
508
  Given our good results, on par with those of large corporations, we hope our work will inspire and set the basis for more small teams to play and experiment with language models on smaller subsets of huge datasets.
509
 
 
 
 
 
 
 
 
 
 
510
  ## Useful links
511
 
512
  - [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
 
516
  - [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
517
  - [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
518
 
519
+ </details>