|
--- |
|
license: mit |
|
datasets: |
|
- DDSC/reddit-da |
|
- uonlp/CulturaX |
|
language: |
|
- da |
|
--- |
|
|
|
# Model Card for the Danoliterate Baseline 7B Model |
|
|
|
A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.) |
|
## Model Details |
|
|
|
### Model Description |
|
|
|
As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1. |
|
|
|
- **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen. |
|
- **Model type:** Base, autoregressive LLM with LLaMa 2 7B architecture. |
|
- **Language(s) (NLP):** Danish |
|
- **License:** MIT |
|
|
|
## Uses |
|
|
|
This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX. |
|
For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf). |
|
|
|
### Training Procedure |
|
|
|
See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf) |
|
|
|
|
|
## Evaluation |
|
|
|
On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 13 as of June 2024. |
|
|
|
|
|
## Model Card Contact |
|
|
|
Contact Søren Vejlgaard Holm at [email protected] or [email protected]. |