--- license: mit datasets: - DDSC/reddit-da - uonlp/CulturaX language: - da --- # Model Card for the Danoliterate Baseline 7B Model A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.) ## Model Details ### Model Description As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1. - **Developed by:** Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen. - **Model type:** Base, autoregressive LLM with LLaMa 2 7B architecture. - **Language(s) (NLP):** Danish - **License:** MIT ## Uses This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly. ## Bias, Risks, and Limitations The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content. ## Training Details ### Training Data The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX. For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf). ### Training Procedure See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf) ## Evaluation On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 13 as of June 2024. ## Model Card Contact Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.