sorenmulli
/

dano-baseline-7b-0.1

Feature Extraction

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

dano-baseline-7b-0.1 / README.md

sorenmulli's picture

Update README.md

91789ac verified 2 months ago

|

history blame contribute delete

1.73 kB

	---
	license: mit
	datasets:
	- DDSC/reddit-da
	- uonlp/CulturaX
	language:
	- da
	---

	# Model Card for the Danoliterate Baseline 7B Model

	A base model with the same architecture as LlaMa 2 7B but trained from scratch on a combination of Danish datasets for 20K updates (655M tokens.)
	## Model Details

	### Model Description

	As test model part of the thesis [Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish](https://sorenmulli.github.io/thesis/thesis.pdf) with relevant details in Sections 4.1, 5.1 and 6.1.

	- Developed by: Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
	- Model type: Base, autoregressive LLM with LLaMa 2 7B architecture.
	- Language(s) (NLP): Danish
	- License: MIT

	## Uses

	This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.


	## Bias, Risks, and Limitations

	The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.

	## Training Details

	### Training Data

	The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX.
	For more details, see Section 4.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf).

	### Training Procedure

	See Sections 5.1 and 6.1 in [the thesis](https://sorenmulli.github.io/thesis/thesis.pdf)


	## Evaluation

	On the [Danoliterate LLM Benchmark](https://danoliterate.compute.dtu.dk/), this model gets an index score of 13 as of June 2024.


	## Model Card Contact

	Contact Søren Vejlgaard Holm at [email protected] or [email protected].