nold commited on
Commit
2db720d
1 Parent(s): ae02e63

Upload folder using huggingface_hub (#1)

Browse files

- 6f60057da271f497f5026040cbbf4996af03d9464f881724793ba85066b2f55d (2963bd068d16f9b574fd6b1fcc643ca486637416)
- e2bf98d3140918064b0d90044812da08d54628f1045af93a932e3ccb6dcc5d15 (e09b3ba264e1ba2ad85bcf06caf8928ff49a9037)
- cda9d07aa6ac79d1dc7948478a1174fcd1401472a0597a3ef03c7a6273aa46b8 (2762f1ec42713e2724e3a67983103dcae66d90ea)
- 288f1cac1d8403bc127bc405219bc99949ab10e8c2c358ee3a36dab0502ca267 (5f9f4b57dfd36e2a151e9485b9fdfc5315ae9a86)

.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ CroissantLLMChat-v0.1_Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
37
+ CroissantLLMChat-v0.1_Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
38
+ CroissantLLMChat-v0.1_Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
39
+ CroissantLLMChat-v0.1_Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
40
+ CroissantLLMChat-v0.1_Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
CroissantLLMChat-v0.1_Q2_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10c1f179142563f970fb6e5b053991dc60fe75f70443f6e55e8a3ee6a52a07f1
3
+ size 575612096
CroissantLLMChat-v0.1_Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:31503df79d7edb19137fcda7d052c7f59d30b436ee113de3de8c49824ee4ce16
3
+ size 872319104
CroissantLLMChat-v0.1_Q5_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18b02dbcc576168bdf85d91d84e1425ad206c782290b5d970a1209d98fcf919c
3
+ size 1000639104
CroissantLLMChat-v0.1_Q6_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90fd5b906d47a10c1152ccb5afc64ddf566110000fc05f7bbb8d445c649d093a
3
+ size 1170267296
CroissantLLMChat-v0.1_Q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd72e85e6b084823859e83cfe9c4b59906180745743a0321e441505d28602835
3
+ size 1430570080
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - croissantllm/croissant_dataset
5
+ - croissantllm/CroissantLLM-2201-sft
6
+ - cerebras/SlimPajama-627B
7
+ - uonlp/CulturaX
8
+ - pg19
9
+ - bigcode/starcoderdata
10
+ language:
11
+ - fr
12
+ - en
13
+ pipeline_tag: text2text-generation
14
+ tags:
15
+ - legal
16
+ - code
17
+ - text-generation-inference
18
+ - art
19
+ ---
20
+
21
+ # CroissantLLMChat (190k steps + Chat)
22
+
23
+ This model is part of the CroissantLLM initiative, and corresponds to the checkpoint after 190k steps (2.99 T) tokens and a final Chat finetuning phase.
24
+
25
+ https://arxiv.org/abs/2402.00786
26
+
27
+ For best performance, it should be used with a temperature of 0.3 or more, and with the exact template described below:
28
+
29
+ ```python
30
+ chat = [
31
+ {"role": "user", "content": "Que puis-je faire à Marseille en hiver?"},
32
+ ]
33
+
34
+ chat_input = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
35
+ ```
36
+
37
+ corresponding to:
38
+
39
+ ```python
40
+ chat_input = """<|im_start|>user
41
+ {USER QUERY}<|im_end|>
42
+ <|im_start|>assistant\n"""
43
+ ```
44
+
45
+
46
+ ## Abstract
47
+ We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware.
48
+ To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources.
49
+ To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81% of the transparency criteria, far beyond the scores of even most open initiatives.
50
+ This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.
51
+
52
+ ## Citation
53
+
54
+ Our work can be cited as:
55
+
56
+ ```bash
57
+ @misc{faysse2024croissantllm,
58
+ title={CroissantLLM: A Truly Bilingual French-English Language Model},
59
+ author={Manuel Faysse and Patrick Fernandes and Nuno Guerreiro and António Loison and Duarte Alves and Caio Corro and Nicolas Boizard and João Alves and Ricardo Rei and Pedro Martins and Antoni Bigata Casademunt and François Yvon and André Martins and Gautier Viaud and Céline Hudelot and Pierre Colombo},
60
+ year={2024},
61
+ eprint={2402.00786},
62
+ archivePrefix={arXiv},
63
+ primaryClass={cs.CL}
64
+ }
65
+ ```
66
+
67
+ ## Usage
68
+
69
+ This model is a Chat model, that is, it is finetuned for Chat function and works best with the provided template.
70
+
71
+ ```python
72
+
73
+ import torch
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+
76
+
77
+ model_name = "croissantllm/CroissantLLMChat-v0.1"
78
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
79
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
80
+
81
+ chat = [
82
+ {"role": "user", "content": "Que puis-je faire à Marseille en hiver?"},
83
+ ]
84
+
85
+ chat_input = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
86
+
87
+ inputs = tokenizer(chat_input, return_tensors="pt", add_special_tokens=True).to(model.device)
88
+ tokens = model.generate(**inputs, max_new_tokens=150, do_sample=True, top_p=0.95, top_k=60, temperature=0.3)
89
+ print(tokenizer.decode(tokens[0]))
90
+ ```
91
+
92
+
93
+ ## Model limitations
94
+
95
+ Evaluation results indicate the model is strong in its size category, and offers decent performances on writing-based tasks and internal knowledge, and very strong performance on translation tasks. The small size of the CroissantLLM model however hinders its capacity to perform more complex reasoning-based tasks, at least in a zero or few-shot manner in its generalist base or chat-model versions. This is aligned with other models of size and underlines the importance of scale for more abstract tasks.
96
+
97
+ #### Knowledge Cutoff
98
+ The model training dataset has a data cutoff date corresponding to the November 2023 Wikipedia dump. This is the de facto knowledge cutoff date for our base model, although a lot of information dates back further. Updated versions can be trained through continued pre-training or subsequent fine-tuning.
99
+
100
+ #### Multilingual performance.
101
+ CroissantLLM is mostly a French and English model. Code performance is relatively limited, and although some amount of data from other languages is included within the SlimPajama training set, out-of-the-box performance in other languages is not to be expected, although some European languages do work quite well.
102
+
103
+ #### Hallucinations.
104
+ CroissantLLM can hallucinate and output factually incorrect data, especially regarding complex topics. This is to be expected given the small model size, and hallucination rates seem inferior to most models of the same size category although no quantitative assessments have been conducted outside of MT-Bench experiments.
105
+
106
+
107
+
108
+ ***
109
+
110
+ Quantization of Model [croissantllm/CroissantLLMChat-v0.1](https://huggingface.co/croissantllm/CroissantLLMChat-v0.1). Created using [llm-quantizer](https://github.com/Nold360/llm-quantizer) Pipeline [8668cbd2081063e33a128251312e6de9744d0a64]