DimensionSTP commited on
Commit
9ead53b
โ€ข
1 Parent(s): 4bc8899

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +20 -3
  2. README_original.md +125 -0
  3. corpus/AI_HUB +50 -0
  4. corpus/MODU_CORPUS +6 -0
  5. generation_config.json +7 -0
  6. gitattributes +35 -0
README.md CHANGED
@@ -1,3 +1,20 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - solar
9
+ - mistral
10
+ - pytorch
11
+ - solar-ko
12
+ library_name: transformers
13
+ license: apache-2.0
14
+ ---
15
+
16
+ ## Model Details
17
+
18
+ **This model is fine-tuned by beomi/OPEN-SOLAR-KO-10.7B**
19
+
20
+ **Fine-tuning dataset: Scientific QA dataset**
README_original.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - solar
9
+ - mistral
10
+ - pytorch
11
+ - solar-ko
12
+ library_name: transformers
13
+ license: apache-2.0
14
+ ---
15
+
16
+ **Update Log**
17
+
18
+ - 2024.01.08: Initial Test version Release of Solar-Ko
19
+
20
+ # **Open-Solar-Ko** โญ๐Ÿ‡ฐ๐Ÿ‡ท
21
+
22
+ Solar-Ko represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, featuring an expanded vocabulary and the inclusion of a Korean corpus for enhanced pretraining.
23
+
24
+ Open-Solar-Ko exclusively utilizes publicly accessible Korean corpora, including sources such as [AI Hub](https://www.aihub.or.kr), [Modu Corpus, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
25
+
26
+ As training was conducted solely with publicly available corpora, this model is open for unrestricted use by everyone, adhering to the Apache2.0 open source License.
27
+
28
+ ## Model Details
29
+
30
+ **Model Developers:** Junbum Lee (Beomi)
31
+
32
+ **Variations:** Solar-Ko is available with one parameter sizes โ€” 10B with Continual Pretrained version.
33
+
34
+ **Input:** The model accepts only text input.
35
+
36
+ **Output:** The model produces text output exclusively.
37
+
38
+ **Model Architecture:**
39
+
40
+ SOLAR-KO-10.7B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
41
+
42
+ | |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
43
+ |---|---|---|---|---|---|---|
44
+ |SOLAR-KO-10.7B|*A curated mix of Publicly Accessible Korean Corpora*|10.7B|4k|O|>15B*|5e<sup>-5</sup>|
45
+
46
+ **Training Corpus**
47
+
48
+ The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
49
+
50
+ - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
51
+ - Only the `Training` segment of the data was used.
52
+ - The `Validation` and `Test` segments were deliberately excluded.
53
+ - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
54
+
55
+ The final JSONL dataset used to train this model is approximately 61GB in size.
56
+
57
+ Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)
58
+
59
+ **Vocab Expansion**
60
+
61
+ | Model Name | Vocabulary Size | Description |
62
+ | --- | --- | --- |
63
+ | Original Solar | 32000 | Sentencepiece BPE |
64
+ | **Expanded SOLAR-KO-10.7B** | 46592 | Sentencepiece BPE. Added Korean vocab and merges |
65
+
66
+ **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."**
67
+
68
+ - SOLAR-10.7B: 26 tokens
69
+ - SOLAR-KO-10.7b: 8 tokens
70
+
71
+ | Model | Tokens |
72
+ | --- | --- |
73
+ | SOLAR-10.7B | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '๋‚ ', '<0xEC>', '<0x94>', '<0xA8>', '๊ฐ€', 'โ–', '์ข‹', '๋„ค', '์š”', '.']` |
74
+ | SOLAR-KO-10.7B | `['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”', '.']` |
75
+
76
+ **Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**
77
+
78
+ - SOLAR-10.7B: 22 tokens
79
+ - SOLAR-KO-10.7b: 22 tokens
80
+
81
+ | Model | Tokens |
82
+ | --- | --- |
83
+ | SOLAR-10.7B | `['โ–Meet', 'โ–', '1', '0', '.', '7', 'B', 'โ–Solar', ':', 'โ–E', 'lev', 'ating', 'โ–Performance', 'โ–with', 'โ–Up', 'stage', 'โ–Dep', 'th', 'โ–UP', 'โ–Scal', 'ing', '!']` |
84
+ | SOLAR-KO-10.7B | `['โ–Meet', 'โ–', '1', '0', '.', '7', 'B', 'โ–Solar', ':', 'โ–E', 'lev', 'ating', 'โ–Performance', 'โ–with', 'โ–Up', 'stage', 'โ–Dep', 'th', 'โ–UP', 'โ–Scal', 'ing', '!']` |
85
+
86
+ # LICENSE
87
+
88
+ Apache 2.0
89
+
90
+ # **Model Benchmark**
91
+
92
+ ## LM Eval Harness - Korean (polyglot branch)
93
+
94
+ - Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
95
+
96
+ | | 0 | 5 | 10 | 50 |
97
+ |:---------------------------------|---------:|---------:|---------:|---------:|
98
+ | kobest_boolq (macro_f1) | 0.853949 | 0.88098 | 0.898139 | 0.902354 |
99
+ | kobest_copa (macro_f1) | 0.804531 | 0.826736 | 0.837656 | 0.860899 |
100
+ | kobest_hellaswag (macro_f1) | 0.507174 | 0.500983 | 0.487287 | 0.512182 |
101
+ | kobest_sentineg (macro_f1) | 0.3517 | 0.972291 | 0.977321 | 0.984884 |
102
+ | kohatespeech (macro_f1) | 0.258111 | 0.403957 | 0.386808 | 0.462393 |
103
+ | kohatespeech_apeach (macro_f1) | 0.337667 | 0.651697 | 0.705337 | 0.827757 |
104
+ | kohatespeech_gen_bias (macro_f1) | 0.124535 | 0.503464 | 0.498501 | 0.443218 |
105
+ | korunsmile (f1) | 0.3814 | 0.356939 | 0.369989 | 0.296193 |
106
+ | nsmc (acc) | 0.5356 | 0.87162 | 0.88654 | 0.89632 |
107
+ | pawsx_ko (acc) | 0.5435 | 0.5245 | 0.5315 | 0.5385 |
108
+
109
+ ## Citation
110
+
111
+ ```
112
+ @misc {solar_ko_junbum_2023,
113
+ author = { {L. Junbum} },
114
+ title = { Solar-Ko-10.7b },
115
+ year = 2024,
116
+ url = { https://huggingface.co/beomi/SOLAR-KO-10.7B },
117
+ publisher = { Hugging Face }
118
+ }
119
+
120
+ ```
121
+
122
+ ## Acknowledgements
123
+
124
+ - Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.
125
+ - The training corpus includes data from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
corpus/AI_HUB ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 754M ./001.๋ฌธ์„œ์š”์•ฝ.jsonl
2
+ 397M ./006.์ „๋ฌธ๋ถ„์•ผํ•œ์˜.jsonl
3
+ 486M ./016.ํ–‰์ •_๋ฌธ์„œ_๋Œ€์ƒ_๊ธฐ๊ณ„๋…ํ•ด_๋ฐ์ดํ„ฐ.jsonl
4
+ 563M ./017.๋‰ด์Šค_๊ธฐ์‚ฌ_๊ธฐ๊ณ„๋…ํ•ด_๋ฐ์ดํ„ฐ.jsonl
5
+ 1.2G ./018.๋…ผ๋ฌธ์ž๋ฃŒ_์š”์•ฝ_๋ฐ์ดํ„ฐ.jsonl
6
+ 88M ./019.๋ฒ•๋ฅ ,_๊ทœ์ •_(ํŒ๊ฒฐ์„œ,_์•ฝ๊ด€_๋“ฑ)_ํ…์ŠคํŠธ_๋ถ„์„_๋ฐ์ดํ„ฐ.jsonl
7
+ 75M ./020.์ฃผ์ œ๋ณ„_ํ…์ŠคํŠธ_์ผ์ƒ_๋Œ€ํ™”_๋ฐ์ดํ„ฐ.jsonl
8
+ 265M ./021.๋„์„œ์ž๋ฃŒ_๊ธฐ๊ณ„๋…ํ•ด.jsonl
9
+ 30M ./021.์šฉ๋„๋ณ„_๋ชฉ์ ๋Œ€ํ™”_๋ฐ์ดํ„ฐ.jsonl
10
+ 566M ./022.์š”์•ฝ๋ฌธ_๋ฐ_๋ ˆํฌํŠธ_์ƒ์„ฑ_๋ฐ์ดํ„ฐ.jsonl
11
+ 19G ./023.์ „๋ฌธ๋ถ„์•ผ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ(๋ถ„์•ผ๋ณ„_๊ฐœ์ฒด๋ช…_์ธ์‹_ํฌํ•จ).jsonl
12
+ 253M ./023.๋ฐฉ์†ก_์ฝ˜ํ…์ธ _๋Œ€๋ณธ_์š”์•ฝ_๋ฐ์ดํ„ฐ.jsonl
13
+ 918M ./025.์ผ์ƒ์ƒํ™œ_๋ฐ_๊ตฌ์–ด์ฒด_ํ•œ-์˜_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
14
+ 307M ./026.ํ•œ๊ตญ์–ด-์˜์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_1.jsonl
15
+ 1.3G ./026.๊ธฐ์ˆ ๊ณผํ•™_๋ถ„์•ผ_ํ•œ-์˜_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
16
+ 309M ./027.ํ•œ๊ตญ์–ด-์ค‘๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_1.jsonl
17
+ 347M ./027.ํ•œ๊ตญ์–ด-์˜์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_2.jsonl
18
+ 538M ./027.์ผ์ƒ์ƒํ™œ_๋ฐ_๊ตฌ์–ด์ฒด_ํ•œ-์ค‘,_ํ•œ-์ผ_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
19
+ 276M ./028.ํ•œ๊ตญ์–ด-์ค‘๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜_2.jsonl
20
+ 300M ./028.๋‹ค๊ตญ์–ด_๊ตฌ์–ด์ฒด_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
21
+ 410M ./029.ํ•œ๊ตญ์–ด-์ผ๋ณธ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜.jsonl
22
+ 542K ./029.๋Œ€๊ทœ๋ชจ_๊ตฌ๋งค๋„์„œ_๊ธฐ๋ฐ˜_ํ•œ๊ตญ์–ด_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
23
+ 9.9G ./030.์›น๋ฐ์ดํ„ฐ_๊ธฐ๋ฐ˜_ํ•œ๊ตญ์–ด_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
24
+ 1.4G ./031.์˜จ๋ผ์ธ_๊ตฌ์–ด์ฒด_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
25
+ 258M ./032.๋ฐฉ์†ก์ฝ˜ํ…์ธ _ํ•œ๊ตญ์–ด-์˜์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜.jsonl
26
+ 84M ./032.ํŠนํ—ˆ_๋ถ„์•ผ_์ž๋™๋ถ„๋ฅ˜_๋ฐ์ดํ„ฐ.jsonl
27
+ 239M ./034.๋ฐฉ์†ก์ฝ˜ํ…์ธ _ํ•œ๊ตญ์–ด-์œ ๋Ÿฝ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜.jsonl
28
+ 65M ./044.ํŽ˜๋ฅด์†Œ๋‚˜_๋Œ€ํ™”.jsonl
29
+ 56M ./045.์ง€์‹๊ฒ€์ƒ‰_๋Œ€ํ™”.jsonl
30
+ 67M ./046.๊ณต๊ฐํ˜•_๋Œ€ํ™”.jsonl
31
+ 85M ./049.์ผ๋ฐ˜์ƒ์‹_๋ฌธ์žฅ_์ƒ์„ฑ_ํ‰๊ฐ€_๋ฐ์ดํ„ฐ.jsonl
32
+ 13M ./050.๋ฐœํ™”์œ ํ˜•(๋ฌธ์–ด,๊ตฌ์–ด,์ฑ„ํŒ…)๋ณ„_๊ธฐ๊ณ„๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜.jsonl
33
+ 193K ./052.๊ธฐ๊ณ„๋ฒˆ์—ญ_ํ’ˆ์งˆ_๊ฒ€์ฆ_๋ฐ์ดํ„ฐ.jsonl
34
+ 118M ./053.ํ•œ๊ตญ์–ด-๋‹ค๊ตญ์–ด(์˜์–ด_์ œ์™ธ)_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(๊ธฐ์ˆ ๊ณผํ•™).jsonl
35
+ 127M ./054.ํ•œ๊ตญ์–ด-๋‹ค๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(๊ธฐ์ดˆ๊ณผํ•™).jsonl
36
+ 67M ./055.ํ•œ๊ตญ์–ด-๋‹ค๊ตญ์–ด_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(์ธ๋ฌธํ•™).jsonl
37
+ 205M ./11.๊ธฐ๊ณ„๋…ํ•ด.jsonl
38
+ 259M ./141.ํ•œ๊ตญ์–ด_๋ฉ€ํ‹ฐ์„ธ์…˜_๋Œ€ํ™”.jsonl
39
+ 248M ./142.ํ•œ๊ตญ์–ด_์ง€์‹๊ธฐ๋ฐ˜_๊ด€๊ณ„_๋ฐ์ดํ„ฐ.jsonl
40
+ 108M ./143.๋ฏผ์›_์—…๋ฌด_ํšจ์œจ,_์ž๋™ํ™”๋ฅผ_์œ„ํ•œ_์–ธ์–ด_AI_ํ•™์Šต๋ฐ์ดํ„ฐ.jsonl
41
+ 2.4G ./146.๋‚š์‹œ์„ฑ_๊ธฐ์‚ฌ_ํƒ์ง€_๋ฐ์ดํ„ฐ.jsonl
42
+ 23M ./147.ํ…์ŠคํŠธ_์œค๋ฆฌ๊ฒ€์ฆ_๋ฐ์ดํ„ฐ.jsonl
43
+ 632M ./153.๊ธฐ์ˆ ๊ณผํ•™_์š”์•ฝ_๋ฐ์ดํ„ฐ.jsonl
44
+ 962M ./155.์‚ฐ์—…์ •๋ณด_์—ฐ๊ณ„_์ฃผ์š”๊ตญ_ํŠนํ—ˆ_์˜-ํ•œ_๋ฐ์ดํ„ฐ.jsonl
45
+ 1.1G ./156.์ „๋ฌธ๋ถ„์•ผ_์˜-ํ•œ,_์ค‘-ํ•œ_๋ฒˆ์—ญ_๋ง๋ญ‰์น˜(์‹ํ’ˆ).jsonl
46
+ 236M ./157.๋ฐฉ์†ก_์ฝ˜ํ…์ธ _ํ•œ-์ค‘,_ํ•œ-์ผ_๋ฒˆ์—ญ_๋ณ‘๋ ฌ_๋ง๋ญ‰์น˜_๋ฐ์ดํ„ฐ.jsonl
47
+ 418M ./157.์ถ”์ƒ_์š”์•ฝ_์‚ฌ์‹ค์„ฑ_๊ฒ€์ฆ_๋ฐ์ดํ„ฐ.jsonl
48
+ 12M ./158.์‹œ๊ฐ„_ํ‘œํ˜„_ํƒ์ง€_๋ฐ์ดํ„ฐ.jsonl
49
+ 17M ./159.๋ฌธ์žฅ_์œ ํ˜•(์ถ”๋ก ,_์˜ˆ์ธก_๋“ฑ)_ํŒ๋‹จ_๋ฐ์ดํ„ฐ.jsonl
50
+ 1.4G ./297.SNS_๋ฐ์ดํ„ฐ_๊ณ ๋„ํ™”.jsonl
corpus/MODU_CORPUS ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ ์ผ์ƒ๋Œ€ํ™”๋ง๋ญ‰์น˜ 2020, 2021
2
+ ์‹ ๋ฌธ ๋ง๋ญ‰์น˜ 2020, 2021, 2022
3
+ ์œ ์‚ฌ ๋ฌธ์žฅ ๋ง๋ญ‰์น˜
4
+ ๋ฌธ์„œ ์š”์•ฝ ๋ง๋ญ‰์น˜
5
+ ๋ฌธ์–ด ๋ง๋ญ‰์น˜
6
+ ์˜๋ฏธ์—ญ ๋ถ„์„ ๋ง๋ญ‰์น˜
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 2,
6
+ "transformers_version": "4.36.2"
7
+ }
gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text