Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +20 -3
README_original.md +125 -0
corpus/AI_HUB +50 -0
corpus/MODU_CORPUS +6 -0
generation_config.json +7 -0
gitattributes +35 -0

README.md CHANGED Viewed

@@ -1,3 +1,20 @@
----
-license: apache-2.0
----

+---
+language:
+- ko
+- en
+pipeline_tag: text-generation
+inference: false
+tags:
+- solar
+- mistral
+- pytorch
+- solar-ko
+library_name: transformers
+license: apache-2.0
+---
+## Model Details
+**This model is fine-tuned by beomi/OPEN-SOLAR-KO-10.7B**
+**Fine-tuning dataset: Scientific QA dataset**

README_original.md ADDED Viewed

	@@ -0,0 +1,125 @@

+---
+language:
+- ko
+- en
+pipeline_tag: text-generation
+inference: false
+tags:
+- solar
+- mistral
+- pytorch
+- solar-ko
+library_name: transformers
+license: apache-2.0
+---
+**Update Log**
+- 2024.01.08: Initial Test version Release of Solar-Ko
+# **Open-Solar-Ko** ⭐🇰🇷
+Solar-Ko represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, featuring an expanded vocabulary and the inclusion of a Korean corpus for enhanced pretraining.
+Open-Solar-Ko exclusively utilizes publicly accessible Korean corpora, including sources such as [AI Hub](https://www.aihub.or.kr), [Modu Corpus, 모두의 말뭉치](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
+As training was conducted solely with publicly available corpora, this model is open for unrestricted use by everyone, adhering to the Apache2.0 open source License.
+## Model Details
+**Model Developers:** Junbum Lee (Beomi)
+**Variations:** Solar-Ko is available with one parameter sizes — 10B with Continual Pretrained version.
+**Input:** The model accepts only text input.
+**Output:** The model produces text output exclusively.
+**Model Architecture:**
+SOLAR-KO-10.7B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
+| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
+|---|---|---|---|---|---|---|
+|SOLAR-KO-10.7B|*A curated mix of Publicly Accessible Korean Corpora*|10.7B|4k|O|>15B*|5e<sup>-5</sup>|
+**Training Corpus**
+The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
+- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
+  - Only the `Training` segment of the data was used.
+  - The `Validation` and `Test` segments were deliberately excluded.
+- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
+The final JSONL dataset used to train this model is approximately 61GB in size.
+Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)
+**Vocab Expansion**
+| Model Name | Vocabulary Size | Description |
+| --- | --- | --- |
+| Original Solar | 32000 | Sentencepiece BPE |
+| **Expanded SOLAR-KO-10.7B** | 46592 | Sentencepiece BPE. Added Korean vocab and merges |
+**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**
+- SOLAR-10.7B: 26 tokens
+- SOLAR-KO-10.7b: 8 tokens
+| Model | Tokens |
+| --- | --- |
+| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
+| SOLAR-KO-10.7B | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.']` |
+**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**
+- SOLAR-10.7B: 22 tokens
+- SOLAR-KO-10.7b: 22 tokens
+| Model | Tokens |
+| --- | --- |
+| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
+| SOLAR-KO-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
+# LICENSE
+Apache 2.0
+# **Model Benchmark**
+## LM Eval Harness - Korean (polyglot branch)
+- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
+|                                  |        0 |        5 |       10 |       50 |
+|:---------------------------------|---------:|---------:|---------:|---------:|
+| kobest_boolq (macro_f1)          | 0.853949 | 0.88098  | 0.898139 | 0.902354 |
+| kobest_copa (macro_f1)           | 0.804531 | 0.826736 | 0.837656 | 0.860899 |
+| kobest_hellaswag (macro_f1)      | 0.507174 | 0.500983 | 0.487287 | 0.512182 |
+| kobest_sentineg (macro_f1)       | 0.3517   | 0.972291 | 0.977321 | 0.984884 |
+| kohatespeech (macro_f1)          | 0.258111 | 0.403957 | 0.386808 | 0.462393 |
+| kohatespeech_apeach (macro_f1)   | 0.337667 | 0.651697 | 0.705337 | 0.827757 |
+| kohatespeech_gen_bias (macro_f1) | 0.124535 | 0.503464 | 0.498501 | 0.443218 |
+| korunsmile (f1)                  | 0.3814   | 0.356939 | 0.369989 | 0.296193 |
+| nsmc (acc)                       | 0.5356   | 0.87162  | 0.88654  | 0.89632  |
+| pawsx_ko (acc)                   | 0.5435   | 0.5245   | 0.5315   | 0.5385   |
+## Citation
+```
+@misc {solar_ko_junbum_2023,
+    author       = { {L. Junbum} },
+    title        = { Solar-Ko-10.7b },
+    year         = 2024,
+    url          = { https://huggingface.co/beomi/SOLAR-KO-10.7B },
+    publisher    = { Hugging Face }
+}
+```
+## Acknowledgements
+- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.
+- The training corpus includes data from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).

corpus/AI_HUB ADDED Viewed

	@@ -0,0 +1,50 @@

+754M ./001.문서요약.jsonl
+397M ./006.전문분야한영.jsonl
+486M ./016.행정_문서_대상_기계독해_데이터.jsonl
+563M ./017.뉴스_기사_기계독해_데이터.jsonl
+1.2G ./018.논문자료_요약_데이터.jsonl
+88M ./019.법률,_규정_(판결서,_약관_등)_텍스트_분석_데이터.jsonl
+75M ./020.주제별_텍스트_일상_대화_데이터.jsonl
+265M ./021.도서자료_기계독해.jsonl
+30M ./021.용도별_목적대화_데이터.jsonl
+566M ./022.요약문_및_레포트_생성_데이터.jsonl
+19G ./023.전문분야_말뭉치_데이터(분야별_개체명_인식_포함).jsonl
+253M ./023.방송_콘텐츠_대본_요약_데이터.jsonl
+918M ./025.일상생활_및_구어체_한-영_번역_병렬_말뭉치_데이터.jsonl
+307M ./026.한국어-영어_번역_말뭉치_1.jsonl
+1.3G ./026.기술과학_분야_한-영_번역_병렬_말뭉치_데이터.jsonl
+309M ./027.한국어-중국어_번역_말뭉치_1.jsonl
+347M ./027.한국어-영어_번역_말뭉치_2.jsonl
+538M ./027.일상생활_및_구어체_한-중,_한-일_번역_병렬_말뭉치_데이터.jsonl
+276M ./028.한국어-중국어_번역_말뭉치_2.jsonl
+300M ./028.다국어_구어체_번역_병렬_말뭉치_데이터.jsonl
+410M ./029.한국어-일본어_번역_말뭉치.jsonl
+542K ./029.대규모_구매도서_기반_한국어_말뭉치_데이터.jsonl
+9.9G ./030.웹데이터_기반_한국어_말뭉치_데이터.jsonl
+1.4G ./031.온라인_구어체_말뭉치_데이터.jsonl
+258M ./032.방송콘텐츠_한국어-영어_번역_말뭉치.jsonl
+84M ./032.특허_분야_자동분류_데이터.jsonl
+239M ./034.방송콘텐츠_한국어-유럽어_번역_말뭉치.jsonl
+65M ./044.페르소나_대화.jsonl
+56M ./045.지식검색_대화.jsonl
+67M ./046.공감형_대화.jsonl
+85M ./049.일반상식_문장_생성_평가_데이터.jsonl
+13M ./050.발화유형(문어,구어,채팅)별_기계번역_병렬_말뭉치.jsonl
+193K ./052.기계번역_품질_검증_데이터.jsonl
+118M ./053.한국어-다국어(영어_제외)_번역_말뭉치(기술과학).jsonl
+127M ./054.한국어-다국어_번역_말뭉치(기초과학).jsonl
+67M ./055.한국어-다국어_번역_말뭉치(인문학).jsonl
+205M ./11.기계독해.jsonl
+259M ./141.한국어_멀티세션_대화.jsonl
+248M ./142.한국어_지식기반_관계_데이터.jsonl
+108M ./143.민원_업무_효율,_자동화를_위한_언어_AI_학습데이터.jsonl
+2.4G ./146.낚시성_기사_탐지_데이터.jsonl
+23M ./147.텍스트_윤리검증_데이터.jsonl
+632M ./153.기술과학_요약_데이터.jsonl
+962M ./155.산업정보_연계_주요국_특허_영-한_데이터.jsonl
+1.1G ./156.전문분야_영-한,_중-한_번역_말뭉치(식품).jsonl
+236M ./157.방송_콘텐츠_한-중,_한-일_번역_병렬_말뭉치_데이터.jsonl
+418M ./157.추상_요약_사실성_검증_데이터.jsonl
+12M ./158.시간_표현_탐지_데이터.jsonl
+17M ./159.문장_유형(추론,_예측_등)_판단_데이터.jsonl
+1.4G ./297.SNS_데이터_고도화.jsonl

corpus/MODU_CORPUS ADDED Viewed

	@@ -0,0 +1,6 @@

+일상대화말뭉치 2020, 2021
+신문 말뭉치 2020, 2021, 2022
+유사 문장 말뭉치
+문서 요약 말뭉치
+문어 말뭉치
+의미역 분석 말뭉치

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 2,
+  "transformers_version": "4.36.2"
+}

gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text