munish0838 commited on
Commit
7f7ba02
1 Parent(s): 93966e3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ license: mit
5
+ language:
6
+ - ja
7
+ - en
8
+
9
+ ---
10
+
11
+ ![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)
12
+
13
+ # QuantFactory/sarashina2-7b-GGUF
14
+ This is quantized version of [sbintuitions/sarashina2-7b](https://huggingface.co/sbintuitions/sarashina2-7b) created using llama.cpp
15
+
16
+ # Original Model Card
17
+
18
+
19
+ # Sarashina2-7B
20
+
21
+ This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
22
+
23
+
24
+ ## How to use
25
+
26
+
27
+ ```python
28
+ import torch
29
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
30
+
31
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-7b", torch_dtype=torch.bfloat16, device_map="auto")
32
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-7b")
33
+ # If you want to use slow tokenizer
34
+ # tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-7b", use_fast=False)
35
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
36
+ set_seed(123)
37
+
38
+ text = generator(
39
+ "おはようございます、今日の天気は",
40
+ max_length=30,
41
+ do_sample=True,
42
+ pad_token_id=tokenizer.pad_token_id,
43
+ num_return_sequences=3,
44
+ )
45
+
46
+ for t in text:
47
+ print(t)
48
+
49
+ # These examples are generated by sarashina2-7b parameters model
50
+ # {'generated_text': 'おはようございます、今日の天気は晴れです。ちょっと風が強い。\n昨日は、久しぶりにゆっくりとしていました。\n2週間位間があいてしまったかも、でもその間に'}
51
+ # {'generated_text': 'おはようございます、今日の天気は曇。朝は曇っていてどんよりしていましたね。昼からは晴れそうですが。気温は徐々に上昇しています。昨日は春らしい陽気でした。'}
52
+ # {'generated_text': 'おはようございます、今日の天気はくもり、少し寒気がします。 この土日に、家族で一泊二日で旅行に行ってきました。といっても、100キロ'}
53
+ ```
54
+
55
+ ## Configuration
56
+
57
+ | Parameters | Vocab size | Training tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
58
+ | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
59
+ | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
60
+ | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
61
+ | [70B](https://huggingface.co/sbintuitions/sarashina2-70b) | 102400 | 2.1T | Llama2 | RoPE | 80 | 8192 | 64 |
62
+
63
+ ## Training Corpus
64
+
65
+ For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
66
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
67
+ After cleaning, our Japanese training data contains about 1T tokens.
68
+
69
+ For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
70
+
71
+ ## Tokenization
72
+
73
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
74
+ We do not apply pre-tokenization with Japanese tokenizer.
75
+ Thus, a user may directly feed raw sentences into the tokenizer.
76
+
77
+
78
+ ## Ethical Considerations and Limitations
79
+ Sarashina2 has not been tuned to follow an instruction yet.
80
+ Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
81
+ Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
82
+
83
+ ## License
84
+
85
+ [MIT License](https://huggingface.co/sbintuitions/sarashina2-7b/blob/main/LICENSE)