Update README.md
Browse files
README.md
CHANGED
@@ -11,6 +11,7 @@ pip install --upgrade pip
|
|
11 |
pip install transformers==4.30 sentencepiece accelerate
|
12 |
|
13 |
Loading model
|
|
|
14 |
import torch
|
15 |
from transformers import LlamaTokenizer, AutoModelForCausalLM
|
16 |
|
@@ -25,9 +26,9 @@ LongLLaMA uses the Hugging Face interface, the long input given to the model wil
|
|
25 |
prompt = "My name is Julien and I like to"
|
26 |
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
27 |
outputs = model(input_ids=input_ids)
|
28 |
-
|
29 |
During the model call, one can provide the parameter last_context_length (default 1024), which specifies the number of tokens left in the last context window. Tuning this parameter can improve generation as the first layers do not have access to memory. See details in How LongLLaMA handles long inputs.
|
30 |
-
|
31 |
generation_output = model.generate(
|
32 |
input_ids=input_ids,
|
33 |
max_new_tokens=256,
|
@@ -37,6 +38,7 @@ generation_output = model.generate(
|
|
37 |
temperature=1.0,
|
38 |
)
|
39 |
print(tokenizer.decode(generation_output[0]))
|
|
|
40 |
|
41 |
Additional configuration
|
42 |
LongLLaMA has several other parameters:
|
@@ -44,6 +46,7 @@ LongLLaMA has several other parameters:
|
|
44 |
mem_layers specifies layers endowed with memory (should be either an empty list or a list of all memory layers specified in the description of the checkpoint).
|
45 |
mem_dtype allows changing the type of memory cache
|
46 |
mem_attention_grouping can trade off speed for reduced memory usage. When equal to (4, 2048), the memory layers will process at most 4*2048 queries at once (4 heads and 2048 queries for each head).
|
|
|
47 |
import torch
|
48 |
from transformers import LlamaTokenizer, AutoModelForCausalLM
|
49 |
|
@@ -55,12 +58,13 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
55 |
trust_remote_code=True,
|
56 |
mem_attention_grouping=(4, 2048),
|
57 |
)
|
58 |
-
|
59 |
Drop-in use with LLaMA code
|
60 |
LongLLaMA checkpoints can also be used as a drop-in replacement for LLaMA checkpoints in Hugging Face implementation of LLaMA, but in this case, they will be limited to the original context length of 2048.
|
61 |
-
|
62 |
from transformers import LlamaTokenizer, LlamaForCausalLM
|
63 |
import torch
|
64 |
|
65 |
tokenizer = LlamaTokenizer.from_pretrained("monuirctc/llama-7b-instruct-indo")
|
66 |
-
model = LlamaForCausalLM.from_pretrained("monuirctc/llama-7b-instruct-indo", torch_dtype=torch.float32)
|
|
|
|
11 |
pip install transformers==4.30 sentencepiece accelerate
|
12 |
|
13 |
Loading model
|
14 |
+
```
|
15 |
import torch
|
16 |
from transformers import LlamaTokenizer, AutoModelForCausalLM
|
17 |
|
|
|
26 |
prompt = "My name is Julien and I like to"
|
27 |
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
28 |
outputs = model(input_ids=input_ids)
|
29 |
+
```
|
30 |
During the model call, one can provide the parameter last_context_length (default 1024), which specifies the number of tokens left in the last context window. Tuning this parameter can improve generation as the first layers do not have access to memory. See details in How LongLLaMA handles long inputs.
|
31 |
+
```
|
32 |
generation_output = model.generate(
|
33 |
input_ids=input_ids,
|
34 |
max_new_tokens=256,
|
|
|
38 |
temperature=1.0,
|
39 |
)
|
40 |
print(tokenizer.decode(generation_output[0]))
|
41 |
+
```
|
42 |
|
43 |
Additional configuration
|
44 |
LongLLaMA has several other parameters:
|
|
|
46 |
mem_layers specifies layers endowed with memory (should be either an empty list or a list of all memory layers specified in the description of the checkpoint).
|
47 |
mem_dtype allows changing the type of memory cache
|
48 |
mem_attention_grouping can trade off speed for reduced memory usage. When equal to (4, 2048), the memory layers will process at most 4*2048 queries at once (4 heads and 2048 queries for each head).
|
49 |
+
```
|
50 |
import torch
|
51 |
from transformers import LlamaTokenizer, AutoModelForCausalLM
|
52 |
|
|
|
58 |
trust_remote_code=True,
|
59 |
mem_attention_grouping=(4, 2048),
|
60 |
)
|
61 |
+
```
|
62 |
Drop-in use with LLaMA code
|
63 |
LongLLaMA checkpoints can also be used as a drop-in replacement for LLaMA checkpoints in Hugging Face implementation of LLaMA, but in this case, they will be limited to the original context length of 2048.
|
64 |
+
```
|
65 |
from transformers import LlamaTokenizer, LlamaForCausalLM
|
66 |
import torch
|
67 |
|
68 |
tokenizer = LlamaTokenizer.from_pretrained("monuirctc/llama-7b-instruct-indo")
|
69 |
+
model = LlamaForCausalLM.from_pretrained("monuirctc/llama-7b-instruct-indo", torch_dtype=torch.float32)
|
70 |
+
```
|