monuirctc commited on
Commit
d184c0a
1 Parent(s): f862bf5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -5
README.md CHANGED
@@ -11,6 +11,7 @@ pip install --upgrade pip
11
  pip install transformers==4.30 sentencepiece accelerate
12
 
13
  Loading model
 
14
  import torch
15
  from transformers import LlamaTokenizer, AutoModelForCausalLM
16
 
@@ -25,9 +26,9 @@ LongLLaMA uses the Hugging Face interface, the long input given to the model wil
25
  prompt = "My name is Julien and I like to"
26
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
27
  outputs = model(input_ids=input_ids)
28
-
29
  During the model call, one can provide the parameter last_context_length (default 1024), which specifies the number of tokens left in the last context window. Tuning this parameter can improve generation as the first layers do not have access to memory. See details in How LongLLaMA handles long inputs.
30
-
31
  generation_output = model.generate(
32
  input_ids=input_ids,
33
  max_new_tokens=256,
@@ -37,6 +38,7 @@ generation_output = model.generate(
37
  temperature=1.0,
38
  )
39
  print(tokenizer.decode(generation_output[0]))
 
40
 
41
  Additional configuration
42
  LongLLaMA has several other parameters:
@@ -44,6 +46,7 @@ LongLLaMA has several other parameters:
44
  mem_layers specifies layers endowed with memory (should be either an empty list or a list of all memory layers specified in the description of the checkpoint).
45
  mem_dtype allows changing the type of memory cache
46
  mem_attention_grouping can trade off speed for reduced memory usage. When equal to (4, 2048), the memory layers will process at most 4*2048 queries at once (4 heads and 2048 queries for each head).
 
47
  import torch
48
  from transformers import LlamaTokenizer, AutoModelForCausalLM
49
 
@@ -55,12 +58,13 @@ model = AutoModelForCausalLM.from_pretrained(
55
  trust_remote_code=True,
56
  mem_attention_grouping=(4, 2048),
57
  )
58
-
59
  Drop-in use with LLaMA code
60
  LongLLaMA checkpoints can also be used as a drop-in replacement for LLaMA checkpoints in Hugging Face implementation of LLaMA, but in this case, they will be limited to the original context length of 2048.
61
-
62
  from transformers import LlamaTokenizer, LlamaForCausalLM
63
  import torch
64
 
65
  tokenizer = LlamaTokenizer.from_pretrained("monuirctc/llama-7b-instruct-indo")
66
- model = LlamaForCausalLM.from_pretrained("monuirctc/llama-7b-instruct-indo", torch_dtype=torch.float32)
 
 
11
  pip install transformers==4.30 sentencepiece accelerate
12
 
13
  Loading model
14
+ ```
15
  import torch
16
  from transformers import LlamaTokenizer, AutoModelForCausalLM
17
 
 
26
  prompt = "My name is Julien and I like to"
27
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
28
  outputs = model(input_ids=input_ids)
29
+ ```
30
  During the model call, one can provide the parameter last_context_length (default 1024), which specifies the number of tokens left in the last context window. Tuning this parameter can improve generation as the first layers do not have access to memory. See details in How LongLLaMA handles long inputs.
31
+ ```
32
  generation_output = model.generate(
33
  input_ids=input_ids,
34
  max_new_tokens=256,
 
38
  temperature=1.0,
39
  )
40
  print(tokenizer.decode(generation_output[0]))
41
+ ```
42
 
43
  Additional configuration
44
  LongLLaMA has several other parameters:
 
46
  mem_layers specifies layers endowed with memory (should be either an empty list or a list of all memory layers specified in the description of the checkpoint).
47
  mem_dtype allows changing the type of memory cache
48
  mem_attention_grouping can trade off speed for reduced memory usage. When equal to (4, 2048), the memory layers will process at most 4*2048 queries at once (4 heads and 2048 queries for each head).
49
+ ```
50
  import torch
51
  from transformers import LlamaTokenizer, AutoModelForCausalLM
52
 
 
58
  trust_remote_code=True,
59
  mem_attention_grouping=(4, 2048),
60
  )
61
+ ```
62
  Drop-in use with LLaMA code
63
  LongLLaMA checkpoints can also be used as a drop-in replacement for LLaMA checkpoints in Hugging Face implementation of LLaMA, but in this case, they will be limited to the original context length of 2048.
64
+ ```
65
  from transformers import LlamaTokenizer, LlamaForCausalLM
66
  import torch
67
 
68
  tokenizer = LlamaTokenizer.from_pretrained("monuirctc/llama-7b-instruct-indo")
69
+ model = LlamaForCausalLM.from_pretrained("monuirctc/llama-7b-instruct-indo", torch_dtype=torch.float32)
70
+ ```