How to Fine-tune Jamba on google Colab?
On 1 GPU π€
Done: https://exnrt.com/blog/ai/finetune-jamba-v01/
Thanks to
@alvations
for the Help.
I've tried A100 on colab but it looks like there's still some bugs in the accelerate
auto mappings, https://colab.research.google.com/drive/1T0fhyP963DHJDjUNrPMScD0L9uDfOj-w?usp=sharing
When initializing the SFTTrainer, it throws the error:
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:245: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
warnings.warn(
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-7-b027ad8b6132> in <cell line: 1>()
----> 1 trainer = SFTTrainer(
2 model=model,
3 tokenizer=tokenizer,
4 args=training_args,
5 peft_config=lora_config,
12 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in convert(t)
1148 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1149 non_blocking, memory_format=convert_to_format)
-> 1150 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
1151
1152 return self._apply(convert)
NotImplementedError: Cannot copy out of meta tensor; no data!
And I think it's also complaining about moving models when accelerate have offloaded some parameters:
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-19-5df240c1e5f7> in <cell line: 3>()
1 import torch
2
----> 3 trainer = SFTTrainer(
4 model=model,
5 train_dataset=valid_dataset,
3 frames
/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py in wrapper(*args, **kwargs)
451 for param in model.parameters():
452 if param.device == torch.device("meta"):
--> 453 raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
454 return fn(*args, **kwargs)
455
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.
E.g. https://colab.research.google.com/drive/1T0fhyP963DHJDjUNrPMScD0L9uDfOj-w?usp=sharing
After some tinkering and using 4bits as per https://github.com/Pleias/Various-Finetuning/blob/main/finetuning_jamba.py , it runs!!
Example: https://colab.research.google.com/drive/1EK-PeLXfO1oOxSY5zlRmVvOzBPrYnp-d?usp=sharing
Installs
! pip install -U pip
! pip install -U transformers==4.39.2
! pip install causal-conv1d mamba-ssm
! pip install accelerate peft bitsandbytes trl
! pip install -U datasets sacrebleu evaluate
! pip install -U flash_attn
Code
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
import mamba_ssm
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
llm_int4_skip_modules=["mamba"] #Maybe not necessary (per axoltl) but to test.
)
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
model = AutoModelForCausalLM.from_pretrained(
"ai21labs/Jamba-v0.1",
trust_remote_code=True,
device_map='auto',
attn_implementation="flash_attention_2",
quantization_config=quantization_config,
use_mamba_kernels=True
)
from datasets import load_dataset
valid_data = load_dataset("facebook/flores", "eng_Latn-deu_Latn", streaming=False, split="dev")
# From https://stackoverflow.com/q/78156752/610569
def preprocess_func(row):
return {'text': "Translate from English to German: <s>[INST] " + row['sentence_eng_Latn'] + " [INST] " + row['sentence_deu_Latn'] + " </s>"}
valid_dataset = valid_data.map(preprocess_func)
valid_dataset['text'][-5:]
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
optim = "adamw_8bit",
max_grad_norm = 0.3,
weight_decay = 0.001,
warmup_ratio = 0.03,
gradient_checkpointing=True,
logging_dir='./logs',
logging_steps=1,
max_steps=50,
group_by_length=True,
lr_scheduler_type = "linear",
learning_rate=2e-3
)
lora_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
init_lora_weights=False,
r=8,
target_modules=["embed_tokens", "x_proj", "in_proj", "out_proj"],
task_type="CAUSAL_LM",
bias="none"
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
peft_config=lora_config,
train_dataset=valid_dataset,
max_seq_length = 256,
dataset_text_field="text",
)
trainer.train()
Can you please share the specification of the GPU device you were able to run the above fine tuning script? I am having problem loading the model into memory even when using AWS SageMaker g5.16xlarge
It's an A100 instance on colab. So you'll need p4/p5 instance on AWS
@alvations any specific reason you set max_grad_norm as 0.3?
I've followed https://github.com/Pleias/Various-Finetuning/blob/main/finetuning_jamba.py
But I'm seeing the loss zero out real fast after 200+ steps so definitely there's a lot of room for "student gradient descent" I.e. hyperpameters search