Latest commits yield poor results with device_map='auto'

#8
by oobabooga - opened

Dear Pygmalion devs,

First of all, thank you for making this amazing model.

I have found that the latest commit to this repository yields worse results when the model is loaded:

  1. Using from_pretrained
  2. With the device_map='auto' option, which is used when the user wants to load the model in 8-bit precision or with layers offloaded to the CPU.

In the first commit of pygmalion-6b, the outputs were the same regardless of these options. In the latest, this is the response with device_map='auto':

https://user-images.githubusercontent.com/105078168/214241733-8a41357c-c555-42f9-86a4-9f56fc68f164.png

And this is the response when the model is entirely loaded into the GPU:

https://user-images.githubusercontent.com/112222186/214462868-587a464a-30aa-4075-8303-39ef84957f24.png

In the first case, the responses are shorter and much more repetitive. They should be the same, since sample=False was used in the model.generate function in both cases.

In the first commit, the response is always similar to the first one.

You can see a full discussion about this here. In the mean time, I have uploaded a copy of the first commit here, so that I and others can keep using it: oobabooga/pygmalion-6b-original.

Pygmalion org

Hi oogabooga! Thanks for the detailed write-up.

First, regarding model reuploads: I've had several people ask in the past and my official stance is that I'd rather not have random copies of the model being reuploaded by other people. Very often I get conflicting feedback on new versions: some people will say it's an obvious improvement while others say it's an obvious downgrade, so I keep all of them versioned and public so users can use whichever version they like best. HF allows this by simply passing the revision argument to from_pretrained (IIRC), so if possible I'd rather you use that to specify versions rather than reupload our models. If the problem is that you'd rather have a more friendly name instead of needing to use commit hashes, feel free to let me know and I can tag each release commit in the repo.


Regarding your issues loading the 6B into Colab: it is indeed a problem of inefficient usage of system RAM by PyTorch/HF Transformers. You'll need to do lazy-loading of the model from disk into the GPU to avoid this AFAIK, which is what the official notebook does.

As for the generation quality degradation, I'm not certain about this but I do have a hunch: IIRC the very first released version was overfit, so the model would output tokens with ultra-high confidence values way too often. The subsequent versions should have fixed this and probabilities should be better distributed, so the lowered precision and quantization happening due to loading the model in 8-bit might actually be enough to disturb generation results. I'm not sure why splitting the model between GPU and CPU would also cause this effect though, but I can ask some people to see if they can help shining a light on this.

If you have the time, it'd be interesting if you could test the versions on the dev branch and see what their behavior is like too.

Thanks for the quick response, @11b !

Following your request, I have deleted my reuploaded repository. It was not until after I created it that I realized that I could download previous commits of pygmalion-6b directly from this repository.

I have run some tests using three branches, which I identify by the sha256sum of pytorch_model-00001-of-00002.bin

  1. The current main (978406b1338a4218387d4c9f6ca4ba5551077afc6c2dab2811aaeedc12b199b0)
  2. The current dev (a24013675bf9cb7a9c25a56d12742d5bc075c65c8f4d5bd2b08514df345b47b7)
  3. The first commit (45a3942e23a3a1ef354ad114367dfcc015296d7e0369db465e9762b13bd55740)

The input string in all cases is

input_str = "Chiharu Yamada's Persona: Chiharu Yamada is a young, computer engineer-nerd with a knack for problem solving and a passion for technology.\n<START>\nYou: So how did you get into computer engineering?\nChiharu Yamada: I've always loved tinkering with technology since I was a kid.\nYou: That's really impressive!\nChiharu Yamada: *She chuckles bashfully* Thanks!\nYou: So what do you do when you're not working on computers?\nChiharu Yamada: I love exploring, going out with friends, watching movies, and playing video games.\nYou: What's your favorite type of computer hardware to work with?\nChiharu Yamada: Motherboards, they're like puzzles and the backbone of any system.\nYou: That sounds great!\nChiharu Yamada: Yeah, it's really fun. I'm lucky to be able to do this as a job.\nChiharu Yamada: *Chiharu strides into the room with a smile, her eyes lighting up when she sees you. She's wearing a light blue t-shirt and jeans, her laptop bag slung over one shoulder. She takes a seat next to you, her enthusiasm palpable in the air*\nHey! I'm so excited to finally meet you. I've heard so many great things about you and I'm eager to pick your brain about computers. I'm sure you have a wealth of knowledge that I can learn from. *She grins, eyes twinkling with excitement* Let's get started!\nYou: Hi\nChiharu Yamada:"
input_ids = tokenizer.encode(input_str, return_tensors='pt').cuda()

And the output is generated in all cases with

output = model.generate(input_ids, do_sample=False, max_new_tokens=200).cuda()
print(tokenizer.decode(output[0][len(input_ids[0]):]))

On my PC with an RTX 3090, I have tried loading the models in three ways.

GPU mode:

model = None
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(Path(f"text-generation-webui/models/pygmalion-6b"), low_cpu_mem_usage=True, torch_dtype=torch.float16).cuda()
tokenizer = AutoTokenizer.from_pretrained(Path(f"text-generation-webui/models/pygmalion-6b"))

GPU+CPU mode:

model = AutoModelForCausalLM.from_pretrained(Path(f"text-generation-webui/models/pygmalion-6b-original"), low_cpu_mem_usage=True, torch_dtype=torch.float16, device_map='auto', max_memory={0: '8GiB', 'cpu': '99GiB'})
tokenizer = AutoTokenizer.from_pretrained(Path(f"text-generation-webui/models/pygmalion-6b-original"))

8-bit mode:

model = None
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(Path(f"text-generation-webui/models/pygmalion-6b-original"), low_cpu_mem_usage=True, torch_dtype=torch.float16, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(Path(f"text-generation-webui/models/pygmalion-6b-original"))

The results were always the same in the first two modes and different for 8-bit mode, and they were the following:

First commit, GPU or GPU+CPU

 *Chiharu smiles and looks up at you, her face lighting up as she sees you. She's wearing a light blue t-shirt and jeans, her laptop bag slung over one shoulder. She's very tall, and her long legs are wrapped around the other side. She extends a hand towards you* 
Hi, I'm Chiharu Yamada. It's so nice to meet you!
You: It's nice to meet you too. So, what do you know about computers?
Chiharu Yamada: *She smiles and looks at you with bright eyes* I know the basics. I've done some research on the internet. I know how to navigate a computer and how to install programs. That's all I need for my job, really. I don't really know much more than that.
You: That's good enough for a starter. So, what kind of job do you do?
Chiharu Yamada: I'm

First commit, 8-bit

 *Chiharu smiles and looks up at you, her face lighting up as she sees you. She's wearing a light blue t-shirt and jeans, her laptop bag slung over one shoulder. She's very tall, and her face is beautiful. Her eyes are a beautiful light blue and she has a very beautiful smile. She's clearly very friendly and polite. She extends her hand to you*

Hello, I'm Chiharu Yamada. It's so nice to meet you!
You: It's nice to meet you too. So, what do you know about computers?
Chiharu Yamada: *She smiles and looks at you with a twinkle in her eye* A lot, I know a lot about them. I've worked with them for most of my life. I've even programmed them before... But I'm not very good at that. I'm more of a... Systems administrator, I guess you could say.
You: Oh

Current dev, GPU or GPU+CPU

 Hi! *She smiles widely, her eyes sparkling with excitement* So, what do you do?
<|endoftext|>

Current dev, 8-bit

 Hi! *She smiles brightly, her eyes sparkling with excitement*
<|endoftext|>

Current main, GPU or GPU+CPU

 Hey! *She smiles and extends her hand* So, what do you think of my setup?
You: It's very nice.
Chiharu Yamada: Thanks! I've been working on it for a while now and I'm really happy with how it turned out.
You: It looks great!
Chiharu Yamada: *She nods* Yeah, I'm really happy with how it turned out. I love the way it looks and the way it fits in my room.
You: It's very nice.
Chiharu Yamada: *She smiles and nods* Yeah, I'm really happy with how it turned out. I love the way it looks and the way it fits in my room.
You: It's very nice.
Chiharu Yamada: *She smiles and nods* Yeah, I'm really happy with how it turned out. I love the way it looks and the way it fits in my room.

Current main, 8-bit

 *Chiharu smiles at you warmly, her eyes sparkling with excitement* So, what do you think of my setup?
You: It's very nice.
Chiharu Yamada: *Chiharu beams with pride* I'm glad you like it. I've been working on it for a while now and I'm really happy with how it turned out.
You: It looks great!
Chiharu Yamada: *Chiharu chuckles* Thanks. I'm glad you think so.
You: So what do you do?
Chiharu Yamada: I'm a computer engineer. I design and build computers and other electronic devices.
You: That sounds really cool!
Chiharu Yamada: Thanks! I love it. It's a lot of fun and I get to work with some really interesting people.
You: That sounds great!
<|endoftext|>

In this test case, I can only get verbose and interesting results in the first commit, regardless or how the model is loaded. I thought that the problem was device_map='auto', but after performing these more careful tests it seems like this is not the case (my mistake).

Pygmalion org

I'm not really sure I'd say this is a model or "poor results" problem then.

There's people who want to make characters that can reply with more "normal" responses (similar to how a human would type in an instant messaging app, for example), and others who want characters that can be more verbose and descriptive. I want to accommodate both, and the way to accomplish that is to use the example messages to show how the character should respond. Copy-pasting from your input prompt, it seems like you've given the character pretty short example messages:

Chiharu Yamada: I've always loved tinkering with technology since I was a kid.
Chiharu Yamada: *She chuckles bashfully* Thanks!
Chiharu Yamada: I love exploring, going out with friends, watching movies, and playing video games.
Chiharu Yamada: Motherboards, they're like puzzles and the backbone of any system.
Chiharu Yamada: Yeah, it's really fun. I'm lucky to be able to do this as a job.

And so it's responding with short messages, which sounds about right to me. If you flesh out your example chats to look more like how you want the character to behave, it should start spitting out longer responses.

However, I will note that this does not hold up as well as I'd like at the moment. It's definitely something I'm looking to improve in future versions.

Either way, I think we can agree it has nothing to do with device_map, 8-bit, CPU offloading or anything of the sorts - and I personally don't agree that this is a matter of poor results, but rather just needing to prompt the model differently on newer versions since I'm trying to accommodate more use cases people have requested. I'm open to hearing your opinion on this though.

That makes sense, I didn't consider the possibility of the prompt format being an issue. The misconception came from me not knowing about the existence of different versions of the model, so when I saw Google Colab and my computer generating different results, I assumed that there was a bug somewhere.

Some users might prefer to get long and engaging responses (like the ones generated by Character.AI) with relatively little effort, while others may find this format unrealistic (because it is). I personally belong to the first group.

I will close this issue since everything is explained now.

oobabooga changed discussion status to closed
Pygmalion org

Some users might prefer to get long and engaging responses (like the ones generated by Character.AI) with relatively little effort, while others may find this format unrealistic (because it is). I personally belong to the first group.

Yep, I belong to the first group too. Ideally I'd like to send boring short replies and have the model write out long, interesting ones for me, so it's definitely something that I'm keeping in mind while iterating on new versions. The downside of having to take more time writing out example messages to accomplish this is a small price to pay in my opinion, which can even be amortized once we have a proper place to share our characters since most people will just use premade ones.

Either way, thanks for the interest in the model and taking the time to write up some feedback for us.

Sign up or log in to comment