arxiv:2401.02385

TinyLlama: An Open-Source Small Language Model

Published on Jan 4

· Submitted by

akhaliq on Jan 5

#1 Paper of the day

Upvote

Authors:

Peiyuan Zhang ,

Guangtao Zeng ,

Tianduo Wang ,

Abstract

We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.

View arXiv page View PDF Add to collection

Community

zandrrlife

Jan 5

Followed the journey. Very nice work guys! I understand everything was done on a sparse budget, but can't help but wonder -- what if....you guys used an embedding-based approach to heavily de-duplicate all that data first? I feel like those benchmarks would be A LOT better. Also this is as a PoC proves SO MUCH. To me, it represents a properly trained model, in terms of Parameter-to-token count. Imagine the same size dataset, but of phi textbook quality? I suspect this variant of tinyLlama model would be as good as gpt-3.5-turbo.

MichaelBarryUK

Jan 7

Followed the journey. Very nice work guys! I understand everything was done on a sparse budget, but can't help but wonder -- what if....you guys used an embedding-based approach to heavily de-duplicate all that data first? I feel like those benchmarks would be A LOT better. Also this is as a PoC proves SO MUCH. To me, it represents a properly trained model, in terms of Parameter-to-token count. Imagine the same size dataset, but of phi textbook quality? I suspect this variant of tinyLlama model would be as good as gpt-3.5-turbo.

I find it perplexing (😉) how data quality has been overlooked in general. I wish I had the compute to do something about it. It seems so blatantly obvious to me that data quality has the highest potential to create earth-shattering advances. I fully expect that in the next few years, tiny models will make GPT4 obsolete. Like dinosaur obsolete. Also we call them tiny, but 1 billion parameters is frigging gigantic. Quality will beat quantity everytime.

MichaelBarryUK

Jan 7

•

edited Jan 7

I wonder if we could shard the dataset somehow and have the community do federated data cleansing? Kind of like how lymsys does their evaluations...

deleted

Jan 7

I wonder if we could shard the dataset somehow and have the community do federated data cleansing? Kind of like how lymsys does their evaluations...

There would be a lot of trust required to do that. I dont think i would.

librarian-bot

Jan 9

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

davegoldblatt

Jan 17

I wonder if we could shard the dataset somehow and have the community do federated data cleansing? Kind of like how lymsys does their evaluations...

There would be a lot of trust required to do that. I dont think i would.

If only there was some sort of distributed ledger of trust, hmm 🤔

deleted

Jan 17

If only there was some sort of distributed ledger of trust, hmm 🤔

it still implies trust of the community in general and that there are no hidden bad-actors.

davegoldblatt

Jan 17

If only there was some sort of distributed ledger of trust, hmm 🤔

it still implies trust of the community in general and that there are no hidden bad-actors.

right, by aligning the incentives using financial incentives you can programmatically reduce this

MichaelBarryUK

Jan 17

Interesting idea, the overhead would be huge and it would be incredibly slow. We'd be better off using git or a plain old database. As for trust, that's kind the whole point of open source. We never built the LLMs yet we use them anyway. So long as the data is out in the open I don't see a problem

deleted

Jan 17

Interesting idea, the overhead would be huge and it would be incredibly slow. We'd be better off using git or a plain old database. As for trust, that's kind the whole point of open source. We never built the LLMs yet we use them anyway. So long as the data is out in the open I don't see a problem

Agreed on speed issues, using a blockchain or something woudl be dreadful

As fr as trust, its easier to trust ( or not trust and move on to another ) a single commercial entity who creates base models, then you find a person that further refines that you feel you can trust. Sure, there is still trust involved, but i find it easier to trust that layout than 'random people in the community'. Yes that is also true in other cases ( Linux kernel for example ) but you do have 'trusted entities' reviewing things. Not saying its not possible here too, but not real sure how to setup a 'trusted review' governing body/committee or something and i do think that would be needed. Would not be hard for 1 or 2 malicious people to really hose things for everyone ( intentional bad info, inserting commercial data into OSS model, etc ).

But, if you all can pull it off, good luck with the idea as the concept of the shared resource isn't bad, so ill shut up now.

Shyamnath

Jul 14

spratap123

Aug 2

Hello
Can anyone help with these information for this model TinyLlama/TinyLlama-1.1B-Chat-v1.0?
File Size
RAM
CPU Requirements
Hardware Requirements
Context size ?

MichaelBarryUK

Aug 2

As a rule of thumb: 1B parameters @ 8bit = 1GB (V)RAM
Currently the best bitrate is around 5/6 bits, so 1GB is plenty. The filesize should be around 800Mb.
Requirements: Let's say you have a GPU with 256GB bandwidth. Your model is 1GB, so you should get 256 tokens per second. As it parses the entire model for each token. Same for CPU, but the bandwidth is much lower.