Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.
Community
Followed the journey. Very nice work guys! I understand everything was done on a sparse budget, but can't help but wonder -- what if....you guys used an embedding-based approach to heavily de-duplicate all that data first? I feel like those benchmarks would be A LOT better. Also this is as a PoC proves SO MUCH. To me, it represents a properly trained model, in terms of Parameter-to-token count. Imagine the same size dataset, but of phi textbook quality? I suspect this variant of tinyLlama model would be as good as gpt-3.5-turbo.
Followed the journey. Very nice work guys! I understand everything was done on a sparse budget, but can't help but wonder -- what if....you guys used an embedding-based approach to heavily de-duplicate all that data first? I feel like those benchmarks would be A LOT better. Also this is as a PoC proves SO MUCH. To me, it represents a properly trained model, in terms of Parameter-to-token count. Imagine the same size dataset, but of phi textbook quality? I suspect this variant of tinyLlama model would be as good as gpt-3.5-turbo.
I find it perplexing (π) how data quality has been overlooked in general. I wish I had the compute to do something about it. It seems so blatantly obvious to me that data quality has the highest potential to create earth-shattering advances. I fully expect that in the next few years, tiny models will make GPT4 obsolete. Like dinosaur obsolete. Also we call them tiny, but 1 billion parameters is frigging gigantic. Quality will beat quantity everytime.
I wonder if we could shard the dataset somehow and have the community do federated data cleansing? Kind of like how lymsys does their evaluations...
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (2024)
- LLM360: Towards Fully Transparent Open-Source LLMs (2023)
- LLaMA Pro: Progressive LLaMA with Block Expansion (2024)
- LLaMA Beyond English: An Empirical Study on Language Capability Transfer (2024)
- SparQ Attention: Bandwidth-Efficient LLM Inference (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
I wonder if we could shard the dataset somehow and have the community do federated data cleansing? Kind of like how lymsys does their evaluations...
There would be a lot of trust required to do that. I dont think i would.
If only there was some sort of distributed ledger of trust, hmm π€
If only there was some sort of distributed ledger of trust, hmm π€
it still implies trust of the community in general and that there are no hidden bad-actors.
right, by aligning the incentives using financial incentives you can programmatically reduce this
Interesting idea, the overhead would be huge and it would be incredibly slow. We'd be better off using git or a plain old database. As for trust, that's kind the whole point of open source. We never built the LLMs yet we use them anyway. So long as the data is out in the open I don't see a problem
Interesting idea, the overhead would be huge and it would be incredibly slow. We'd be better off using git or a plain old database. As for trust, that's kind the whole point of open source. We never built the LLMs yet we use them anyway. So long as the data is out in the open I don't see a problem
Agreed on speed issues, using a blockchain or something woudl be dreadful
As fr as trust, its easier to trust ( or not trust and move on to another ) a single commercial entity who creates base models, then you find a person that further refines that you feel you can trust. Sure, there is still trust involved, but i find it easier to trust that layout than 'random people in the community'. Yes that is also true in other cases ( Linux kernel for example ) but you do have 'trusted entities' reviewing things. Not saying its not possible here too, but not real sure how to setup a 'trusted review' governing body/committee or something and i do think that would be needed. Would not be hard for 1 or 2 malicious people to really hose things for everyone ( intentional bad info, inserting commercial data into OSS model, etc ).
But, if you all can pull it off, good luck with the idea as the concept of the shared resource isn't bad, so ill shut up now.
hi
Hello
Can anyone help with these information for this model TinyLlama/TinyLlama-1.1B-Chat-v1.0?
File Size
RAM
CPU Requirements
Hardware Requirements
Context size ?
As a rule of thumb: 1B parameters @ 8bit = 1GB (V)RAM
Currently the best bitrate is around 5/6 bits, so 1GB is plenty. The filesize should be around 800Mb.
Requirements: Let's say you have a GPU with 256GB bandwidth. Your model is 1GB, so you should get 256 tokens per second. As it parses the entire model for each token. Same for CPU, but the bandwidth is much lower.
Models citing this paper 12
Browse 12 models citing this paperDatasets citing this paper 0
No dataset linking this paper