mayank-mishra (Mayank Mishra)

posted an update 6 months ago

Post

1721

New preprint out with colleagues from MIT and IBM Research

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (2405.12981)

We introduce a simple mechanism of sharing keys and values across layers, reducing the memory needed for KV cache during inference!!

1 reply

·

reacted to their post with ❤️ 6 months ago

Post

2483

Thrilled to unveil DS-MoE: a dense training and sparse inference scheme for enhanced computational and memory efficiency in your MoE models! 🚀🚀🚀

Discover more in our blog: https://huggingface.co/blog/bpan/ds-moe and dive into the details with our paper: Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models (2404.05567)

1 reply

·

posted an update 7 months ago

Post

2483

Thrilled to unveil DS-MoE: a dense training and sparse inference scheme for enhanced computational and memory efficiency in your MoE models! 🚀🚀🚀

Discover more in our blog: https://huggingface.co/blog/bpan/ds-moe and dive into the details with our paper: Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models (2404.05567)

1 reply

·

reacted to akhaliq's post with 👀❤️ 7 months ago

Post

2191

Aurora-M

The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order (2404.00399)

Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, whereas pretraining from scratch is computationally expensive, and compliance with AI safety and development laws. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435 billion additional tokens, Aurora-M surpasses 2 trillion tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Aurora-M is rigorously evaluated across various tasks and languages, demonstrating robustness against catastrophic forgetting and outperforming alternatives in multilingual settings, particularly in safety evaluations.

reacted to huu-ontocord's post with 🔥❤️ 7 months ago

Post

1601

We would like to announce our Aurora-M multilingual models which is based on Starcoderplus.
Twitter: https://twitter.com/ontocord/status/1772778544051155029
LinkedIn: https://www.linkedin.com/feed/update/urn:li:activity:7178521998845759488/
Blog post: https://huggingface.co/blog/mayank-mishra/aurora
Arxiv: Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order (2404.00399)

Current LLMs are very susceptible to generating toxic, harmful and even dangerous content. They can also generate outputs with gender or racial biases. The Biden-Harris Executive Order https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence) sets forth guidelines on what is considered a safe AI system.
Following up on these guidelines, we present the world's first open source Biden-Harris Executive Order Red teamed Multilingual Language Model: Aurora-M. Inspired by BigScience, the model is trained on 5 languages: English, Hindi, Japanese, Vietnamese and Finnish.

* Red teamed model: aurora-m/aurora-m-biden-harris-redteamed tuned according to the order mentioned above)
* Base model: aurora-m/aurora-m-base (not safety tuned)
* Instruct model: aurora-m/aurora-m-instruct (not safety tuned)

@mayank-mishra @cabbage972 @sted97 @Xa9aX @Taishi-N324 @Muennighoff @vumichien @prateeky2806 @felfri @spyysalo and many many others!

replied to ordagan's post 8 months ago

thanks

reacted to their post with 🔥 8 months ago

Post

1895

Current LLMs are very susceptible to generating toxic, harmful and even dangerous content. They can also generate outputs with gender or racial biases.

The Biden-Harris Executive Order (https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence) sets forth guidelines on what is considered a safe AI system.

Following up on these guidelines, we present the world's first open source Biden-Harris Executive Order Red teamed Multilingual Language Model: Aurora-M.

The model is trained on 5 languages: English, Hindi, Japanese, Vietnamese and Finnish.

Blog: https://huggingface.co/blog/mayank-mishra/aurora
Paper coming out soon.

Base model: aurora-m/aurora-m-base (not safety tuned)
Instruct model: aurora-m/aurora-m-instruct (not safety tuned)
Red teamed model: aurora-m/aurora-m-biden-harris-redteamed (safety tuned according to the order mentioned above)

posted an update 8 months ago

Post

1895

Current LLMs are very susceptible to generating toxic, harmful and even dangerous content. They can also generate outputs with gender or racial biases.

The Biden-Harris Executive Order (https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence) sets forth guidelines on what is considered a safe AI system.

Following up on these guidelines, we present the world's first open source Biden-Harris Executive Order Red teamed Multilingual Language Model: Aurora-M.

The model is trained on 5 languages: English, Hindi, Japanese, Vietnamese and Finnish.

Blog: https://huggingface.co/blog/mayank-mishra/aurora
Paper coming out soon.

Base model: aurora-m/aurora-m-base (not safety tuned)
Instruct model: aurora-m/aurora-m-instruct (not safety tuned)
Red teamed model: aurora-m/aurora-m-biden-harris-redteamed (safety tuned according to the order mentioned above)

reacted to their post with 🤝 8 months ago

Post

I have just published my first blog post.

While FlashAttention has been readily integrated into HuggingFace transformers, there are much higher gains to be had (at least theoretically) for finetuning models with examples of variable sequence lengths in a batch.

For a deeper dive, please go through my blog at https://huggingface.co/blog/mayank-mishra/padding-free-transformer.

10 replies

·

replied to their post 8 months ago

Thanks a lot @julien-c
means a lot coming from you :)

replied to their post 8 months ago

This comment has been hidden

replied to their post 8 months ago

@joaogante I am adding a new architecture for this: https://github.com/huggingface/transformers/pull/29578
It supports both padding free and normal transformers

reacted to their post with ❤️ 8 months ago

Post

I have just published my first blog post.

While FlashAttention has been readily integrated into HuggingFace transformers, there are much higher gains to be had (at least theoretically) for finetuning models with examples of variable sequence lengths in a batch.

For a deeper dive, please go through my blog at https://huggingface.co/blog/mayank-mishra/padding-free-transformer.

10 replies

·

replied to their post 8 months ago

yeah, its just that people have not been using this for finetuning where it can give considerable memory savings. I guess the issue is the core design of HF transformers.

I am planning to release the code for this sometime soon :)

posted an update 8 months ago

Post

I have just published my first blog post.

While FlashAttention has been readily integrated into HuggingFace transformers, there are much higher gains to be had (at least theoretically) for finetuning models with examples of variable sequence lengths in a batch.

For a deeper dive, please go through my blog at https://huggingface.co/blog/mayank-mishra/padding-free-transformer.

10 replies

·

reacted to ybelkada's post with ❤️ 8 months ago

Post

Check out quantized weights from ISTA-DAS Lab directly in their organisation page: https://huggingface.co/ISTA-DASLab ! With official weights of AQLM (for 2bit quantization) & QMoE (1-bit MoE quantization)

Read more about these techniques below:

AQLM paper: Extreme Compression of Large Language Models via Additive Quantization (2401.06118)
QMoE: QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models (2310.16795)

Some useful links below:

AQLM repo: https://github.com/Vahe1994/AQLM
How to use AQLM & transformers: https://huggingface.co/docs/transformers/quantization#aqlm
How to use AQLM & PEFT: https://huggingface.co/docs/peft/developer_guides/quantization#aqlm-quantizaion

Great work from @BlackSamorez and team !

reacted to loubnabnl's post with 🤯❤️ 9 months ago

Post

⭐ Today we’re releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:

- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks.
- StarCoder2-3B outperforms StarCoderBase-15B and similar sized models.
- The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens 🚀
As always, we released everything from models and datasets to curation code. Enjoy!

🔗 StarCoder2 collection: bigcode/starcoder2-65de6da6e87db3383572be1a
🔗 Paper: https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view
🔗 BlogPost: https://huggingface.co/blog/starcoder2
🔗 Code Leaderboard: bigcode/bigcode-models-leaderboard

Mayank Mishra

AI & ML interests

Articles

Improving Hugging Face Training Efficiency Through Packing with Flash Attention

Saving Memory Using Padding-Free Transformer Layers during Finetuning

Aurora-M: The First Open Source Biden-Harris Executive Order Red teamed Multilingual Language Model

Organizations

mayank-mishra's activity