rwightman (Ross Wightman)

reacted to merve's post with 🚀 20 days ago

Post

3889

I ported the hottest new shape-optimized SigLIP 🔥 merve/siglip-so400m-patch16-256-i18n

if you don't want to wait for the next transformers release install transformers from my PR https://github.com/huggingface/transformers/pull/32938 and initialize SigLIP from there

posted an update 23 days ago

Post

557

A new timm release (1.0.11) is out now. A also wrote an article on one of the included models: https://huggingface.co/blog/rwightman/mambaout

Featured in the release are:
* The MambaOut model, a cheeky arch inspired by SSM but without the SSM part, a ConvNeXt with gating.
* Several timm trained MambaOut variations with arch tweaks and ImageNet-12k pretrain to verify scaling, supplement ported weights.
* The smallest MobileNetV4, a 0.5x width scaled Conv-Small.
* Two impressive MobileNetV3 Large models outperforming all previous, using MNV4 Small recipe.
* 'Zepto,' a new compact ConvNeXt variant even smaller than the previous Atto, 2.2M params, RMSNorm, and solid results for its size.
* Newly ported SigLIP SO400M/16 ViT multi-lingual weights, the largest i18n weights, prevous was B/16.
* Two ImageNet-1k fine-tuned SigLIP SO400M models at 378x378
* InternViT 300M weight port. A really solid ViT encoder distilled from OpenGVLab 6B VL model encoder.
* An assortment of very small, sub 1M param pretrained test models to improve library unit tests and serve low-resource applications.

reacted to merve's post with 🔥 about 1 month ago

Post

3700

Meta AI vision has been cooking @facebook
They shipped multiple models and demos for their papers at @ECCV 🤗

Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos 👏

All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images

Model: facebook/vfusion3d
Demo: facebook/VFusion3D

- CoTracker is the state-of-the-art point (pixel) tracking model

Demo: facebook/cotracker
Model: facebook/cotracker

reacted to Wauplin's post with 🔥 about 1 month ago

Post

2681

What a great milestone to celebrate! The huggingface_hub library is slowly becoming a cornerstone of the Python ML ecosystem when it comes to interacting with the @huggingface Hub. It wouldn't be there without the hundreds of community contributions and feedback! No matter if you are loading a model, sharing a dataset, running remote inference or starting jobs on our infra, you are for sure using it! And this is only the beginning so give a star if you wanna follow the project 👉 https://github.com/huggingface/huggingface_hub

1 reply

·

reacted to abidlabs's post with 🚀🔥 about 1 month ago

Post

4022

👋 Hi Gradio community,

I'm excited to share that Gradio 5 will launch in October with improvements across security, performance, SEO, design (see the screenshot for Gradio 4 vs. Gradio 5), and user experience, making Gradio a mature framework for web-based ML applications.

Gradio 5 is currently in beta, so if you'd like to try it out early, please refer to the instructions below:

---------- Installation -------------

Gradio 5 depends on Python 3.10 or higher, so if you are running Gradio locally, please ensure that you have Python 3.10 or higher, or download it here: https://www.python.org/downloads/

* Locally: If you are running gradio locally, simply install the release candidate with pip install gradio --pre
* Spaces: If you would like to update an existing gradio Space to use Gradio 5, you can simply update the sdk_version to be 5.0.0b3 in the README.md file on Spaces.

In most cases, that’s all you have to do to run Gradio 5.0. If you start your Gradio application, you should see your Gradio app running, with a fresh new UI.

-----------------------------

Fore more information, please see: https://github.com/gradio-app/gradio/issues/9463

2 replies

·

posted an update about 2 months ago

Post

2415

A 'small' MobileNet-V4 update, I just pushed weights for the smallest model I've trained in the series, a 0.5 width multiplier version of the MobileNet-V4 Conv Small.

Now you may look at this and say hey, why is this impressive? 64.8% top-1 and 2.2M params? MobileNetV3-Small 0.75, and MobileNet-V2 0.5 are both fewer params (at ~2M) and over 65% top-1, what gives? Well this is where MobileNet-V4 differs from the previous versions of the model family, it trades off (gives up) a little parameter efficiency for some computational efficiency.

So, let's look at the speed. On a 4090 w/ torchcompile
* 98K img/sec - timm/mobilenetv4_conv_small_050.e3000_r224_in1k
* 58K img/sec - timm/mobilenetv3_small_075.lamb_in1k
* 37K img/sec - timm/mobilenetv2_050.lamb_in1k

And there you go, if you have a need for speed, MNV4 is the better option.

reacted to cbensimon's post with 🚀 about 2 months ago

Post

4219

Hello everybody,

We've rolled out a major update to ZeroGPU! All the Spaces are now running on it.

Major improvements:

1. GPU cold starts about twice as fast!
2. RAM usage reduced by two-thirds, allowing more effective resource usage, meaning more GPUs for the community!
3. ZeroGPU initializations (coldstarts) can now be tracked and displayed (use progress=gr.Progress(track_tqdm=True))
4. Improved compatibility and PyTorch integration, increasing ZeroGPU compatible spaces without requiring any modifications!

Feel free to answer in the post if you have any questions

🤗 Best regards,
Charles

reacted to merve's post with 👍 2 months ago

Post

3777

If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try 🤗

Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. 🥲

How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. 🤝

This is much faster + you do not lose out on any information + much easier to maintain too! 🥳

Multimodal RAG merve/multimodal-rag-66d97602e781122aae0a5139 💬
Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e 📖

2 replies

·

reacted to maxiw's post with 👀 2 months ago

Post

2226

The new Qwen-2 VL models seem to perform quite well in object detection. You can prompt them to respond with bounding boxes in a reference frame of 1k x 1k pixels and scale those boxes to the original image size.

You can try it out with my space maxiw/Qwen2-VL-Detection

4 replies

·

posted an update 2 months ago

Post

1262

The timm leaderboard timm/leaderboard has been updated with the ability to select different hardware benchmark sets: RTX4090, RTX3090, two different CPUs along with some NCHW / NHWC layout and torch.compile (dynamo) variations.

Also worth pointing out, there are three rather newish 'test' models that you'll see at the top of any samples/sec comparison:
* test_vit ( timm/test_vit.r160_in1k)
* test_efficientnet ( timm/test_efficientnet.r160_in1k)
* test_byobnet ( timm/test_byobnet.r160_in1k, a mix of resnet, darknet, effnet/regnet like blocks)

They are < 0.5M params, insanely fast and originally intended for unit testing w/ real weights. They have awful ImageNet top-1, it's rare to have anyone bother to train a model this small on ImageNet (the classifier is roughly 30-70% of the param count!). However, they are FAST on very limited hadware and you can fine-tune them well on small data. Could be the model you're looking for?

reacted to merve's post with 🔥 2 months ago

Post

5475

I have put together a notebook on Multimodal RAG, where we do not process the documents with hefty pipelines but natively use:
- vidore/colpali for retrieval 📖 it doesn't need indexing with image-text pairs but just images!
- Qwen/Qwen2-VL-2B-Instruct for generation 💬 directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new 🐭 Byaldi library by @bclavie 🤗
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb

reacted to merve's post with 🤗 2 months ago

Post

2253

amazing leaderboard by @rwightman , compare all the image backbones on various metrics against model performance

below is an example for top-k against inferred samples per second
timm/leaderboard

posted an update 3 months ago

Post

2051

The latest timm validation & test set results are now viewable by a leaderboard space: timm/leaderboard

As of yesterday, I updated all of the results for ImageNet , ImageNet-ReaL, ImageNet-V2, ImageNet-R, ImageNet-A, and Sketch sets. The csv files can be found in the GH repo https://github.com/huggingface/pytorch-image-models/tree/main/results

Unfortunately the latest benchmark csv files are not yet up to date, there are some gaps in dataset results vs throughput/flop numbers impact the plots.

h/t to @MohamedRashad for making the first timm leaderboard.

1 reply

·

posted an update 4 months ago

Post

1982

I can't resist an opportunity to update an old baseline. Read a new article on my latest look at improving MobileNet-V1 and EfficientNet-B0 baselines.

https://huggingface.co/blog/rwightman/mobilenet-baselines
timm/mobilenetv1_100.ra4_e3600_r224_in1k
timm/efficientnet_b0.ra4_e3600_r224_in1k

posted an update 5 months ago

Post

2446

MobileNetV4 weights are now in timm! So far these are the only weights for these models as the offiicial Tensorflow impl remains weightless.

Guided by paper hparams with a few tweaks, I've managed to match or beat the paper results training the medium models. I'm still working on large and improving the small result. They appear to be solid models for on-device use.

timm/mobilenetv4-pretrained-weights-6669c22cda4db4244def9637

MobileNetV4 -- Universal Models for the Mobile Ecosystem (2404.10518)

1 reply

·

reacted to abidlabs's post with 🚀 6 months ago

Post

3555

𝗣𝗿𝗼𝘁𝗼𝘁𝘆𝗽𝗶𝗻𝗴 holds an important place in machine learning. But it has traditionally been quite difficult to go from prototype code to production-ready APIs

We're working on making that a lot easier with 𝗚𝗿𝗮𝗱𝗶𝗼 and will unveil something new on June 6th: https://www.youtube.com/watch?v=44vi31hehw4&ab_channel=HuggingFace

2 replies

·

reacted to MohamedRashad's post with ❤️ 6 months ago

Post

1531

Timm Leaderboard space here:
MohamedRashad/timm-leaderboard

Thanks goes to @rwightman for building timm 🤗

posted an update 6 months ago

Post

1481

Aiming to keep timm (https://github.com/huggingface/pytorch-image-models) the go-to library for efficient image encoders for your mobile and edge devices, I've started working on an implementation of the new MobileNet-V4 model. Take a look at a short article I wrote about the model: https://huggingface.co/blog/rwightman/mobilenetv4

Paper: MobileNetV4 -- Universal Models for the Mobile Ecosystem (2404.10518)

reacted to danaaubakirova's post with 😎 6 months ago

Post

1278

The Document AI team ( @Molbap , @rwightman , @danaaubakirova ) at Hugging Face is developing a new multimodal data augmentation pipeline utilising both visual and textual aspects of document images.

Check out my latest blog post for more details:
https://huggingface.co/blog/danaaubakirova/doc-augmentation

Please, share your thoughts and suggestions with us.
And stay tuned for the updates!

Ross Wightman

AI & ML interests

Articles

Trick or ResNet Treat

Mamba Out

Tiny Test Models

Searching for better (Full) ImageNet ViT Baselines

MobileNet Baselines

MobileNet-V4 (now in timm)

Organizations

rwightman's activity