hlky

hlky

AI & ML interests

DATA

Articles

Organizations

hlky's activity

posted an update 3 months ago
replied to their post 3 months ago
view reply

Thanks that's helpful. Currently the title and description are not ideal for this kind of filtering. We'd need all the images captioned and classes/categories extracting from the captions. Captioning the set is something that's planned, I'm building https://github.com/bigdata-pw/florence-tool for that purpose. Another (very exciting) project is my priority right now but I will aim to get an initial version of this UI out soon focusing on image datasets like Flickr with a gallery type view for quick review and selection, plus filtering options, however as mentioned the usefulness of text based filtering will be limited until captions/classes are available, still it will be useful to filter on available image resolutions, view count (popularity) etc.
For reference the image sizes (url_n, url_w, url_m etc.) are documented here https://www.flickr.com/services/api/misc.urls.html

replied to their post 3 months ago
view reply

Sure, it's a fun and useful project. I've made a start already with some of the basic features. If you could tell me more about how you're expecting it to work and what the user interface should be like that would help refine it.

replied to their post 3 months ago
replied to their post 3 months ago
view reply

This article should get you started: https://huggingface.co/blog/hlky/processing-parquets-102

We'll cover more advanced topics like downloading into WebDatasets, which is recommended if you want millions of images, in later articles.

If there's any specific kind of filtering you'd like to see covered or anything else just let me know, always happy to help!

replied to their post 3 months ago
view reply

Please refrain from advertising your service on my post, thanks!

posted an update 3 months ago
view post
Post
2152
BIG update dropped for bigdata-pw/Flickr - now ~515M images! Target for the next update: 1B

In case you missed them; other recent drops include bigdata-pw/Dinosaurs - a small set of BIG creatures ๐Ÿฆ•๐Ÿฆ– and the first in a series of articles about the art of web scraping! https://huggingface.co/blog/hlky/web-scraping-101 https://huggingface.co/blog/hlky/web-scraping-102

Stay tuned for exciting datasets and models coming soon:
- PC and Console game screenshots
- TV/Film actors biographies and photos (think facial recognition and automatic captioning!)
- bigdata-pw/lyrics-gpt v2
- and more!
  • 11 replies
ยท
posted an update 3 months ago
view post
Post
1908
Announcing another BIG data drop! This time it's ~275M images from Flickr bigdata-pw/Flickr

Data acquisition for this project is still in progress, get ready for an update soon:tm:

In case you missed them; other BIG data drops include Diffusion1B bigdata-pw/Diffusion1B - ~1.23B images and generation parameters from a variety of diffusion models and if you fancy practicing diffusion model training check out Dataception bigdata-pw/Dataception - a dataset of over 5000 datasets in WebDataset format!

Requests are always welcome so reach out if there's a dataset you'd like to see!
  • 1 reply
ยท