fondant-ai (Fondant)

Fondant banner Large-scale data processing made easy and reusable
Explore the docs »

🍫 Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making containerized components reusable across pipelines and execution environments and shareable within the community.

It offers:

🔧 Plug ‘n’ play composable pipelines for creating datasets for
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
- Large language model fine-tuning (LLaMA, Falcon)
- Code generation model fine-tuning (StarCoder)
🧱 Library of off-the-shelf reusable components for
- Extracting data from public sources such as Common Crawl, LAION, ...
- Filtering on
  - Content, e.g. language, visual style, topic, format, aesthetics, etc.
  - Context, e.g. copyright license, origin
  - Metadata
- Removal of unwanted data such as toxic, NSFW or generated content
- Removal of unwanted data patterns such as societal bias
- Transforming data (resizing, cropping, reformatting, …)
- Tuning the data for model performance (normalization, deduplication, …)
- Enriching data (captioning, metadata generation, synthetics, …)
- Transparency, auditability, compliance
📖 🖼️ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
🐍 Standardized, Python/Pandas-based way of creating custom components
🏭 Production-ready, scalable deployment
☁️ Multi-cloud integrations

🪤 Why Fondant?

In the age of Foundation Models, control over your data is key and building pipelines for large-scale data processing is costly, especially when they require advanced machine learning-based operations. This need not be the case, however, if processing components would be reusable and exchangeable and pipelines were easily composable. Realizing this is the main vision behind Fondant.

(back to top)

Fondant

AI & ML interests

🪤 Why Fondant?

models

datasets 2

fondant-ai/datacomp-small-clip

fondant-ai/fondant-cc-25m

AI & ML interests

Team members 9

🪤 Why Fondant?

models

datasets 2 Sort: Recently updated

datasets 2