Large-scale data processing made easy and reusable
Explore the docs Β»
π« Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
containerized components reusable across pipelines and execution environments and shareable within the community.
It offers:
- π§ Plug βnβ play composable pipelines for creating datasets for
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
- Large language model fine-tuning (LLaMA, Falcon)
- Code generation model fine-tuning (StarCoder)
- 𧱠Library of off-the-shelf reusable components for
- Extracting data from public sources such as Common Crawl, LAION, ...
- Filtering on
- Content, e.g. language, visual style, topic, format, aesthetics, etc.
- Context, e.g. copyright license, origin
- Metadata
- Removal of unwanted data such as toxic, NSFW or generated content
- Removal of unwanted data patterns such as societal bias
- Transforming data (resizing, cropping, reformatting, β¦)
- Tuning the data for model performance (normalization, deduplication, β¦)
- Enriching data (captioning, metadata generation, synthetics, β¦)
- Transparency, auditability, compliance
- π πΌοΈ ποΈ βΎοΈ Out of the box multimodal capabilities: text, images, video, etc.
- π Standardized, Python/Pandas-based way of creating custom components
- π Production-ready, scalable deployment
- βοΈ Multi-cloud integrations
πͺ€ Why Fondant?
In the age of Foundation Models, control over your data is key and building pipelines
for large-scale data processing is costly, especially when they require advanced
machine learning-based operations. This need not be the case, however, if processing
components would be reusable and exchangeable and pipelines were easily composable.
Realizing this is the main vision behind Fondant.
(back to top)