Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
its5Q 
posted an update 22 days ago
Post
1241
Continuing my streak by releasing the Wikireading dataset: a large collection of scraped non-fiction books predominantly in Russian language.
its5Q/wikireading

Here's the highlights:
- ~7B tokens, or ~28B characters, making it a great candidate for use in pretraining
- Contains non-fiction works from many knowledge domains
- Includes both the original HTML and extracted text of book chapters
In this post