M8than / tiny_giant_filtered_pretrain

Dataset Card for TinyGiant

Dataset Summary

This dataset aims to provide a small but pretty viable base model dataset. Aimed to be able to train a model and teach it a viable amount of information about all tokens.

Languages

English (100%)

More soon…

Vocab Coverage (and other stats)

RWKV World Tokenizer

=====================================

enwiki.jsonl

documents: 46180

max context length: 54110

total tokens: 35413961

vocab coverage: 80.41%

file size: 159.56 megabytes

=====================================

stack_exchange.jsonl

documents: 71160

max context length: 20671

total tokens: 38983876

vocab coverage: 79.48%

file size: 148.36 megabytes

=====================================

webtext.jsonl

documents: 154557

max context length: 448

total tokens: 25027551

vocab coverage: 76.54%

file size: 109.57 megabytes

=====================================

code_documents.jsonl

documents: 23298

max context length: 263776

total tokens: 52397777

vocab coverage: 84.61%

file size: 187.14 megabytes

=====================================

stories.jsonl

documents: 25385

max context length: 1053

total tokens: 5552189

vocab coverage: 18.97%

file size: 23.57 megabytes

=====================================

text.jsonl

documents: 181030

max context length: 146988

total tokens: 350672227

vocab coverage: 95.67%

file size: 1329.66 megabytes

=====================================

vn.jsonl

documents: 190

max context length: 2217608

total tokens: 57891290

vocab coverage: 63.14%

file size: 209.89 megabytes

=====================================

jupyter_to_text.jsonl

documents: 9701

max context length: 45295

total tokens: 30927312

vocab coverage: 78.75%

file size: 112.12 megabytes

=====================================

stories_smart.jsonl

documents: 100676

max context length: 1137

total tokens: 23692169

vocab coverage: 23.55%

file size: 98.75 megabytes

=====================================

totals

documents: 612177

max context length: 2217608

tokens: 620558352

vocab coverage: 99.24%

size: 2378.63 megabytes