Dataset Card for Oasis-Corpus
Dataset Description
Oasis-Corpus is a 783GB high-quality bilingual corpus.
All data in Oasis-Corpus are built by Oasis and sourced from Common Crawl. It consists of 374GB of Chinese from 17 recent dumps and 409GB of English textual data from 5 dumps.
Languages
English(409GB, 70,121,125 lines) and Chinese(374GB, 110,580,964 lines)
Data Splits
Language | Dump | docs | size |
---|---|---|---|
Chinese | cc-may-jun-2023-zh | 5,627,020 | 19.31 GB |
cc-mar-apr-2023-zh | 5,548,376 | 19.22 GB | |
cc-jan-feb-2023-zh | 5,369,296 | 18.55 GB | |
cc-sep-oct-2022-zh | 6,156,501 | 20.86 GB | |
cc-aug-2022-zh | 4,971,629 | 17.14 GB | |
cc-jun-jul-2022-zh | 5,566,643 | 18.85 GB | |
cc-may-2022-zh | 6,408,203 | 21.53 GB | |
cc-jan-2022-zh | 6,853,895 | 22.70 GB | |
cc-oct-2021-zh | 7,975,739 | 26.35 GB | |
cc-sep-2021-zh | 7,371,460 | 24.69 GB | |
cc-jul-aug-2021-zh | 6,643,794 | 22.17 GB | |
cc-jun-2021-zh | 6,509,108 | 22.25 GB | |
cc-may-2021-zh | 5,142,078 | 17.63 GB | |
cc-apr-2021-zh | 7,284,775 | 24.32 GB | |
cc-jan-2021-zh | 8,133,760 | 27.19 GB | |
cc-nov-dec-2020-zh | 6,834,254 | 23.49 GB | |
cc-oct-2020-zh | 8,184,433 | 27.40 GB | |
English | cc-may-jun-2023-en | 15,712,655 | 90.74 GB |
cc-may-2022-en | 14,728,252 | 81.81 GB | |
cc-jun-jul-2022-en | 14,124,173 | 81.66 GB | |
cc-jan-2022-en | 12,686,195 | 78.67 GB | |
cc-oct-2021-en | 12,869,850 | 75.24 GB |
Dataset Structure
Data Fields
- text:the processed and cleaned text contained in the page
- timestamp:timestamp of when the webpage was crawled by CommonCrawl
- url:the url of the webpage crawled to produce the sample
Dataset Creation
- (1) Ungoliant Content Extraction
- (2) Rule Filter
- (3) Neural Filter
- (4) Document Deduplication
Contact
The Laboratory of Cognition and Decision Intelligence for Complex Systems. Institute of Automation, Chinese Academy of Sciences