HuggingFaceM4/OBELICS

GeneralMouton · August 23, 2023, 7:31am

Dataset Card for OBELICS

OBELICS is an open, massive, and curated collection of interleaved image-text web documents, containing 141M English documents, 115B text tokens, and 353M images, extracted from Common Crawl dumps between February 2020 and February 2023. The collection and filtering steps are described in our paper.

Interleaved image-text web documents are a succession of text paragraphs interleaved by images, such as web pages that contain images. Models trained on these web documents outperform vision and language models trained solely on image-text pairs on various benchmarks. They can also generate long and coherent text about a set of multiple images. As an example, we trained IDEFICS, a visual language model that accepts arbitrary sequences of image and text inputs and produces text outputs.

We provide an interactive visualization of OBELICS that allows exploring the content of OBELICS. The map shows a subset of 11M of the 141M documents.