Dataset Card for CC OpenBooks

Dataset Description

CC OpenBooks is a curated collection of high quality non-fiction books. All texts are from CC-By-4.0 sources, with no license ambiguity. The documents are normalized to markdown retaining as much formatting as possible to improve output presentation in downstream tasks.

Source Data

The following Openstax collections were used in creating this dataset:

  • Introduction to Anthropology
  • College Success Concise
  • College Success
  • Preparing for College Success
  • Microbiology
  • Chemistry 2e
  • Chemistry: Atoms First 2e
  • Física universitaria volumen 1
  • Física universitaria volumen 2
  • Física universitaria volumen 3
  • Introduction to Business
  • Astronomy 2e
  • Principles of Marketing
  • Psychologia
  • Contemporary Mathematics
  • Statistics
  • World History Volume 1, to 1500
  • World History Volume 2, from 1400
  • Physics
  • Introduction to Political Science
  • Introducción a la estadística empresarial
  • Introducción a la estadística
  • Entrepreneurship
  • Fizyka dla szkół wyższych. Tom 1
  • Fizyka dla szkół wyższych. Tom 2
  • Fizyka dla szkół wyższych. Tom 3
  • Writing Guide with Handbook
  • Biology 2e
  • Biology for AP® Courses
  • Concepts of Biology
  • Introduction to Sociology 3e
  • Life, Liberty, and the Pursuit of Happiness
  • Precálculo 2ed
  • Psychology 2e
  • Playground
  • University Physics Volume 1
  • University Physics Volume 2
  • University Physics Volume 3
  • Principles of Finance
  • U.S. History
  • American Government 3e
  • Anatomy and Physiology 2e
  • Química 2ed
  • Química: Comenzando con los átomos 2ed
  • Elementary Algebra 2e
  • Intermediate Algebra 2e
  • Prealgebra 2e
  • Business Ethics
  • Organizational Behavior
  • Principles of Management
  • Introduction to Intellectual Property
  • Principles of Economics 3e
  • Principles of Macroeconomics 3e
  • Principles of Macroeconomics for AP® Courses 2e
  • Algebra and Trigonometry 2e
  • College Algebra 2e
  • College Algebra with Corequisite Support 2e
  • Precalculus 2e
  • Introduction to Philosophy
  • College Physics 2e
  • College Physics for AP® Courses 2e
  • Mikroekonomia – Podstawy

Initial Data Collection and Normalization

Wherever possible, the books are converted to markdown. This formatting is kept intact with downstream tasks in mind (e.g. conversational QA). The source of the text is prepended to each document to add context, and it is hoped that this also has the potential to improve source attribution and guidance capabilities of models.

Licensing Information

All books in this collection were previously released with an unambiguous cc-by-4.0 license by the original authors.