Roemmele / ablit

https://huggingface.co/datasets/roemmele/ablit

Dataset Card for AbLit

Dataset Summary

The AbLit dataset contains abridged versions of 10 classic English literature books, aligned with their original versions on various passage levels. The abridgements were written and made publically available by Emma Laybourn here. This is the first known dataset for NLP research that focuses on the abridgement task.

See the paper for a detailed description of the dataset, as well as the results of several modeling experiments. The GitHub repo also provides more extensive ways to interact with the data beyond what is provided here.

Languages

English

Dataset Structure

Each passage in the original version of a book chapter is aligned with its corresponding passage in the abridged version. These aligned pairs are available for various passage sizes: sentences, paragraphs, and multi-paragraph “chunks”. The passage size is specified when loading the dataset. There are train/dev/test splits for items of each size.

Passage Size Description # Train # Dev # Test
chapters Each passage is a single chapter 808 10 50
sentences Each passage is a sentence delimited by the NLTK sentence tokenizer 122,219 1,143 10,431
paragraphs Each passage is a paragraph delimited by a line break 37,227 313 3,125
chunks-10-sentences Each passage consists of up to X=10 number of sentences, which may span more than one paragraph. To derive chunks with other lengths X, see GitHub repo above 14,857 141 1,264

Example Usage

To load aligned paragraphs:

from datasets import load_dataset
data = load_dataset("roemmele/ablit", "paragraphs")

Data Fields

  • original: passage text in the original version
  • abridged: passage text in the abridged version
  • book: title of book containing passage
  • chapter: title of chapter containing passage

Dataset Creation

Curation Rationale

Abridgement is the task of making a text easier to understand while preserving its linguistic qualities. Abridgements are different from typical summaries: whereas summaries abstractively describe the original text, abridgements simplify the original primarily through a process of extraction. We present this dataset to promote further research on modeling the abridgement process.

Source Data

The author Emma Laybourn wrote abridged versions of classic English literature books available through Project Gutenberg. She has also provided her abridgements for free on her website. This is how she describes her work: “This is a collection of famous novels which have been shortened and slightly simplified for the general reader. These are not summaries; each is half to two-thirds of the original length. I’ve selected works that people often find daunting because of their density or complexity: the aim is to make them easier to read, while keeping the style intact.”

Initial Data Collection and Normalization

We obtained the original and abridged versions of the books from the respective websites.

Who are the source language producers?

Emma Laybourn

Annotations

Annotation process

We designed a procedure for automatically aligning passages between the original and abridged version of each chapter. We conducted a human evaluation to verify these alignments had high accuracy. The training split of the dataset has ~99% accuracy. The dev and test splits of the dataset were fully human-validated to ensure 100% accuracy. See the paper for further explanation.

Who are the annotators?

The alignment accuracy evaluation was conducted by the authors of the paper, who have expertise in linguistics and NLP.

Personal and Sensitive Information

None

Considerations for Using the Data

Social Impact of Dataset

We hope this dataset will promote more research on the authoring process for producing abridgements, including models for automatically generating abridgements. Because it is a labor-intensive writing task, there are relatively few abridged versions of books. Systems that automatically produce abridgements could vastly expand the number of abridged versions of books and thus increase their readership.

Discussion of Biases

We present this dataset to introduce abridgement as an NLP task, but these abridgements are scoped to one small set of texts associated with a specific domain and author. There are significant practical reasons for this limited scope. In particular, in constrast to the books in AbLit, most recently published books are not included in publicly accessible datasets due to copyright restrictions, and the same restrictions typically apply to any abridgements of these books. For this reason, AbLit consists of British English literature from the 18th and 19th centuries. Some of the linguistic properties of these original books do not generalize to other types of English texts that would be beneficial to abridge. Moreover, the narrow cultural perspective reflected in these books is certainly not representative of the diverse modern population. Readers may find some content offensive.

Dataset Curators

The curators are the authors of the paper.

Licensing Information

cc-by-sa-4.0

Citation Information

Roemmele, Melissa, Kyle Shaffer, Katrina Olsen, Yiyi Wang, and Steve DeNeefe. “AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature.” Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2023).