Datasets: aquamuse

Dataset Card for AQuaMuSe

Dataset Summary

AQuaMuSe is a novel scalable approach to automatically mine dual query based multi-document summarization datasets for extractive and abstractive summaries using question answering dataset (Google Natural Questions) and large document corpora (Common Crawl)

This dataset contains versions of automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in AQuaMuSe paper.

Supported Tasks and Leaderboards

  • Abstractive and Extractive query-based multi-document summarization
  • Question Answering


en : English

Dataset Structure

Data Instances

  • input_urls: a list of string features.
  • query: a string feature.
  • target: a string feature


    'input_urls': [''],
     'query': 'who is the actor that plays marcel on the originals',
     'target': "In February 2013, it was announced that Davis was cast in a lead role on The CW's new show The 
Originals, a spinoff of The Vampire Diaries, centered on the Original Family as they move to New Orleans, where 
Davis' character (a vampire named Marcel) currently rules."

Data Fields

  • input_urls: a list of string features.
  • List of URLs to input documents pointing to Common Crawl to be summarized.
  • Dependencies: Documents URLs references the Common Crawl June 2017 Archive.
  • query: a string feature.
  • Input query to be used as summarization context. This is derived from Natural Questions user queries.
  • target: a string feature
  • Summarization target, derived from Natural Questions long answers.

Data Splits

  • This dataset has two high-level configurations abstractive and extractive
  • Each configuration has the data splits of train, dev and test
  • The original format of the data was in TFrecords, which has been parsed to the format as specified in Data Instances

Dataset Creation

Curation Rationale

The dataset is automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in AQuaMuSe paper.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]


Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

The dataset curator is sayalikulkarni, who is the contributor for the official GitHub repository for this dataset and also one of the authors of this dataset’s paper. As the account handles of other authors are not available currently who were also part of the curation of this dataset, the authors of the paper are mentioned here as follows, Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, and Eugene Ie.

Licensing Information

[More Information Needed]

Citation Information

@misc{kulkarni2020aquamuse, title={AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization}, author={Sayali Kulkarni and Sheide Chammas and Wan Zhu and Fei Sha and Eugene Ie}, year={2020}, eprint={2010.12694}, archivePrefix={arXiv}, primaryClass={cs.CL} }


Thanks to @Karthik-Bhaskar for adding this dataset.