Dataset Card for “librispeech_lm”
Dataset Summary
Language modeling resources to be used in conjunction with the LibriSpeech ASR corpus.
Supported Tasks and Leaderboards
Languages
Dataset Structure
Data Instances
default
- Size of downloaded dataset files: 1.51 GB
- Size of the generated dataset: 4.42 GB
- Total amount of disk used: 5.93 GB
An example of ‘train’ looks as follows.
{
"text": "This is a test file"
}
Data Fields
The data fields are the same among all splits.
default
text
: astring
feature.
Data Splits
name | train |
---|---|
default | 40418260 |
Dataset Creation
Curation Rationale
Source Data
Initial Data Collection and Normalization
Who are the source language producers?
Annotations
Annotation process
Who are the annotators?
Personal and Sensitive Information
Considerations for Using the Data
Social Impact of Dataset
Discussion of Biases
Other Known Limitations
Additional Information
Dataset Curators
Licensing Information
Citation Information
@inproceedings{panayotov2015librispeech,
title={Librispeech: an ASR corpus based on public domain audio books},
author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},
pages={5206--5210},
year={2015},
organization={IEEE}
}
Contributions
Thanks to @lewtun, @jplu, @thomwolf for adding this dataset.