Copenlu / answerable_tydiqa

Dataset Card for “answerable-tydiqa”

Dataset Summary

TyDi QA is a question answering dataset covering 11 typologically diverse languages. Answerable TyDi QA is an extension of the GoldP subtask of the original TyDi QA dataset to also include unanswertable questions.

Dataset Structure

The dataset contains a train and a validation set, with 116067 and 13325 examples, respectively. Access them with

from datasets import load_dataset
dataset = load_dataset("copenlu/answerable_tydiqa")
train_set = dataset["train"]
validation_set = dataset["validation"]

Data Instances

Here is an example of an instance of the dataset:

{'question_text': 'dimanakah  Dr. Ernest François Eugène Douwes Dekker meninggal?',
 'document_title': 'Ernest Douwes Dekker',
 'language': 'indonesian',
             {'answer_start': [45],
              'answer_text': ['28 Agustus 1950']
 'document_plaintext': 'Ernest Douwes Dekker wafat dini hari tanggal 28 Agustus 1950 (tertulis di batu nisannya; 29 Agustus 1950 versi van der Veur, 2006) dan dimakamkan di TMP Cikutra, Bandung.',
 'document_url': ''}

Description of the dataset columns:

Column name type Description
document_title str The title of the Wikipedia article from which the data instance was generated
document_url str The URL of said article
language str The language of the data instance
question_text str The question to answer
document_plaintext str The context, a Wikipedia paragraph that might or might not contain the answer to the question
annotations[“answer_start”] list[int] The char index in ‘document_plaintext’ where the answer starts. If the question is unanswerable - [-1]
annotations[“answer_text”] list[str] The answer, a span of text from ‘document_plaintext’. If the question is unanswerable - [‘’]

Notice: If the question is answerable, annotations[“answer_start”] and annotations[“answer_text”] contain a list of length 1
(In some variations of the dataset the lists might be longer, e.g. if more than one person annotated the instance, but not in our case). If the question is unanswerable, annotations[“answer_start”] will have “-1”, while annotations[“answer_text”] contain a list with an empty sring.

Useful stuff

Check out the datasets ducumentations to learn how to manipulate and use the dataset. Specifically, you might find the following functions useful:

dataset.filter, for filtering out data (useful for keeping instances of specific languages, for example)., for manipulating the dataset.

dataset.to_pandas, to convert the dataset into a pandas.DataFrame format.

title   = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages},
author  = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki}
year    = {2020},
journal = {Transactions of the Association for Computational Linguistics}


Thanks to @thomwolf, @albertvillanova, @lewtun, @patrickvonplaten for adding this dataset.

1 Like