Dataset Description

Original Repository:

This is a dataset of the training data used to train the Baize family of models. This dataset is used for instruction fine-tuning of LLMs, particularly in “chat” format. Human and AI messages are marked by [|Human|] and [|AI|] tags respectively. The data from the orignial repo consists of 4 datasets (alpaca, medical, quora, stackoverflow), and this dataset combines all four into one dataset, all in all consisting of about 210K rows.