OpenAssistant/oasst1

GeneralMouton · August 23, 2023, 7:22am

OpenAssistant Conversations Dataset (OASST1)

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

Please refer to our paper for further details.

Dataset Structure

This dataset contains message trees. Each message tree has an initial prompt message as the root node, which can have multiple child messages as replies, and these child messages can have multiple replies.

All messages have a role property: this can either be “assistant” or “prompter”. The roles in conversation threads from prompt to leaf node strictly alternate between “prompter” and “assistant”.

This version of the dataset contains data collected on the open-assistant.io website until April 12 2023.

JSON Example: Message

For readability, the following JSON examples are shown formatted with indentation on multiple lines. Objects are stored without indentation (on single lines) in the actual jsonl files.