Datasets/OpenOrca

Abner · August 23, 2023, 7:07am

We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

Official Models

OpenOrca-Platypus2-13B

Our latest release, the first 13B model to score higher than LLaMA1-65B on the HuggingFace Leaderboard! Released in partnership with Platypus.

OpenOrcaxOpenChat-Preview2-13B

Our second model, highlighting that we’ve surpassed the performance reported in the Orca paper. Was #1 at release time, now surpassed by our own OpenOrca-Platypus2-13B. Released in partnership with OpenChat.

OpenOrca-Preview1-13B

OpenOrca-Preview1-13B This model was trained in less than a day, for <$200, with <10% of our data. At release, it beat the current state of the art models on BigBench-Hard and AGIEval. Achieves ~60% of the improvements reported in the Orca paper.

Dataset Summary

The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. The data is primarily used for training and evaluation in the field of natural language processing.