We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
Official Models
OpenOrca-Platypus2-13B
Our latest release, the first 13B model to score higher than LLaMA1-65B on the HuggingFace Leaderboard! Released in partnership with Platypus.
OpenOrcaxOpenChat-Preview2-13B
Our second model, highlighting that we’ve surpassed the performance reported in the Orca paper. Was #1 at release time, now surpassed by our own OpenOrca-Platypus2-13B. Released in partnership with OpenChat.
OpenOrca-Preview1-13B
OpenOrca-Preview1-13B This model was trained in less than a day, for <$200, with <10% of our data. At release, it beat the current state of the art models on BigBench-Hard and AGIEval. Achieves ~60% of the improvements reported in the Orca paper.
Dataset Summary
The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. The data is primarily used for training and evaluation in the field of natural language processing.