Meet Stable Beluga 1 and Stable Beluga 2, Our Large and Mighty Instruction Fine-Tuned Language Models

Updated 28 Jul 2023

Stability AI and its CarperAI lab proudly announce Stable Beluga 1 and its successor Stable Beluga 2 (formerly codenamed FreeWilly), two powerful new, open access, Large Language Models (LLMs). Both models demonstrate exceptional reasoning ability across varied benchmarks. Stable Beluga 1 leverages the original LLaMA 65B foundation model and was carefully fine-tuned with a new synthetically-generated dataset using Supervised Fine-Tune (SFT) in standard Alpaca format. Similarly, Stable Beluga 2 leverages the LLaMA 2 70B foundation model to achieve industry-leading performance.

Both models are research experiments and are released to foster open research under a non-commercial license.* While we have conducted internal red-teaming to ensure the model remains polite and harmless, we welcome the community s feedback and help in further red-teaming.

Data Generation and Collection

The training for the Stable Beluga models was directly inspired by the methodology pioneered by Microsoft in its paper: “Orca: Progressive Learning from Complex Explanation Traces of GPT-4.” While our data generation process is similar, we differ in our data sources.

Our variant of the dataset, containing 600,000 data points (roughly 10% of the dataset size the original Orca paper used), was created synthetically using high-quality instructions from the following datasets created by Enrico Shippole:

To ensure fair comparisons, we carefully filtered these datasets and removed examples that originated from evaluation benchmarks.** Despite training on one-tenth the sample size of the original Orca paper (significantly reducing the cost and carbon footprint of training the model compared to the original paper), the resulting Stable Beluga models demonstrate exceptional performance across various benchmarks – validating our approach to synthetically generated datasets.

Performance Evaluation

To internally evaluate these models, we used EleutherAI’s lm-eval-harness, to which we added AGIEval.

Both Stable Beluga models excel in many areas, including intricate reasoning, understanding linguistic subtleties, and answering complex questions related to specialized domains, e.g. Law and mathematical problem-solving.

Open LLM Leaderboard benchmarks:

These Stable Beluga results were evaluated by Stability AI researchers and independently reproduced by Hugging Face on July 21st, 2023, and published in their leaderboard.

As of July 27th, 2023, Stable Beluga 2 is the very best model (#1) on the leaderboard, and Stable Beluga 1 is #4:

further comparisons:***

GPT4ALL benchmarks (all 0-shot):

AGI Eval (all 0-shot):

Contributing to an open future

Stable Beluga 1 and Stable Beluga 2 set a new standard in the field of open access Large Language Models. They both significantly advance research, enhance natural language understanding and enable complex tasks. We are excited about the endless possibilities these models will bring to the AI community and the new applications they will inspire.

We would like to express our sincere gratitude to our passionate team of researchers, engineers, and collaborators, whose remarkable efforts and dedication have enabled us to reach this significant milestone.

Stay tuned for more exciting developments, and begin exploring the incredible potential of Stable Beluga today!

Why did we change the names?

These models were renamed from their internal code-name FreeWilly (a homage to the movies that some of us remember fondly), referring to the Orca paper. There were multiple reasons for the name change, the most notable being that belugas are gentler animals, unlike the fierce Orca (commonly known as killer whales). Stable Beluga models are optimized for “harmlessness”; therefore, the new names fit better with the models.

*The weights for Stable Beluga 2 are released as-is, while Stable Beluga 1’s are released as deltas over the original model. Both models are released under the Stable Beluga Research License.

**These include the ARC-Challenge and others on the Open LLM Leaderboard and GPT4ALL’s Performance Benchmarks.

***As reported in the “GPT-4 Technical Report” from OpenAI (March 27th, 2023).

****As reported in the paper “Orca: Progressive Learning from Complex Explanation Traces of GPT-4” from Microsoft Research (June 5th, 2023).

#undefined
[Source: stability.ai]