In today's digital era, large language models (LLMs) have undergone a significant transformation. They've progressed from struggling with human speech intricacies to generating text that closely resembles human writing. These LLMs now excel not only in contextual conversations but also in programming tasks.
The beginnings of LLMs are closely tied to the open-source movement. Pioneering minds and scholars recognized the potential within these models, while understanding the substantial computing resources needed to train them.
This led to the emergence of open-source alternatives, providing practical options for researchers and developers. In this article, we'll explore the top 11 open-source LLMs of 2023, comparing their capabilities. We'll also delve into LLM leaderboards and offer guidance on choosing the right LLM for your needs.
Here’s what we’ll cover:
But before that…
**💡 Pro tip: Looking for a reliable tool to protect your LLM applications? We've got you covered! Try Lakera Guard for free.**
Now, let’s dive in!
While several proprietary LLMs have carved their niche, the open-source arena is bustling with innovation, presenting models that are not only powerful but also accessible to a broader audience.
Let’s take a look.
Llama 2 is a cutting-edge collection of pre-trained and fine-tuned generative text models. The series offers models ranging from 7 billion to 70 billion parameters, making it a state-of-the-art tool. Llama-2-Chat, the fine-tuned versions, are designed explicitly for dialogue applications and have been optimized to provide superior performance compared to open-source chat models. They have been evaluated by humans and have received high marks in both helpfulness and safety, putting them on par with popular closed-source models like ChatGPT and PaLM.
Here are the details of this model:
Parameters: 7B, 13B, and 70B
License: Custom commercial license available at Meta's website.
Release Date: July 18, 2023
Training Database: Llama 2 was pre-trained on 2 trillion tokens from public data, then fine-tuned with over a million human-annotated instances and public instruction datasets. Meta claims that no Meta user data was used in either phase.
Variants: Llama 2 is available in multiple parameter sizes, including 7B, 13B, and 70B. Both pre-trained and fine-tuned variations are available.
Fine-tuning Techniques: The model employs supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to better align with human preferences, ensuring helpfulness and safety.
OpenLLaMA is an open-source replica of Meta AI's famous LLaMA model. The creators of OpenLLaMA have made this permissively licensed model available to the general public. With 7 billion to 65 billion parameters, the OpenLLaMA model is trained on 200 billion tokens.
Here are the details of OpenLLaMA:
Parameters: 3B, 7B and 13B
License: Apache 2.0
Release Date: May 5, 2023
HuggingFace: OpenLLaMA: An Open Reproduction of LLaMA
Training Database: OpenLLaMA was trained using the RedPajama dataset, which has over 1.2 trillion tokens. The developers followed the same preprocessing and training hyperparameters as the original LLaMA paper.
Fine-tuning Techniques: The OpenLLaMA has the same model architecture, context length, training steps, learning rate schedule, and optimizer as the original LLaMA paper. The main difference between OpenLLaMA and the original LLaMA is the dataset used for training.
Falcon models were developed by the Technology Innovation Institute in Abu Dhabi. The Falcon family of language models is groundbreaking and state-of-the-art, with the Falcon-40B being the most notable and could compete with multiple close-source LLMs.
Here are the details of Falcon model:
Parameters: 7B and 40B
License: Apache 2.0
Release Date: June 5, 2023
Training Database: The Falcon-7B and Falcon-40B models have undergone extensive training using vast data, with 1.5 trillion and 1 trillion tokens, respectively. The primary training data for these models is the RefinedWeb dataset, which includes over 80% of their training material. This dataset is a massive web collection based on CommonCrawl, emphasizing quality and scale.
Techniques Used for Fine-Tuning: Falcon models use multiquery attention to share keys and values for improved inference scalability.
System Requirements: Falcon-40B: Requires ~90GB of GPU memory, and Falcon-7B: Requires ~15GB of GPU memory.
Package Version Requirements: For optimal performance, it's recommended to use the bfloat16 datatype, which requires a recent version of CUDA and is best suited for modern graphics cards.
💡 Pro tip: Check out Jailbreaking Large Language Models: Techniques, Examples, Prevention Methods
Dolly, officially known as dolly-v2-12b, is an instruction-following large language model developed by Databricks. This model has been trained on about 15,000 instruction/response fine-tuning records created by Databricks employees using the pythia-12b model on the Databricks machine learning platform.
It covers a range of capability domains from the InstructGPT paper, such as brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. Although Dolly is not considered a state-of-the-art model, especially after Databricks acquired MosaicML, it displays exceptional instruction-following behaviour that is not typical of the foundation model it is built upon.
Here are the details of this model:
License: permissive license (CC-BY-SA)
Release Date: Apr 12, 2023
Variants: There are two versions of Dolly: Dolly-v2-7b, which has 6.9 billion parameters and is based on Pythia-6.9b, and Dolly-v2-3b, which has 2.8 billion parameters and is based on Pythia-2.8b.
Database Used for Training: The dataset used for training the model is databricks-dolly-15k. This dataset contains fine-tuning records created by Databricks employees.
Techniques Used for Fine-Tuning: The model was fine-tuned using data from various domains per the InstructGPT paper.
The MosaicML company has developed MPT-30B, a decoder-based transformer pre-trained on 1T tokens of both English text and code. It's part of the Mosaic Pretrained Transformer (MPT) series, designed for efficient fine-tuning and LLM deployment. MPT-30B boasts features like an 8k token context window, context-length extrapolation through ALiBi, and the FlashAttention mechanism for fast training and inference.
The model is compatible with both HuggingFace and NVIDIA's FasterTransformer, and its size is optimized for deployment on single GPU setups. The MosaicML NLP team developed MPT-30B on their platform using the LLM codebase found in the llm-foundry repository which is recommended for fine-tunning and inference.
Here are the detail of this model:
Release Date: June 22, 2023
Variants: There are two models available: MPT-7B and MPT-30B. Each model comes with an instruction and a chat version.
Database Used for Training: 1T tokens of English text and code
System Requirements: The model could be deployed on a single GPU, which could be either 1xA100-80GB in 16-bit precision or 1xA100-40GB in 8-bit precision.
Package Version Requirements for Training: MosaicML recommends utilizing the MosaicML llm-foundry repository to train and fine-tune the model for optimal results. It's worth noting that the MPT-30B tokenizer used in the training process is identical to the EleutherAI/gpt-neox-20b tokenizer.
The Guanaco model is an LLM that utilizes the LoRA fine-tuning technique Tim Dettmers, and the UW NLP team developed. With the help of QLoRA, it's possible to fine-tune a 65B parameter model on a 48GB GPU without sacrificing performance compared to a 16-bit model.
The Guanaco series has outperformed all previous models. Since these models come from the LLaMA model series, they are suitable for commercial use. Although this LLM is not the most advanced model available on the market, this LLM introduces the QLoRA method, which offers an efficient fine-tuning technique and enables personal and smaller businesses to fine-tune large models with up to 65 billion parameters.
Here are the details of this model:
License: MIT License
Release Date: May 24, 2023
BLOOM, which stands for BigScience Language Open-science Open-access Multilingual, is a powerful language model that uses large computational resources to generate text based on a given prompt. This model is the biggest in the list, with around 176 billion parameters.
It can produce coherent text almost indistinguishable from human-generated content in 46 natural languages and 13 programming languages. When given input text, BLOOM can continue the text to generate relevant continuations by examining the preceding words.
While the direct application of BLOOM is primarily for text generation, the model can be adapted for tasks such as Information Extraction, Question Answering, and text summarization by framing them as text generation tasks.
Here are the details of this massive model.
License: RAIL License v1.0
Release Date: July 11, 2022
Compute Infrastructure: This model was trained on the Jean Zay Public Supercomputer with 416 A100 80GB GPUs, 384 across 48 nodes, each with 8 GPUs connected through NVLink 4 inter-gpu connections and 4 OmniPath links. Each node has 512GB of RAM and the GPU has 640GB. Megatron-DeepSpeed, DeepSpeed, PyTorch, and apex are used to train this model.
Alpaca is a language model that follows instructions and generates outputs based on provided data. It has been fine-tuned from a 7B LLaMA model using 52K instruction-following data. In preliminary human evaluations, Alpaca has shown behaviour similar to the text-davinci-003 model in the Self-Instruct instruction-following evaluation suite.
Here are the details:
Parameters: Fine-tuned from a 7B LLaMA model.
Release Date: Mar 13, 2023
Training Database: The model was fine-tuned on 52K instruction data using modified techniques from the Self-Instruct paper. Data generation leveraged text-davinci-003, a simplified pipeline, and produced one instance per instruction. Fine-tuning employed the Hugging Face training code.
OpenChatKit is an open-source toolset that empowers users to create general and specialized chatbot applications. One of the models developed in this platform is the GPT-NeoXT-Chat-Base-20B-v0.16, an LLM with 20B parameters.
This model is fine-tuned from EleutherAI's GPT-NeoX and focuses on dialogue-style interactions. Its primary function is to perform tasks like answering questions, classification, extraction, and summarization. The model has undergone extensive training with over 40 million instructions on 100% carbon-negative computing.
Here are the details:
Training Database: The model has been enhanced with a set of 43 million top-notch instructions. The exact datasets utilized can be found in the togethercomputer/OpenDataHub repository.
Fine-tuning Techniques: This model has been enhanced and fine-tuned using EleutherAI's GPT-NeoX and feedback data, resulting in better adaptation for human conversation.
System Requirements: To run the GPT-NeoXT-Chat-Base-20B model, a minimum of 41GB of free VRAM is required, with each prompt consuming an additional 100-200 MB. Based on its guide, it is recommended to follow consumer hardware guidelines and use at least one GPU for the operation, although inference can be done with less than 48GB of VRAM.
GPT4All is an ecosystem for training and deploying large language models. These models can run locally on CPUs that are designed for consumer use. This system is an assistant-style language model that is instruction-tuned and can be used, distributed, and built upon by anyone, whether they are an individual or belong to an enterprise.
This ecosystem enables users to create and use language models specific to their requirements. These models can operate efficiently on standard CPUs without requiring an internet connection or GPU. Direct installer links are available for macOS, Windows, and Ubuntu.
Here are more details about their models:
Parameters: The model size ranges from 3GB to 8GB and given typical sizes, it could range between 7B to 13B.
Release date: Apr 24, 2023
Fine-tuning Techniques: The GPT4All software ecosystem supports multiple Transformer architectures, including Falcon, LLaMA (including OpenLLaMA), MPT (including Replit), and GPT-J.
FLAN-T5 is an improved version of T5 that is specifically designed for zero-shot and few-shot NLP tasks. With over 1000 additional tasks and multiple languages covered, it is a powerful language model optimized for research purposes, including reasoning and question answering.
Google has released various variants of the model from flan-t5-small with 80 million parameters to flan-t5-xxl with 11 billion parameters. Largest model flan-t5-xxl only support English, German, French languages while smaller models like flan-t5-xl support 50+ languages.
Here are the details of this models:
Parameters: 80M to 11B
License: Apache 2.0
Variants: Google's LAN-T5 has been released in 5 variants: the flan-t5-small with 80M parameters, the flan-t5-base with 250M parameters, the flan-t5-large with 780M parameters, the flan-t5-xl boasting 3B parameters, and the largest, flan-t5-xxl, with 11B parameters.
Fine-tuning Techniques: Based on pretrained T5 Fine-tuned with instructions for enhanced zero-shot and few-shot performance.
System Requirements: The required hardware for this model includes Google Cloud TPU Pods, specifically TPU v3 or TPU v4 with a minimum of four chips. Additionally, the model has been trained using the t5x codebase in conjunction with jax, so these package versions are required.
Each of these 11 LLMs comes with distinctive features and specifications that cater to a range of users. Whether your focus is on portability, performance, or budget-friendliness, you will find a model designed to match your requirements.
However, while open-source options offer great advantages, the process of developing and selecting the right model can pose its challenges. Let’s explore them.
LLMs are always changing, as new models keep appearing. While having lots of options is exciting, it can also be a bit overwhelming for developers, researchers, and tech enthusiasts. To help with this changing landscape, LLM leaderboards give us a clear picture of how different language models perform.
Let's take a look at some of them.
The HuggingFace Open LLM Leaderboard is a platform designed to track, rank and assess LLMs and chatbots as they gain popularity. It is unique because it is open to the community, allowing anyone to submit their model for automatic evaluation on the HuggingFace GPU cluster. The only requirement is that the model is a HuggingFace Transformers model with weights available on the Hub. They also allow for model evaluations with delta-weights for non-commercial licensed models, like the original LLaMa release. Users can easily filter models based on their type, whether pre-trained, fine-tuned, instruction-tuned or RL-tuned.
The evaluation process used by the Chatbot Arena Leaderboard involves three benchmarks: 1Chatbot Arena, MT-Bench, and MMLU (5-shot). Models compete on Chatbot Arena in randomized settings, answer multi-turn questions on MT-Bench, and undergo a rigorous multitask accuracy test on MMLU (5-shot) across 57 tasks. The leaderboard is meticulous in its calculation of ratings and scores.
The most recent version can be found here.
The AlpacaEval Leaderboard has been designed to evaluate LLMs' ability to follow instructions. The models are evaluated based on their success rate and output length.
Open-source development for Large Language Models (LLMs) brings numerous advantages, like collaboration, transparency, and innovation. However, building and maintaining these models presents its share of challenges, including:
Companies like MosaicML and Databricks are trying to make fine-tuning more accessible through their platforms. Others, like Lambda, are working on reducing GPU costs. Still, cost-related issues persist.
electing the appropriate Large Language Model (LLM) for your business use case requires a systematic approach. Here's a short step-by-step guide to help you make the right choice:
Looking ahead, spanning 2023 and beyond, we can expect the open-source LLM landscape flourishing with the regular introduction of new models.
The 11 models that we’ve listed have made powerful language processing accessible, overcoming cost and proprietary hurdles.
However, they also face a multitude of challenges like cost, privacy, bias, and scalability. Users must consider these against benefits like customization, cost savings, and security, compared to proprietary LLMs that offer support but with fees and less flexibility.
Yet, the open-source community is committed to ethical, user-centric models. As technology evolves, these LLMs will progress, driving innovative, collaborative, and responsible AI-driven language processing.
We are excited to see what lies ahead for the AI community and hope you are, too!
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.