Back

Introduction to Large Language Models: Everything You Need to Know in 2023 (+ Resources)

Single handedly, large language models (LLMs) have propped up the current wave of the AI boom. There’s a lot of hype—and for good reason. But what, exactly, is going on under the hood? What are some examples of LLMs and the different ways they can be implemented?Find answers in our LLM guide.

Avi Bewtra
December 4, 2023
September 17, 2023
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

Single handedly, large language models (LLMs) have propped up the current wave of the AI boom. There’s a lot of hype—and some for good reason. Bigger and better LLMs are released each month, and each is more applicable in virtually every industry which relies on language: social media, healthcare, education, law, finance, scientific research…

When ChatGPT was released, the public saw an AI agent approaching human-level performance. And while tech like this had existed before—in GPT and GPT-2, for example—it had never been so accessible.

Since then, there’s been a slew of new LLMs that have pushed the state-of-the-art forward. You’ve almost definitely tried some of them, but if you haven’t I encourage you to ask ChatGPT a few questions. You’ll likely be shocked.

We know LLMs work. We can see that.

But what, exactly, is going on under the hood? What are some examples of LLMs and the different ways they can be implemented? Where are they most valuable? And where might they fail?

You can find the answers to these and many more questions in this article. Here’s what we’ll cover:

  1. What is a Language Model?
  2. How do Large Language Models work?
  3. How are Large Language Models trained?
  4. Applications of Large Language Models
  5. Limitations of Large Language Models

And psst… don’t forget that it’s imperative to think about safety, security, and privacy as these huge models become more widely available. It’s critical that dev-teams and end-users can trust their model’s outputs.

Lakera is building industry-leading LLM security tools. To implement LLMs without worrying about data leakage, prompt injections, and hallucinations—among other forms of attack—check out Lakera Guard.

Now, let’s dive in!

{{Advert}}

What is a Language Model?

Natural language processing (NLP)—the field of research pursuing ways for machines to master language—has had a long history. One of the most significant research areas of NLP has been language modeling. So before getting to large language models, we’ll consider language models more generally.

Fundamentally, a language model is one that predicts the next element in a sequence of text.

Typically, we’ll provide the model with some sort of context to make its prediction. Consider the following phrase as the model’s context: “The cat is...”

Let’s say we wanted to build a language model to predict the next word in the sentence. And, let’s say we are given the vocabulary—the set of possible words that could come next. A common way to make a prediction would be to calculate the probability of each word in the vocabulary coming next, given the set of previous words in the sequence—the context.

If our vocabulary was limited to “blue”, “black”, and “green”, we might build a model that calculated the following.

We, humans, can intuit that the most likely word to come next is black. (I haven’t seen too many blue or green cats.) The model would have to represent this intuitive pattern in the probability distribution it returns. In our example, the word with the highest probability is ‘black,’ giving us our predicted sentence: “The cat is black.”

Ok. So we know our model has to return a probability distribution of all the words. Let’s consider this model as a function, ƒ, which takes the context, and returns the probability distribution over our vocabulary. In our example, that looks like:

Here, ƒ is our language model. How it generates this probability distribution has been a research topic for more than fifty years. It’s been an iterative development: from statistical Markov methods, through neural network architectures including RNNs, LSTMs, and eventually, transformer models.

But for all of these architectures, our model, ƒ, can be used to generate whole phrases by adding each predicted word to our original context for the consequent prediction. For example, let’s look at another example.

Consider a similar simple vocabulary: {“<start>”, “The”, “cat”, “is”, “blue”, “black”, “green”, “<end>”}

Here, “<start>” denotes an input to the model with no context, and “<end>” denotes the prediction that the sequence is complete.

If we’re trying to build a model, ƒ, to make sentences, we’d predict the first token with the following. (Note: I’ve made up the probability values for illustration)

In other words, by feeding the “<start>” token to the model, we generate the probability distribution over the vocabulary given no context. Because “The” has the highest probability, we’ll select it as the next word. It is then added to our context, and we can generate the next probability distribution. This process repeats as follows until our model predicts the sequence is over with the “<end>” token.

This gives us the completed sentence, “The cat is black”.

Note: In this illustration, we can see how language modeling could be viewed as a classification problem, with the vocabulary as the set of available classes.

How do Large Language Models work?

LLMs take a deep learning approach to language modeling.

An LLM is a language model that uses deep neural networks to generate the probability distribution of all possible next words (or tokens) in the vocabulary. Generally, that happens in three steps, which we’ll describe in more detail:

  1. The context gets tokenized, then vectorized or embedded
  2. The tokenized context passes through a transformer to generate the probability distribution of next words
  3. A token is selected by either:
  1. Taking the token with the largest probability, or
  2. Sampling from the output distribution

Tokenization

As we’ve discussed it so far, the context that’s fed into our model—in our example, the phrase, “The cat is…”—is a sequence of words.

However, because LLMs are built using neural networks, which are most commonly implemented using a sequence of matrix operations, they cannot accept words or characters as strings; rather, neural networks best handle vectors as input data. But, before text can be processed by a model as a vector, it has to be broken down into more manageable pieces. This process of breaking down language into bite-size chunks is called tokenization.

Generally, there are three ways to break language into tokens: by character, word, or sub-word.

Character tokenization is when text is broken down by character: “T”, “h”, “e”, “<space>”, “c” … The model predicts individual characters, and the vocabulary is composed of every possible character.

In our previous example, we were chunking our context down by word—“The”, “cat”, “is”... These can be viewed as tokens which can be converted back and forth between strings and vectors.

However, most commonly with LLMs, a vocabulary of sub-word tokens is created. In our example, we might have tokens like “The”, “<space>” “c”, “at”, “is”, “bl”, “ack”, “<period>”. Instead of necessarily predicting whole words, our model might predict subword tokens chained one after another to make up a sentence.

Once tokenized, the input text can be either vectorized—where simple rules generate vectorizations—or embedded—where vector-representations are generated from an algorithm that has learned feature embeddings.

Check out resources covering some of the most popular tokenization, vectorization, and embedding methods below:

Attention and the Transformer w.r.t. LLMs

Once our context (“The cat is…”) is transformed into a model-ready format, what happens to it?

We discussed how LLMs are built using neural nets. Historically, many types of language models have relied on recurrences in the network architecture. The biggest enabler of large language models, however, has been removing any recurrences from the network. The application of self-attention improves upon the already very successful architecture of recurrent neural networks, and enables the effective use of fully feed-forward networks for language modeling.

Self-attention operates under the following intuition: calculate the importance of a token (the query) with respect to all other tokens in a sequence (the keys); then weight the importances between queries and keys based on the general value of each query.

The transformer model—proposed by Vaswani et al. in the paper, “Attention is All You Need”—is the general architecture that uses self-attention as its fundamental building block. For LLMs, we use decoder-only transformers—which means that query-tokens can only interact with tokens previously in the context, but not future tokens.

In a decoder-only transformer, we have a sequence of decoder blocks in sequence.

Note: Encoder-only and encoder-decoder variants of the transformer architecture are quite popular for NLP tasks other than language modeling—tasks like sentiment analysis and machine translation.

The inner-workings of transformers and attention mechanisms are out of the scope of this article. However, the unreasonable effectiveness of transformers is due to three properties—each of which was a key issue with its RNN-predecessors:

  1. Transformers have long-term memory: Attention allows information in every token in the context window to interact with every other token. Tokens fed one by one into RNNs were not guaranteed to have such interactions.
  2. They are extremely parallelizable: The matrix operations of self-attention were much simpler to parallelize than the recurrences in RNNs.
  3. They’re less prone to vanishing and exploding gradients: RNNs had to propagate data across time. Because transformers process all the tokens in the context at the same time, they are not as sensitive to vanishing or exploding gradients.

For more detail on how and why self-attention and the transformer have enabled the scalability of LLMs, I highly recommend the following three articles:

Selecting the Next Token

Let’s return to our toy example: “The cat is...”

If we are given the sequence, “The cat is,” and we’re trying to select the next word.

Our model, ƒ, outputs a probability distribution over the vocabulary. Previously we selected our next token by taking the one with the highest probability.

However, in practice, it’s more common to sample from the probability distribution. Multinomial sampling can be done using the output distribution to select the next token. This adds a bit of stochasticity into a model’s performance. Given a prompt, it will not necessarily generate the same response twice.

How are Large Language Models trained?

Let’s take a look.

Dataset

Data comes first in most machine learning problems. In a traditional classification problem, we would need ground truth targets for every example. This is still true in language models, however, it requires very little human intervention to create a dataset.

Let’s say we scraped a whole trove of raw text straight from the internet, and let’s say the first sentence from our newly-collected, un-annotated, unfiltered dataset was: “The cat is black.” This single sentence provides our model with five training examples, each containing a context and a corresponding target:

  1. <start> → The
  2. The → cat
  3. The cat → is
  4. The cat is → black
  5. The cat is black → <end>

In this way, we can derive targets from completely unlabeled data. This enables the use of enormous datasets which are relatively cheap to collect, because they require very little human-performed preprocessing or annotation.

Problems arise, however, from relying on completely unfiltered, unlabeled data from the internet. We’ll discuss the effect this has on bias, safety, and falsity with respect to training LLMs.

Optimization

Transformer-based LLMs are feedforward neural networks. They have parameters that can be optimized so that the model can make predictions with less error.

Note: There was an earlier note where I mentioned that language modeling could be viewed as a classification problem, where the set of classes are the tokens in the vocabulary. LLMs can largely be optimized in this paradigm.

Consider a neural network, ƒ, with weights, θ. Given a dataset of contexts (x) and targets (y), the goal is to select θ such that ƒ(x) is (on average) the best model of y. In other words, we want to select θ to minimize the average measured error between ƒ(x) and y. Given some cost function, ℓ, to measure error, we might consider the following minimization problem to select an optimal θ:

The most popular and successful optimization techniques for deep neural networks utilize some form of gradient descent. Generally, approximating an optimal θ* using gradient descent can be accomplished with a relatively simple training process:

  1. Tokenize and embed the context.
  2. Run inference: do a forward pass of the context through the model and sample a predicted output token.
  3. Measure the language model's performance: calculate the error, or loss, between the prediction and a ground truth. (Note: Holding to our classification analogy, LLMs can use a form of cross-entropy as a cost function.)
  4. Back-propagate the loss with respect to model weights, θ, to find the gradient.
  5. Update the weights.
  6. Repeat.

The above is a fairly generic training process not specific to LLMs, so if you’re unfamiliar with how feed-forward neural networks are traditionally trained, I encourage you to check out the resources below:

Specifically, for an intro to building LLMs from scratch in code, Andrej Karpathy’s videos are an amazing place to start:

Challenges in Training LLMs because of Scale

The biggest challenges in pre-training LLMs from scratch are the sheer scale of resources necessary. The models themselves require enormous distributed systems, and can have anywhere from billions, to hundreds of billions, to trillions of parameters. This is only complicated by the fact that a training dataset can contain whole percentages of the entire internet’s text.

For reference, one of Meta’s newest LLMs—Llama 2, released in July, 2023—was trained on their enormous Research SuperCluster and with a dataset of over two-trillion tokens. In a section covering pre-training in the publication for Llama 2, the team wrote: “A cumulative of 3.3M GPU hours of computation was performed on hardware of type A100-80GB (TDP of 400W or 350W). We estimate the total emissions for training to be 539 tCO2eq, of which 100% were directly offset by Meta’s sustainability program.”

Note: Meta published a training logbook for a model with over 175B parameters: OPT-175. While it might be too technical to read top-to-bottom, seeing the team show their work is an impressive point of reference.

Not every dev team can afford the resources for training an LLM—especially of that scale—from scratch. Only highly equipped teams, normally at elite companies, have the necessary data and equipment available. As a result, most dev teams find it more practical to leverage existing, pre-trained LLMs already published in the open-source by fine-tuning them on the data they have available.

Pre-training vs. Fine-tuning

The primary objective of LLMs is to grasp the human language. As a result, huge unlabeled portions of the internet—an enormous breadth of content—has enabled LLMs to gain general proficiency in generating language. This large majority of LLM-training, on a very broad training dataset, is called pre-training.

Large, more general, pre-trained models are often fine-tuned by retraining portions of the model in order to make a large language model adopt specific behavior.

For example, ChatGPT has been shown to perform well on the SAT, but what if we wanted it to perform better? We might construct a more supervised training methodology, with a specific dataset of SAT questions, and further train—or fine-tune—ChatGPT to perform better when answering SAT questions.

Pre-trained models are also fine-tuned to be more factual, to follow privacy guidelines, and to generate more fluent language—often using human-generated feedback. I encourage you to check out the following resources to learn more about pre-training vs. fine-tuning:

**💡 Pro tip: Learn more about LLM Fine Tuning.**

LLMs as Foundation Models

This perspective—of utilizing LLMs trained for very general performance, then fine-tuning them for a specific purpose—is from the more general development of foundation models.

A Stanford survey paper provides a definition for foundation models: “A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks.”

Emergent Abilities of Large Language Models

We’ve discussed how LLMs work. But not why. So, why do large language models work?

This is an entire area of research outside the scope of this introduction, but, miraculously, deep learning models seem to defy (to a degree) the statistician’s notion of over-parameterization. Given sufficient data, larger LLMs don’t overfit. Instead, as they grow in size (number of parameters), certain properties emerge—fluency, nascent reasoning abilities, etc.

This phenomenon has been well-documented over the last few years. A paper by Wei et al. studies the threshold that seems to occur as model parameters increase. When this threshold is crossed—once a model reaches a certain size—model performance improves significantly and spontaneously. Check out the paper, “Emergent Abilities of Large Language Models.”

Applications of LLMs

Below is a (not-at-all-exhaustive) list of proven LLM-based applications. They are each paired with some resources on popular large language models in industry.

Text Generation: These are the most generic LLMs. They can create human-like text based on a given prompt or context, making them useful for all sorts of applications: content creation, technical writing, and more.

Chatbots, Virtual Assistants, and Conversational AI: These models have been fine-tuned to have interactive and human-like conversations with users. They can answer a series of questions, provide information, assistance, and entertainment.

Text Summarization: LLMs can automatically condense longer texts into shorter, coherent summaries, retaining the main ideas and key points of the original content. This can be done with general models, or ones finely tuned for the task.

Code Generation: LLMs can speed up development processes for programmers. They can be fine-tuned to generate code snippets, functions, or even entire programs.

Content Creation: LLMs can help automate the creation of diverse content, such as blog posts, social media updates, technical documentation, and marketing materials. They can even write stories and tell jokes.

Search and Information Retrieval: LLMs are being combined with search methods to summarize search results so they are easier to digest.

**💡 Pro tip: Evaluating the performance and reliability of LLMs is paramount. Explore Lakera's insights on Large Language Model Evaluation to ensure your models deliver accurate and consistent results.**

LLMs Limitations

Rationality and Hallucinations

LLMs don’t understand natural language, they are only fluent in using it. As a result, they are often prone to making logical mistakes, called hallucinations. This has proven a challenge for the development of state-of-the-art LLMs. Check out OpenAI’s work to mitigate logical mistakes. Not only rational, but consistently rational behavior is important for artificial intelligence safety in the real world.

Bias and Ethical Considerations

LLMs are trained on troves of data from the internet, and most of this data is not filtered for quality. There can be hateful, biased, and unsafe content in some of these enormous, naively collected datasets. As a result, AI models can learn to generate text that’s unfair and unethical. There is a lot of work to be done on methods of dataset engineering and model fine-tuning to create safe, ethical, and equitable models. One example is an OpenAI article about an enormous multi-modal deep learning algorithm generating images from textual context.

Privacy and Security

Can LLMs learn to keep our most sensitive information private? There are many security concerns about the risks of LLMs giving away secret information found in training data. Many of these concerns were raised as developers re-train data on human-written text that users have provided during interactions with LLMs—much of which could contain private, sensitive, or confidential information.

The UK’s National Cyber Security Center (NCSC) released an article, where they maintain large language models, despite their abilities, pose risks to the security and privacy of users.

Lakera Guard

Lakera Guard is a tool to protect against LLM users and developers from prompt injections, leakage of sensitive information, hallucinations, unethical content, and other LLM vulnerabilities.

It serves as a barrier between LLMs and users, not only protecting users from receiving sensitive or unethical responses to prompts, but also protecting the model from being attacked by malicious agents.

Check out Lakera Guard for super fast integration into your training and deployment platforms.

Key Takeaways

Here’s a quick summary of everything we covered:

  • LLMs are large neural networks used to generate the next element in a sequence of text.
  • How do LLMs work?
  • Tokenization and embedding of the context.
  • The context passes through the model to generate the probability distribution over the vocabulary.
  • A token is sampled from the output distribution and added to the context to select the next token.
  • LLMs are trained on a huge dataset of unlabeled text.
  • They are optimized like any other machine learning model, using a form of gradient descent to optimize the prediction of tokens.
  • Training LLMs from scratch requires an immense amount of resources, and most developers rely on LLMs as foundation models, which can be fine-tuned for domain-specific purposes.
  • LLMs have emergent abilities that appear as they grow in size.
  • LLMs are capable text generators, chatbots, text summarizers, code generators, and content creators. These abilities can be applied across almost any industry relying on language.
  • LLMs are limited in their ability to: reason logically, be ethical and fair, and maintain privacy and security.
Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Avi Bewtra
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download
You might be interested
No items found.
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.