Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

Introduction to Large Language Models: Everything You Need to Know for 2025 [+Resources]

Large language models (LLMs) are driving many of the recent advancements in AI. But what makes them so impactful, and how do they actually work? This guide explains what LLMs are, how they’re used, and the different ways they can be implemented—along with practical examples.

Avi Bewtra

October 20, 2023

Last updated:

May 21, 2025

Single handedly, large language models (LLMs) have propped up the current wave of the AI boom. There’s a lot of hype—and some for good reason. Bigger and better LLMs are released each month, and each is more applicable in virtually every industry which relies on language: social media, healthcare, education, law, finance, scientific research…

When ChatGPT was released, the public saw an AI agent approaching human-level performance. And while tech like this had existed before—in GPT and GPT-2, for example—it had never been so accessible.

Since then, there’s been a slew of new LLMs that have pushed the state-of-the-art forward. You’ve almost definitely tried some of them, but if you haven’t I encourage you to ask ChatGPT a few questions. You’ll likely be shocked.

We know LLMs work. We can see that.

But what, exactly, is going on under the hood? What are some examples of LLMs and the different ways they can be implemented? Where are they most valuable? And where might they fail?

You can find the answers to these and many more questions in this article.

Here’s what we’ll cover:

What is a Language Model?
How do Large Language Models work?
How are Large Language Models trained?
Applications of Large Language Models
Limitations of Large Language Models

On this page

Hide table of contents

Show table of contents

Your LLMs are only as secure as the guardrails you put around them. Learn how Lakera Guard protects them in production.

‍

‍

‍

The Lakera team has accelerated Dropbox’s GenAI journey.

“Dropbox uses Lakera Guard as a security solution to help safeguard our LLM-powered applications, secure and protect user data, and uphold the reliability and trustworthiness of our intelligent features.”

-db1-

If you’re getting up to speed on LLMs, these articles go beyond the basics—covering the risks, attack methods, and defense mechanisms every team should understand:

Start with one of the most common attack types—this prompt injection guide shows how attackers manipulate outputs with language alone.
Understand the difference between prompt injection and direct prompt injection, and why both matter.
Learn how LLM jailbreaking works—and why guardrails don’t always hold up.
Explore the hidden dangers in your dataset with this training data poisoning overview.
For broader perspective, this AI security fundamentals post covers how GenAI systems reshape security best practices.
When you’re ready to ship, this post on how to deploy an LLM walks through the technical and security layers to consider.
And if you’re testing or benchmarking different models, AI red teaming helps uncover risks before users find them.

-db1-

What is a Language Model?

Natural language processing (NLP)—the field of research pursuing ways for machines to master language—has had a long history. One of the most significant research areas of NLP has been language modeling. So before getting to large language models, we’ll consider language models more generally.

Fundamentally, a language model is one that predicts the next element in a sequence of text.

Typically, we’ll provide the model with some sort of context to make its prediction. Consider the following phrase as the model’s context: “The cat is...”

Let’s say we wanted to build a language model to predict the next word in the sentence. And, let’s say we are given the vocabulary—the set of possible words that could come next. A common way to make a prediction would be to calculate the probability of each word in the vocabulary coming next, given the set of previous words in the sequence—the context.

If our vocabulary was limited to “blue”, “black”, and “green”, we might build a model that calculated the following.

We, humans, can intuit that the most likely word to come next is black. (I haven’t seen too many blue or green cats.) The model would have to represent this intuitive pattern in the probability distribution it returns. In our example, the word with the highest probability is ‘black,’ giving us our predicted sentence: “The cat is black.”

Ok. So we know our model has to return a probability distribution of all the words. Let’s consider this model as a function, ƒ, which takes the context, and returns the probability distribution over our vocabulary. In our example, that looks like:

Here, ƒ is our language model. How it generates this probability distribution has been a research topic for more than fifty years. It’s been an iterative development: from statistical Markov methods, through neural network architectures including RNNs, LSTMs, and eventually, transformer models.

But for all of these architectures, our model, ƒ, can be used to generate whole phrases by adding each predicted word to our original context for the consequent prediction. For example, let’s look at another example.

Consider a similar simple vocabulary: {“<start>”, “The”, “cat”, “is”, “blue”, “black”, “green”, “<end>”}

Here, “<start>” denotes an input to the model with no context, and “<end>” denotes the prediction that the sequence is complete.

If we’re trying to build a model, ƒ, to make sentences, we’d predict the first token with the following. (Note: I’ve made up the probability values for illustration)

In other words, by feeding the “<start>” token to the model, we generate the probability distribution over the vocabulary given no context. Because “The” has the highest probability, we’ll select it as the next word. It is then added to our context, and we can generate the next probability distribution. This process repeats as follows until our model predicts the sequence is over with the “<end>” token.

This gives us the completed sentence, “The cat is black”.

Note: In this illustration, we can see how language modeling could be viewed as a classification problem, with the vocabulary as the set of available classes.

How do Large Language Models work?

LLMs take a deep learning approach to language modeling.

An LLM is a language model that uses deep neural networks to generate the probability distribution of all possible next words (or tokens) in the vocabulary. Generally, that happens in three steps, which we’ll describe in more detail:

The context gets tokenized, then vectorized or embedded
The tokenized context passes through a transformer to generate the probability distribution of next words
A token is selected by either:

Taking the token with the largest probability, or
Sampling from the output distribution

Tokenization

As we’ve discussed it so far, the context that’s fed into our model—in our example, the phrase, “The cat is…”—is a sequence of words.

However, because LLMs are built using neural networks, which are most commonly implemented using a sequence of matrix operations, they cannot accept words or characters as strings; rather, neural networks best handle vectors as input data. But, before text can be processed by a model as a vector, it has to be broken down into more manageable pieces. This process of breaking down language into bite-size chunks is called tokenization.

Generally, there are three ways to break language into tokens: by character, word, or sub-word.

Character tokenization is when text is broken down by character: “T”, “h”, “e”, “<space>”, “c” … The model predicts individual characters, and the vocabulary is composed of every possible character.

In our previous example, we were chunking our context down by word—“The”, “cat”, “is”... These can be viewed as tokens which can be converted back and forth between strings and vectors.

However, most commonly with LLMs, a vocabulary of sub-word tokens is created. In our example, we might have tokens like “The”, “<space>” “c”, “at”, “is”, “bl”, “ack”, “<period>”. Instead of necessarily predicting whole words, our model might predict subword tokens chained one after another to make up a sentence.

Once tokenized, the input text can be either vectorized—where simple rules generate vectorizations—or embedded—where vector-representations are generated from an algorithm that has learned feature embeddings.

Check out resources covering some of the most popular tokenization, vectorization, and embedding methods below:

Attention and the Transformer w.r.t. LLMs

Once our context (“The cat is…”) is transformed into a model-ready format, what happens to it?

We discussed how LLMs are built using neural nets. Historically, many types of language models have relied on recurrences in the network architecture. The biggest enabler of large language models, however, has been removing any recurrences from the network. The application of self-attention improves upon the already very successful architecture of recurrent neural networks, and enables the effective use of fully feed-forward networks for language modeling.

Self-attention operates under the following intuition: calculate the importance of a token (the query) with respect to all other tokens in a sequence (the keys); then weight the importances between queries and keys based on the general value of each query.

The transformer model—proposed by Vaswani et al. in the paper, “Attention is All You Need”—is the general architecture that uses self-attention as its fundamental building block. For LLMs, we use decoder-only transformers—which means that query-tokens can only interact with tokens previously in the context, but not future tokens.

*In a decoder-only transformer, we have a sequence of decoder blocks in sequence.*

Note: Encoder-only and encoder-decoder variants of the transformer architecture are quite popular for NLP tasks other than language modeling—tasks like sentiment analysis and machine translation.

The inner-workings of transformers and attention mechanisms are out of the scope of this article. However, the unreasonable effectiveness of transformers is due to three properties—each of which was a key issue with its RNN-predecessors:

Transformers have long-term memory: Attention allows information in every token in the context window to interact with every other token. Tokens fed one by one into RNNs were not guaranteed to have such interactions.
They are extremely parallelizable: The matrix operations of self-attention were much simpler to parallelize than the recurrences in RNNs.
They’re less prone to vanishing and exploding gradients: RNNs had to propagate data across time. Because transformers process all the tokens in the context at the same time, they are not as sensitive to vanishing or exploding gradients.

For more detail on how and why self-attention and the transformer have enabled the scalability of LLMs, I highly recommend the following three articles:

Selecting the Next Token

Let’s return to our toy example: “The cat is...”

If we are given the sequence, “The cat is,” and we’re trying to select the next word.

Our model, ƒ, outputs a probability distribution over the vocabulary. Previously we selected our next token by taking the one with the highest probability.

However, in practice, it’s more common to sample from the probability distribution. Multinomial sampling can be done using the output distribution to select the next token. This adds a bit of stochasticity into a model’s performance. Given a prompt, it will not necessarily generate the same response twice.

How are Large Language Models trained?

Let’s take a look.

Dataset

Data comes first in most machine learning problems. In a traditional classification problem, we would need ground truth targets for every example. This is still true in language models, however, it requires very little human intervention to create a dataset.

Let’s say we scraped a whole trove of raw text straight from the internet, and let’s say the first sentence from our newly-collected, un-annotated, unfiltered dataset was: “The cat is black.” This single sentence provides our model with five training examples, each containing a context and a corresponding target:

<start> → The
The → cat
The cat → is
The cat is → black
The cat is black → <end>

In this way, we can derive targets from completely unlabeled data. This enables the use of enormous datasets which are relatively cheap to collect, because they require very little human-performed preprocessing or annotation.

Problems arise, however, from relying on completely unfiltered, unlabeled data from the internet. We’ll discuss the effect this has on bias, safety, and falsity with respect to training LLMs.

Optimization

Transformer-based LLMs are feedforward neural networks. They have parameters that can be optimized so that the model can make predictions with less error.

Note: There was an earlier note where I mentioned that language modeling could be viewed as a classification problem, where the set of classes are the tokens in the vocabulary. LLMs can largely be optimized in this paradigm.

Consider a neural network, ƒ, with weights, θ. Given a dataset of contexts (x) and targets (y), the goal is to select θ such that ƒ(x) is (on average) the best model of y. In other words, we want to select θ to minimize the average measured error between ƒ(x) and y. Given some cost function, ℓ, to measure error, we might consider the following minimization problem to select an optimal θ:

The most popular and successful optimization techniques for deep neural networks utilize some form of gradient descent. Generally, approximating an optimal θ* using gradient descent can be accomplished with a relatively simple training process:

Tokenize and embed the context.
Run inference: do a forward pass of the context through the model and sample a predicted output token.
Measure the language model's performance: calculate the error, or loss, between the prediction and a ground truth. (Note: Holding to our classification analogy, LLMs can use a form of cross-entropy as a cost function.)
Back-propagate the loss with respect to model weights, θ, to find the gradient.
Update the weights.
Repeat.

The above is a fairly generic training process not specific to LLMs, so if you’re unfamiliar with how feed-forward neural networks are traditionally trained, I encourage you to check out the resources below:

Specifically, for an intro to building LLMs from scratch in code, Andrej Karpathy’s videos are an amazing place to start:

Challenges in Training LLMs because of Scale

The biggest challenges in pre-training LLMs from scratch are the sheer scale of resources necessary. The models themselves require enormous distributed systems, and can have anywhere from billions, to hundreds of billions, to trillions of parameters. This is only complicated by the fact that a training dataset can contain whole percentages of the entire internet’s text.

For reference, one of Meta’s newest LLMs—Llama 2, released in July, 2023—was trained on their enormous Research SuperCluster and with a dataset of over two-trillion tokens. In a section covering pre-training in the publication for Llama 2, the team wrote: “A cumulative of 3.3M GPU hours of computation was performed on hardware of type A100-80GB (TDP of 400W or 350W). We estimate the total emissions for training to be 539 tCO2eq, of which 100% were directly offset by Meta’s sustainability program.”

Note: Meta published a training logbook for a model with over 175B parameters: OPT-175. While it might be too technical to read top-to-bottom, seeing the team show their work is an impressive point of reference.

Not every dev team can afford the resources for training an LLM—especially of that scale—from scratch. Only highly equipped teams, normally at elite companies, have the necessary data and equipment available. As a result, most dev teams find it more practical to leverage existing, pre-trained LLMs already published in the open-source by fine-tuning them on the data they have available.

Pre-training vs. Fine-tuning

The primary objective of LLMs is to grasp the human language. As a result, huge unlabeled portions of the internet—an enormous breadth of content—has enabled LLMs to gain general proficiency in generating language. This large majority of LLM-training, on a very broad training dataset, is called pre-training.

Large, more general, pre-trained models are often fine-tuned by retraining portions of the model in order to make a large language model adopt specific behavior.

For example, ChatGPT has been shown to perform well on the SAT, but what if we wanted it to perform better? We might construct a more supervised training methodology, with a specific dataset of SAT questions, and further train—or fine-tune—ChatGPT to perform better when answering SAT questions.

Pre-trained models are also fine-tuned to be more factual, to follow privacy guidelines, and to generate more fluent language—often using human-generated feedback. I encourage you to check out the following resources to learn more about pre-training vs. fine-tuning:

**💡 Pro tip: Learn more about LLM Fine Tuning.**

LLMs as Foundation Models

This perspective—of utilizing LLMs trained for very general performance, then fine-tuning them for a specific purpose—is from the more general development of foundation models.

A Stanford survey paper provides a definition for foundation models: “A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks.”

Emergent Abilities of Large Language Models

We’ve discussed how LLMs work. But not why. So, why do large language models work?

This is an entire area of research outside the scope of this introduction, but, miraculously, deep learning models seem to defy (to a degree) the statistician’s notion of over-parameterization. Given sufficient data, larger LLMs don’t overfit. Instead, as they grow in size (number of parameters), certain properties emerge—fluency, nascent reasoning abilities, etc.

This phenomenon has been well-documented over the last few years. A paper by Wei et al. studies the threshold that seems to occur as model parameters increase. When this threshold is crossed—once a model reaches a certain size—model performance improves significantly and spontaneously. Check out the paper, “Emergent Abilities of Large Language Models.”

Applications of LLMs

Below is a (not-at-all-exhaustive) list of proven LLM-based applications. They are each paired with some resources on popular large language models in industry.

Text Generation: These are the most generic LLMs. They can create human-like text based on a given prompt or context, making them useful for all sorts of applications: content creation, technical writing, and more.

Chatbots, Virtual Assistants, and Conversational AI: These models have been fine-tuned to have interactive and human-like conversations with users. They can answer a series of questions, provide information, assistance, and entertainment.

Text Summarization: LLMs can automatically condense longer texts into shorter, coherent summaries, retaining the main ideas and key points of the original content. This can be done with general models, or ones finely tuned for the task.

Summarization with Hugging Face

Code Generation: LLMs can speed up development processes for programmers. They can be fine-tuned to generate code snippets, functions, or even entire programs.

Github Copilot (w/ OpenAI): Paper and Demo
Meta’s Code Llama

Content Creation: LLMs can help automate the creation of diverse content, such as blog posts, social media updates, technical documentation, and marketing materials. They can even write stories and tell jokes.

Jasper

Search and Information Retrieval: LLMs are being combined with search methods to summarize search results so they are easier to digest.

**💡 Pro tip: Evaluating the performance and reliability of LLMs is paramount. Explore Lakera's insights on Large Language Model Evaluation to ensure your models deliver accurate and consistent results.**

LLMs Limitations

Rationality and Hallucinations

LLMs don’t understand natural language, they are only fluent in using it. As a result, they are often prone to making logical mistakes, called hallucinations. This has proven a challenge for the development of state-of-the-art LLMs. Check out OpenAI’s work to mitigate logical mistakes. Not only rational, but consistently rational behavior is important for artificial intelligence safety in the real world.

Bias and Ethical Considerations

LLMs are trained on troves of data from the internet, and most of this data is not filtered for quality. There can be hateful, biased, and unsafe content in some of these enormous, naively collected datasets. As a result, AI models can learn to generate text that’s unfair and unethical. There is a lot of work to be done on methods of dataset engineering and model fine-tuning to create safe, ethical, and equitable models. One example is an OpenAI article about an enormous multi-modal deep learning algorithm generating images from textual context.

Privacy and Security

Can LLMs learn to keep our most sensitive information private? There are many security concerns about the risks of LLMs giving away secret information found in training data. Many of these concerns were raised as developers re-train data on human-written text that users have provided during interactions with LLMs—much of which could contain private, sensitive, or confidential information.

The UK’s National Cyber Security Center (NCSC) released an article, where they maintain large language models, despite their abilities, pose risks to the security and privacy of users.

Key Takeaways

Here’s a quick summary of everything we covered:

LLMs are large neural networks used to generate the next element in a sequence of text.
How do LLMs work?
Tokenization and embedding of the context.
The context passes through the model to generate the probability distribution over the vocabulary.
A token is sampled from the output distribution and added to the context to select the next token.
LLMs are trained on a huge dataset of unlabeled text.
They are optimized like any other machine learning model, using a form of gradient descent to optimize the prediction of tokens.
Training LLMs from scratch requires an immense amount of resources, and most developers rely on LLMs as foundation models, which can be fine-tuned for domain-specific purposes.
LLMs have emergent abilities that appear as they grow in size.
LLMs are capable text generators, chatbots, text summarizers, code generators, and content creators. These abilities can be applied across almost any industry relying on language.
LLMs are limited in their ability to: reason logically, be ethical and fair, and maintain privacy and security.