Andrej Karpathy Deep Dive on LLM (Part 2): Understanding Training and Inference through GPT-2 and Llama 3.1

), today we will look at two examples.

GPT-2

GPT-2: Training and Inference

. Its core technology is still in use today, just with a significant increase in scale and computational power.

Basic parameters of GPT-2

, GPT-2 is relatively small.
, while the context window of the latest models has been expanded to
, it is much smaller.
Andrej himself replicated a GPT-2: https://github.com/karpathy/llm.c/discussions/677

GPT-2 Training Process

The essence of GPT-2 training is

, the initial output is completely random.
, improve the prediction of tokens in 1 million training data, calculate the error (Loss) of the current prediction.
, improve the accuracy of the next token prediction.
, each time updating the model's weights to gradually enhance its predictive ability. Each step takes about 7 seconds.
As training progresses, the generated text evolves from random characters to coherent and readable content.。

Training Cost

The estimated cost of training GPT-2 in 2019 was
, the cost of training a model of the same scale today may have decreased to

The main reasons for the decrease in training costs:

, reducing useless data and improving training efficiency.
, such as a significant increase in GPU computing power, optimizing training speed.
, such as more efficient training frameworks, allowing the same computing resources to accomplish more tasks.

Training Progress

In the early stages of training, the generated text is
, but still lacks overall logic.
, with significantly improved accuracy in predicting the next token.

Computing Resources and GPUs

, which is difficult for personal computers to achieve.
Modern AI training relies on cloud-based GPU clusters, for example, Andrej's own replication of GPT-2 used
".

Llama 3.1

Llama 3.1 Base Model Inference

because they do not engage in dialogue or execute instructions.

How Base Models Work

that generates token sequences based on the statistical patterns of training data, rather than an interactive assistant.
; it merely selects tokens based on probability, similar to
, but typically requires further fine-tuning for practical applications to become useful assistant models (Assistant Models).

Llama 3.1 Base Model

, it is one of the most advanced open-source base models currently available.
, the scale has significantly increased.
far exceeding GPT-2's
which can be used as an assistant model.

Base Model Inference Examples

1. Direct Use of Base Model

You can use https://app.hyperbolic.xyz/ to use the Base Model. When using the Base Model, input:

"What is 2 + 2?"

because it is not an assistant model.
It only predicts the next most likely token based on the statistical patterns in the training data, possibly outputting random content, such as:
rather than truly understanding the question.

it does not answer your questions like an assistant, this Base Model is still very valuable. Because it has learned a lot of information about the world and stores the knowledge of the web in its parameters, it is a distillation of web information.

2. Generating Knowledge-Based Text

If we input:

"Here is a list of the top 10 landmarks in Paris:"

The Base Model will automatically complete the list and generate possible landmark information.
, and it is not reliable facts.

3. Memory and Generalization

is called regurgitation of training data. The reason for this phenomenon is that the quality of information on Wikipedia is high, so the model may have seen the article 10 or even 100 times during training, thus memorizing it.
, but it may be incorrect. This phenomenon is called hallucination.

How to Turn a Base Model into an Assistant?

Although the Base Model is not an assistant, it can

Using few-shot learning, which is

For example, providing multiple words and their Korean translations, allowing the model to learn the pattern automatically:
The Base Model may correctly complete

pretend to be a conversation

, guide the Base Model to act as an assistant:
In this way, the Base Model will continue in this format, appearing to provide answers.
However, sometimes in addition to answering, the Base Model will alsohallucinate the next human question