The spelling challenges and uneven intelligence of LLMs? Andrej Karpathy's in-depth explanation of LLMs (Part 7)

Spelling Challenges

When exploring the capabilities of Large Language Models (LLMs), we often marvel at their impressive performances in mathematics, logical reasoning, and even writing. However, inspelling-related tasksLLMs tend to perform less ideally. This is becausethe world of models is built on "tokens" rather than individual characters.This will affect its ability to handle spelling tasks.

1. Why are LLMs not good at spelling?

LLM They do not see characters directly like humans do.Their input istokens，that is, the text is broken down intoLarger text block. The model learns token-level language structures during training rather than learning word spellings character by character.

For example, the word“ubiquitous”may be split intomultiple tokens in the eyes of an LLM.instead of character by character:

Therefore, when we ask the modelto extract specific characters (such as selecting one every three characters)it will perform poorly because it does not directly "see" individual letters of words but seesconcatenated tokens。

2. Specific case: Failure in spelling tasks

Suppose we give the LLM a task like this:

Please print every third character of the word "ubiquitous" starting from the first letter.

We expect to get:

U Q T S

However, the LLM's answer is usually wrong because it doesn't process tasks at the character level but rather operates based on tokens.

Why does it fail?

✅ Humans see charactersand can easily extract the specified letters.
❌ LLMs see tokensrather than individual characters, so they cannot accurately perform character-level operations.

3. Classic error case: Can LLMs not count "R"?

A well-known LLM spelling issue is:

How many R's are in "strawberry"?

✅ The correct answer is3。
❌ Many early LLMs (including GPT-3) would incorrectly answer2。

This has triggered extensive discussions,Why can an AI that solves Olympiad math problems not correctly count letters?There are two main reasons for this:

LLMs see tokens, not characters.Thus, they cannot process spelling tasks letter by letter.
LLMs are inherently not good at countingwhich makes it even more difficult to count characters.

4. How to make LLMs perform spelling tasks correctly?

Since LLMscannot reliably handle character-level tasks, we canleverage the code interpreter (Code Interpreter)to solve this problem. For example, we could have the model invoke Python:

Example: Using Python to solve problems

We can ask the LLM:

Please use Python code to extract"ubiquitous"every third character from the string.

Python code generated by LLM:

word = "ubiquitous"
result = word[::3]
print(result)

Output result:

U Q T S

Why is the code more reliable than the LLM itself?

✅ Python code can manipulate on a per-character basiswhile LLMs operate on tokens and cannot directly access individual characters.
✅ With the help of code tools, AI can accurately perform character-level taskswithout making mistakes due to tokenization issues.

Imbalanced intelligence

AI It can solve Olympiad-level math problems, answer PhD-level physics, chemistry, and biology questions, but it stumbles repeatedly on the most basic questions.This is preciselyAI's jagged intelligencea typical manifestation of.

1. AI's "unreasonable" mistakes on simple problems

Let's look at a confusing case:

Is 9.11 greater than 9.9?

✅ The correct answer should beNo (9.9 > 9.11), but a LLM might give the following incorrect answer:

9.11 is greater than 9.9.

Even it wouldattempt to use mathematical logic to explain a wrong conclusion, or after multiple inquiriesbe inconsistent, sometimes correct and sometimes wrong。

2. Why is AI better at handling complex problems but prone to errors on simple questions?

In some cases, AI can even solveOlympiad-level math problems, but fails atsimple numerical comparisonsMaking mistakes above seems completely unreasonable. But after further research, the researchers discovered some astonishing phenomena.

(1) The model may be disturbed by "non-mathematical factors"

The research team found that when the model wascalculating the size relationship between 9.11 and 9.9，The activated neurons were highly similar to text patterns related to the Bible.。

In the Bible,chapter numbersare usually written as9:11 (such as John 9:11)。
In this format, 9:11 does indeed appear after 9:9.。
Therefore,the model's memory might be misledinto thinking that 9.11 is greater than 9.9, rather than performing a numerical comparison according to mathematical logic.

(2) Statistical patterns of LLM vs. Logical reasoning

LLM Mainly relies on statistical pattern matching rather than true logical reasoning：

When the model sees“9.11 vs 9.9”，it might incorrectly associate it withtext patterns related to Bible verse numbersrather than performing mathematical operations.
The model cannot distinguish between mathematical context and text patterns, leading it to make non-mathematical decisions.

(3) LLM is not a perfect mathematical tool

The nature of language models isto predict the next most likely tokenbut not strictly enforcing mathematical rules.
Complex math problemsusually rely on more systematic reasoning steps, whilesimple calculationsare more prone to being affected byNoise interference in training data。

3. Solution: How to improve the computational reliability of LLMs?

✅ Do not fully trust AI's mathematical reasoning ability

Due to LLMmay be affected by text pattern interferenceIts calculation result requiresadditional verification。
Suggestion: Use external calculation tools (such as Python code interpreters) for validationinstead of directly relying on AI's "mental calculation."

✅ Treat AI as a tool, not the final answer

AI is a "probabilistic system," not an absolutely correct reasoning machine.。
When using AI to solve problems,we should not blindly copy the answers generated by AIbut shouldverify in combination with context and tools.。