Introduction: A New Way for AI to "Think"?
The world of AI is currently all about Large Language Models (LLMs). They've shown some incredible skills, but their ability to handle complex, multi-step problems relies almost entirely on a trick called Chain-of-Thought (CoT) prompting. This approach forces the model to "think out loud" by writing down its reasoning step-by-step as a sequence of text. But this isn't a real solution; it's more of a clever workaround. The standard Transformer architecture that powers these LLMs is, as the paper "Hierarchical Reasoning Model" puts it, "paradoxically shallow". In computer science terms, their design limits them to a class of problems (TC0) that, in theory, can't handle tasks requiring deep, sequential computation on their own. CoT gets around this, but it has major downsides: it's brittle, meaning one mistake can ruin the whole process; it often needs a ton of data to work well; and it's slow because it has to generate so many intermediate words.
This is where the Hierarchical Reasoning Model (HRM) comes in, proposing a totally different path. Inspired by how the human brain processes information on multiple levels and at different speeds, HRM is a new type of recurrent model designed for serious "computational depth." Its big idea is to enable "latent reasoning," where the heavy lifting happens inside the model's hidden states, not out in the open as text. The paper makes some bold claims, stating its relatively small 27-million-parameter model can master complex puzzles like Sudoku and mazes with just 1,000 training examples. It also claims to beat much larger LLMs on the Abstraction and Reasoning Corpus (ARC-AGI), a key test for general intelligence.
We'll weigh its performance claims against what's happening in the broader AI field, and discuss whether it's truly a viable new direction for artificial intelligence.
A Quick Primer: What is a Transformer?
Before we dive deeper into HRM, it helps to understand the architecture it's being compared to: the Transformer. First introduced in a 2017 Google paper titled "Attention Is All You Need," the Transformer has become the go-to architecture for today's most advanced LLMs, like the GPT series.7
At its core, a Transformer is a neural network designed to transform an input sequence (like a sentence) into an output sequence (like a translation or an answer).10 What made it a game-changer was its departure from older models like Recurrent Neural Networks (RNNs) that had to process text word-by-word. Transformers can look at an entire sentence at once, which makes them much faster and better at understanding the broader context.8
Here’s a simplified look at how it works for a developer audience:
- Tokenization and Embedding: First, the input text is broken down into smaller pieces called "tokens," which can be words or parts of words. Each token is then converted into a numerical vector called an "embedding".10 This vector represents the token's meaning in a way the machine can understand. You can think of this as giving each word a set of coordinates in a high-dimensional space.8
- Positional Encoding: Because the model processes all tokens in parallel, it loses the original word order. To fix this, "positional encoding" adds information to each embedding that signals the token's position in the sequence. This lets the model know that "the cat sat on the mat" is different from "the mat sat on the cat".10
- The Attention Mechanism: This is the Transformer's secret sauce. Attention allows the model to weigh the importance of every other word in the input when processing a specific word. For example, in the sentence "The tired rabbit took a nap," the attention mechanism helps the model understand that "nap" is strongly related to "tired" and "rabbit".15 It does this by creating three vectors for each token: a
Query (what I'm looking for), a Key (what I have), and a Value (what I'll give you). By comparing the Query of one word to the Keys of all other words, the model calculates "attention scores" that determine how much focus to place on each part of the input when generating an output.16 Many models use "Multi-Head Attention," which is like having several teams of detectives looking at the input from different angles to build a richer understanding.15 - Encoder-Decoder Stacks: Many classic Transformers consist of two main parts. The Encoder reads and understands the input sequence. The Decoder takes that understanding and generates the output sequence, one token at a time.11 In each step, the decoder also pays attention to the words it has already generated, making the output coherent.14
This architecture, especially its ability to handle long-range connections in text, is what allows LLMs to perform so well on a wide range of language tasks, from translation to powering conversational chatbots.12
How the Hierarchical Reasoning Model Works
The HRM is built from two main parts that work together, operating on different timescales, much like different parts of the brain.
The High-Level (H) Module is the "slow, abstract planner." It updates its internal state only once per "high-level cycle." Its job is to take in the results from the low-level computations, form a big-picture view of the problem, and provide guidance for the next round of detailed work. Think of it as the brain's executive function, managing the overall strategy.2
The Low-Level (L) Module is the "fast, detailed computation" engine. It runs multiple updates within a single high-level cycle, handling the nitty-gritty work like local searches and checking constraints, similar to how the brain's lower-level circuits handle sensory details.2
A classic problem with standard Recurrent Neural Networks (RNNs) is that they tend to run out of steam and stop computing. HRM is designed to beat this problem with a process the authors call hierarchical convergence. The L-module works on a sub-task until it settles, then the H-module takes that result, updates its own strategy, and gives the L-module a new task. This prevents the whole system from getting stuck and allows for a long, sustained chain of computation, which is key for solving complex problems.2
HRM vs. LLMs: A New Kind of Thinking?
The HRM paper frames its model not just as an improvement, but as a fundamentally different way of doing AI. The core difference is how they "reason." HRM performs latent reasoning, where all the work happens inside the model's hidden states. The model "thinks" by refining its internal understanding of the problem until it finds a solution.2
LLMs using CoT, on the other hand, perform externalized reasoning. The reasoning process is the text they generate. This makes the process fragile—one bad word can throw off the whole chain of logic—and very inefficient.2 This difference in approach stems from a deeper architectural divide. As a recurrent model, HRM is theoretically
Turing-complete, meaning it can solve any problem that is solvable by an algorithm. Standard Transformers are in a much weaker computational class, which is why they need the CoT workaround for complex sequential tasks.2
| Feature | Hierarchical Reasoning Model (HRM) | Large Language Model (LLM) with CoT |
|---|---|---|
| Core Architecture | Coupled Recurrent Modules (Transformer Blocks) | Feed-forward Transformer |
| Reasoning Process | Latent State-Space Iteration | Externalized Token-Sequence Generation |
| Computational Power | Turing-Complete (in theory) | TC0 (computationally shallow) |
| Training Method | Gradient Approximation + Deep Supervision + ACT | Backpropagation on Token Prediction Loss |
| Data Requirement | Small, task-specific datasets (~1k examples) | Massive, general web text (trillions of tokens) |
| Key Strength | Computational Depth & Data Efficiency for Algorithms | Breadth of Knowledge & Language Fluency |
| Key Weakness | Narrow Specialization & Lack of World Knowledge | Computational Shallowness & Brittle Reasoning Chains |
The Big Question: Is the Comparison Fair?
Here’s the biggest catch in the paper's argument: the comparisons aren't exactly fair. It puts HRM, a small model trained specifically on thousands of examples for a single task (like Sudoku), up against huge, general-purpose LLMs that are tested with generic, out-of-the-box prompting.2 This is a classic apples-to-oranges comparison that mixes up architectural skill with task-specific training.20
The most obvious thing missing is a crucial baseline: what happens when you fine-tune a state-of-the-art LLM on the same 1,000 Sudoku or maze examples? This is a standard way to make generalist models into specialists. Independent research has shown that when LLMs are properly adapted this way, they can get very good at solving complex mazes, with scores over 90%.22 This directly challenges the paper's suggestion that LLM architectures are
incapable of solving these problems. The 0% scores in the paper show a failure of the prompting method (zero-shot CoT), not a fundamental failure of the architecture.
So, what the results really show is not that HRM is better than all LLMs, but that the HRM architecture is better than a standard Transformer when both are trained from scratch as specialists on a small dataset. That's still an interesting result, but it's a much smaller claim than the one the paper makes.
Putting HRM to the Test: Puzzles, Brainwaves, and Bold Claims
Even with the unfair comparison, HRM's performance on tough algorithmic puzzles is stunning. On complex Sudoku and maze-solving tasks where top LLMs scored 0%, HRM achieved near-perfect accuracy after being trained on just 1,000 examples.2 On the ARC-AGI benchmark, a key test of fluid intelligence, the small HRM model also significantly outperformed much larger, pre-trained LLMs.2
| Benchmark | HRM (27M, ~1k samples) | Direct pred (Transformer, same setup) | Claude 3.7 8K (CoT, Pre-trained) | o3-mini-high (CoT, Pre-trained) | |
|---|---|---|---|---|---|
| ARC-AGI-1 (%) | 40.3 | 15.8 | 21.2 | 34.5 | |
| ARC-AGI-2 (%) | 3.0 | 0.9 | 1.0 | 0.3 | |
| Sudoku-Extreme (9x9) (%) | 96.0 | 0.0 | 0.0 | 0.0 | |
| Maze-Hard (30x30) (%) | 100.0 | 0.0 | 0.0 | 0.0 | |
| Data sourced from Figure 1 of the HRM paper.2 |
So how does it do it? The authors point to a fascinating parallel with the human brain. They measured the "effective dimensionality" of the model's internal states—a way to quantify how complex and flexible its thinking is. They found that the high-level "planner" module learned to operate in a very high-dimensional space, allowing for flexible, abstract thought. Meanwhile, the low-level "worker" module operated in a much lower-dimensional space, perfect for focused, detailed computation.2
This is exciting because it mirrors a similar hierarchy found in the brain's cortex and suggests the model independently discovered a fundamental principle of biological intelligence. This isn't just a design choice; it's an emergent property the model learns on its own.2
The Bottom Line: A Hybrid Future for AI?
The Hierarchical Reasoning Model is a significant step forward. It proves that architectural innovation—not just throwing more data and compute at a problem—is a critical path toward more capable AI. It's a brilliant specialist, excelling at the kind of deep, algorithmic reasoning that LLMs struggle with.
However, HRM is not an "LLM killer." It lacks the vast world knowledge and linguistic fluency that makes LLMs so versatile. Its true potential may lie not in replacing LLMs, but in working alongside them.
This points to a hybrid future for AI, one that leverages the best of both worlds. Imagine a system where an LLM acts as the friendly front door. It would understand a user's complex request in natural language, like "Plan the most efficient delivery route for these 10 packages, considering real-time traffic." The LLM would then formulate the problem into a structured format and hand it off to an HRM-like "reasoning co-processor." This specialist module would efficiently solve the complex pathfinding problem and return the structured solution. Finally, the LLM would translate that technical answer back into a clear, human-friendly explanation.
This approach combines the LLM's broad knowledge with the HRM's deep, reliable computation. It's a plausible and powerful path toward the next generation of AI—one that doesn't just talk, but truly thinks.
You can find the full research paper here: Hierarchical Reasoning Model
Works cited
- What is chain of thought (CoT) prompting? - IBM
- [D] The Parallelism Tradeoff: Understanding Transformer Expressivity Through Circuit Complexity : r/MachineLearning - Reddit
- On Limitations of the Transformer Architecture - arXiv
- Question about chain of thought (CoT) in LLMs : r/ArtificialInteligence - Reddit
- Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance - arXiv
- Transformer (deep learning architecture) - Wikipedia
- What is a Transformer Model? Components, Innovations & Use Cases - AI21 Labs
- Generative pre-trained transformer - Wikipedia
- What are Transformers in Artificial Intelligence? - AWS
- How Transformers Work: A Detailed Exploration of Transformer Architecture - DataCamp
- What is a Transformer Model? - Moveworks
- LLM Transformer Model Visually Explained - Polo Club of Data Science
- Transformer Attention Mechanism in NLP - GeeksforGeeks
- Introduction to Transformers and Attention Mechanisms | by Rakshit Kalra | Medium
- The Transformer Attention Mechanism - MachineLearningMastery.com
- What is an attention mechanism? | IBM
- 6 Types of Useful Transformer Models and their Use Cases - Data Science Dojo
- What is GPT (generative pre-trained transformer)? - IBM
- LLMs vs. SLMs: The Differences in Large & Small Language Models | Splunk
- General-purpose LLMs fall short in fair, accurate hiring — here's what to use instead - Eightfold AI
- AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO - arXiv
- This AI Paper from Menlo Research Introduces AlphaMaze: A Two-Stage Training Framework for Enhancing Spatial Reasoning in Large Language Models - MarkTechPost
- AlphaMaze: Teaching LLMs to think visually - Menlo Research