Skip to Content

2022: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Unlocking Hidden Reasoning Power in Large Language Models Through Step-by-Step Thinking

Introduction

In January 2022, researchers at Google Brain led by Jason Wei published a paper titled "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" in the proceedings of NeurIPS. This research revealed a remarkably simple yet powerful technique that could improve how large language models tackle complex reasoning problems. The core insight was elegant: instead of asking models to jump directly to answers, showing them how to think step-by-step through intermediate reasoning steps could unlock latent reasoning capabilities that were already present but hidden.

Core Ideas

The paper's central breakthrough was the discovery that generating a chain of thought - a series of intermediate reasoning steps - significantly improves the ability of large language models to perform complex reasoning.

Traditional prompting methods would present a problem and expect the model to provide a direct answer. For instance, asking "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" would typically get a direct response like "11 tennis balls" without explanation.

Chain-of-thought prompting changes this approach fundamentally. Instead of just showing the question and answer, you provide examples that demonstrate the reasoning process: "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11." This intermediate reasoning serves as a guide for the model to follow similar thinking patterns.

The method requires minimal setup - just a few well-crafted examples showing the step-by-step reasoning process. The researchers tested this approach on three different language model families: GPT-3, LaMDA, and PaLM, with model sizes ranging from 422 million to 540 billion parameters.

What made this discovery particularly significant was that chain-of-thought prompting didn't require any model retraining or fine-tuning. It was purely a prompting technique that could be applied to existing large language models immediately.

"Complex reasoning emerges naturally when we teach language models to think step-by-step through problems."

Breaking Down the Key Concepts

To understand why this works, consider how we humans approach complex problems. When faced with a difficult maths problem, we don't usually arrive at the answer instantly. Instead, we break it down into smaller, manageable steps, solving each part systematically before combining the results.

Large language models, it turns out, have this same capability lying dormant. They can perform step-by-step reasoning, but they need to be explicitly encouraged to do so. Chain-of-thought prompting acts like a gentle nudge, showing the model that it's acceptable and beneficial to think out loud rather than rushing to a conclusion.

The technique works because language models are trained to continue patterns they see in text. When you show them examples of step-by-step reasoning, they naturally continue that pattern when faced with new problems. It's similar to how a student learns to solve algebra problems by first watching a teacher work through several examples on the blackboard.

The researchers found that this reasoning ability is an emergent property, meaning it only appears when models reach a certain size threshold. Smaller models don't benefit significantly from chain-of-thought prompting, but once models reach around 100 billion parameters, the technique becomes remarkably effective.

Results and Significance

The empirical results were striking. Prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier. Specifically, the method achieved 58% accuracy on GSM8K, beating the previous state-of-the-art of 55% that required extensive fine-tuning and additional verifier models.

The technique democratises access to sophisticated reasoning capabilities without requiring massive computational resources for model training. You can apply chain-of-thought prompting to existing models through API calls, making advanced reasoning accessible to startups and individual developers who couldn't afford to train large models from scratch.

The method showed improvements across multiple reasoning domains. On the GSM8K dataset of math word problems, PaLM shows remarkable performance when scaled to 540B parameters, but the benefits extended beyond mathematics. The researchers demonstrated improvements in commonsense reasoning tasks like understanding sports scenarios, where PaLM 540B's chain of thought performance surpassed that of an unaided sports enthusiast (95% vs 84%).

The research also revealed that chain-of-thought prompting provides valuable interpretability. Unlike black-box predictions, the intermediate reasoning steps help developers understand how the model arrived at its conclusions, making it easier to debug errors and build trust in AI systems.

The original paper can be found here - https://arxiv.org/abs/2201.11903