Skip to Content

1986:Learning Representations by Back-propagating Errors

The Algorithm That Taught Neural Networks How to Learn

Introduction

In 1986, David Rumelhart from UC San Diego, Geoffrey Hinton from Carnegie Mellon, and Ronald Williams published "Learning Representations by Back-propagating Errors" in Nature. This paper introduced the backpropagation algorithm, solving the fundamental challenge of training multi-layer neural networks. Before this breakthrough, researchers could only train single-layer networks effectively, severely limiting what artificial neural networks could achieve. The paper revolutionised machine learning by showing how networks could automatically discover useful internal representations through supervised learning.

"Teaching machines to learn by systematically spreading blame backwards through layers of artificial neurons revolutionised artificial intelligence forever."

Core Ideas

The backpropagation algorithm addresses a critical problem that had stumped researchers for decades. How do you train the hidden layers of a neural network when you only know what the final output should be? The answer lies in systematically propagating error information backwards through the network layers.

The algorithm works in two distinct phases. During the forward pass, input data flows through the network layer by layer, with each neuron applying weights to its inputs, adding a bias, and passing the result through an activation function. The network produces an output, which is compared against the desired target using a loss function to measure the error.

The backward pass is where the magic happens. The algorithm calculates how much each weight contributed to the total error by using the chain rule from calculus. Starting from the output layer, it computes the gradient of the error with respect to each weight, then propagates these gradients backwards to earlier layers. Each layer receives information about how its outputs affected the final error, allowing it to adjust its weights accordingly.

The mathematical foundation relies on gradient descent optimisation. The algorithm computes partial derivatives of the loss function with respect to every weight in the network. These gradients indicate the direction and magnitude of weight adjustments needed to reduce the error. By repeatedly performing forward passes, calculating errors, backpropagating gradients, and updating weights, the network gradually improves its performance.

What made this approach revolutionary was its ability to train hidden layers automatically. Previously, researchers had to manually engineer features or rely on unsupervised pre-training. Backpropagation showed that hidden units could learn to represent important features of the task domain through supervised learning alone.

Breaking Down the Key Concepts

Think of backpropagation like teaching a cricket team. When the team loses a match, you don't just tell them they played poorly. Instead, you analyse each player's contribution to the loss, working backwards from the final result. The wicket-keeper's missed catch led to extra runs, which happened because the bowler's line was slightly off, which occurred because the captain's field placement wasn't optimal.

Similarly, backpropagation examines how each neuron's "decision" contributed to the network's mistake. It traces the error backwards, assigning blame proportionally. A neuron that made a big mistake gets a large correction, while one that barely affected the outcome receives a tiny adjustment.

The chain rule acts like a relay race of responsibility. Just as each relay runner affects the team's final time, each layer's output influences the final prediction. The algorithm calculates how much each runner's performance contributed to the team's overall result.

The learning process resembles how students improve through practice tests. Initially, the network makes random guesses. After seeing the correct answers, it adjusts its approach. With each iteration, it gets better at recognising patterns and making accurate predictions.

Results and Significance

The paper demonstrated that backpropagation could solve problems that had previously stumped artificial intelligence researchers. The authors showed successful training on various tasks, proving that multi-layer networks could learn complex mappings between inputs and outputs. Hidden units automatically developed representations that captured important features of the problem domain.

For developers working with modern AI frameworks like TensorFlow, PyTorch, or JAX, backpropagation remains the fundamental training algorithm. Every time you call model.fit() or compute gradients, you're using principles directly descended from this 1986 paper. The algorithm became so successful that it's now considered the standard approach for training neural networks.

The paper's impact on computer science cannot be overstated. It triggered the revival of neural networks in the late 1980s and early 1990s, leading to advances in computer vision, natural language processing, and speech recognition. Modern deep learning, which powers everything from recommendation systems to language models, builds directly on backpropagation principles.

The algorithm's efficiency was crucial to its adoption. Computing gradients for all weights in a single backward pass is roughly equivalent to just two forward passes in computational cost. This efficiency made training large networks feasible, unlike previous approaches that might require thousands of forward passes to estimate gradients.

The ability to automatically discover useful features distinguished backpropagation from earlier methods like the perceptron convergence procedure. Networks could now learn hierarchical representations, with early layers detecting simple features and deeper layers combining them into complex concepts.

Original paper can be found here - https://www.nature.com/articles/323533a0