🤓A Deep Dive into Deep Learning: Part 2
Part 2 of 3 posts on the history of Deep Learning and the foundational developments that led to today’s AI innovations.
Thank you for subscribing to my newsletter. As promised, this is Part 2 of my deep dive into the origins of deep learning. If you missed Part 1, you can read it here.
The field of deep learning is filled with lots of jargon. When you see the 🤓 emoji, that’s where I go a layer deeper into foundational concepts and try to decipher the jargon.
P.S. Don’t forget to hit subscribe if you want to receive more AI content like this!
Part 2 - The Hidden Layers of Deep Learning
We ended Part 1 at the close of the 1970s in the AI winter. Research had slowed down because of limitation of single layer neural networks and computers aren’t yet powerful enough to train large networks.
Let’s pick things up in the 1980s…
As we learned in Part 1, single layer neural networks are limited to only being able to learn linearly separable data, which means they couldn’t learn simple mathematical functions like XOR. In theory, adding just one additional layer to a single-layer network allows it to approximate any mathematical function. In practice however, two-layer networks end up being too big and too slow to be useful.
Two-layer neural networks have a large number of parameters, which need to be adjusted during the training process. This requires a lot of computational resources, making the training process very slow, especially when working with large datasets. Two-layer neural networks also have a limited capacity to learn complex patterns and features from data. Because they only have two layers (input and output layer), they are not able to extract high-level features that can be used to represent data beyond those directly represented in the input layer. This makes it difficult to learn complex relationships and patterns.
A great example of the limitations of a two-layer network is in the task of image recognition. A two-layer neural network, would only have an input layer that receives the raw image data and an output layer that produces the final prediction or decision. The lack of additional layers in between the input and output layers limits the capacity of the network to extract high-level features such as edges, textures, and shapes from the image data.
In order to create neural networks that are more capable of learning, researchers would need to go beyond two layers. These multi-layer neural networks are also known as Deep Neural Networks (DNNs).
🤓 What is a Deep Neural Network?
A deep neural network (DNN) is a neural network with many layers, typically composed of an input layer, multiple hidden layers, and an output layer. Each layer contains a set of interconnected "neurons" that perform computations on the input data and pass the results to the next layer. The neurons in a deep neural network are connected by weights which can be learned through training. The goal of training is to adjust these weights so that the DNN produces a desired output for a given input.
What are hidden layers?
The input layer takes in the input data, which is then processed by the hidden layers. The hidden layers are called "hidden" because their internal workings are not directly observable and are not part of the network's input or output. These layers use a set of weights and biases to transform the input data, passing it through multiple non-linear processing stages known as activation functions. The output of the last hidden layer is then passed on to the output layer, which produces the final output of the network.
The purpose of the hidden layers is to extract and abstract features from the input data and pass them on to the next layer. The more hidden layers a DNN has, the more complex patterns it can learn and represent. The number of hidden layers and the number of neurons in each hidden layer is a parameter of the network called its architecture, and can be adjusted during the training process to optimize the performance of the network.
To simplify, think of a DNN as a function that can map input data to output data by passing through the layers. Each layers adapts the function to better fit the input-output pairs that it sees during the training phase, with the final output being the output of the last layer.
For example, let's say that the DNN is trained on a dataset of images of handwritten digits. During the training process, the DNN would learn the statistical patterns and relationships between the pixel values of the images and the labels indicating which digit the image represents.
Once the DNN is trained, it can be used to classify new images of handwritten digits. To classify a new image, the DNN would take the pixel values of the image as input and pass them through the layers of the network, using the patterns and relationships it learned during training to classify the image as a specific digit.
In 1986, a pivotal book was published: Parallel Distributed Processing (PDP) by David Rumelhart, James McClelland and the PDP Research Group1. In the book, David Rumelhart and James L. McClelland that explores the use of artificial neural networks for computational modeling. The PDP series includes several influential papers on the development of neural networks and their applications.
One of the main contributions of the PDP series was the use of Paul Werbos’ backpropagation algorithm for training neural networks. The PDP series also introduced the concept of distributed representation, which is the idea that the meaning of a concept can be represented by the pattern of activity across multiple neurons in the network. This idea is important because it allows neural networks to learn more complex relationships between the input and output data, and to generalize better to new data.
In the 1990s, the term "deep learning" is coined by Igor Aizenberg and colleagues to describe multi-layered neural networks. The biggest breakthrough of the decade however, is when the first successful application of deep learning is demonstrated by Yann LeCun, Yoshua Bengio, and Geoffrey Hinton in their work on handwritten digit recognition using a convolutional neural network (CNN)2.
LeCun’s, Bengio and Hilton’s work is seminal because it demonstrates the effectiveness of deep learning for real-world applications. Prior to this work, there had been limited success in using neural networks for practical tasks, and many researchers were skeptical of their potential. However, the results of this work shows that deep learning can be used to achieve high accuracy on a challenging real-world task, paving the way for further research and development in the field.
Here’s LeCun demo-ing his CNN, LeNet 1 in 1993:
🤓 What is a Convolutional Neural Network?
A convolutional neural network (CNN) is a type of deep neural network that is commonly used in image and video recognition tasks. It is designed to automatically and adaptively learn spatial hierarchies of features from input data, making it particularly well-suited for image analysis.
A CNN consists of multiple layers of interconnected neurons, which process and analyze the input data. The layers of a CNN are organized in such a way that they learn increasingly complex features of the input data as the data passes through the network.
Why is it “convolutional”?
One key feature of CNNs is the use of "convolutional" layers, which are designed to automatically learn and extract features from the input data. These layers apply a set of filters to the input data and use the resulting output to detect patterns or features in the data. This process is repeated multiple times, allowing the CNN to learn increasingly complex features of the input data as it passes through the network.
Here’s a great video by Google that visualizes how a CNN works:
In 1995, Jurgen Schmidhuber and his student Sepp Hochreiter publish a paper on the concept of "long short-term memory" (LSTM) units3, which are a type of Recurrent Neural Network (RNN) that are able to capture long-term dependencies in time series data.
Traditional RNNs have difficulty in capturing long-term dependencies (e.g. one word after another in a sequence of text), which can limit their effectiveness for certain tasks. To address this issue, Schmidhuber and Hochreiter introduce the concept of LSTM units, which are able to "remember" information for long periods of time and use it to make predictions or decisions later. They demonstrate the effectiveness of LSTM units for a number of tasks, including language modeling and polyphonic music modeling, and show that they outperformed traditional RNNs in these tasks.
LSTMs go on to have a significant impact on the development of deep learning models for tasks such as language modeling, machine translation, and speech recognition.
🤓 What is a Recurrent Neural Network?
A recurrent neural network (RNN) is a type of artificial neural network that is designed to process sequential data, such as text or time series data. It is called "recurrent" because it makes use of sequential information, passing the output from one step of the processing back into the network as input for the next step.
How does an RNN work?
In a recurrent neural network (RNN), the neurons are connected in a directed cycle, meaning that the output from one step in the processing is passed as input to the next step in the cycle. This allows information to be passed from one step to the next and for the network to use information from the past to inform its current and future processing. For example, predicting where an object is going based on it’s passed co-ordinates or predicting the next word in a text based on previous words for auto-competion.
For example, in a language modeling task, an RNN might take a sequence of words as input and use the output from processing the previous word to inform its processing of the current word. This allows the RNN to capture the context and dependencies between words, which is important for understanding the meaning of the text.
What about LSTMs? How do they fit in?
A long short-term memory (LSTM) unit is a type of recurrent neural network (RNN) that is able to capture long-term dependencies in time series data. It is called "long short-term memory" because it is able to remember information for long periods of time and use it to make predictions or decisions later. For example, storing a whole sentence in a text to predict the next word vs just storing the last word.
What are RNNs useful for?
RNNs are particularly well-suited for tasks such as language modeling, machine translation, and speech recognition, where the order and context of the input data is important. They have also been applied to a wide range of other tasks, including image and video analysis, music generation, and protein folding prediction.
Here’s a great article that explains RNNs in more detail: Introducing Recurrent Neural Networks.
To be continued…
We end Part 2 with deep learning making steady progress thanks to the dedicated work of a small group of researchers including Yann LeCun, Yoshua Bengio and Geoffrey Hinton. During this period, many of their academic papers were rejected by journals and conferences because of their use of neural networks, despite dramatically outperforming any previous approaches.
In 2018, LeCun, Bengio and Hinton were awarded the Turing Award, the highest honor in Computer Science, for their dedication and persistence in developing neural networks despite skepticism from the academic world.
In Part 3, we will see how the work of these researchers laid the foundations for modern day AI…
Fun fact: I used ChatGPT for a lot of the research for this series of posts. Over the course of a week I asked ChatGPT dozens of questions about the history of deep learning and the concepts behind it. To make sure the article was accurate, I asked ChatGPT to provide citations to relevant scientific papers which I’ve included in the footnotes. If you find any errors, please reach out or comment below and I will fix them!
If you enjoy reading my posts and would like to receive more content about AI for builders, founders and product managers, please subscribe!
Thanks for reading The Hitchhikers Guide to AI! Subscribe for free to receive new posts and support my work.
David E. Rumelhart, Geoffrey E. Hinton, and James L. McClelland. (1986) "A general framework for parallel distributed processing." Parallel distributed processing: Explorations in the microstructure of cognition
LeCun, Y., Bengio, Y., & Hinton, G. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks