Overview of Neural Networks: History and How it Works

If you’re learning about data science, you’ve definitely heard of neural networks, but may not know much more than the name. In this blog, I’m going to talk a bit about the history of neural networks and how they work to give a brief overview/intro to those just looking into it.


It may be hard to believe, but the idea of neural networks were first proposed in 1944 by two University of Chicago professors, Warren McCullough and Walter Pitts. However, at the time they were just capable of doing calculations that a digital computer could do. We will talk more about this later, but neural networks have ‘layers’ between the input data and the output data. This one had none.

In 1959, there was finally a neural network that could be trained. It was called the Perceptron and was presented by psychologist Frank Rosenblatt. The Perceptron had a single layer between the input and output, but, due to technological limitations, was only able to do fairly common computations and was too time consuming.

Neural network research and use went down until a resurgence in the 1980s when researchers had developed algorithms that made modifying the weights and thresholds in neural networks efficient enough to have more than one layer(2–3). These networks, however, could not be deciphered for interpreting of the results to gain insight and there were alternative approaches for machine learning that did.

The reason for the rise of neural networks in recent years is due to the development of GPUs(graphics processing unit). The setup of GPUs is similar to neural nets and the processing power allows for big jump in the number of layers you can have in a neural network, going from 2–3 to 10–20 layers. Hence the term Deep Learning, where deep refers to how many layers a neural network has.


Neural networks are composed of layers of nodes, where a node is a point where some computation happens. The first layer is the input layer, and as the name suggests the nodes in this layer contain all the variables/input from your data. All the input data is then sent to every node in the next layer. In each subsequent layer, consisting of varying numbers of hidden layers and an ultimate output layer, the nodes take in the data from the previous layer and assign a weight to each piece of data it receives. It then multiplies the data by the weight and sums all the weighted data together. If this number is above its threshold, the node is activated and sends the data along to the next layer of nodes. This happens until it reaches an output.

The weights and thresholds each node has are randomized in the beginning and are adjusted after calculating the error between the predicted output and the actual output. This is similar to how you would train any other model, but due to the complexity and number of layers, neural networks are able to correct for its errors and find patterns and connections between the data that a human would not and is therefore able to make better predictions through learning.

As can be seen above, in each subsequent layer, the neural network learns more and more complex relationships and ideas which factor into its predictive capability.


It’s fairly clear how neural networks help with classifications when you give it training data to learn on with defined labels. However, even with unclassified, unlabeled data, neural networks are capable of finding similarities and clustering the data into distinct groups.

For example, deep learning can take a million images, and cluster them according to their similarities: cats in one corner, ice breakers in another, and in a third all the photos of your grandmother. This is the basis of so-called smart photo albums.

Neural networks also have the advantage of automatic feature selection, which as any data scientist knows is very time consuming and a big part of making models. Through subsequent layers and runs, only the combinations of features that pass the threshold are used to determine the outcome.

In the process, these neural networks learn to recognize correlations between certain relevant features and optimal results — they draw connections between feature signals and what those features represent, whether it be a full reconstruction, or with labeled data.


Data Science student at Flatiron School