If we are thinking of a neural network as a graph, the perceptron is like a node or a vertex of the graph. A perceptron is also the "neuron" in the neural network.
The purpose of a perceptron is to "learn" how to separate linearly classifiable problems.
We are given a binary classification problem. In other words, we are given a series of points we want to group them into 2 clean groups.
We say these points are linearly separable if we can draw a line such that it mostly separates the 2 groups.
There are multiple ways to group the points in a linearly separable fashion, so oftentimes we give the perceptron a list of already known groupings so it can predict how the line should be drawn.
This list of already known groupings is called the training set.
A perceptron takes in three values and outputs one value.
The three values it takes in is a matrix of inputs x, a matrix of weights w, and a bias b.
The dimensions of x is related to the dimension of the problem. Since the above example was 2-dimensional, so x would include both the x and y coordinates of a point in the graph.
The value outputted is so oftentimes we include the bias as the last weight and 1 as the last input in order to simplify everything to one matrix.
Thus, if we used a matrix representation the value outputted is .
The output then gets passed through an activation function.
The activation function is designed to imitate the action potential when a neuron fires in the brain.
Think of the activation function as a dam that only leaks water (overflows) past a certain threshold.
Here are some examples of activation functions:
The activation function acts as a control-flow for the information in a neural network, only letting the output through past a certain threshold or condition.
The bias modifies that threshold, shifting the boundary line left or right based on its value.
In the binary classification problem, the activation function is what determines if the neuron classifies the point in the data as A or B.
If we were using a nonlinear function like Sigmoid the output represents the probability the point is part of group A or B.
Training a perceptron is done through a process called gradient descent.
If you are familiar with calculus, in one dimension gradient descent looks like .
Here's an animaton of Gradient Descent (Source: Aleksandr Antonov):
So how do we do something similar but with our training data and weights?
Well, in this case, we can use the error/cost function, or the difference between the expected output and our perceptron's output, as our rate of change.
Plugging everything back in, the new formula for gradient descent would look like .
Another way of thinking about perceptron is to visualize the weight vector as perpendicular to the boundary line separating the two points.
This is because another way of expressing is .
Whenever the angle between the x vector and w is less than 90 degrees, such as with , will be positive which will lead the activation function to classify x in group 1.
Whenever the angle between the x vector and w is greater than 90 degrees, such as with , will be negative which will lead the activation function to classify x in group 0.
If the vector is mislabeled as positive, it will be subtracted from w and w will go further away from the vector, leading the boundary line to eventually shift away from x and make it 0.
If the vector is mislabeled as 0, it will be added to w and w will go towards the vector, leading the boundary line to eventually shift towards x and make it positive.
Here's an animation depicting the training process (modified based on https://github.com/ayusek/Perceptron-Animation/tree/master):
The left side shows all the data points, as well as the weight vector in yellow.
the left side also shows the current training point either in the mislabeled color or green if it was labeled correctly.
The right side shows all of the training set so far.
Unfortunately, not all problems are Linearly Separable. One classic example is a XOR gate:
As you can see by the above image, it's impossible to separate the 1s and 0s using only one line, so multiple lines are needed.
For problems that aren't linearly separable, we instead use a Multi-Layer Perceptron.