Perceptron

Diagram of Perceptron

    Introduction

  • If we are thinking of a neural network as a graph, the perceptron is like a node or a vertex of the graph. A perceptron is also the "neuron" in the neural network.

  • The purpose of a perceptron is to "learn" how to separate linearly classifiable problems.

    Problem Statement

  • We are given a binary classification problem. In other words, we are given a series of points we want to group them into 2 clean groups.

  • Graph with unlabeled points
  • We say these points are linearly separable if we can draw a line such that it mostly separates the 2 groups.

  • Graph with labeled points and line separating them
  • There are multiple ways to group the points in a linearly separable fashion, so oftentimes we give the perceptron a list of already known groupings so it can predict how the line should be drawn.

  • This list of already known groupings is called the training set.

    Inputs / Outputs

  • A perceptron takes in three values and outputs one value.

  • The three values it takes in is a matrix of inputs x, a matrix of weights w, and a bias b.

  • The dimensions of x is related to the dimension of the problem. Since the above example was 2-dimensional, so x would include both the x and y coordinates of a point in the graph.

  • The value outputted is w0x0+w1x1++wnxn+b=w0x0+w1x1++wnxn+b1w_0*x_0 + w_1*x_1 + \ldots + w_n*x_n + b = w_0*x_0 + w_1*x_1 + \ldots + w_n*x_n + b*1 so oftentimes we include the bias as the last weight and 1 as the last input in order to simplify everything to one matrix.

  • Thus, if we used a matrix representation the value outputted is wTx\mathbf{w}^T\mathbf{x}.

    Activation Function

  • The output then gets passed through an activation function.

  • The activation function is designed to imitate the action potential when a neuron fires in the brain.

  • Think of the activation function as a dam that only leaks water (overflows) past a certain threshold.

  • Here are some examples of activation functions:

    • Binary Step Function: If x is posistive the function outputs a 1. If x is negative the function outputs a 0. If x is 0 the function outputs 0.5.
    • Graph of Binary Step Function
    • RELU: If x is posistive the function outputs x, otherwise the function outputs a 0.
    • Graph of RELU Function
    • Sigmoid: Outputs 11+ex\frac{1}{1+e^{-x}}.
    • Graph of Sigmoid Function
  • The activation function acts as a control-flow for the information in a neural network, only letting the output through past a certain threshold or condition.

  • The bias modifies that threshold, shifting the boundary line left or right based on its value.

  • In the binary classification problem, the activation function is what determines if the neuron classifies the point in the data as A or B.

  • If we were using a nonlinear function like Sigmoid the output represents the probability the point is part of group A or B.

    Training

  • Training a perceptron is done through a process called gradient descent.

  • If you are familiar with calculus, in one dimension gradient descent looks like xnew=xoldlearning ratef(xold)x_{new}=x_{old}-\text{learning rate}*f'(x_{old}).

  • Here's an animaton of Gradient Descent (Source: Aleksandr Antonov):

  • A gif showing gradient descent down a parabola
  • So how do we do something similar but with our training data and weights?

  • Well, in this case, we can use the error/cost function, or the difference between the expected output and our perceptron's output, as our rate of change.

  • Plugging everything back in, the new formula for gradient descent would look like xnew=xoldlearning rate(expectedwTx)\mathbf{x_{new}}=\mathbf{x_{old}}-\text{learning rate}*(\text{expected} - \mathbf{w}^T\mathbf{x}).

    Geometrical Intuition

  • Another way of thinking about perceptron is to visualize the weight vector as perpendicular to the boundary line separating the two points.

  • An image showing the perependicular weight vector and two different vectors x
  • This is because another way of expressing wTx\mathbf{w}^T\mathbf{x} is wxcos(θ)||\mathbf{w}||\cdot||\mathbf{x}|| \cdot \cos(\theta).

  • Whenever the angle between the x vector and w is less than 90 degrees, such as with x1\mathbf{x_1}, cos(θ)\cos(\theta) will be positive which will lead the activation function to classify x in group 1.

  • Whenever the angle between the x vector and w is greater than 90 degrees, such as with x2\mathbf{x_2}, cos(θ)\cos(\theta) will be negative which will lead the activation function to classify x in group 0.

  • If the vector is mislabeled as positive, it will be subtracted from w and w will go further away from the vector, leading the boundary line to eventually shift away from x and make it 0.

  • If the vector is mislabeled as 0, it will be added to w and w will go towards the vector, leading the boundary line to eventually shift towards x and make it positive.

  • Here's an animation depicting the training process (modified based on https://github.com/ayusek/Perceptron-Animation/tree/master):

  • An animation showing how the perceptron learns a binary classification problem
  • The left side shows all the data points, as well as the weight vector in yellow.

  • the left side also shows the current training point either in the mislabeled color or green if it was labeled correctly.

  • The right side shows all of the training set so far.

Perceptron Code:

    Limitations

  • Unfortunately, not all problems are Linearly Separable. One classic example is a XOR gate:

  • A XOR Truth Table with 2 lines necessary for separating the 1s and 0s
  • As you can see by the above image, it's impossible to separate the 1s and 0s using only one line, so multiple lines are needed.

  • For problems that aren't linearly separable, we instead use a Multi-Layer Perceptron.