Deep learning notes from siraj on youtube¶

18 December 2017 by Anil

Tutorial

1. Make a Prediction¶

Overview

Data parsing and processing using data frames
Splitting our data into a training and testing set
Simple linear regression modal to find our line of best fit
Slope formula - $y = mx + b$

Introduction¶

What is deep learning?¶

A neural network is a machine learning model, when we create neural networks that are many layers deep, that’s Deep Learning.

Deep learning is a subset of machine learning that’s outperformed every other type of model.

Task¶

This task uses a supervised approach (with labelled data)
The type of machine learning task we’ll perform is called regression.

Create a new virtualenv and switch to it.

# create a new virtual environment
mkvirtualenv deeplearning

# switch to it
workon deeplearning

# install dependencies - we must use version 1 of matplotlib
pip install -U numpy pandas scikit-learn scipy 'matplotlib==1.5.1'

Sample brain body data¶
Brain	Body
3.385	44.500
0.480	15.500
1.350	8.100
465.000	423.000

Here’s my jupyter notebook for this lesson.

In [1]:

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

In [2]:

import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

%matplotlib inline

# Read our data into a pandas data-frame object
# which is a 2d data structure of rows and columns
dataframe = pd.read_fwf('brain_body.txt')
x_values = dataframe[['Brain']]
y_values = dataframe[['Body']]

# linear regression helps find the relationship between our 2 vars to find the only line of best fit
# Use scikits linear_model object to init our linear regression model
body_reg = linear_model.LinearRegression()

# Fit our model on our x,y value pairs
body_reg.fit(x_values, y_values)

# Now we have the line of best fit, we can plot it on a graph

# Scatter plot
plt.scatter(x_values, y_values)

# Plot our regression line
# "for ever x value we have, predict the associated y value
#  and draw a line that intersects all those points"
plt.plot(x_values, body_reg.predict(x_values))

# Save it to a file
# plt.savefig('best-fit.png', bbox_inches='tight')

# Show it
plt.show()

The x axis represents brain weights.

The y axis represents body weights.

Challenge dataset¶

In [1]:

import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
from IPython.display import display

Data cleaning¶

Our data is in a different format, we don't have heading columns, so we need to clean it up.

We also split our data into a training and testing set.

In [2]:

global df

# Read our data into a pandas data-frame object
df = pd.read_csv('challenge.csv', names=['X', 'Y'])

# Split our data into a test and training set (so we can calculate the error)
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = np.asarray(train_test_split(df['X'], df['Y'], test_size=0.1))

"Dataframe length: %s, X_train: %s, X_test: %s, y_train: %s, y_test: %s" % (len(df), len(X_train), len(X_test), len(y_train), len(y_test))

Out[2]:

'Dataframe length: 97, X_train: 87, X_test: 10, y_train: 87, y_test: 10'

X_train, X_test, y_train, y_test are all of type pandas.core.series.Series.

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

Create our model¶

Create a LinearRegression model and fit our line.

We need to reshape our training data from (87,) to (87,1)

# (87,)
[1,2,3,86,87]

# (87,1)
[[1],[2],[3],[86],[87]]

# np.array([1,2,3,86,87]).shape           # (5,)
# np.array([[1],[2],[3],[86],[87]]).shape # (5,1)

http://cs231n.github.io/python-numpy-tutorial/#array-indexing

In [3]:

# linear regression helps find the relationship between our 2 vars to find the only line of best fit
# Use scikits linear_model object to init our linear regression model
reg = linear_model.LinearRegression()

# Fit our model on our x,y value pairs
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.reshape.html
x = reg.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))

# Returns the mean accuracy on the given test data and labels.
score = reg.score(X_test.values.reshape(-1, 1), y_test.values.reshape(-1, 1))

score

Out[3]:

-0.08021882393371649

In [4]:

# Now we have the line of best fit, we can plot it on a graph

# Create our horizontal axis ranging from 5-25
x_line = np.arange(5, 25).reshape(-1,1)

# Scatter plot
plt.scatter(df['X'], df['Y'])

# Plot our regression line
# "for ever x value we have, predict the associated y value
#  and draw a line that intersects all those points"
plt.plot(x_line, reg.predict(x_line))

# Show it
plt.show()

Slope formula example¶

Here's a simple example using a small dataset showing the slope formula.

$$y = mx + b$$

This is a table of salary and years worked.

In [5]:

salary_data = {
    'year': [1,2,3,4,5,6,7,8,9,10],
    'salary': [40000,42000,43000,46000,48000,52000,58000,62000,65500,68000]
}
df = pd.DataFrame(data=salary_data)
df

Out[5]:

	year	salary
0	1	40000
1	2	42000
2	3	43000
3	4	46000
4	5	48000
5	6	52000
6	7	58000
7	8	62000
8	9	65500
9	10	68000

In [6]:

# No train/test splitting as we only have 10 values
X_train = df['year'].values.reshape(-1, 1)
y_train = df['salary'].values.reshape(-1, 1)

# Create our linear model
reg = linear_model.LinearRegression()

# Fit our model on our x,y value pairs
x = reg.fit(X_train, y_train)

b = int(round(reg.intercept_[0]))
m = int(round(reg.coef_[0][0]))

"y = {}x + {}".format(m, b)

Out[6]:

'y = 3342x + 34067'

The above should show:

$$y = 3342x + 34067$$

So if we were prediciting the salary for someone after 15 years, the formula would be.

$$y = (3342*15) + 34067$$$$y = 84197$$

In [ ]:

2. Linear Regression using Gradient Descent¶

Overview

Hyperparameters
Sum of squared errors
Local minima
Gradient descent
Partial derivative

Linear Regression with Gradient Descent¶

We use gradient descent to find the line of best fit in our data, which we can then use to make a prediction. The gradient is a direction of positive or negative which we update our current b and current m values with each time step until we find our local minima (lowest error).

Linear regression is plain machine learning, there is no neural network.

Linear regression using gradient descent is used everywhere in ML and DL, so it’s important to understand the core concept.

Linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.

The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

Wikipeda#Linear_regression

https://raw.githubusercontent.com/llSourcell/GradientDescentExample/master/gradient_descent_example.gif

Hyperparameters¶

In the context of machine learning, hyperparameters are parameters whose values are set prior to the commencement of the learning process. By contrast, the values of other parameters are derived via training.

Wikipeda#Hyperparameter_(machine_learning)

Examples of hyper-parameters include:

Learning rate (can be dynamic)
Number of hidden layers in a deep neural network
Read more link

Local Minima¶

If we find the smallest error rate (our local minima), that’ll also give us the y-intercept and slope.

We can use these in our $y=mx+b$ equation to get our line of best fit, which we can then use to make a prediction.

https://spin.atomicobject.com/wp-content/uploads/gradient_descent_error_surface.png

This graph shows 3 dimensions of all the possible values of the y intercept, the slope and error values.

We want to find the point where the error is smallest (our local minima). Looking at the graph, it’s the blue part at the bottom of the curve.

Why local as oppose to non local?: We have a simple graph with 1 local minima, sometimes you may have many minima, so you may want to find which minima (2nd order optimization).
Do we always have 1 local minima?: No, in complex cases there could be several local minima.

The gradient¶

https://www.youtube.com/watch?v=IHZwWFHWa-w

The way we get the smallest error, is by calculating the gradient.

The gradient gives us the direction to move the slope. It tells us to move positive (up) or negative (down), after completing our iterations, we should find our local minima - the point with the lowest errors.

If a function’s gradient can be computed using the partial derivative, it’s known as a differentiable function, and we can optimize it.

The gradient is a tangent line (a line that touches the function at one point only)

To calculate the gradient, we have compute the partial derivative with respect to our values b & m

Derivatives and partial derivatives¶

A derivative means the slope, let’s say we have the following function.

\[f(a) = 3a\]

If $a = 2$ then $f(a) = 6$.

If we nudge $a = 2.001$ then $f(a) = 6.003$ moves 3 times as much.

So the slope (derivative) of $f(a)$ at $a = 2$ is $3$.

The slope is defined as the $\frac{height}{width}$ of the triangle, so in our example it would be $\frac{0.003}{0.001} = 3$.

This is commonly written as $\frac{d\;f(a)}{da} = 3$

The formal definition of derivative says whenever you nudge a to the right by an infinitesimal amount (an infinitely tiny amount) if you do that does f(a) go up 3x as much as whatever was the tiny amount you nudged it by.

—Andrew Ng @ youtube.com/watch?v=GzphoJOVEcE

The partial derivative is when we change one variable and keep one variable constant to view the relationship between them.

Calculate the partial derivative of m:

\[\frac \partial{\partial m} = \frac 2N\sum_{i=1}^n -x_i(y_i - (mx_i + b))\]

Calculate the partial derivative of b:

\[\frac \partial{\partial b} = \frac 2N\sum_{i=1}^n -(y_i - (mx_i + b))\]

Sum of squared errors¶

We want to compute the error/loss for the current time step (error/loss means the same thing), we use the sum of squared errors to do this.

Sum of squared errors example

given some points, measure the distance from each of those points to the line we have drawn and then square them, sum them all together and then divide by the total number of points

—@sirajraval

\[Error(m,b) = \frac 1N\sum_{i=1}^n (y_i - (mx_i + b))^2\]

Sum of squared error algebra equation¶
Symbol	Description
$\Sigma$	Sigma notation - a set of values we are going to iterate over
$\,i\,$	Current index point in iteration
$N$	number of points
$\,y\,$	Current y point in our dataset
$\frac 1N$	Find the average of those
Measure the squared error values for all of N points
The value returned is the sum of the squared errors - our error value

Measure the squared error values for all of N points.

The value returned is the sum of the squared errors - our error value.

We want to minimize this value every time step using gradient descent.
We have a $y$ point from our dataset, and a $y$ point on our line, we want to minimize the distance between both of those points.

We subtract the point in the dataset from the point on our line and then we square it to ensure the value is positive.

We don’t need the actual value, we only care about the magnitude of this value, we use it to minimize the error.
We now have the difference between our “y intercepts”.

3. Make a neural network¶

Overview

The McCulloch-Pitts Model of Neuron (1943)
The Perceptron
Forward propagation
Backpropagate to update weights

Introduction¶

Let’s create our own neural network from scratch using just numpy.

The McCulloch-Pitts Model of Neuron (1943)¶

The McCulloch-Pitts Model of Neuron (1943)

Take some binary inputs
Sum them
If the sum exceeds a threshold output a 0 or 1

The Perceptron by Frank Rosenblatt¶

The Perceptron by Frank Rosenblatt

The perceptron built on the mcculloch-pitts model and is known today as a single layer feed forward neural network. The perceptron incorporated the idea of weights. Given a set of input output examples, it should learn a function from it by continuously increasing or decreasing the weights. These weights are mathematically applied so that after each iteration the output prediction gets more accurate.

Lesson #3 - Create a neural network¶

In [1]:

import numpy as np
from numpy import exp, array, random, dot
import matplotlib.pyplot as plt
from past.builtins import xrange
from IPython.display import display, Markdown

In [2]:

import numpy as np

# Custom logger to log one-line arrays
def logger(msg, log_array, nl=False):
    if not isinstance(log_array, np.ndarray):
        print('{message}'.format(message=msg))
    else:
        string_array = ', '.join(map(str, log_array))
        print('{message: <{fill}}: {array}'.format(message=msg, array=string_array, fill='35'))

    if nl is True:
        print('\n')

In [3]:

class NeuralNetwork():
    def __init__(self):
        # Seed the random generator so we get the same number each time
        np.random.seed(1)
        
        # We model a single neuron with 3 input connections 1 output connection
        # we assign random weights to a 3 x 1 matrix, with values in the range -1 to 1
        self.synaptic_weights = 2 * np.random.random((3,1)) -1
        display(Markdown('## Initializing Single Layer Feed Forward Network'))
        display(Markdown('Starting our network with the following random synaptic weights:'))
        display(Markdown('> ``%s``' % self.synaptic_weights))

    # The Sigmoid function, which describes an S shaped curve.
    # We pass the weighted sum of the inputs through this function to
    # normalise them between 0 and 1.
    # We can pass a np.array here to apply it to all values
    def __sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    # The derivative of the Sigmoid function.
    # This is the gradient of the Sigmoid curve. The slope.
    # It indicates how confident we are about the existing weight.
    def __sigmoid_derivative(self, x):
        return x * (1-x)
    
    # We train the neural network through a process of trial and error.
    # Adjusting the synaptic weights each time.
    def train(self, training_set_inputs, training_set_outputs, number_of_training_iterations):
        for iteration in xrange(number_of_training_iterations):
            if (iteration % 2000 == 0):
                logger('Iteration: %s' % iteration, False)
                logger('Current Synaptic Weights:', self.synaptic_weights)
          
            # Pass the training set through our neural network (a single neuron).
            output = self.predict(training_set_inputs)
            
            # Calculate the error (The difference between the desired output
            # and the predicted output).
            error = training_set_outputs - output

            # Calculate the adjustment by multiplying the error by the input and again by the gradient of the Sigmoid curve.
            # This means less confident weights are adjusted more.
            # This means inputs, which are zero, do not cause changes to the weights.
            adjustment = np.dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))

            # Adjust the weights.
            self.synaptic_weights += adjustment

            if (iteration % 2000 == 0):
                logger('Sigmoid of dot inputs and weights:', output)
                logger('Error of our training to outputs', error, False)
                logger('adjustment for the weights', adjustment, False)
                logger('The new synaptic weights are:', self.synaptic_weights, True)
    
    def predict(self, inputs):
        # Pass inputs through our neural network (our single layer).
        return self.__sigmoid(np.dot(inputs, self.synaptic_weights))

    
if __name__ == '__main__':
    
    # Initialise our single neuron neural network
    neural_network = NeuralNetwork()
    
    # The training set
    # We have 4 examples, each consisting of 3 input values and 1 output value
    training_set_inputs = np.array([[0,0,1], [1,1,1], [1,0,1], [0,1,1]])
    training_set_outputs = np.array([[0,1,1,0]]).T

    # This is running in a ipython notebook, so we can use markdown
    display(Markdown('We have the following training set inputs:'))
    display(Markdown('> ``%s``' % training_set_inputs))
    display(Markdown('and the expected outputs:'))
    display(Markdown('> ``%s``' % training_set_outputs))
    display(Markdown("Let's step through our network to better understand what's going on."))
    
    display(Markdown('-------------'))
    
    # Train the network using a training set
    # Do it 10,000 times and make small adjustments each time
    neural_network.train(training_set_inputs, training_set_outputs, 10000)
    
    display(Markdown('The final Synaptic weights after training for 10,000 epochs:'))
    display(Markdown('> ``%s``' % neural_network.synaptic_weights))

    # Test the network with a new situation
    display(Markdown("Let's now make a prediction with [1, 0, 0]."))
    display(Markdown('> ``%s``' % neural_network.predict(np.array([1, 0, 0]))))

    display(Markdown("The script is complete"))
    display(Markdown('-------------'))

Initializing Single Layer Feed Forward Network¶

Starting our network with the following random synaptic weights:

[[-0.16595599] [ 0.44064899] [-0.99977125]]

We have the following training set inputs:

[[0 0 1] [1 1 1] [1 0 1] [0 1 1]]

and the expected outputs:

[[0] [1] [1] [0]]

Let's step through our network to better understand what's going on.

Iteration: 0
Current Synaptic Weights:          : [-0.16595599], [0.44064899], [-0.99977125]
Sigmoid of dot inputs and weights: : [0.2689864], [0.3262757], [0.23762817], [0.36375058]
Error of our training to outputs   : [-0.2689864], [0.6737243], [0.76237183], [-0.36375058]
adjustment for the weights         : [0.28621005], [0.06391297], [0.14913351]
The new synaptic weights are:      : [0.12025406], [0.50456196], [-0.85063774]


Iteration: 2000
Current Synaptic Weights:          : [8.0024254], [-0.21312605], [-3.78977343]
Sigmoid of dot inputs and weights: : [0.02210122], [0.98200541], [0.985409], [0.01793507]
Error of our training to outputs   : [-0.02210122], [0.01799459], [0.014591], [-0.01793507]
adjustment for the weights         : [0.00052777], [2.08078895e-06], [-0.0002658]
The new synaptic weights are:      : [8.00295317], [-0.21312397], [-3.79003922]


Iteration: 4000
Current Synaptic Weights:          : [8.72812088], [-0.21048991], [-4.15496406]
Sigmoid of dot inputs and weights: : [0.01544409], [0.98741602], [0.98978021], [0.0125494]
Error of our training to outputs   : [-0.01544409], [0.01258398], [0.01021979], [-0.0125494]
adjustment for the weights         : [0.00025974], [8.52828727e-07], [-0.00013061]
The new synaptic weights are:      : [8.72838062], [-0.21048906], [-4.15509467]


Iteration: 6000
Current Synaptic Weights:          : [9.14789684], [-0.20920251], [-4.36595143]
Sigmoid of dot inputs and weights: : [0.01254323], [0.98977602], [0.99168995], [0.01019961]
Error of our training to outputs   : [-0.01254323], [0.01022398], [0.00831005], [-0.01019961]
adjustment for the weights         : [0.00017194], [4.90093661e-07], [-8.63861358e-05]
The new synaptic weights are:      : [9.14806878], [-0.20920202], [-4.36603782]


Iteration: 8000
Current Synaptic Weights:          : [9.44407174], [-0.20840265], [-4.51471078]
Sigmoid of dot inputs and weights: : [0.01082824], [0.99117199], [0.99282079], [0.00880917]
Error of our training to outputs   : [-0.01082824], [0.00882801], [0.00717921], [-0.00880917]
adjustment for the weights         : [0.00012842], [3.27845256e-07], [-6.44822235e-05]
The new synaptic weights are:      : [9.44420015], [-0.20840232], [-4.51477526]

The final Synaptic weights after training for 10,000 epochs:

[[ 9.67299303] [-0.2078435 ] [-4.62963669]]

Let's now make a prediction with [1, 0, 0].

[0.99993704]

The script is complete

If everything went right, we should see:

Synaptic weights after training    : [ 9.67299303], [-0.2078435], [-4.62963669]
Making a prediction with [1, 0, 0] : [ 0.99993704]

Let's step through the very first iteration¶

We'll step through the very first iteration to see what's going on and when.

Iteration: 0

We start with random weights¶

We model a single neuron with 3 input connections and 1 output connection.

We assign random weights to a 3 x 1 matrix, with values in the range -1 to 1.

#
# Start with random 'self.synaptic_weights'
#
Current Synaptic Weights:          : [-0.16595599], [ 0.44064899], [-0.99977125]

Create our input dot product¶

Our output variable is the dot product of the inputs and weights (and usually a bias). We pass it through a sigmoid function as our activation function to squash it and return the product.

Every hidden layer in our network will perform a similar operation.

We have our input, we need to multiply it by our weights (random to begin with) and then pass it through our sigmoid activation to get it's value. This is our output variable.

#
# 'output' variable : 'self.__sigmoid(np.dot(inputs, self.synaptic_weights))'
#
Sigmoid of dot inputs and weights: : [ 0.2689864], [ 0.3262757], [ 0.23762817], [ 0.36375058]

Calculate the error¶

Now we have our output, we can compare it to our expected output from the training set to calculate the error. We want to minimize this loss (aka error, cost) over time to make our network more accurate.

#
# 'error' variable : training_set_outputs - output
#
Error of our training to outputs   : [-0.2689864], [ 0.6737243], [ 0.76237183], [-0.36375058]

Calculate the adjustment¶

We need to optimize our network by adjusting the weights.

We do this by calculating the adjustment required to our training inputs by using gradient descent.

# Creates a transposed matrix of our existing inputs
training_set_inputs.T

# Multiply the error by the gradient of the Sigmoid curve for the output
error * self.__sigmoid_derivative(output)

# We want the dot product of these
np.dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))

The adjustment is the difference from our expected inputs and the current error, which takes into account the direction we need to move.

#
# 'adjustment' variable: np.dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))
#
adjustment for the weights         : [ 0.28621005], [ 0.06391297], [ 0.14913351]

Update the weights for the next iteration¶

Finally, we update the weights with our adjustment for the next epoch.

# We simply add the adjustment to our weights
self.synaptic_weights += adjustment

# synaptic_weights variable: adjustment is added to the weights to move them for the next iteration
The new synaptic weights are:      : [ 0.12025406], [ 0.50456196], [-0.85063774]

Lesson 3 - Challenge¶

The challenge is to add 3 hidden layers to our NeuralNetwork class.

We need to do a few things to our existing neural network:.

We'll rename our synaptic_weights to simply weights to keep things simple
Add 2 new layers to our code - add a function to initialize a new hidden layer.
Refactor functions - forward prop
Store our adjustments from back prop to be used in gradient descent
Add back prop

I reversed engineered Ludo Bouan's workbook to figure this out.

Adding additional hidden layers¶

We want to add 2 new hidden layers to our network, we create function add_layer to help us do this.

We need to initialize random weights and an empty adjustments tensor for this layer.

#
# Create weights with shape specified + biases
#
# np.vstack = Stack arrays in sequence vertically (row wise).
#
# a (2, 9) dimension tensor would look like this (with random numbers)
#
# [
#     [1, 2, 3, 4, 5, 6, 7, 8, 9]
#     [1, 2, 3, 4, 5, 6, 7, 8, 9]
#     [1, 2, 3, 4, 5, 6, 7, 8, 9]
# ]
#
#
# a (9, 1) dimension tensor would look like this (with random numbers)
#
# [ [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] ]
# 
self.weights[self.num_layers] = np.vstack((2 * np.random.random(shape) - 1, 2 * np.random.random((1, shape[1])) - 1))

Next we need to initialize our adjustments for this layer, the adjustments are what we calculate in the back propagation step, we need to store these so we can apply them after using gradient descent.

# This creates the adjustments layer for this shape with all zeros
# ie: it's the same as above but with all zeros instead of randmly initialized weights.
self.adjustments[self.num_layers] = np.zeros(shape)

Increment our num_layers property by one to keep track of how many layers we have.

Here's the complete add_layer function.

def add_layer(self, shape):
    # Create weights with shape specified + biases
    self.weights[self.num_layers] = np.vstack((2 * np.random.random(shape) - 1, 2 * np.random.random((1, shape[1])) - 1))

    # Initialize the adjustments for these weights to zero
    self.adjustments[self.num_layers] = np.zeros(shape)

    # Increment our counter
    self.num_layers += 1

Forward propagation¶

We forward propagate our inputs through our network and return the activation value (the sigmoid of the dot product of our inputs and weights).

We already do this in our network above, but now we store it with the layer as it's index.

def __forward_propagate(self, data):
    # Progapagate through network and hold values for use in back-propagation
    activation_values = {}
    activation_values[1] = data
    for layer in range(2, self.num_layers+1):
        data = np.dot(data.T, self.weights[layer-1][:-1, :]) + self.weights[layer-1][-1, :].T # + self.biases[layer]
        data = self.__sigmoid(data).T
        activation_values[layer] = data
    return activation_values

Back propagation¶

After forward propagating our inputs, we need to calculate the error starting from the last hidden layer working backwards.

def __back_propagate(self, output, target):
    # Delta = change
    deltas = {}

    # Delta of output Layer
    deltas[self.num_layers] = output[self.num_layers] - target

    # Delta of hidden Layers
    # Start with the last layer and work backwards
    for layer in reversed(range(2, self.num_layers)):  # All layers except input/output
        a_val = output[layer]                          # Output product
        weights = self.weights[layer][:-1, :]          # Weights for this layer
        prev_deltas = deltas[layer+1]                  # Delta of the previous layer (working backwards)

        # This is our "error weighted derivative"
        # It gives us a direction which we'll save and use to later adjust our weights
        # We multiply the weights for this layer with the weights from the previous layer
        # and save it calculate the adjustment later
        deltas[layer] = np.multiply(np.dot(weights, prev_deltas), self.__sigmoid_derivative(a_val))

    # Calculate total adjustments based on deltas
    for layer in range(1, self.num_layers):
        # For this layer get the dot product of our deltas and the current inputs to calculate our adjustments
        self.adjustments[layer] += np.dot(deltas[layer+1], output[layer].T).T

Gradient descent¶

Finally, we calculate our partial derivative with respect to our adjustment for this layer and use it to move the weights in the direction closest to our local minima (lowest point of error).

# To calculate our partial derivative, we need our batch size
# We also apply a learning rate to our gradient descent step (default: 1)
def __gradient_descent(self, batch_size, learning_rate):
    # For each layer working forwards
    for layer in range(1, self.num_layers):
        # calculate our partial derivative for this layer
        partial_d = (1/batch_size) * self.adjustments[layer]

        # apply the adjustment to the first 2 dimensions of this layers weights
        self.weights[layer][:-1, :] += learning_rate * -partial_d

        # Update the last weight dimension with a 0.001 difference
        self.weights[layer][-1, :] += learning_rate*1e-3 * -partial_d[-1, :]

In [4]:

import numpy as np
np.seterr(over='ignore')

class NeuralNetwork():
    def __init__(self):
        np.random.seed(1)  # Seed the random number generator
        self.weights = {}  # Create dict to hold weights (renamed)
        self.num_layers = 1  # Set initial number of layer to one (input layer)
        self.adjustments = {}  # Create dict to hold adjustments
        self.log = False;

    def add_layer(self, shape):
        # Create weights with shape specified + biases
        self.weights[self.num_layers] = np.vstack((2 * np.random.random(shape) - 1, 2 * np.random.random((1, shape[1])) - 1))
        # Initialize the adjustments for these weights to zero
        self.adjustments[self.num_layers] = np.zeros(shape)
        self.num_layers += 1

    def __sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def __sigmoid_derivative(self, x):
        return x * (1 - x)

    def predict(self, data):
        # Pass data through pretrained network
        for layer in range(1, self.num_layers+1):
            data = np.dot(data, self.weights[layer-1][:, :-1]) + self.weights[layer-1][:, -1] # + self.biases[layer]
            data = self.__sigmoid(data)
        return data

    def __forward_propagate(self, data):
        # Progapagate through network and hold values for use in back-propagation
        activation_values = {}
        activation_values[1] = data
        for layer in range(2, self.num_layers+1):
            data = np.dot(data.T, self.weights[layer-1][:-1, :]) + self.weights[layer-1][-1, :].T # + self.biases[layer]
            data = self.__sigmoid(data).T
            activation_values[layer] = data
        return activation_values

    def simple_error(self, outputs, targets):
        return targets - outputs

    def sum_squared_error(self, outputs, targets):
        return 0.5 * np.mean(np.sum(np.power(outputs - targets, 2), axis=1))

    def __back_propagate(self, output, target):
        deltas = {}
        # Delta of output Layer
        deltas[self.num_layers] = output[self.num_layers] - target

        # Delta of hidden Layers
        for layer in reversed(range(2, self.num_layers)):  # All layers except input/output
            a_val = output[layer]
            weights = self.weights[layer][:-1, :]
            prev_deltas = deltas[layer+1]
            deltas[layer] = np.multiply(np.dot(weights, prev_deltas), self.__sigmoid_derivative(a_val))

        # Caclculate total adjustments based on deltas
        for layer in range(1, self.num_layers):
            self.adjustments[layer] += np.dot(deltas[layer+1], output[layer].T).T

    def __gradient_descent(self, batch_size, learning_rate):
        # Calculate partial derivative and take a step in that direction
        for layer in range(1, self.num_layers):
            partial_d = (1/batch_size) * self.adjustments[layer]

            # Update the first 2 weight dimensions
            self.weights[layer][:-1, :] += learning_rate * -partial_d

            # Update the last weight dimension with a 0.001 difference
            self.weights[layer][-1, :] += learning_rate*1e-3 * -partial_d[-1, :]

    def train(self, inputs, targets, num_epochs, learning_rate=1, stop_accuracy=1e-5):
        error = []
        for iteration in range(num_epochs):
            for i in range(len(inputs)):
                x = inputs[i]
                y = targets[i]

                # Pass the training set through our neural network
                output = self.__forward_propagate(x)

                # Calculate the error
                loss = self.sum_squared_error(output[self.num_layers], y)
                error.append(loss)

                # Calculate Adjustments
                self.__back_propagate(output, y)

            self.__gradient_descent(i, learning_rate)
            
            # Check if accuracy criterion is satisfied
            if np.mean(error[-(i+1):]) < stop_accuracy and iteration > 0:
                break

        return(np.asarray(error), iteration+1)

if __name__ == "__main__":
    # ----------- XOR Function -----------------

    # Create instance of a neural network
    nn = NeuralNetwork()

    # Add Layers (Input layer is created by default)
    nn.add_layer((2, 9))
    nn.add_layer((9, 1))

    # XOR function
    training_data = np.asarray([[0, 0], [0, 1], [1, 0], [1, 1]]).reshape(4, 2, 1)
    training_labels = np.asarray([[0], [1], [1], [0]])

    error, iteration = nn.train(training_data, training_labels, 5000)
    # nn.predict(testing_data)
    print('Error: %s' % np.mean(error[-4:]))
    print('Epoches needed to train: %s' % iteration)
    print('Predicting 1,0,0: %s' % neural_network.predict(np.array([1, 0, 0])))
    print('Predicting 0,0,0: %s' % neural_network.predict(np.array([0, 0, 0])))
    print('Predicting 0,0,1: %s' % neural_network.predict(np.array([0, 0, 1])))

Error: 7.299670911230147e-06
Epoches needed to train: 62
Predicting 1,0,0: [0.99993704]
Predicting 0,0,0: [0.5]
Predicting 0,0,1: [0.009664]

Backpropogation and gradient descent¶

Backpropogation and gradient descent are two different things, I was a little confused by this when doing this and found this explanation which helped.

Backpropogation is an efficient technique in providing a computationally efficient method for evaluating of derivatives in a network. The term backpropagation is specifically used to describe the evaluation of derivatives. Now these derivatives are used to make adjustments to the weights, the simplest such technique is gradient descent.

It is important to recognize that the two stages are distinct. Thus, the first stage, namely the propagation of errors backwards through the network (i.e. backpropogation) in order to evaluate derivatives, can be applied to many other kinds of network and not just the multilayer perceptron. Similarly, the second stage of weight adjustment using the calculated derivatives can be tackled using a variety of optimization schemes, many of which are substantially more powerful than simple gradient descent like methods having adaptive learning rates eg techniques like Nestrov mommentum, Adam etc.

https://math.stackexchange.com/a/2721936

Backpropogation is the technique of evaluating derivatives with respect to weights
Gradient descent is a simple technique of updating our weights (but other more powerful methods exist)

Backpropogation¶

"Back Propagation is a way of computing, through recursive application of the chain rule in a computational graph, the influence of every single intermediate value in that graph on the final loss function."

https://www.youtube.com/watch?v=i94OvYb6noo&t=888s

The purpose of back propagation is to figure out the partial derivatives of our error function with respect to each individual weight in the network, so we can use those in gradient descent.

Back prop is the calculation of the deltas and storing them "the adjustments", working backwards from the last layer.

Gradient descent¶

In gradient descent we take the step and update our weights with our learning rate, this happens once at the end of each epoch.

https://stats.stackexchange.com/a/184497

In [ ]:

Deep learning glossary¶

Machine Learning¶

Supervised

Labelled data - Many images of labelled cars, to create a model to label cars (using a predefined set to test against and work towards).

Get’s feedback after every move.

Unsupervised

Unlabelled data - Data set with no labels, it gets no feedback as to what’s right or wrong. It has to learn the structure of the data to solve the task.

Never get’s feedback - even if it won.

Reinforcement

Interacting with environment, only get feedback when it achieves it’s goal (a game finishes).

Only gets feedback if it won the game, or scored points.

Perceptron

Perceptron’s fit a linear separable line, if the data is non-linearly separable this approach will fail and the learning algorithm will not converge. This is a simple supervised linear model and used in neural networks. It updates weights via a simple learning algorithm.

Start with a low weight $w$
Choose a point (randomly to start) $i$
Update w based on $y$ (output), $x$ (input) and the previous weight
Check the error, update the weights accordingly

An early classifier, neural networks today use more powerful classification

Neural network

A neural network has an input layer, multiple hidden layers, and an output layer.

Each node in the hidden or output layer has it’s own classifier, it passes it’s data to the next hidden layer until eventually it reaches the final output layer where the results are determined by the scores for each node.

Neural networks were born out of the need to address the inaccuracies of the perceptron, by using a layered web of perceptron’s the accuracy of predictions could be improved, this is also known as an Multi layered perceptron (MLP).

Feed forward neural network

Signals flow in one direction, from input to output, one layer at a time.

“The Perceptron by Frank Rosenblatt” - is a feed forward forward neural network.

Linear regression

Models relationship between independent & dependent variables to find the line of best fit.

Linear_regression

Forward Propagation

The series of events starting from the input where the activation is sent to the next layer until it reaches the output is know as forward propagation.

Back Propagation

Also known as back prop, this is the process of back tracking errors through the weights of the network after forward propagating inputs through the network.

This is used by applying the chain rule in calculus. (Source)

Sigmoid

A function used to activate weights in our network in the interval of [0, 1]. This function graphed out looks like an ‘S’ which is where this function gets is name, the s is sigma in greek. Also known as the logistic function.

https://hackernoon.com/deep-learning-cheat-sheet-25421411e460#2e97

Cost

$Cost = GeneratedOutput - ActualOutput$

Gradient

A gradient describes a slope, a direction we’re moving to reduce our error rate, it’s either positive or negative.

The gradient is the partial derivative of a function that takes in multiple vectors and outputs a single value (i.e. our cost functions in Neural Networks). The gradient tells us which direction to go on the graph to increase our output if we increase our variable input.

We use the gradient and go in the opposite direction since we want to decrease our loss.

Gradient Descent

The process when we descend our gradient to approach a zero error rate and update our weight values iteratively.

https://stats.stackexchange.com/questions/184448/difference-between-gradientdescentoptimizer-and-adamoptimizer-tensorflow

Normalization

Normalization is the process of normalizing our data to operate on a scale relative to the original set of values, this can allow our model to converge faster as all the values operate on the same scale.

A popular normalization function is min max scaling.

\[z = \frac {x-min(x)}{max(x)-min(x)}\]

Hyperparameters

Hyperparameters are parameters whose values are set prior to the commencement of the learning process. By contrast, the values of other parameters are derived via training.

Examples include batch size, learning rate, number of iterations, weight decay.

Time step

A training step is one gradient update. In one step batch_size many examples are processed.

Epoch

An epoch consists of one full cycle through the training data. This is usually many steps. As an example, if you have 2,000 images and use a batch size of 10 an epoch consists of 2,000 images / (10 images / step) = 200 steps.

Weights

Weights are the probabilities that affect how data flows in the graph, they will be updated continuously during training so our results get closer to the result with each iteration.

Bias

The bias lets us shift our regression line to better fit the data.

Dot Product

When we multiply 2 matrices together, like applying weight values to input data. The resulting scalar value is the dot product.

If a matrix is returned, it’s called a cross product.

Understanding Vectors - Practical Machine Learning Tutorial with Python p.21

Local Minima

The local minima is the point where the error rate is the lowest, finding the local minima will also give us the y-intercept and slope.

https://en.wikipedia.org/wiki/Local_optimum

YOLO

https://www.ted.com/talks/joseph_redmon_how_a_computer_learns_to_recognize_objects_instantly

Algebra¶

Slope Formula

$y = mx + b$

https://www.khanacademy.org/math/algebra/two-var-linear-equations/slope/v/introduction-to-slope

b is the y’s intercept and m measures how steep.

Tangent Line

A tangent line is a straight line that touches a function at only one point.

The tangent line represents the instantaneous rate of change of the function at that one point. The slope of the tangent line at a point on the function is equal to the derivative of the function at the same point.

Secant line

A secant line is a straight line joining two points on a function. It is also equivalent to the average rate of change, or simply the slope between two points.

https://www.khanacademy.org/math/calculus-home/taking-derivatives-calc/secant-lines-calc/v/slope-of-a-line-secant-to-a-curve

Linear Algebra¶

Linear algebra, the study of the properties of vector spaces and matrices.

Scalars

A single number

Vectors

A 1 dimensional array of numbers (rows or columns)

Matrix

An n dimensional array of numbers

Rows and columns, each one is a vector

Tensor

A tensor is a multi-dimensional array.

A first order tensor is a vector. A second order tensor is a matrix.

Tensors of order three or higher are called a higher order tensors.

Calculus¶

Calculus is a branch of mathematics that studies change. It focuses on limits, functions, derivatives, integrals and infinite series.

It is used to solve mathematical problems that cannot be solved by algebra and helps in determining the rate a variable will change in relation to others.

Symbol	Description
\(\Sigma\)	Sigma notation - a set of values we are going to iterate over
\(\,i\,\)	Current index point in iteration
\(N\)	number of points
\(\,y\,\)	Current y point in our dataset
\(\frac 1N\)	Find the average of those
Measure the squared error values for all of N points
The value returned is the sum of the squared errors - our error value