# Multivariable Dual Numbers & Automatic Differentiation

In a previous post I showed how to use dual numbers to be able to get both the value and derivative of a function at the same time:
Dual Numbers & Automatic Differentiation

That post mentions that you can extend it to multivariable functions but doesn’t explain how. This post is that explanation, including simple working C++ code!

Extending this to multivariable functions is useful for ray marching, calculating analytical surface normals and also likely useful for training a neural network if you want an alternative to back propagation. I’m not sure about the efficiency comparison of this versus back propagation but I intend on looking into it (:

# How Does it Work?

It turns out to be really simple to use dual numbers with multivariable functions. The end result is that you want a partial derivative for each variable in the equation, so to do that, you just have a dual number per variable, and process the entire equation for each of those dual numbers separately.

We’ll work through an example. Let’s find the partial derivatives of x and y of the function $3x^2-2y^3$, at input (5,2).

We’ll start by finding the derivative of x, and then the derivative of y.

# Example: df/dx

We start by making a dual number for our x value, remembering that the real part is the actual value for x, and the dual part is the derivative of x, which is 1:

$5+1\epsilon$

or:

$5+\epsilon$

We multiply that value by itself to get the $x^2$ value, keeping in mind that $\epsilon^2$ is zero:
$(5+\epsilon)*(5+\epsilon)= \\ 25+10\epsilon+\epsilon^2= \\ 25+10\epsilon \\$

Next we need to multiply that by 3 to get the $3x^2$ term:

$3*(25+10\epsilon) = 75+30\epsilon$

Putting that aside for a moment, we need to make the $2y^3$ term. We start by making our y value:

$2+0\epsilon$

or:

$2$

If you are wondering why it has a zero for the epsilon term, it’s because when we are calculating the partial derivative of x, y is a constant, so has a derivative of zero.

Next, we multiply this y value by itself twice to get the $y^3$ value:

$2*2*2=8$

We then multiply it by 2 to get the $2y^3$ term:

$8*2=16$

Now that we have our two terms, we subtract the y term from the x term to get our final result:

$75+30\epsilon-16 = \\ 59+30\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30.

That checks out, let’s move on to the derivative of y!

# Example: df/dy

Calculating the derivative of y is very similar to calculating the derivative of x, except that it’s the x term that has an epsilon value (derivative) of 0, instead of the y term. The y term has the epsilon value of 1 this time as well. We’ll work through it to see how it plays out.

First up, we need to make the value for x:

$5+0\epsilon$

or:

$5$

Next we square it and multiply it by 3 to get the $3x^2$ term:

$5*5*3=75$

Next we need to make the value for y, remembering that we use an epsilon value of 1 since the derivative of y is 1 this time around.

$2+\epsilon$

We cube that value and multiply by 2 to get the $2y^3$ term:
$2*(2+\epsilon)*(2+\epsilon)*(2+\epsilon)= \\ 2*(2+\epsilon)*(4+4\epsilon+\epsilon^2)= \\ 2*(2+\epsilon)*(4+4\epsilon)= \\ 2*(8+12\epsilon+4\epsilon^2)= \\ 2*(8+12\epsilon)= \\ 16+24\epsilon$

Now we subtract the y term from the x term to get the final result:

$75 - (16+24\epsilon)= \\ 59-24\epsilon$

This result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of y at that point is -24.

That also checks out, so we got the correct value and partial derivatives for the equation.

# Reducing Redundancy

There was quite a bit of redundancy when working through the x and y derivatives wasn’t there? Increasing the number of variables will increase the amount of redundancy too, so it doesn’t scale up well.

Luckily there is a way to address this. Basically, instead of making two dual numbers which have two items, you make them share the real value (since it’s the same for both, as is the work to make it) and append the dual values for x and y to it.

$x'=(a+b\epsilon) \\ y'=(a+b\epsilon)$

You have:

$(a+b\epsilon_x+c\epsilon_y)$

Then, in your math or in your program, you treat it as if it’s two different dual numbers packed into one. This lets you do the work for the real number once instead of twice, but still lets you do your dual number work for each variable independently.

While it’s probably easiest to think of these as two dual numbers packed into one value, there is actually a mathematical basis for it as well, which may or may not surprise you.

Check out what happens when we multiply two of these together, keeping in mind that multiplying ANY two epsilon values together becomes zero, even if they are different epsilons:

$(a+b\epsilon_x+c\epsilon_y) * (d+e\epsilon_x+f\epsilon_y)= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + be\epsilon_x^2 + bf\epsilon_x\epsilon_y + cd\epsilon_y + ce\epsilon_x\epsilon_y + cf\epsilon_y^2= \\ ad + ae\epsilon_x + af\epsilon_y + bd\epsilon_x + cd\epsilon_y= \\ ad + (ae+bd)\epsilon_x + (af+cd)\epsilon_y$

The interesting thing is that the above result gives you the same values as if you did the same work for two dual numbers individually.

Let’s see this three component dual number in action by re-doing the example again. Note that this pattern scales up to ANY number of variables!

# Example: Both Derivatives (Gradient Vector)

Our goal is to calculate the value and partial derivatives of the function $3x^2-2y^3$ at location (5,2).

First we make our x value:

$5 + 1\epsilon_x + 0\epsilon_y$

or:

$5 + \epsilon_x$

We square that and multiply it by 3 to get our $3x^2$ term:

$3*(5 + \epsilon_x)*(5 + \epsilon_x)= \\ 3*(25+10\epsilon_x+\epsilon_x^2)= \\ 3*(25+10\epsilon_x)= \\ 75+30\epsilon_x$

Next, we make our y value:

$2 + 0\epsilon_x + 1\epsilon_y$

or:

$2 + \epsilon_y$

We cube it and multiply it by 2 to get our $2x^3$ term:

$16+24\epsilon_y$

Lastly we subtract the y term from the x term to get our final answer:

$(75+30\epsilon_x) - (16+24\epsilon_y)= \\ 59+30\epsilon_x-24\epsilon_y$

The result says that $3x^2-2y^3$ has a value of 59 at location (5,2), and that the derivative of x at that point is 30, and the derivative of y at that point is -24.

Neat, right?!

# Example Code

Here is the example code output:

Here is the code that generated it:

#include <stdio.h>
#include <cmath>
#include <array>
#include <algorithm>

#define PI 3.14159265359f

#define EPSILON 0.001f  // for numeric derivatives calculation

template <size_t NUMVARIABLES>
class CDualNumber
{
public:

// constructor to make a constant
CDualNumber (float f = 0.0f) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
}

// constructor to make a variable value.  It sets the derivative to 1.0 for whichever variable this is a value for.
CDualNumber (float f, size_t variableIndex) {
m_real = f;
std::fill(m_dual.begin(), m_dual.end(), 0.0f);
m_dual[variableIndex] = 1.0f;
}

// storage for real and dual values
float							m_real;
std::array<float, NUMVARIABLES> m_dual;
};

//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator + (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real + b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] + b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator - (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real - b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] - b.m_dual[i];
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator * (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real * b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_real * b.m_dual[i] + a.m_dual[i] * b.m_real;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> operator / (const CDualNumber<NUMVARIABLES> &a, const CDualNumber<NUMVARIABLES> &b)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = a.m_real / b.m_real;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = (a.m_dual[i] * b.m_real - a.m_real * b.m_dual[i]) / (b.m_real * b.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sqrt (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
float sqrtReal = sqrt(a.m_real);
ret.m_real = sqrtReal;
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = 0.5f * a.m_dual[i] / sqrtReal;
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> pow (const CDualNumber<NUMVARIABLES> &a, float y)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = pow(a.m_real, y);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = y * a.m_dual[i] * pow(a.m_real, y - 1.0f);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> sin (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = sin(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] * cos(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> cos (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = cos(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = -a.m_dual[i] * sin(a.m_real);
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> tan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (cos(a.m_real) * cos(a.m_real));
return ret;
}

template <size_t NUMVARIABLES>
inline CDualNumber<NUMVARIABLES> atan (const CDualNumber<NUMVARIABLES> &a)
{
CDualNumber<NUMVARIABLES> ret;
ret.m_real = tan(a.m_real);
for (size_t i = 0; i < NUMVARIABLES; ++i)
ret.m_dual[i] = a.m_dual[i] / (1.0f + a.m_real * a.m_real);
return ret;
}

// templated so it can work for both a CDualNumber<1> and a float
template <typename T>
inline T SmoothStep (const T& x)
{
return x * x * (T(3.0f) - T(2.0f) * x);
}

//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------

void TestSmoothStep (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// calculate value and derivative using dual numbers
CDualNumber<1> y = SmoothStep(x);

// calculate numeric derivative using central differences
float derivNumeric = (SmoothStep(input + EPSILON) - SmoothStep(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 6.0f * input - 6.0f * input * input;

// show value and derivatives
printf("(smoothstep) y=3x^2-2x^3  (x=%0.4f)n", input);
printf("  y = %0.4fn", y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

void TestTrig (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sin
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = sin(x);

// calculate numeric derivative using central differences
float derivNumeric = (sin(input + EPSILON) - sin(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = cos(input);

// show value and derivatives
printf("sin(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// cos
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = cos(x);

// calculate numeric derivative using central differences
float derivNumeric = (cos(input + EPSILON) - cos(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -sin(input);

// show value and derivatives
printf("cos(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// tan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = tan(x);

// calculate numeric derivative using central differences
float derivNumeric = (tan(input + EPSILON) - tan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (cos(input)*cos(input));

// show value and derivatives
printf("tan(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// atan
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = atan(x);

// calculate numeric derivative using central differences
float derivNumeric = (atan(input + EPSILON) - atan(input - EPSILON)) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.0f / (1.0f + input * input);

// show value and derivatives
printf("atan(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}
}

void TestSimple (float input)
{
// create a dual number as the value of x
CDualNumber<1> x(input, 0);

// sqrt
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = CDualNumber<1>(3.0f) / sqrt(x);

// calculate numeric derivative using central differences
float derivNumeric = ((3.0f / sqrt(input + EPSILON)) - (3.0f / sqrt(input - EPSILON))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = -3.0f / (2.0f * pow(input, 3.0f / 2.0f));

// show value and derivatives
printf("3/sqrt(%0.4f) = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}

// pow
{
// calculate value and derivative using dual numbers
CDualNumber<1> y = pow(x + CDualNumber<1>(1.0f), 1.337f);

// calculate numeric derivative using central differences
float derivNumeric = ((pow(input + 1.0f + EPSILON, 1.337f)) - (pow(input + 1.0f - EPSILON, 1.337f))) / (2.0f * EPSILON);

// calculate actual derivative
float derivActual = 1.337f * pow(input + 1.0f, 0.337f);

// show value and derivatives
printf("(%0.4f+1)^1.337 = %0.4fn", input, y.m_real);
printf("  dual# dy/dx = %0.4fn", y.m_dual[0]);
printf("  actual dy/dx = %0.4fn", derivActual);
printf("  numeric dy/dx = %0.4fnn", derivNumeric);
}
}

void Test2D (float inputx, float inputy)
{
// create dual numbers as the value of x and y
CDualNumber<2> x(inputx, 0);
CDualNumber<2> y(inputy, 1);

// z = 3x^2 - 2y^3
{
// calculate value and partial derivatives using dual numbers
CDualNumber<2> z = CDualNumber<2>(3.0f) * x * x - CDualNumber<2>(2.0f) * y * y * y;

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y) {
return 3.0f * x * x - 2.0f * y * y * y;
};
float derivNumericX = (f(inputx + EPSILON, inputy) - f(inputx - EPSILON, inputy)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON) - f(inputx, inputy - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = 6.0f * inputx;
float derivActualY = -6.0f * inputy * inputy;

// show value and derivatives
printf("z=3x^2-2y^3 (x = %0.4f, y = %0.4f)n", inputx, inputy);
printf("  z = %0.4fn", z.m_real);
printf("  dual# dz/dx = %0.4fn", z.m_dual[0]);
printf("  dual# dz/dy = %0.4fn", z.m_dual[1]);
printf("  actual dz/dx = %0.4fn", derivActualX);
printf("  actual dz/dy = %0.4fn", derivActualY);
printf("  numeric dz/dx = %0.4fn", derivNumericX);
printf("  numeric dz/dy = %0.4fnn", derivNumericY);
}
}

void Test3D (float inputx, float inputy, float inputz)
{
// create dual numbers as the value of x and y
CDualNumber<3> x(inputx, 0);
CDualNumber<3> y(inputy, 1);
CDualNumber<3> z(inputz, 2);

// w = sin(x*cos(2*y)) / tan(z)
{
// calculate value and partial derivatives using dual numbers
CDualNumber<3> w = sin(x * cos(CDualNumber<3>(2.0f)*y)) / tan(z);

// calculate numeric partial derivatives using central differences
auto f = [] (float x, float y, float z) {
return sin(x*cos(2.0f*y)) / tan(z);
};
float derivNumericX = (f(inputx + EPSILON, inputy, inputz) - f(inputx - EPSILON, inputy, inputz)) / (2.0f * EPSILON);
float derivNumericY = (f(inputx, inputy + EPSILON, inputz) - f(inputx, inputy - EPSILON, inputz)) / (2.0f * EPSILON);
float derivNumericZ = (f(inputx, inputy, inputz + EPSILON) - f(inputx, inputy, inputz - EPSILON)) / (2.0f * EPSILON);

// calculate actual partial derivatives
float derivActualX = cos(inputx*cos(2.0f*inputy))*cos(2.0f * inputy) / tan(inputz);
float derivActualY = cos(inputx*cos(2.0f*inputy)) *-2.0f*inputx*sin(2.0f*inputy) / tan(inputz);
float derivActualZ = sin(inputx * cos(2.0f * inputy)) / -(sin(inputz) * sin(inputz));

// show value and derivatives
printf("w=sin(x*cos(2*y))/tan(z) (x = %0.4f, y = %0.4f, z = %0.4f)n", inputx, inputy, inputz);
printf("  w = %0.4fn", w.m_real);
printf("  dual# dw/dx = %0.4fn", w.m_dual[0]);
printf("  dual# dw/dy = %0.4fn", w.m_dual[1]);
printf("  dual# dw/dz = %0.4fn", w.m_dual[2]);
printf("  actual dw/dx = %0.4fn", derivActualX);
printf("  actual dw/dy = %0.4fn", derivActualY);
printf("  actual dw/dz = %0.4fn", derivActualZ);
printf("  numeric dw/dx = %0.4fn", derivNumericX);
printf("  numeric dw/dy = %0.4fn", derivNumericY);
printf("  numeric dw/dz = %0.4fnn", derivNumericZ);
}
}

int main (int argc, char **argv)
{
TestSmoothStep(0.5f);
TestSmoothStep(0.75f);
TestTrig(PI * 0.25f);
TestSimple(3.0f);
Test2D(1.5f, 3.28f);
Test3D(7.12f, 8.93f, 12.01f);
return 0;
}


# Closing

One of the neatest things about dual numbers is that they give precise results. They are not approximations and they are not numerical methods, unlike the central differences method that I compared them to in the example program (More info on numerical derivatives here: Finite Differences). Using dual numbers gives you exact derivatives, within the limitations of (eg) floating point math.

It turns out that backpropagation (the method that is commonly used to train neural networks) is just steepest gradient descent. You can read about that here: Backpropogation is Just Steepest Descent with Automatic Differentiation

That makes me wonder how dual numbers would do in run time speed compared to back propagation as well as numerical methods for getting the gradient to adjust a neural network during training.

If I had to guess, I’d say that dual numbers may be slightly slower than backpropagation, but not as slow as numerical methods which are going to be much, much slower. We’ll see though. It may be much easier to implement neural network learning using dual numbers compared to backpropagation, so may be worth an exploration and write up, even if only to make neural networks a little bit more accessible to people.

Comments, corrections, etc? Let me know in the comments below, or contact me on twitter at @Atrix256

# A Geometric Interpretation of Neural Networks

In the 90s before I was a professional programmer / game developer I looked at neural networks and found them interesting but got scared off by things like back propagation, which I wasn’t yet ready to understand.

With all the interesting machine learning things going on in modern times, I decided to have a look again and have been pleasantly surprised at how simple they are to understand. If you have knowledge of partial derivatives and gradients (like, if you’ve done any ray marching), you have the knowledge it takes to understand it.

Here are some really great resources I recomend for learning the nuts and bolts of how modern neural networks actually work:
Learn TensorFlow and deep learning, without a Ph.D.
Neural Networks and Deep Learning
A Neural Network Playground (Web Based NN sand box from google)

This post doesn’t require any understanding of neural networks or partial derivatives, so don’t worry if you don’t have that knowledge. The harder math comes up when training a neural network, but we are only going to be dealing with evaluating neural networks, which is much simpler.

## A Geometric Interpretation of a Neuron

A neural network is made up layers.

Each layer has some number of neurons in it.

Every neuron is connected to every neuron in the previous and next layer.

Below is a diagram of a neural network, courtesy of wikipedia. Every circle is a neuron.

To calculate the output value of a neuron, you multiply every input into that neuron by a weight for that input, and add a bias. This value is fed into an activation function (more on activation functions shortly) and the result is the output value of that neuron. Here is a diagram for a single neuron:

A more formal definition of a neuron’s output is below. $b$ is the bias, $w_j$ is the j’th weight and $i_j$ is the j’th input.
$Output = b+\sum_{j=0}^nw_ji_j$

You might notice that the above is really just the dot product of the weight vector and the input vector, with a value added on the end. We could re-write it like that:
$Output = Dot(w,i)+b$

Interestingly, that is the same equation that you use to find the distance of a point to a plane. Let’s say that we have a plane defined by a unit length normal N and a distance to the origin d, and we want to calculate the distance of a point P to that plane. We’d use this formula:
$Distance = Dot(N,P)+d$

This would give us a signed distance, where the value will be negative if we are in the negative half space defined by the plane, and positive otherwise.

This equation works if you are working in 3 dimensional space, but also works in general for any N dimensional point and plane definition.

What does this mean? Well, this tells us that every neuron in a neural network is essentially deciding what side of a hyperplane a point is on. Each neuron is doing a linear classification, saying if something is on side A or side B, and giving a distance of how far it is into A or B.

This also means that when you combine multiple neurons into a network, that an output neuron of that neural network tells you whether the input point is inside or outside of some shape, and by how much.

I find this interesting for two reason.

Firstly, it means that a neural network can be interpreted as encoding SHAPES, which means it could be used for modeling shapes. I’m interested in seeing what sort of shapes it’s capable of, and any sorts of behaviors this representation might have. I don’t expect it to be useful for, say, main stream game development (bonus if it is useful!) but at minimum it ought to be an interesting investigation to help understand neural networks a bit better.

Secondly, there is another machine learning algorithm called Support Vector Machines which are also based on being able to tell you which side of a separation a data point is on. However, unlike the above, SVM separations are not limited to plane tests and can use arbitrary shapes for separation. Does this mean that we are leaving something on the table for neural networks? Could we do better than we are to make networks with fewer layers and fewer neurons that do better classification by doing non linear separation tests?

Quick side note: besides classification, neural nets can help us with something called regression, which is where you fit a network to some analog data, instead of the more discrete problem of classification, which tells you what group something belongs to.

It turns out that the activation function of a neuron can have a pretty big influence on what sort of shapes are possible, which makes it so we aren’t strictly limited to shapes made up of planes and lines, and also means we aren’t necessarily leaving things on the table compared to SVM’s.

This all sort of gives me the feeling though that modern neural networks may not be the best possible algorithm for the types of things we use them for. I feel like we may need to generalize them beyond biological limitations to allow things like multiplications between weighted inputs instead of only sums. I feel like that sort of setup will be closer to whatever the real ideal “neural computation” model might be. However, the modern main stream neural models do have the benefit that they are evaluated very efficiently via dot products. They are particularly well suited for execution on GPUs which excel at performing homogenous operations on lots and lots of data. So, a more powerful and more general “neuron” may come at the cost of increased computational costs, which may make them less desirable in the end.

As a quick tangent, here is a paper from 2011 which shows a neural network model which does in fact allow for multiplication between neuron inputs. You then will likely be wanting exponents and other things, so while it’s a step in the right direction perhaps, it doesn’t yet seem to be the end all be all!
Sum-Product Networks: A New Deep Architecture

It’s also worth while to note that there are other flavors of neural networks, such as convolutional neural networks, which work quite a bit differently.

Let’s investigate this geometric interpretation of neurons as binary classifiers a bit, focusing on some different activation functions!

## Step Activation Function

The Heaviside step function is very simple. If you give it a value greater than zero, it returns a 1, else it returns a 0. That makes our neuron just spit out binary: either a 0 or a 1. The output of a neuron using the step activation function is just the below:

$Output = Dot(w,i)+b > 0$

The output of a neuron using the step activation function is true if the input point is in the positive half space of the plane that this neuron describes, else it returns false.

Let’s think about this in 2d. Let’s make a neural network that takes x and y as input and spits out a value. We can make an image that visualizes the range from (-1,-1) to (1,1). Negative values can be shown in blue, zero in white, and positive values in orange.

To start out, we’ll make a 2d plane (aka a line) that runs vertically and passes through the origin. That means it is a 2d plane with a normal of (1,0) and a distance from the origin of 0. In other words, our network will have a single neuron that has weights of (1,0) and a bias of 0. This is what the visualization looks like:

You can actually play around with the experiments of this post and do your own using an interactive visualization I made for this post. Click here to see this first experiment: Experiment: Vertical Seperation

We can change the normal (weights) to change the angle of the line, and we can also change the bias to move the line to it’s relative left or right. Here is the same network that has it’s weights adjusted to (1,1) with a bias of 0.1.

Experiment: Diagonal Separation

The normal (1,1) isn’t normalized though, which makes it so the distance from origin (aka the bias) isn’t really 0.1 units. The distance from origin is actually divided by the length of the normal to get the REAL distance to origin, so in the above image, where the normal is a bit more than 1.0, the line is actually less than 0.1 units from the origin.

Below is the visualization if we normalize the weights to (0.707,0.707) but leave the bias at 0.1 units. The result is that the line is actually 0.1 units away from the origin.

Experiment: Normalized Diagonal Separation

Recalling the description of our visualization, white is where the network outputs zero, while orange is where the network outputs a positive number (which is 1 in this case).

If we define three lines such that their negative half spaces don’t completely overlap, we can get a triangle where the network outputs a zero, while everywhere else it outputs a positive value. We do this by having three sibling neurons in the first layer which define three separate lines, and then in the output neuron we give them all a weight of 1. This makes it so the area outside the triangle is always a positive value, which step turns into 1, but inside the triangle, the value remains at 0.

We can turn this negative space triangle into a positive space triangle however by making the output neuron have a weight on the inputs of -1, and adding a bias of 0.1. This makes it so that pixels in the positive space of any of the lines will become a negative value. The negative space of those three lines get a small bias to make it be a positive value, resulting in the step function making the values be 0 outside of the triangle, and 1 inside the triangle. This gives us a positive space triangle:

Taking this a bit further, we can make the first layer define 6 lines, which make up two different triangles – a bigger one and a smaller one. We can then have a second layer which has two neurons – one which makes a positive space larger triangle, and one which makes a positive space smaller triangle. Then, in the output neuron we can give the larger triangle neuron a weight of 1, while giving the smaller triangle neuron a weight of -1. The end result is that we have subtracted the smaller triangle from the larger one:

Using the step function we have so far been limited to line based shapes. This has been due to the fact that we can only test our inputs against lines. There is a way around this though: Pass non linear input into the network!

Below is a circle with radius 0.5. The neural network has only a single input which is sqrt(x*x+y*y). The output neuron has a bias of -0.5 and a weight of 1. It’s the bias value which controls the radius of the circle.

You could pass other non linear inputs into the network to get a whole host of other shapes. You could pass sin(x) as an input for example, or x squared.

While the step function is inherently very much limited to linear tests, you can still do quite a lot of interesting non linear shapes (and data separations) by passing non linear input into the network.

Unfortunately though, you as a human would have to know the right non linear inputs to provide. The network can’t learn how to make optimal non linear separations when using the step function. That’s quite a limitation, but as I understand it, that’s how it works with support vector machines as well: a human can define non linear separations, but the human must decide the details of that separation.

BTW it seems like there could be a fun puzzle game here. Something like you have a fixed number of neurons that you can arrange in however many layers you want, and your goal is to separate blue data points from orange ones, or something like that. If you think that’d be a fun game, go make it with my blessing! I don’t have time to pursue it, so have at it (:

## Identity and Relu Activation Functions

The identity activation function doesn’t do anything. It’s the same as if no activation function is used. I’ve heard that it can be useful in regression, but it can also be useful for our geometric interpretation.

Below is the same circle experiment as before, but using the identity activation function instead of the step activation function:

Remembering that orange is positive, blue is negative, and white is zero, you can see that outside the circle is orange (positive) and inside the circle is blue (negative), while the outline of the circle itself is white. We are looking at a signed distance field of the circle! Every point on this image is a scalar value that says how far inside or outside that point is from the surface of the shape.

Signed distance fields are a popular way of rendering vector graphics on the GPU. They are often approximated by storing the distance field in a texture and sampling that texture at runtime. Storing them in a texture only requires a single color channel for storage, and as you zoom in to the shape, they preserve their shape a lot better than regular images. You can read more about SDF textures on my post: Distance Field Textures.

Considering the machine learning perspective, having a signed distance field is also an interesting proposition. It would allow you to do classification of input, but also let you know how deeply that input point is classified within it’s group. This could be a confidence level maybe, or could be interpreted in some other way, but it gives a sort of analog value to classification, which definitely seems like it could come in handy sometimes.

If we take our negative space triangle example from the last section and change it from using step activation to identity activation, we find that our technique doesn’t generalize naively though, as we see below. (It doesn’t generalize for the positive space triangle either)

The reason it doesn’t generalize is that the negatives and positives of pixel distances to each of the lines cancel out. Consider a pixel on the edge of the triangle: you are going to have a near zero value for the edge it’s on, and two larger magnitude negative values from the other edges it is in the negative half spaces of. Adding those all together is going to be a negative value.

To help sort this out we can use an activation function called “relu”, which returns 0 if the value it’s given is less than zero, otherwise it returns the value. This means that all our negative values become 0 and don’t affect the distance summation. If we switch all the neurons to using relu activation, we get this:

If you squint a bit, you can see a triangle in the white. If you open the experiment and turn on “discrete output” to force 0 to orange you get a nice image that shows you that the triangle is in fact still there.

Our result with relu is better than identity, but there are two problems with our resulting distance field.

Firstly it isn’t a signed distance field – there is no blue as you might notice. It only gives positive distances, for pixels that are outside the shape. This isn’t actually that big of an issue from a rendering perspective, as unsigned distance fields are still useful. It also doesn’t seem that big of an issue from a machine learning perspective, as it still gives some information about how deeply something is classified, even though it is only from one direction.

I think with some clever operations, you could probably create the internal negative signed distance using different operations, and then could compose it with the external positive distance in the output neuron by adding them together.

The second problem is a bigger deal though: The distance field is no longer accurate!

By summing the distance values, the distance is incorrect for points where the closest feature of the triangle is a vertex, because multiple lines are contributing their distance to the final value.

I can’t think of any good ways to correct that problem in the context of a neural network, but the distance is an approximation, and is correct for the edges, and also gets more correct the closer you get to the object, so this is still useful for rendering, and probably still useful for machine learning despite it being an imperfect measurement.

## Sigmoid and Hyperbolic Tangent Activation Function

The sigmoid function is basically a softer version of the step function and gives output between 0 and 1. The hyperbolic tangent activation function is also a softer version of the step function but gives output between -1 and 1.

Sigmoid:

Hyperbolic Tangent:

(images from Wolfram Mathworld)

They have different uses in machine learning, but I’ve found them to be visibly indistinguishable in my experiments after compensating for the different range of output values. It makes me think that smoothstep could probably be a decent activation function too, so long as your input was in the 0 to 1 range (maybe you could clamp input to 0 and 1?).

These activation functions let you get some non linearity in your neural network in a way that the learning algorithms can adjust, which is pretty neat. That puts us more into the realm where a neural net can find a good non linear separation for learned data. For the geometric perspective, this also lets us make more interesting non linear shapes.

Unfortunately, I haven’t been able to get a good understanding of how to use these functions to craft desired results.

It seems like if you add even numbers of hyperbolic tangents together in a neural network that you end up getting a lot of white, like below:

However, if you add an odd number of them together, it starts to look a bit more interesting, like this:

Other than that, it’s been difficult seeing a pattern that I can use to craft things. The two examples above were made by modifying the negative space triangle to use tanh instead of step.

## Closing

We’ve wandered a bit in the idea of interpreting neural networks geometrically but I feel like we’ve only barely scratched the surface. This also hasn’t been a very rigorous exploration, but more has just been about exploring the problem space to get a feeling for what might be possible.

It would be interesting to look more deeply into some of these areas, particularly for the case of distance field generation of shapes, or combining activation functions to get more diverse results.

While stumbling around, it seems like we may have gained some intuition about how neural networks work as well.

It feels like whenever you add a layer, you are adding the ability for a “logical operation” to happen within the network.

For instance, in the triangle cutout experiment, the first layer after the inputs defines the 6 individual lines of the two triangles and classifies input accordingly. The next layer combines those values into the two different triangle shapes. The layer after that converts them from negative space triangles to positive space triangles. Lastly, the output layer subtracts the smaller triangle’s values from the larger triangle’s values to make the final triangle outline shape.

Each layer has a logical operation it performs, which is based on the steps previous to it.

Another piece of intuition I’ve found is that it seems like adding more neurons to a layer allows it to do more work in parallel.

For instance, in the triangle cutout experiment, we created those two triangles in parallel, reserving some neurons in each layer for each triangle. It was only in the final step that we combined the values into a single output.

Since neurons can only access data from the previous network layer, it seems as though adding more neurons to layers can help push data forward from previous layers, to later layers that might need the data. But, it seems like it is most efficient to process input data as early as possible, so that you don’t have to shuttle it forward and waste layers / neurons / memory and computing power.

Here is some info on other activation functions:
Wikipedia:Activation Function

Here’s a link that talks about how perceptrons (step activated neural networks) relate to SVMs:
Hyperplane based Classification: Perceptron and (Intro to) Support Vector Machines

By the way, did I mention you can visualize neural networks in three dimensions as well?

Experiment: 3d Visualization

Here are the two visualizers of neural networks I made for this post using WebGL2:
Neural Network Visualization 2D
Neural Network Visualization 3D

If you play around with this stuff and find anything interesting, please share!