FIR Audio & Data Filters

There is an interactive demo that goes with this post where you can play with filter parameters, hear the resulting filter applied to audio samples, and copy/paste the simple formula to use the filter in your own applications:

There is also some simple standalone C++ code (~170 lines) that shows how to apply filters, as well as calculate phase/frequency response and zero locations:

https://github.com/Atrix256/DSPFIR

The math behind frequency filters can look pretty daunting, but it’s actually not that bad!

This post is going to talk about a simple, real time friendly filter called a “Finite Impulse Response” or FIR filter. They are also known as feed forward filters.

An order 1 FIR filter looks like this:
$y_n = a_0x_n + a_1x_{n-1}$

$y_n$ is the filtered sample, $x_n$ is the current unfiltered sample, $x_{n-1}$ is the previous unfiltered sample, and $a_0$ and $a_1$ are values to multiply the samples by to control the filtering.

For instance, the below averages the samples, which is also called a box filter. This is in fact a low pass filter, meaning that it filters away higher frequencies while letting low frequencies pass through.
$y_n = 0.5*x_n + 0.5*x_{n-1}$

If you change the addition to a subtraction, you get a high pass filter, which filters out low frequencies, and lets high frequencies pass through.
$y_n = 0.5*x_n - 0.5*x_{n-1}$

An order 2 FIR filter looks like the below and the pattern continues for higher order filters.
$y_n = a_0x_n + a_1x_{n-1} + a_2x_{n-2}$

These equations are called difference equations and this post is going to show you how to analyze them to understand what they do, and how to craft your own that behave in specific ways.

Gain Parameter

If you made a machine that applied an order 1 FIR filter, where you gave a knob to control $a_0$ and $a_1$, you’d find that the knobs were pretty unintuitive.

$y_n = a_0x_n + a_1x_{n-1}$

Specifically, if you had the knobs set to a specific setting that you liked, but you wanted the overall filtered sound to be louder or quieter, it would be difficult to make that happen. (spoiler: the ratio between them would need to stay the same but if you made them larger magnitude numbers the result would get louder)

This problem only gets worse with higher order filters as you add more knobs.

To deal with this, you can factor out the $a_0$ parameter which makes the order 1 equation into the below:

$y_n = a_0(x_n + \frac{a_1}{a_0}x_{n-1})$

We can replace $\frac{a_1}{a_0}$ with $\alpha_1$ to get this equation:

$y_n = a_0(x_n + \alpha_1x_{n-1})$

Now, you can have two knobs for those two parameters and things are much easier to deal with. You can adjust $a_0$ to adjust the volume, and you can adjust $\alpha_1$ to adjust how frequencies are affected by the filter.

If you wanted to get $a_1$ out again, like for use in the difference equation, you would just calculate it like this:

$a_1 = \alpha_1 * a_0$

You can do the same process to the order 2 FIR filter to get this equation:

$y_n = a_0(x_n + \alpha_1x_{n-1} + \alpha_2x_{n-2})$

where

$\alpha_1 = \frac{a_1}{a_0}$

and

$\alpha_2 = \frac{a_2}{a_0}$

From Difference Equation to Transfer Function

The difference equation is how you filter data, but it doesn’t really help you understand what the filter is actually doing.

To understand what the filter does, and how it affects various frequencies, we need something called a transfer function. We can transform a difference equation into a transfer function with a handful of steps.

First, instead of putting a signal $x_n$ through our filter, we need to put through a complex sinusoid $e^{i \omega t}$, which will let us plug in different frequency values for omega ($\omega$) to see how the filter affects those frequencies.

While we can replace $x_n$ with $e^{i \omega t}$, what do we replace $x_{n-1}$ with? We replace it with $e^{i \omega (t-1)}$. Another way to write that is $e^{i \omega t}e^{-i \omega}$. That shows that we can multiply any point in our complex sinusoid signal by $e^{-i \omega}$ to get the previous sample, which is super handy, as it delays the signal by one sample.

So, let’s go back to our difference equation and plug in the complex sinusoid.

$y_n = a_0(x_n + \alpha_1x_{n-1}) \\ y(t) = a_0(e^{i \omega t} + \alpha_1e^{i \omega t}e^{-i \omega})$

Then let’s factor out the complex sinusoid.

$y(t) = a_0e^{i \omega t}(1 + \alpha_1e^{-i \omega})$

That sinusoid is what we input into the equation so we can replace it with $x(t)$.

$y(t) = a_0x(t)(1 + \alpha_1e^{-i \omega})$

The transfer function is $y(t)$ divided by $x(t)$ so let’s put it in that transform function form:

$H(\omega) = \frac{y(t)}{x(t)} = a_0 (1 + \alpha_1e^{-i \omega})$

The next step is to do a replacement in the function. We are going to define $z=e^{i \omega}$, which makes our transfer function look like the below.

$H(z) = a_0 (1 + \alpha_1z^{-1})$

We are done! That is the transfer function of our order 1 FIR filter which is controlled by two parameters:

• $a_0$ – This is the gain (volume) of the filter
• $\alpha_1$ – This controls how the frequencies are filtered. Multiply this by $a_0$ to get the $a_1$ parameter if you ever want to (like to use in the difference equation), but let the user control this value directly.

For order 2, if we do the same, we end up with this equation:

$H(z) = a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2})$

The parameters are the same but order 2 adds in a $\alpha_2$ which also controls how the frequencies are filtered. You can also multiply $\alpha_2$ by $a_0$ if you want to calculate the $a_2$ value for use in the difference equation.

Complex Sinusoids, Complex Unit Circle, Phase and Frequency Response

There’s a bizarre looking equation called Euler’s identity which looks like this:

$e^{i\pi}=-1$

This is actually just a specific value (pi) plugged into the more general Euler’s formula:

$e^{ix}=\cos x+i\sin x$

In those formulas, you can see that x is an angle, so when you plug in pi, you are plugging in 180 degrees. The result you get out is a complex number where the real part is the cosine of the angle, and the imaginary part is the sine of the angle.

The cosine of pi is -1, and the sine of pi is 0, so that’s why Euler’s identity gives you -1 as a result. It’s really -1 + 0i.

Something interesting you may remember is that if you want to draw a point on a unit circle, x is cosine theta, and y is sine theta. (Theta being the angle from the origin to the point)

So, what this tells us is that Euler’s formula is a way of drawing a circle where the x axis is the real number axis, and the y axis is the imaginary number axis.

In the last section, we plugged a similar looking – but different – complex sinusoid into the equation that looked like this:

$e^{i \omega t}$

The difference is that we have omega times t as our x value.

t is the time variable, so is what we increase to move around the circle.

Omega says how many radians around the circle the point travels every time t increases by 1.

If omega is zero, that means the point is stationary and doesn’t move. It also means it has a frequency of zero. This is 0hz or DC.

If omega is pi/2, that means the point moves pi/2 radians (90 degrees) around the circle every time t increase by 1. If this were a cosine wave, and t was an integer index for samples, it would repeat: 1, 0, -1, 0. This is also 1/2 nyquist frequency.

if omega is pi, that means the point moves pi radians (180 degrees) around the circle every time t increase by 1. As a cosine wave it would repeat: 1, -1. This is nyquist frequency.

Anything higher frequency than nyquist has aliasing problems and can’t be reconstructed from the samples, so it makes sense to stop there.

Similarly, if at 0 radians, you go negative, such as -pi/2, that is the same as 3*pi/2 which is above nyquist, so we stay away from the negative frequencies as well.

In any case, we are plugging a circle into the transfer function where the angle on the circle just means “how many radians does this wave form advance each sample”, and as output, we get that circle but distorted. A filter where output = input without any filtering happening would leave the circle unaffected.

The output point for a given input point is another point on the complex plane (x = real, y = imaginary).

The distance of this point from the origin (or, the magnitude of this vector) tells us how much the amplitude (volume) would be altered for a signal with that frequency. A value less than 1 means that it gets smaller/quieter, while a value greater than 1 means that it gets larger/louder.

The angle that the output point is on the circle (use atan2 to get this value) tells you how much this signal will be “phase shifted” or how much it will be delayed. For instance, if the signal is advancing at 90 degrees per sample, and it has a phase shift of -90 degrees, that means the signal is effectively delayed by 1 sample. A positive phase shift means that the signal is moved forward in time, and is an anti-delay (it’s from the future kinda).

This stuff is all very interesting, because real signals / sounds are commonly made up of several different frequencies. The filter treats all those frequencies independently and does the things it says it will – as far as wave amplitude and phase shift – to each frequency that may be present, independent of what it does to the other frequencies.

Frankly, it feels a bit magic that it’s able to do that, but that’s what it does.

You can actually plug angles between 0 and pi into the transfer function, get complex numbers out, calculate the amplitude and phase response, and graph them. This gives you something where you can see how a filter actually behaves. We do 0 to pi, because that is 0hz to nyquist.

Below is the before mentioned “box filter” aka low pass filter, that you get when you average the current sample with the last. You can see that the low frequencies are left alone (0db) but that higher frequencies drop down in amplitude a lot as they approach nyquist. The x axis labels are % of nyquist, so at the very right is the nyquist frequency – the highest frequency that is able to be expressed.

Note: this and the other screenshots came from the interactive demo that goes with this post. You can find it at http://demofox.org/DSPFIR/FIR.html

Pole Zero Plot of Order 1 FIR Filter

Lets take a look at the transfer function of our order 1 FIR filter again:

$H(z) = a_0 (1 + \alpha_1z^{-1})$

Now i have a question for you… what value(s) of z make this function equal to zero? You can work through some simple algebra to find that it’s where $z = -\alpha_1$:

$a_0 (1 + \alpha_1z^{-1}) = 0\\ 1 + \alpha_1z^{-1} = 0\\ \alpha_1z^{-1} = -1\\ z^{-1} = -\frac{1}{\alpha_1}\\ z = -\alpha_1$

This just means that on the complex plane, the function equals zero at that location.

As points in the function get closer to there, they will approach zero as well.

What this means is that frequencies on the complex sinusoid unit circle that are nearer to this zero will be attenuated (shrunken, made smaller / quieter), while frequencies farther away won’t be, or won’t be attenuated as much.

So, you control what frequencies get filtered just by where the zero is!

If the zero is on the left, at angle pi (180 degrees), where the highest frequencies live (nyquist!), the high frequencies will get attentuated, making it a low pass filter. Below is the box filter again, which does this.

If the zero is on the right, at angle 0, where the low frequencies live, the low frequencies will get attenuated, making it a high pass filter.

If you put the zero in the middle, you get a filter that doesn’t do anything, other than possibly a0 adjusting the volume, like in this filter.

You don’t have to keep the zero inside the unit circle though. Here it is going outside of the unit circle to the right. You can see that low frequencies are still attenuated, so it’s a high pass filter, but they are not attenuated as sharply.

It’s also worth noting that the last filter was the first filter where the phase response was not a line. The other filters had “linear phase response”. Linear phase filters have some nice properties that are sometimes desired, such as not distorting the shape of a wave form, other than adjusting frequency amplitude.

You’ve now essentially seen everything that an order 1 FIR filter can do. There’s a zero you can move around on the real axis (the x axis) and any frequencies near the zero get quieter when passed through the filter.

It’s important to note that the zero can only make things quieter. This means that FIR filters are stable. They never suffer from feedback issues where they make something louder and louder and louder into infinity or start spitting out NaNs. That makes them pretty attractive if you really want a stable filter, but that stability is also tameness.

Pole Zero Plot of Order 2 FIR Filter

Let’s take a look at the transfer function of 2nd order filters.

$H(z) = a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2})$

Just like with the order 1 filters, we need to find out where the zeroes of the function are. With some algebra we can get it to something more manageable:

$a_0 (1 + \alpha_1z^{-1} + \alpha_2z^{-2}) = 0 \\ 1 + \alpha_1z^{-1} + \alpha_2z^{-2} = 0 \\ \frac{z^2 + \alpha_1z^1 + \alpha_2}{z^2} = 0$

Looking at that, when z is 0 (the origin), it shoots to infinity. That’s called a “pole” which is the opposite of a zero. We’ll talk about them lots more in the next post, but for now, since it’s at the origin we can ignore it as a “trivial pole”.

So, we need to figure out when the top part of the fraction is equal to zero:

$z^2 + \alpha_1z^1 + \alpha_2 = 0$

Luckily this is a standard quadratic polynomial, just like if we had $Ax^2+Bx+C$. In our case, x is z, A is 1, B is $\alpha_1$ and C is $\alpha_2$.

To solve those equations we normally would use the quadratic equation:

$\frac{-B \pm \sqrt{B^2-4AC}}{2A}$

For our purposes, it is more convenient to re-arrange that a bit to be like the below:

$\frac{-B}{2A} \pm \frac{\sqrt{B^2-4AC}}{2A}$

The reason for this is because we are going to have two zeroes (roots) when we solve this thing. The left side is the x axis (real axis) middle point between the two roots. The right side tells us how much to move on each side of that middle point to get to the zeroes. If the number under the square root is positive, it will give us a real number answer, that we will use to move left and right on the x axis. If the number under the square root is negative, it will give us an imaginary number answer, that we will use to move up and down on the y axis.

We can plug in our values for A, B and C and get our roots:

$z = \frac{-\alpha_1}{2} \pm \frac{\sqrt{\alpha_1^2-4\alpha_2}}{2}$

It isn’t as clear where the roots are as it was for an order 1 filter, but it will tell us where they are if we plug values in.

Firstly, the center between the zeroes will be at $\frac{-\alpha_1}{2}$. So, if $\alpha_1$ is 1.0, the zeroes will be evenly spaced away from -0.5 on the real axis. If it’s -2.0, the zeroes will be evenly spaced away from 1.0 on the real axis.

From there, you just modify $\alpha_2$ to change if the zeroes move on the real axis (x axis) or the imaginary axis (y axis) and by how much. Pro Tip: If you give $\alpha_2$ a value of 1.0, the zeroes will be on the unit circle.

Here is an order 2 high pass filter:

Here is an order 2 low pass filter:

Lastly, here’s an order 2 notch filter, which means it cuts a “notch” out of the frequency spectrum. It can sound pretty cool to play a sound (in the acompanying demo) and slide the notch up and down the frequency range.

Estimating Frequency Response From a Pole Zero Plot

If you have a pole zero plot of an FIR, here is how you estimate how much a specific frequency will be attenuated by that filter.

Draw a line from that frequency on the unit circle to every zero on the plot and get the lengths of those lines.

Multiply all the length of the lines together.

That final result is an estimate of how much the amplitude of a sinusoid wave in that frequency will be attenuated.

A value smaller than 1 means it will get smaller (quieter) while a value greater than 1 means it will get bigger (louder).

I’m not sure why it’s only called an estimate though. When looking at order 1 and order 2 filters, I’ve been unable to find a situation where they don’t match visually. The demo has a checkbox at the bottom to draw the estimate in teal if you want to see for yourself.

Why is it Called Finite Impulse Response?

There is a concept of an “impulse” where you have a signal this is a bunch of zeroes, then a single 1, then a bunch of zeroes again. Putting an impulse through a filter is a way to analyze how a filter behaves, and in fact is a unique finger print for the filter which you can use to reconstruct the coefficients of the filter itself.

If it’s a finite impulse response filter, the impulse response is actually the coefficients of the filter itself, which is pretty nifty.

Let’s check out the difference function for an order 1 FIR:

$y_n = a_0x_n + a_1x_{n-1}$

In this situation we have only a couple cases to think about.

When we start in the “a bunch of zeroes” section of the signal, it’s going to be a0 * 0 + a1 * 0, which is always going to be zero.

So, for that part of the signal, we always have zeroes.

Then, we hit a point where $x_n$ is 1 and $x_{n-1}$ is 0. In this case, the output of the filter is going to be a0.

Next, we hit the point where $x_n$ is 0 again and $x_{n-1}$ is 1. In this case, the output of the filter is going to be a1.

Then, we have a bunch of zeroes again and the filter will keep putting out zeroes until the sun goes dark.

The fact that this filter had a “finite response” to the impulse, meaning it settled back down to zero and stayed zero, is why it’s named a finite response.

In contrast to this, if the filter had some kind of internal state that was updated as each sample went through, it could very well never go back to zero. In that case, we’d have an infinite impulse response or IIR. The next blog post is going to talk about that kind of a filter.

Crafting a Filter By Placing Zeroes

If you have a high order FIR – say an 8th order one – it’s very difficult to find the zeroes because you ultimately need to find the zeroes of an 8th order polynomial, which is not really possible analytically in the general case.

However, if you want to go the reverse way, where you have placed your zeroes in specific locations and want to make a filter from that, that is a lot easier.

First up, you need to decide where your zeroes are.

• You can have as many zeroes as you want.
• Duplicating zeroes make them have steeper attenuation, but also the effect is more localized and more distant frequencies are affected less than closer ones which are affected more.
• You can add as many real valued zeroes as you want.
• For every complex zero you add, you also need to add it’s complex conjugate. For instance, if you add 1+3i, you also need to add 1-3i. If you don’t do this, your filter can spit out complex numbers instead of only real numbers!

To work through an example, let’s say that we have 4 zeroes: 3, -2, 3i, -3i.

First we need to make a polynomial with those zeroes.

$0 = (x-3)(x+2)(x+3i)(x-3i) \\ 0 = (x^2-x-6)(x^2+9) \\ 0 = x^4-x^3+3x^2-9x-54$

Next, we could work through some simple algebra (the reverse of what we did in previous sections), but to jump straight to the punch line, our filter coefficients are the coefficients of this polynomial: 1, -1, 3, -9, -54. That gives us this difference equation:

$y(n) = 1*x_n - 1*x_{n-1} + 3*x_{n-2} - 9*x_{n-3} - 54*x_{n-4}$

Or simplified:

$y(n) = x_n - x_{n-1} + 3x_{n-2} - 9x_{n-3} - 54x_{n-4}$

If you want a $a_0$ gain parameter in your filter function, just multiply all the coefficients by that value.

You can also use those coefficients to go directly to the transfer function very similarly. The coefficients are inside the parens.

$H(z) = a_0(1*z^0 - 1*z^{-1} + 3*z^{-2} - 9*z^{-3} - 54*z^{-4})$

Or simplified:

$H(z) = a_0(1 - z^{-1} + 3z^{-2} - 9z^{-3} - 54z^{-4})$

Before moving on, it’s important to note that some coefficients may be zero and you need to watch out for that. For instance, let’s say you have zeroes at +/- i:

$0 = (x-i)(x+i) \\ 0 = x^2 + 1$

You would probably be tempted to say that the filter coefficients are “1, 1” (I made that mistake!) but in fact, they are “1, 0, 1”. The equation above is really this one below:

$0 = x^2 + 0x + 1$

Making Cool Sounds & Music

A really fun thing to do in the demo (http://demofox.org/DSPFIR/FIR.html) is to make it play white noise and then filter the white noise. You can make it sound like wind or the ocean, or other things. White noise contains all frequencies so filtering like this can make some really interesting sounds. Some other types of sounds (like a sine wave tone) have only a single or very few frequencies, so putting them through filters doesn’t make them sound that interesting.

Filtering sounds to create new sounds is called subtractive synthesis (since you are attenuating or removing frequencies). It’s not something I’m very experienced at musically, but I’m finding that I really like the sound of making some frequencies become very loud and clip (causing distortion), while other frequencies do not. I anticipate the next post on IIRs will have more of that. IIRs place infinities (poles) on the pole zero plot so can boost frequencies quite high, instead of how FIRs can only place zeroes that make some frequencies quieter.

Here’s something to try in the demo…

1. Have it play the harp sound clip
2. Set it to an order 2 filter
3. Turn on overdrive and set gain to 33
4. A0 = 33, alpha1 = -1.34, alpha2 = 1.
5. Bonus: slide alpha1 back and forth slowly. You could imagine putting that parameter on low frequency sine wave so that it did this part automatically.

You are on your way to making music! If that sounds awesome to you, you might like my other audio blog posts: https://blog.demofox.org/#Audio

Closing

Now that you understand how FIR’s work, you may be interested in a post i wrote about generating different distributions and colors of noise using dice. It turns out they are just FIRs in disguise. The dice are just a source of white noise and the operations with adding and subtracting is just having positive and negative coefficients in the filter 🙂

https://blog.demofox.org/2019/07/30/dice-distributions-noise-colors/

Also, something nice about FIR filters is that the filter is stateless. You just multiply the last N samples by N specific coefficients to get your result. This means you can apply a filter to a signal via convolution. Convolution isn’t a cheap operation, but it does a lot of simple branchless operations in parallel, which is something that things like GPUs and SIMD excel at. You can also use the convolution theorem by taking the signal and filter coefficients into frequency space (DFT via the FFT), multiplying them together, and taking them back into time space (IDFT via the FFT again). This can be more efficient for a high order filter. In fact, really nice reverb is made up of a very long impulse response, which makes it a very high order filter. It’s not uncommon to implement reverb via convolution (but doing it as a multiplication in frequency space).

I’d like to thanks these 3 peeps who helped me understand some DSP concepts. Thanks for your kindness!
@BartWronsk
@nickappleton
@rygorous

Also, this book is exceptional as an introduction to DSP, specifically meant for audio purposes. It has dsp theory but also information about how to program things like compressors and limiters, flange, chorus and reverb. It’s very cool and a lot of fun.

Granular Audio Synthesis

If you want to make a sound shorter, you can play it faster. Doing this also makes it higher pitch unfortunately.

If you want to make a sound longer, you can play it more slowly. This also makes it lower pitch though.

Sound length and pitch are tied together and there’s no way to change one without changing the other.

… Actually that’s a lie. Granular synthesis can be used to change playback speed and pitch independently!

This post talks about how granular synthesis works, gives some examples you can listen to, and also supplies simple standalone C++ code that does it. (680 lines of code, one source file, only standard C++ includes, no libraries used.)

By the end of this post, you should even be able to program your own “autotune” effect.

The code and audio samples are available on github here: https://github.com/Atrix256/GranularSynth

Granular Synthesis Basics

Granular synthesis is conceptually pretty simple. The first step is to break a sound file up into small sections of sounds called “grains” that are typically between 10 and 100 milliseconds long.

You don’t have to do anything special to make grains, you just literally cut the sound up into a bunch of pieces.

To make a sound twice as long, you then make a new sound where each grain is repeated twice. When you play it back, it will sound mostly the same and be the same pitch, but will be twice as long.

To make a sound half as long, you would just throw away every other grain. The result is a sound that mostly sounds the same, and is the same pitch, but is half as long.

You aren’t restricted to integers though. You could easily throw out every 3rd grain to make it 2/3 as long or repeat every 5th grain to make it 20% longer.

To make a sound that is the same length, but has twice as high pitch, you would double each grain, but play them back twice as fast. That would result in a sound that was the same length but had frequencies that were twice as high.

To make a sound that is the same length, but had half as high of a pitch, you would throw out every other grain, but play the ones you kept twice as slowly. That would result in a sound that was the same length but had frequencies that were half as high.

Congratulations, you now understand granular synthesis!

There are other usage cases of granular synthesis, but they are a lot more exotic – like making crazy cool sounds.

There are also variations of granular synthesis where grains overlap instead of being omitted, when dealing with sounds getting shorter, or grains being played a non integer number of times.

Check out this youtube video to see an insane usage case for real time granular synthesis. WTF!
Youtube: Drum Sound Experiment – Dynamic Granular Synthesis

Some Granular Synthesis Gotchas

If you just do the above, you are going to have some issues with clicking and popping. When you put grains next to each other that weren’t next to each other before, there is going to be a discontinuity in the audio wave form, which translates into very short, very high frequencies that make a popping noise.

What’s more is if your grain size is 20 milliseconds, that means you will get a pop 50 times a second, which means you’ll get a 50hz tone from the popping.

So, how do you fix this?

One way is to use envelopes to do cross fading.

If you wanted to put down grain A and then grain C that would make a pop because A and C expect B to be between them.

Using an envelop to fix this problem, you’d put down grain A, and then next to it you’d put down grain B but have it’s volume go from 1 to 0 over some length of time (like 2 milliseconds). You then ADD grain C on top of grain B, but have it’s volume from from 0 to 1 over the same length of time.

The result is that there is no immediate “pop” from discontinuous wave forms. Instead, it gently fades from one grain to the next over the length of the cross fade.

Note that the above is equivalent to linearly interpolating from grain B to grain C over the length of the envelope, it’s just done in two passes.

I’ve heard that another way to handle this problem is that instead of cutting grains perfectly at the time / length they should be at, you make the cuts at zero crossings that are closest to the desired cut position.

What this does is make it so you can put any grain next to any other grain, and they should fit together pretty decently. This gives you C0 continuity by the way, but higher order discontinuities still affect the quality of the result. So, while this method is fast, it isn’t the highest quality. I didn’t try it personally, so am unsure how it affects the quality in practice.

Another issue you will encounter when doing granular synthesis is wanting to play back a grain at a non integer playback speed. This means you may want to sample index 0, then index 1.1, then index 2.2 and so on. How do you sample a fractional index?

A quick and easy way is to use the fractional part of the index to linearly interpolate between the two samples it’s in between. That means that sampling index 2.2 would be a linear interpolation between index 2 and index 3, and would be 20% of the way from index 2 to index 3. AKA it would be value[2] * 0.8 + value[3] * 0.2;

Another way to sample fractionally is to use cubic hermite interpolation. It’s more expensive to compute but interpolates more smoothly, and perserves first order derivatives.

You can read more about that on my blog post here: Cubic Hermite Interpolation

Lastly, there are a couple parameters to these techniques that have to be hand tuned for the situation they are used in:

• Grain Size – Some usage cases want larger grain sizes, while others want smaller grain sizes. You’ll have to play with it and see what’s best for your usage case. Again, I’ve heard that typical grain sizes can range between 10 and 100 milliseconds.
• Envelope Size – The size of the envelope used for cross fading can affect the result as well. Too short an envelope will start to make popping happen, but too long an envelop will muddy your sound.

Drums and other percussion instruments have a particularly hard time with granular synthesis because they are usually made up of very short but noticeable sounds. If these sounds get (partially?) repeated, it will sound weird and wrong.

Check the links at the end of the post for deeper reads into these topics and more!

Experiments & Results

For all experiments, I used a sound clip from one of my favorite movies “Legend” where Tim Curry plays the devil, and Tom Cruise and Mia Sara are the main characters fighting him.

legend1.mp3:

The experiments use cubic hermite interpolation to sample fractionally, and they use crossfading to fight popping between grains. Everything uses a grain size of 20 milliseconds, and a cross fade of 2 milliseconds.

To start out, we can play the sound faster and slower to naively adjust pitch and length.

Here’s a fast / high pitched version (70% of time):

Here’s an even faster / higher pitched version (40% of time):

Here’s a slow / low pitched version (130% of time):

And here’s an even slower / lower pitched version (210% of time):

Here is a sound made shorter (70% of time) using granular synthesis, so is the same length as the fast/high version, but has the same pitch as the original. Pretty cool, right?

Here is the sound made even shorter (40% of time) so is the same length as the faster/higher version, but has the same pitch as the original again.

Here is the sound made longer (130% of time), but again has the same length as the original.

And here is the even longer version (210% of time).

If we want to adjust the pitch but leave the length alone, there are two ways you could do that.

The first way is to use granular synthesis to change the length of the sound (longer or shorter), keeping the pitch the same, then use the regular “naive” method to make that resulting sound be the original sound length again.

Here is a sound where that process is used to make the pitch about 1.43 times higher (1.0/0.7), but keeps the same sound length.

Another way to get a very similar result is to just change the playback rate of the grains themselves – INDEPENDENTLY of how many times you repeat the grains (0 to N times) which changes sound length.

Here is a sound made doing that. It plays each grain back ~1.43 times faster, but makes a sound that is the same length in the end. My ears can’t tell the difference, even though I do know there is one. This is the technique we’ll be using for the rest of the experiments.

Here is a higher pitched sound that plays back each grain 2.5 times faster (1.0 / 0.4).

You can use this to make the sound lower pitched as well. Here is a sound where we play the grains back at 0.77 speed (1.0 / 1.3).

Here they play back at 0.48 speed (1.0 / 2.1).

Granular Synth Pitch and Length Adjustment

To really drive home how pitch and length adjustments can be made independent with granular synthesis, here is a sound where it plays back more slowly (130% as much time), but it plays back at a higher pitch (~1.43 times as high).

And the opposite… here is the sound played back more quickly (70% as much time), but it plays back at a lower pitch (~0.77 times as high)

Dynamic Parameters

Something else fun is that the parameters don’t have to be fixed at runtime.

In the example code, I have a version of the granular synthesis function that calls back to a lambda for every grain to see what time and pitch multiplier it should use.

Here the pitch is on a 10 cycle sine wave going between 0.75 and 1.25.

Here the sound length is on a 13 cycle sine wave going between 0.5 and 2.5.

And lastly, here it combines the pitch and sound length parameters described above.

https://www.soundonsound.com/techniques/granular-synthesis

https://granularsynthesis.com/guide.php

https://theproaudiofiles.com/granular-synthesis/

I also want to mention that the basics of this technique was kindly described to me at lunch by a co-worker. His web page is at http://antonte.com/

Understanding The Discrete Fourier Transform

I’ve been working on getting a better understanding of the Discrete Fourier Transform. I’ve figured out some things which have really helped my intuition, and made it a lot simpler in my head, so I wanted to write these up for the benefit of other folks, as well as for my future self when I need a refresher.

The discrete Fourier transform takes in data and gives out the frequencies that the data contains. This is useful if you want to analyze data, but can also be useful if you want to modify the frequencies then use the inverse discrete Fourier transform to generate the frequency modified data.

Multiplying By Sinusoids (Sine / Cosine)

If you had a stream of data that had a single frequency in it at a constant amplitude, and you wanted to know the amplitude of that frequency, how could you figure that out?

One way would be to make a cosine wave of that frequency and multiply each sample of your wave by the corresponding samples.

For instance, let’s say that we have a data stream of four values that represent a 1hz cosine wave: 1, 0, -1, 0.

We could then multiply each point by a corresponding point on a cosine wave, add sum them together:

We get.. 1*1 + 0*0 + -1*-1 + 0*0 = 2.

The result we get is 2, which is twice the amplitude of the data points we had. We can divide by two to get the real amplitude.

To show that this works for any amplitude, here’s the same 1hz cosine wave data with an amplitude of five: 5, 0, -5, 0.

Multiplying by the 1hz cosine wave, we get… 5*1 + 0*0 + -5*-1 + 0*0 = 10.

The actual amplitude is 5, so we can see that our result was still twice the amplitude.

In general, you will need to divide by N / 2 where N is the number of samples you have.

What happens if our stream of data has something other than a cosine wave in it though?

Let’s try a sine wave: 0, 1, 0, -1

When we multiply those values by our cosine wave values we get: 0*1 + 1*0 + 0*-1 + -1*0 = 0.

We got zero. Our method broke!

In this case, if instead of multiplying by cosine, we multiply by sine, we get what we expect: 0*0 + 1*1 + 0*0 + -1*-1 = 2. We get results consistent with before, where our answer is the amplitude times two.

That’s not very useful if we have to know whether we are looking for a sine or cosine wave though. We might even have some other type of wave that is neither one!

The solution to this problem is actually to multiply the data points by both cosine and sine waves, and keep both results.

Let’s see how that works.

For this example We’ll take a cosine wave and shift it 0.25 radians to give us samples: 0.97, -0.25, -0.97, 0.25. (The formula is cos(x*2*pi/4+0.25) from 0 to 3)

When we multiply that by a cosine wave we get: 0.97*1 + -0.25*0 + -0.97*-1 + 0.25*0 = 1.94
When we multiply it by a sine wave we get: 0.97*0 + -0.25*1 + -0.97*0 + 0.25*-1 = -0.5

Since we know both of those numbers are twice the actual amplitude, we can divide them both by two to get:
cosine: 0.97
sine: -0.25

Those numbers kind of makes sense if you think about it. We took a cosine wave and shifted it over a little bit so our combined answer is that it’s mostly a cosine wave, but is heading towards being a sine wave (technically a sine wave that has an amplitude of -1, so is a negative sine wave, but is still a sine wave).

To get the amplitude of our phase shifted wave, we treat those values as a vector, and get the magnitude. sqrt((0.97*0.97)+(-0.25*-0.25)) = 1.00. That is correct! Our wave’s amplitude is 1.0.

To get the starting angle (phase) of the wave, we use inverse tangent of sine over cosine. We want to take atan2(sine, cosine), aka atan2(-0.25, 0.97) which gives us a result of -0.25 radians. That’s the amount that we shifted our wave, so it is also correct!

Formulas So Far

Let’s take a look at where we are at mathematically. For $N$ samples of data, we multiply each sample of data by a sample from a wave with the frequency we are looking for and sum up the results (Note that this is really just a dot product between N dimensional vectors!). We have to do this both with a cosine wave and a sine wave, and keep the two resulting values to be able to get our true amplitude and phase information.

For now, let’s drop the “divide by two” part and look at the equations we have.

$Value_{cosine} = \sum\limits_{n=0}^{N-1} Sample_n * Cos(2\pi*Frequency*n/N)$

$Value_{sine} = \sum\limits_{n=0}^{N-1} Sample_n * Sin(2\pi*Frequency*n/N)$

Those equations above do exactly what we worked through in the samples.

The part that might be confusing is the part inside of the Cos() and Sin() so I’ll explain that real quick.

Looking at the graph cos(x) and sin(x) you’ll see that they both make one full cycle between the x values of 0 and 2*pi (aprox. 6.28):

If instead we change it to a graph of cos(2*pi*x) and sin(2*pi*x) you’ll notice that they both make one full cycle between the x values of 0 and 1:

This means that we can think of the wave forms in terms of percent (0 to 1) instead of in radians (0 to 2pi).

Next, we multiply by the frequency we are looking for. We do this because that multiplication makes the wave form repeat that many times between 0 and 1. It makes us a wave of the desired frequency, still remaining in “percent space” where we can go from 0 to 1 to get samples on our cosine and sine waves.

Here’s a graph of cos(2*pi*x*2) and sin(2*pi*x*2), to show some 2hz frequency waves:

Lastly, we multiply by n/N. In our sum, n is our index variable and it goes from 0 to N-1. This is very much like a for loop. n/N is our current percent done we are in the for loop, so when we multiply by n/N, we are just sampling at a different location (by percentage done) on our cosine and sine waves that are at the specified frequencies.

Not too difficult right?

Imaginary Numbers

Wouldn’t it be neat if instead of having to do two separate calculations for our cosine and sine values, we could just do a single calculation that would give us both values?

Well, interestingly, we can! This is where imaginary numbers come in. Don’t get scared, they are here to make things simpler and more convenient, not to make things more complicated or harder to understand (:

There is something called Euler’s Identity which states the below:

$e^{ix} = cos(x) + i*sin(x)$

That looks pretty useful for our usage case doesn’t it? If you notice that our equations both had the same parameters for cos and sin, that means that we can just use this identity instead!

We can now take our two equations and combine them into a single equation:

$Value = \sum\limits_{n=0}^{N-1} Sample_n * e^{i *2\pi * Frequency*n/N}$

When we do the above calculation, we will get a complex number out with a real and imaginary part. The real part of the complex number is just the cosine value, while the imaginary part of the complex number is the sine value. Nothing scary has happened, we’ve just used/abused complex numbers a bit to give us what we want in a more compact form.

Multiple Frequencies

Now we know how to look for a single, specific frequency.

What if you want to look for multiple frequencies though? Further more, what if you don’t even know what frequencies to look for?

When doing the discrete Fourier transform on a stream of samples, there are a specific set of frequencies that it checks for. If you have N samples, it only checks for N different frequencies. You could ask about other frequencies, but these frequencies are enough to reconstruct the original signal from only the frequency data.

The first N/2 frequencies are going to be the frequencies 0 (DC) through N/2. N/2 is the Nyquist frequency and is the highest frequency that your signal is able of holding.

After N/2 comes the negative frequencies, counting from a large negative frequency (N/2-1) up to the last frequency which is -1.

As an example, if you had 8 samples and ran the DFT, you’d get 8 complex numbers as outputs, which represent these frequencies:

[0hz, 1hz, 2hz, 3hz, 4hz, -3hz, -2hz, -1hz]

The important take away from this section is that if you have N samples, the DFT asks only about N very specific frequencies, which will give enough information to go from frequency domain back to time domain and get the signal data back.

That then leads to this equation below, where we just put a subscript on Value, to signify which frequency we are probing for.

$Value_{Frequency} = \sum\limits_{n=0}^{N-1} Sample_n * e^{2\pi i*Frequency*n/N}$

Frequency is an integer, and is between 0 and N-1. We can say that mathematically by listing this information along with the equation:

$Frequency \in [0,N), Frequency \in \mathbb Z$

Final Form

Math folk use letters and symbols instead of words in their equations.

The letter $k$ is used instead of frequency.

Instead of $Value_{Frequency}$, the symbol is $X_{k}$.

Instead of $Sample_n$, the symbol is $x_n$.

This gives us a more formal version of the equation:

$X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}$

$k\in [0,N), k \in \mathbb Z$

We are close to the final form, but we aren’t quite done yet!

Remember the divide by two? I had us take out the divide by two so that I could re-introduce it now in it’s correct form. Basically, since we are querying for N frequencies, we need to divide each frequency’s result by N. I mentioned that we had to divide by N/2 to make the amplitude information correct, but since we are checking both positive AND negative frequencies, we have to divide by 2*(N/2), or just N. That makes our equation become this:

$X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}$

$k\in [0,N), k \in \mathbb Z$

Lastly, we actually want to make the waves go backwards, so we make the exponent negative. The reason for this is a deeper topic than I want to get into right now, but you can read more about why here if you are curious: DFT – Why are the definitions for inverse and forward commonly switched?

That brings us to our final form of the formula for the discrete Fourier transform:

$X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{-2\pi ikn/N}$

$k\in [0,N), k \in \mathbb Z$

Taking a Test Drive

If we use that formula to calculate the DFT of a simple, 4 sample, 1hz cosine wave (1, 0, -1, 0), we get as output [0, 0.5, 0, -0.5].

That means the following:

• 0hz : 0 amplitude
• 1hz : 0.5 amplitude
• 2hz : 0 amplitude
• -1hz : -0.5 amplitude

It is a bit strange that the simple 1hz, 1.0 amplitude cosine wave was split in half, and made into a 1hz cosine wave and -1hz cosine wave, each contributing half amplitude to the result, but that is “just how it works”. I don’t have a very good explanation for this though. My personal understanding just goes that if we are checking for both positive AND negative frequencies, that makes it ambiguous whether it’s a 1hz wave with 1 amplitude, or -1hz wave with -1 amplitude. Since it’s ambiguous, both must be true, and they get half the amplitude each. If I get a better understanding, I’ll update this paragraph, or make a new post on it.

Making The Inverse Discrete Fourier Transform

Making the formula for the Inverse DFT is really simple.

We start with our DFT formula, drop the 1/N from the front, and also make the exponent to e positive instead of negative. That is all! The intuition here for me is that we are just doing the reverse of the DFT process, to do the inverse DFT process.

$x_k= \sum\limits_{n=0}^{N-1} X_n * e^{2\pi ikn/N}$

$k\in [0,N), k \in \mathbb Z$

While the DFT takes in real valued signals and gives out complex valued frequency data, the IDFT takes in complex valued frequency data, and gives out real valued signals.

A fun thing to try is to take some data, DFT it, modify the frequency data, then IDFT it to see what comes out the other side.

Other Notes

Here are some other things of note about the discrete Fourier transform and it’s inverse.

Data Repeating Forever

When you run the DFT on a stream of data, the math is such that it assumes that the stream of data you gave it repeats forever both forwards and backwards in time. This is important because if you aren’t careful, you can add frequency content to your data that you didn’t intend to be there.

For instance, if you tile a 1hz sine wave, it’s continuous, and there is only the 1hz frequency present:

However, if you tile a 0.9hz sine wave, there is a discontinuity, which means that there will be other, unintended frequencies present when you do a DFT of the data, to be able to re-create the discontinuity:

Fast Fourier Transform

There are a group of algorithms called “Fast Fourier Transforms”. You might notice that if we have N samples, taking the DFT is an O(N^2) operation. Fast Fourier transforms can bring it down to O(N log N).

DFT / IDFT Formula Variations

The formula we came up with is one possible DFT formula, but there are a handful of variations that are acceptable, even though different variations come up with different values!

The first option is whether to make the e exponent negative or not. Check out these two formulas to see what I mean.

$X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}$
vs
$X_k= \sum\limits_{n=0}^{N-1} x_n * e^{-2\pi ikn/N}$

Either one is acceptable, but when providing DFT’d data, you should mention which one you did, and make sure and use the opposite one in your inverse DFT formula.

The next options is whether to divide by N or not, like the below:

$X_k= \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}$
vs
$X_k= 1/N \sum\limits_{n=0}^{N-1} x_n * e^{2\pi ikn/N}$

Again, either one is acceptable, but you need to make sure and let people know which one you did, and also make sure and use the opposite one in your inverse DFT formula.

Instead of dividing by N, some people actually divide by 1/sqrt(N) in both the DFT and inverse DFT. Wolfram alpha does this for instance!

One thing to note though is that if doing 1/N on the DFT (my personal preference), the 0hz (DC) frequency bin gives you the average value of the signal, and the amplitude data you get out of the other bins is actually correct (keeping in mind the amplitudes are split in half between positive and negative frequencies).

Why Calculate Negative Frequencies

The negative frequencies are able to be calculated on demand from the positive frequency information (eg complex conjugation), so why should we even bother calculating them? Sure it’d be more computationally efficient not to calculate them, especially when just doing frequency analysis, right?!

Here’s a discussion about that: Why calculate negative frequencies of DFT?

Higher Dimensions

It’s possible to DFT in 2d, 3d, and higher. The last blog post shows how to do this with 2d images, but I’d also like to write a blog post like this one specifically talking about the intuition behind multi dimensional DFTs.

Example Program Source Code

Here’s some simple C++ source code which calculates the DFT and inverse DFT, optionally showing work in case you want to try to work some out by hand to better understand this. Working a few out by hand really helped me get a better intuition for all this stuff.

Program Output:

Source Code:

#include <stdio.h>
#include <complex>
#include <vector>

#include <fcntl.h>
#include <io.h>

// set to 1 to have it show you the steps performed.  Set to 0 to hide the work.
// useful if checking work calculated by hand.
#define SHOW_WORK 1

#if SHOW_WORK
#define PRINT_WORK(...) wprintf(__VA_ARGS__)
#else
#define PRINT_WORK(...)
#endif

// Use UTF-16 encoding for Greek letters
static const wchar_t kPi = 0x03C0;
static const wchar_t kSigma = 0x03A3;

typedef float TRealType;
typedef std::complex<TRealType> TComplexType;

const TRealType c_pi = (TRealType)3.14159265359;
const TRealType c_twoPi = (TRealType)2.0 * c_pi;

//=================================================================================
TComplexType DFTSample (const std::vector<TRealType>& samples, int k)
{
size_t N = samples.size();
TComplexType ret;
for (size_t n = 0; n < N; ++n)
{
TComplexType calc = TComplexType(samples[n], 0.0f) * std::polar<TRealType>(1.0f, -c_twoPi * TRealType(k) * TRealType(n) / TRealType(N));
PRINT_WORK(L"    n = %i : (%f, %f)n", n, calc.real(), calc.imag());
ret += calc;
}
ret /= TRealType(N);
PRINT_WORK(L"    Sum the above and divide by %in", N);
return ret;
}

//=================================================================================
std::vector<TComplexType> DFTSamples (const std::vector<TRealType>& samples)
{
PRINT_WORK(L"DFT:  X_k = 1/N %cn[0,N) x_k * e^(-2%cikn/N)n", kSigma, kPi);

size_t N = samples.size();
std::vector<TComplexType> ret;
ret.resize(N);
for (size_t k = 0; k < N; ++k)
{
PRINT_WORK(L"  k = %in", k);
ret[k] = DFTSample(samples, k);
PRINT_WORK(L"  X_%i = (%f, %f)n", k, ret[k].real(), ret[k].imag());
}
PRINT_WORK(L"n");
return ret;
}

//=================================================================================
TRealType IDFTSample (const std::vector<TComplexType>& samples, int k)
{
size_t N = samples.size();
TComplexType ret;
for (size_t n = 0; n < N; ++n)
{
TComplexType calc = samples[n] * std::polar<TRealType>(1.0f, c_twoPi * TRealType(k) * TRealType(n) / TRealType(N));
PRINT_WORK(L"    n = %i : (%f, %f)n", n, calc.real(), calc.imag());
ret += calc;
}
PRINT_WORK(L"    Sum the above and take the real componentn");
return ret.real();
}

//=================================================================================
std::vector<TRealType> IDFTSamples (const std::vector<TComplexType>& samples)
{
PRINT_WORK(L"IDFT:  x_k = %cn[0,N) X_k * e^(2%cikn/N)n", kSigma, kPi);

size_t N = samples.size();
std::vector<TRealType> ret;
ret.resize(N);
for (size_t k = 0; k < N; ++k)
{
PRINT_WORK(L"  k = %in", k);
ret[k] = IDFTSample(samples, k);
PRINT_WORK(L"  x_%i = %fn", k, ret[k]);
}
PRINT_WORK(L"n");
return ret;
}

//=================================================================================
template<typename LAMBDA>
std::vector<TRealType> GenerateSamples (int numSamples, LAMBDA lambda)
{
std::vector<TRealType> ret;
ret.resize(numSamples);
for (int i = 0; i < numSamples; ++i)
{
TRealType percent = TRealType(i) / TRealType(numSamples);
ret[i] = lambda(percent);
}
return ret;
}

//=================================================================================
int main (int argc, char **argv)
{
// Enable Unicode UTF-16 output to console
_setmode(_fileno(stdout), _O_U16TEXT);

// You can test specific data samples like this:
//std::vector<TRealType> sourceData = { 1, 0, 1, 0 };
//std::vector<TRealType> sourceData = { 1, -1, 1, -1 };

// Or you can generate data samples from a function like this
std::vector<TRealType> sourceData = GenerateSamples(
4,
[] (TRealType percent)
{
const TRealType c_frequency = TRealType(1.0);
return cos(percent * c_twoPi * c_frequency);
}
);

// Show the source data
wprintf(L"nSource = [ ");
for (TRealType v : sourceData)
wprintf(L"%f ",v);
wprintf(L"]nn");

// Do a dft and show the results
std::vector<TComplexType> dft = DFTSamples(sourceData);
wprintf(L"dft = [ ");
for (TComplexType v : dft)
wprintf(L"(%f, %f) ", v.real(), v.imag());
wprintf(L"]nn");

// Do an inverse dft of the dft data, and show the results
std::vector<TRealType> idft = IDFTSamples(dft);
wprintf(L"idft = [ ");
for (TRealType v : idft)
wprintf(L"%f ", v);
wprintf(L"]n");

return 0;
}


Explaining how to calculate the frequencies represented by the bins of output of DFT:
How do I obtain the frequencies of each value in an FFT?

Another good explanation of the Fourier transform if it isn’t quite sinking in yet:
An Interactive Guide To The Fourier Transform

Some nice dft calculators that also have inverse dft equivelants:
DFT Calculator 1
IDFT Calculator 1
DFT Calculator 2
IDFT Calculator 2

Wolfram alpha can also do DFT and IDFT, but keep in mind that the formula used there is different and divides the results by 1/sqrt(N) in both DFT and IDFT so will be different values than you will get if you use a different formula.

Wolfram Alpha: Fourier[{1, 0, -1, 0}] = [0,1,0,1]

Wolfram Alpha: Inverse Fourier[{0, 1 , 0, 1}] = [1, 0, -1, 0]

Intro To Audio Synthesis For Music Presentation

Today I gave a presentation at work on the basics of audio synthesis for music. It seemed to go fairly well and I was surprised to hear that so many others also dabbled in audio synth and music.

The slide deck and example C++ program (uses portaudio) are both up on github here:
https://github.com/Atrix256/MusicSynth

Questions, feedback, etc? Drop me a line (:

The Beating Effect

This post is going to be a pretty short one, so here it is 😛

The Beating Effect

The beating effect occurs when you play two similar frequencies of sound at the same time.

Because playing two sounds at once means adding them together, and due to the fact that sound waves are made up of positive and negative values (aka positive and negative pressures), the sounds playing at different frequencies will sometimes add peaks and troughs together to get louder, and other times the peak of one wave will add to the valley of another wave and the result will be quieter. This is know as constructive and destructive interference respectively.

The end result is that the sound will have a pulsing quality to it, like a tremolo effect. If one sound is played at frequency F1 and the other sound is played at frequency F2, the pulsing sound will occur at 2*(F2-F1) times a second.

Here’s a demo of this in action where the first sound is 200hz, the second sound is 205hz, and the result of them played together has a 10hz tremolo effect!

monobeat.wav

Binaural Beat

The beating effect happens when sound waves physically mix together.

Believe it or not though, there is a part of your brain where it mixes (adds) the sounds from each ear together as well.

That means that if you play similar frequencies to different ears, they will mix INSIDE YOUR BRAIN, and you will perceive a different kind of beating effect.

Below is a demo of that. You might need a pair of headphones to get the full effect.

stereobeat.wav

If you think this is pretty neat, you might want to google “psycho-acoustics” for more stuff like this (:

Source Code

Here’s the C++ code that generated these sound files.

#include <stdio.h>
#include <memory.h>
#include <inttypes.h>
#include <vector>

// constants
const float c_pi = 3.14159265359f;
const float c_twoPi = 2.0f * c_pi;

// typedefs
typedef uint16_t    uint16;
typedef uint32_t    uint32;
typedef int16_t     int16;
typedef int32_t     int32;

//this struct is the minimal required header data for a wav file
{
//the main chunk
unsigned char m_chunkID[4];
uint32		  m_chunkSize;
unsigned char m_format[4];

//sub chunk 1 "fmt "
unsigned char m_subChunk1ID[4];
uint32		  m_subChunk1Size;
uint16		  m_audioFormat;
uint16		  m_numChannels;
uint32		  m_sampleRate;
uint32		  m_byteRate;
uint16		  m_blockAlign;
uint16		  m_bitsPerSample;

//sub chunk 2 "data"
unsigned char m_subChunk2ID[4];
uint32		  m_subChunk2Size;

//then comes the data!
};

//this writes
template <typename T>
bool WriteWaveFile(const char *fileName, std::vector<T> data, int16 numChannels, int32 sampleRate)
{
int32 dataSize = data.size() * sizeof(T);
int32 bitsPerSample = sizeof(T) * 8;

//open the file if we can
FILE *File = nullptr;
fopen_s(&File, fileName, "w+b");
if (!File)
return false;

//fill out the main chunk

//fill out sub chunk 1 "fmt "
waveHeader.m_byteRate = sampleRate * numChannels * bitsPerSample / 8;
waveHeader.m_blockAlign = numChannels * bitsPerSample / 8;

//fill out sub chunk 2 "data"

//write the wave data itself
fwrite(&data[0], dataSize, 1, File);

//close the file and return success
fclose(File);
return true;
}

template <typename T>
void ConvertFloatSamples (const std::vector<float>& in, std::vector<T>& out)
{
// make our out samples the right size
out.resize(in.size());

// convert in format to out format !
for (size_t i = 0, c = in.size(); i < c; ++i)
{
float v = in[i];
if (v < 0.0f)
v *= -float(std::numeric_limits<T>::lowest());
else
v *= float(std::numeric_limits<T>::max());
out[i] = T(v);
}
}

void GenerateMonoBeatingSamples (std::vector<float>& samples, int sampleRate)
{
int sectionLength = samples.size() / 3;
int envelopeLen = sampleRate / 20;

for (int index = 0, numSamples = samples.size(); index < numSamples; ++index)
{
samples[index] = 0.0f;
int section = index / sectionLength;
int sectionOffset = index % sectionLength;

// apply an envelope at front and back  of each section keep it from popping between sounds
float envelope = 1.0f;
if (sectionOffset < envelopeLen)
envelope = float(sectionOffset) / float(envelopeLen);
else if (sectionOffset > sectionLength - envelopeLen)
envelope = (float(sectionLength) - float(sectionOffset)) / float(envelopeLen);

// first sound
if (section == 0 || section == 2)
samples[index] += sin(float(index) * c_twoPi * 200.0f / float(sampleRate)) * envelope;

// second sound
if (section == 1 || section == 2)
samples[index] += sin(float(index) * c_twoPi * 205.0f / float(sampleRate)) * envelope;

// scale it to prevent clipping
if (section == 2)
samples[index] *= 0.5f;
samples[index] *= 0.95f;
}
}

void GenerateStereoBeatingSamples (std::vector<float>& samples, int sampleRate)
{
int sectionLength = (samples.size() / 2) / 3;
int envelopeLen = sampleRate / 20;

for (int index = 0, numSamples = samples.size() / 2; index < numSamples; ++index)
{
samples[index * 2] = 0.0f;
samples[index * 2 + 1] = 0.0f;

int section = index / sectionLength;
int sectionOffset = index % sectionLength;

// apply an envelope at front and back  of each section keep it from popping between sounds
float envelope = 1.0f;
if (sectionOffset < envelopeLen)
envelope = float(sectionOffset) / float(envelopeLen);
else if (sectionOffset > sectionLength - envelopeLen)
envelope = (float(sectionLength) - float(sectionOffset)) / float(envelopeLen);
envelope *= 0.95f;

// first sound
if (section == 0 || section == 2)
samples[index * 2] += sin(float(index) * c_twoPi * 200.0f / float(sampleRate)) * envelope;

// second sound
if (section == 1 || section == 2)
samples[index * 2 + 1] += sin(float(index) * c_twoPi * 205.0f / float(sampleRate)) * envelope;
}
}

//the entry point of our application
int main(int argc, char **argv)
{
// generate the mono beating effect
{
// sound format parameters
const int c_sampleRate = 44100;
const int c_numSeconds = 4;
const int c_numChannels = 1;
const int c_numSamples = c_sampleRate * c_numChannels * c_numSeconds;

// make space for our samples
std::vector<float> samples;
samples.resize(c_numSamples);

// generate samples
GenerateMonoBeatingSamples(samples, c_sampleRate);

// convert from float to the final format
std::vector<int32> samplesInt;
ConvertFloatSamples(samples, samplesInt);

// write our samples to a wave file
WriteWaveFile("monobeat.wav", samplesInt, c_numChannels, c_sampleRate);
}

// generate the stereo beating effect (binaural beat)
{
// sound format parameters
const int c_sampleRate = 44100;
const int c_numSeconds = 4;
const int c_numChannels = 2;
const int c_numSamples = c_sampleRate * c_numChannels * c_numSeconds;

// make space for our samples
std::vector<float> samples;
samples.resize(c_numSamples);

// generate samples
GenerateStereoBeatingSamples(samples, c_sampleRate);

// convert from float to the final format
std::vector<int32> samplesInt;
ConvertFloatSamples(samples, samplesInt);

// write our samples to a wave file
WriteWaveFile("stereobeat.wav", samplesInt, c_numChannels, c_sampleRate);
}
}


For a more mathematical explanation of these, check out wikipedia (:
Wikipedia: Beat (acoustics)

Wikipedia: Binaural beats

Synthesizing a Plucked String Sound With the Karplus-Strong Algorithm

If you are looking to synthesize the sound of a plucked string, there is an amazingly simple algorithm for doing so called the Karplus-Strong Algorithm.

Give it a listen: KarplusStrong.wav
Here it is with flange and reverb effects applied: KPFlangeReverb.wav

It works like this:

1. Fill a circular buffer with static (random numbers)
2. Play the contents of the circular buffer over and over
3. Each time you play a sample, replace that sample with the average of itself and the next sample in the buffer. Also multiplying that average by a feedback value (like say, 0.996)

Amazingly, that is all there is to it!

Why Does That Work?!

The reason this works is that it is actually very similar to how a real guitar string pluck works.

When you pluck a guitar string, if you had a perfect pluck at the perfect location with the perfect transfer of energy, you’d get a note that was “perfect”. It wouldn’t be a pure sine wave since strings have harmonics (integer multiple frequencies) beyond their basic tuning, but it would be a pure note.

In reality, that isn’t what happens, so immediately after plucking the string, there is a lot of vibrations in there that “don’t belong” due to the imperfect pluck. Since the string is tuned, it wants to be vibrating a specific way, so over time the vibrations evolve from the imperfect pluck vibrations to the tuning of the guitar string. As you average the samples together, you are removing the higher frequency noise/imperfections. Averaging is a crude low pass filter. This makes it converge to the right frequencies over time.

It’s also important to note that with a real stringed instrument, when you play a note, the high frequencies disappear before the low frequencies. This averaging / low pass filter makes that happen as well and is part of what helps it sound so realistic.

Also while all that is going on, the energy in the string is being diminished as it becomes heat and sound waves and such, so the noise gets quieter over time. When you multiply the values by a feedback value which is less than 1, you are simulating this loss of energy by making the values get smaller over time.

Tuning The Note

This wasn’t intuitive for me at first, but the frequency that the note plays at is determined ENTIRELY by the size of the circular buffer.

If your audio has a sample rate of 44100hz (44100 samples played a second), and you use this algorithm with a buffer size of 200 samples, that means that the note synthesized will be 220.5hz. This is because 44100/200 = 220.5.

Thinking about the math from another direction, we can figure out what our buffer size needs to be for a specific frequency. If our sample rate is 44100hz and we want to play a note at 440hz, that means we need a buffer size of 100.23 samples. This is because 44100/440 = 100.23. Since we can’t have a fractional number of samples, we can just round to 100.

You can actually deal with the fractional buffer size by stepping through the ring buffer in non integer steps and using the fraction to interpolate audio samples, but I’ll leave that as an exercise for you if you want that perfectly tuned note. IMO leaving it slightly off could actually be a good thing. What guitar is ever perfectly in tune, right?! With it being slightly out of tune, it’s more likely to make more realistic sounds and sound interactions when paired with other instruments.

You are probably wondering like I was, why the buffer size affects the frequency of the note. The reason for this is actually pretty simple and intuitive after all.

The reason is because the definition of frequency is just how many times a wave form repeats per second. The wave form could be a sine wave, a square wave, a triangle wave, or it could be something more complex, but frequency is always the number of repetitions per second. If you think about our ring buffer as being a wave form, you can now see that if we have a buffer size of 200 samples, and a sample rate of 44100hz, when we play that buffer continually, it’s going to play back 220.5 times every second, which means it will play with a frequency of 220.5!

Sure, we modify the buffer (and waveform) as we play it, but the modifications are small, so the waveform is similar from play to play.

Some More Details

I’ve found that this algorithm doesn’t work as well with low frequency notes as it does with high frequency notes.

They say you can prime the buffer with a saw tooth wave (or other wave forms) instead of static (noise). While it still “kind of works”, in my experimentation, it didn’t work out that well.

You could try using other low pass filters to see if that affects the quality of the note generated. The simple averaging method works so well, I didn’t explore alternative options very much.

Kmm on hacker news commented that averaging the current sample with the last and next, instead of just the next had the benefit that the wave form didn’t move forward half a step each play through and that there is an audible difference between the techniques. I gave it a try and sure enough, there is an audible difference, the sound is less harsh on the ears. I believe this is so because averaging 3 samples instead of 2 is a stronger low pass filter, so gets rid of higher frequencies faster.

Example Code

Here is the C++ code that generated the sample at the top of the post. Now that you can generate plucked string sounds, you can add some distortion, flange, reverb, etc and make some sweet (synthesized) metal without having to learn to play guitar and build up finger calluses 😛

#include <stdio.h>
#include <memory.h>
#include <inttypes.h>
#include <vector>

// constants
const float c_pi = 3.14159265359f;
const float c_twoPi = 2.0f * c_pi;

// typedefs
typedef uint16_t    uint16;
typedef uint32_t    uint32;
typedef int16_t     int16;
typedef int32_t     int32;

//this struct is the minimal required header data for a wav file
{
//the main chunk
unsigned char m_chunkID[4];
uint32		  m_chunkSize;
unsigned char m_format[4];

//sub chunk 1 "fmt "
unsigned char m_subChunk1ID[4];
uint32		  m_subChunk1Size;
uint16		  m_audioFormat;
uint16		  m_numChannels;
uint32		  m_sampleRate;
uint32		  m_byteRate;
uint16		  m_blockAlign;
uint16		  m_bitsPerSample;

//sub chunk 2 "data"
unsigned char m_subChunk2ID[4];
uint32		  m_subChunk2Size;

//then comes the data!
};

//this writes
template <typename T>
bool WriteWaveFile(const char *fileName, std::vector<T> data, int16 numChannels, int32 sampleRate)
{
int32 dataSize = data.size() * sizeof(T);
int32 bitsPerSample = sizeof(T) * 8;

//open the file if we can
FILE *File = nullptr;
fopen_s(&File, fileName, "w+b");
if (!File)
return false;

//fill out the main chunk

//fill out sub chunk 1 "fmt "
waveHeader.m_byteRate = sampleRate * numChannels * bitsPerSample / 8;
waveHeader.m_blockAlign = numChannels * bitsPerSample / 8;

//fill out sub chunk 2 "data"

//write the wave data itself
fwrite(&data[0], dataSize, 1, File);

//close the file and return success
fclose(File);
return true;
}

template <typename T>
void ConvertFloatSamples (const std::vector<float>& in, std::vector<T>& out)
{
// make our out samples the right size
out.resize(in.size());

// convert in format to out format !
for (size_t i = 0, c = in.size(); i < c; ++i)
{
float v = in[i];
if (v < 0.0f)
v *= -float(std::numeric_limits<T>::lowest());
else
v *= float(std::numeric_limits<T>::max());
out[i] = T(v);
}
}
//calculate the frequency of the specified note.
//fractional notes allowed!
float CalcFrequency(float octave, float note)
/*
Calculate the frequency of any note!
frequency = 440×(2^(n/12))

N=0 is A4
N=1 is A#4
etc...

notes go like so...
0  = A
1  = A#
2  = B
3  = C
4  = C#
5  = D
6  = D#
7  = E
8  = F
9  = F#
10 = G
11 = G#
*/
{
return (float)(440 * pow(2.0, ((double)((octave - 4) * 12 + note)) / 12.0));
}

class CKarplusStrongStringPluck
{
public:
CKarplusStrongStringPluck (float frequency, float sampleRate, float feedback)
{
m_buffer.resize(uint32(float(sampleRate) / frequency));
for (size_t i = 0, c = m_buffer.size(); i < c; ++i) {
m_buffer[i] = ((float)rand()) / ((float)RAND_MAX) * 2.0f - 1.0f;  // noise
//m_buffer[i] = float(i) / float(c); // saw wave
}
m_index = 0;
m_feedback = feedback;
}

float GenerateSample ()
{
// get our sample to return
float ret = m_buffer[m_index];

// low pass filter (average) some samples
float value = (m_buffer[m_index] + m_buffer[(m_index + 1) % m_buffer.size()]) * 0.5f * m_feedback;
m_buffer[m_index] = value;

// move to the next sample
m_index = (m_index + 1) % m_buffer.size();

// return the sample from the buffer
return ret;
}

private:
std::vector<float>  m_buffer;
size_t              m_index;
float               m_feedback;
};

void GenerateSamples (std::vector<float>& samples, int sampleRate)
{
std::vector<CKarplusStrongStringPluck> notes;

enum ESongMode {
e_twinkleTwinkle,
e_strum
};

int timeBegin = 0;
ESongMode mode = e_twinkleTwinkle;
for (int index = 0, numSamples = samples.size(); index < numSamples; ++index)
{
switch (mode) {
case e_twinkleTwinkle: {
const int c_noteTime = sampleRate / 2;
int time = index - timeBegin;
// if we should start a new note
if (time % c_noteTime == 0) {
int note = time / c_noteTime;
switch (note) {
case 0:
case 1: {
notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 0), float(sampleRate), 0.996f));
break;
}
case 2:
case 3: {
notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 7), float(sampleRate), 0.996f));
break;
}
case 4:
case 5: {
notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 9), float(sampleRate), 0.996f));
break;
}
case 6: {
notes.push_back(CKarplusStrongStringPluck(CalcFrequency(3, 7), float(sampleRate), 0.996f));
break;
}
case 7: {
mode = e_strum;
timeBegin = index+1;
break;
}
}
}
break;
}
case e_strum: {
const int c_noteTime = sampleRate / 32;
int time = index - timeBegin - sampleRate;
// if we should start a new note
if (time % c_noteTime == 0) {
int note = time / c_noteTime;
switch (note) {
case 0: notes.push_back(CKarplusStrongStringPluck(55.0f, float(sampleRate), 0.996f)); break;
case 1: notes.push_back(CKarplusStrongStringPluck(55.0f + 110.0f, float(sampleRate), 0.996f)); break;
case 2: notes.push_back(CKarplusStrongStringPluck(55.0f + 220.0f, float(sampleRate), 0.996f)); break;
case 3: notes.push_back(CKarplusStrongStringPluck(55.0f + 330.0f, float(sampleRate), 0.996f)); break;
case 4: mode = e_strum; timeBegin = index + 1; break;
}
}
break;
}
}

// generate and mix our samples from our notes
samples[index] = 0;
for (CKarplusStrongStringPluck& note : notes)
samples[index] += note.GenerateSample();

// to keep from clipping
samples[index] *= 0.5f;
}
}

//the entry point of our application
int main(int argc, char **argv)
{
// sound format parameters
const int c_sampleRate = 44100;
const int c_numSeconds = 9;
const int c_numChannels = 1;
const int c_numSamples = c_sampleRate * c_numChannels * c_numSeconds;

// make space for our samples
std::vector<float> samples;
samples.resize(c_numSamples);

// generate samples
GenerateSamples(samples, c_sampleRate);

// convert from float to the final format
std::vector<int32> samplesInt;
ConvertFloatSamples(samples, samplesInt);

// write our samples to a wave file
WriteWaveFile("out.wav", samplesInt, c_numChannels, c_sampleRate);
}


Hacker News Discussion (This got up to topic #7, woo!)

Wikipedia: Karplus-Strong String Synthesis

Princeton COS 126: Plucking a Guitar String

Shadertoy: Karplus-Strong Variation (Audio) – I tried to make a bufferless Karplus-Strong implementation on shadertoy. It didn’t quite work out but is still a bit interesting.

Cubic Hermite Interpolation

It’s a big wide world of curves out there and I have to say that most of the time, I consider myself a Bezier man.

Well let me tell you… cubic Hermite splines are technically representable in Bezier form, but they have some really awesome properties that I never fully appreciated until recently.

Usefulness For Interpolation

If you have a set of data points on some fixed interval (like for audio data, but could be anything), you can use a cubic Hermite spline to interpolate between any two data points. It interpolates the value between those points (as in, it passes through both end points), but it also interpolates a derivative that is consistent if you approach the point from the left or the right.

In short, this means you can use cubic Hermite splines to interpolate data such that the result has $C1$ continuity everywhere!

Usefulness As Curves

If you have any number $N$ control points on a fixed interval, you can treat it as a bunch of piece wise cubic Hermite splines and evaluate it that way.

The end result is that you have a curve that is $C1$ continuous everywhere, it has local control (moving any control point only affects the two curve sections to the left and the two curve sections to the right), and best of all, the computational complexity doesn’t rise as you increase the number of control points!

The image below was taken as a screenshot from one of the HTML5 demos I made for you to play with. You can find links to them at the end of this post.

Cubic Hermite Splines

Cubic Hermite splines have four control points but how it uses the control points is a bit different than you’d expect.

The curve itself passes only through the middle two control points, and the end control points are there to help calculate the tangent at the middle control points.

Let’s say you have control points $P_{-1}, P_0, P_1, P_2$. The curve at time 0 will be at point $P_0$ and the slope will be the same slope as a line would have if going from $P_{-1}$ to $P_1$. The curve at time 1 will be at point $P_1$ and the slope will be the same slope as a line would have if going from $P_0$ to $P_2$.

Check out the picture below to see what I mean visually.

That sounds like a strange set of properties, but they are actually super useful.

What this means is that you can treat any group of 4 control points / data points as a separate cubic hermite spline, but when you put it all together, it is a single smooth curve.

Note that you can either interpolate 1d data, or you can interpolate 2d data points by doing this interpolation on each axis. You could also use this to make a surface, which will likely be the next blog post!

The Math

I won’t go into how the formula is derived, but if you are interested you should check out Signal Processing: Bicubic Interpolation.

The formula is:

$a*t^3+b*t^2+c*t+d$

Where…

$a = \frac{-P_{-1} + 3*P_0 - 3*P_1 + P_2}{2}$
$b = P_{-1} - \frac{5*P_0}{2} + 2*P_1 - \frac{P_2}{2}$
$c = \frac{-P_{-1} + P_1}{2}$
$d = P_0$

Note that t is a value that goes from 0 to 1. When t is 0, your curve will be at $P_1$ and when t is 1, your curve will be at $P_2$. $P_{-1}$ and $P_{2}$ are used to be able to make this interpolation $C1$ continuous.

Here it is in some simple C++:

// t is a value that goes from 0 to 1 to interpolate in a C1 continuous way across uniformly sampled data points.
// when t is 0, this will return B.  When t is 1, this will return C.
static float CubicHermite (float A, float B, float C, float D, float t)
{
float a = -A/2.0f + (3.0f*B)/2.0f - (3.0f*C)/2.0f + D/2.0f;
float b = A - (5.0f*B)/2.0f + 2.0f*C - D / 2.0f;
float c = -A/2.0f + C/2.0f;
float d = B;

return a*t*t*t + b*t*t + c*t + d;
}


Code

Here is an example C++ program that interpolates both 1D and 2D data.

#include <stdio.h>
#include <vector>
#include <array>

typedef std::vector<float> TPointList1D;
typedef std::vector<std::array<float,2>> TPointList2D;

void WaitForEnter ()
{
printf("Press Enter to quit");
fflush(stdin);
getchar();
}

// t is a value that goes from 0 to 1 to interpolate in a C1 continuous way across uniformly sampled data points.
// when t is 0, this will return B.  When t is 1, this will return C.
float CubicHermite (float A, float B, float C, float D, float t)
{
float a = -A/2.0f + (3.0f*B)/2.0f - (3.0f*C)/2.0f + D/2.0f;
float b = A - (5.0f*B)/2.0f + 2.0f*C - D / 2.0f;
float c = -A/2.0f + C/2.0f;
float d = B;

return a*t*t*t + b*t*t + c*t + d;
}

template <typename T>
inline T GetIndexClamped(const std::vector<T>& points, int index)
{
if (index < 0)
return points[0];
else if (index >= int(points.size()))
return points.back();
else
return points[index];
}

int main (int argc, char **argv)
{
const float c_numSamples = 13;

// show some 1d interpolated values
{
const TPointList1D points =
{
0.0f,
1.6f,
2.3f,
3.5f,
4.3f,
5.9f,
6.8f
};

printf("1d interpolated values.  y = f(t)n");
for (int i = 0; i < c_numSamples; ++i)
{
float percent = ((float)i) / (float(c_numSamples - 1));
float x = (points.size()-1) * percent;

int index = int(x);
float t = x - floor(x);
float A = GetIndexClamped(points, index - 1);
float B = GetIndexClamped(points, index + 0);
float C = GetIndexClamped(points, index + 1);
float D = GetIndexClamped(points, index + 2);

float y = CubicHermite(A, B, C, D, t);
printf("  Value at %0.2f = %0.2fn", x, y);
}
printf("n");
}

// show some 2d interpolated values
{
const TPointList2D points =
{
{ 0.0f, 1.1f },
{ 1.6f, 8.3f },
{ 2.3f, 6.5f },
{ 3.5f, 4.7f },
{ 4.3f, 3.1f },
{ 5.9f, 7.5f },
{ 6.8f, 0.0f }
};

printf("2d interpolated values.  x = f(t), y = f(t)n");
for (int i = 0; i < c_numSamples; ++i)
{
float percent = ((float)i) / (float(c_numSamples - 1));
float x = 0.0f;
float y = 0.0f;

float tx = (points.size() -1) * percent;
int index = int(tx);
float t = tx - floor(tx);

std::array<float, 2> A = GetIndexClamped(points, index - 1);
std::array<float, 2> B = GetIndexClamped(points, index + 0);
std::array<float, 2> C = GetIndexClamped(points, index + 1);
std::array<float, 2> D = GetIndexClamped(points, index + 2);
x = CubicHermite(A[0], B[0], C[0], D[0], t);
y = CubicHermite(A[1], B[1], C[1], D[1], t);

printf("  Value at %0.2f = (%0.2f, %0.2f)n", tx, x, y);
}
printf("n");
}

WaitForEnter();
return 0;
}


The output of the program is below:

Here are some interactive HTML5 demos i made:
1D cubic hermite interpolation
2D cubic hermite interpolation

Wikipedia: Cubic Hermite Spline

Closely related to cubic hermite splines, catmull-rom splines allow you to specify a “tension” parameter to make the result more or less curvy:
Catmull-Rom spline

Frequency Domain Audio Synthesis – With IFFT and Oscillators

One way to look at sounds is to think about what frequencies they contain, at which strengths (amplitudes), and how those amplitudes change over time.

For instance, if you remember the details of the post about how to synthesis a drum sound (DIY Synth: Basic Drum), a drum has low frequencies, but also, the frequencies involved get lower over the duration, which gives it that distinctive sound.

Being able to analyze the frequency components of a sound you want to mimic, and then being able to generate audio samples based on a description of frequency components over time is a powerful tool in audio synthesis.

This post will talk about exactly that, using both oscillators as well as the inverse fast Fourier transform (IFFT).

We’ve already gone over the basics of using oscillators to create sounds in a previous post so we’ll start with IFFT. For a refresher on those basics of additive synthesis check out this link: DIY Synth 3: Sampling, Mixing, and Band Limited Wave Forms

All sample sound files were generated with the sample code at the end of this post. The sample code is a single, standalone c++ file which only includes standard headers and doesn’t rely on any libraries. You should be able to compile it and start using the techniques from this post right away!

What is the IFFT?

Audio samples are considered samples in the “time domain”. If you instead describe sound based on what frequencies they contain, what amplitudes they are at, and what phase (angle offset) the sinusoid wave starts at, what you have is samples in the “frequency domain”.

The process for converting time domain to frequency domain is called the “Fourier Transform”. If you want to go the opposite way, and convert frequency domain into time domain, you use the “Inverse Fourier Transform”.

These transforms come in two flavors: continuous and discrete.

The continuous Fourier transform and continuous inverse Fourier transform work with continuous functions. That is, they will transform a function from time domain to frequency domain, or from frequency domain to time domain.

The discrete version of the fourier transform – often refered to as DFT, short for “discrete fourier transform” – and inverse fourier transform work with data samples instead of functions.

Naively computing the DFT or IDFT is a very processor intensive operation. Luckily for us, there is an algorithm called “Fast Fourier Transform” or FFT that can do it a lot faster, if you are ok with the constraints of the data it gives you. It also has an inverse, IFFT.

IFFT is what we are going to be focusing on in this post, to convert frequency domain samples into time domain samples. Also, like i said above, we are also going to be using oscillators to achieve the same result!

Using the IFFT

Here’s something interesting… where time domain samples are made up of “real numbers” (like 0.5 or -0.9 or even 1.5 etc), frequency domain samples are made up of complex numbers, which are made up of a real number and an imaginary number. An example of a complex number is “3.1 + 2i” where i is the square root of negative 1.

If you look at the complex number as a 2d vector, the magnitude of the vector is the amplitude of the frequency, and the angle of the vector is the phase (angle) that the sinusoid (cosine wave) starts at for that frequency component.

You can use these formulas to create the complex number values:

real = amplitude * cos(phase);
imaginary = amplitude * sin(phase);


Or you could use std::polar if you want to (:

BTW thinking of complex numbers as 2d vectors might be a little weird but there’s precedent. Check this out: Using Imaginary Numbers To Rotate 2D Vectors

When using the IFFT, the number of frequency domain samples you decide to use is what decides the frequencies of those samples. After preforming the IFFT, you will also have the same number of time domain samples as frequency domain samples you started with. That might sound like it’s going to be complex, but don’t worry, it’s actually pretty simple!

Let’s say that you have 4 buckets of frequency domain data, that means you will end up with 4 time domain audio samples after running IFFT. Here are the frequencies that each frequency domain sample equates to:

• index 0: 0hz. This is the data for the 0hz frequency, or DC… yes, as in DC current! DC represents a constant added to all samples in the time domain. If you put a value in only this bucket and run the IFFT, all samples will be the same constant value.
• index 1: 1hz. If you put a value in only this bucket and run the IFFT, you’ll end up with one full cosine wave (0-360 degrees aka 0-2pi radians), across the 4 time domain samples. 4 samples isn’t very many samples for a cosine wave though so it will look pretty blocky, but those 4 samples will be correct!
• index 2: 2hz. If you put a value in only this bucket and run the IFFT, you’ll end up with two full cosine waves (0-720 degrees aka 0-4pi radians) across the 4 time domain samples.
• index 3: -1hz.

You might wonder what that negative frequency at index 3 is all about. A negative frequency probably seems weird, but it’s really the same as it’s positive frequency, just with a negative phase. It’s basically a cosine wave that decreases in angle as it goes, instead of increasing.

Why is it there though? Well, there’s something called the Nyquist–Shannon Sampling Theorem which states that if you have N time domain audio samples, the maximum frequency you can store in those audio samples is N/2 cycles. That frequency is called the Nyquist frequency and any frequency above that results in “aliasing”. If you’ve ever seen a car tire spin in reverse on tv when the care was moving forward, that is an example of the same aliasing I’m talking about. In manifests in audio as high pitches and in general terrible sounding sound. Before I knew how to avoid aliasing in audio, my now wife used to complain that my sounds hurt her ears. Once i removed the aliasing, she says it no longer hurts her ears. Go back and check the early DIY synth posts for audible examples of aliasing (;

Since we have 4 frequency domain samples, which translates to 4 time domain audio samples, we can only store up to 2hz, and anything above that will alias and seem to go backwards. That’s why index 3 is -1hz instead of 3hz.

So basically, if you have N frequency domain samples (or bins as they are sometimes called), the first bin, at index 0, is always 0hz / DC. Then from 1 up to N/2, the frequency of the bin is equal to it’s index. Then, at N/2+1 up to N-1, there are negative frequencies (frequencies beyond the Nyquist frequency) reflected in that upper half of the bins, starting at N/2 and counting back up to -1.

In many common situations (including the ones we are facing in this post), you don’t need to worry about the negative frequency bins. You can leave them as zero if doing IFFT work, or ignore their values if doing FFT then IFFT work.

Ready for one last piece of complexity? I hope so… it’s the last before moving onto the next step! 😛

Sounds are obviously much longer than 4 samples, so a 4 bin (sample) IFFT just isn’t going to cut it. 1000 bins would to be too few, and Frankly, at 44,100 samples per second (a common audio sampling rate), 88,200 bins is only 2 seconds of audio which isn’t very much at all! Even with the “Fast Fourier Transform” (FFT), that is a huge number of bins and would take a long time to calculate.

One way to deal with this would be to have fewer bins than audio samples that you want to end up with, use IFFT to generate the audio samples, and then use interpolation to get the audio samples between the ones generated by IFFT. We aren’t going to be doing that today but if you like the option, you can most certainly use it!

One of the ways we are going to deal with using a small number of bins to generate a lot of noise is that we are going to use the IFFT to generate wave forms, and then we are going to repeat those wave forms several times.

Another one of the things we are going to do is use IFFT bins to generate a small number of samples, and then modify the IFFT bins to generate the next number of samples, repeating until we have all of our audio samples generated.

In both cases, the frequency buckets need some conversion from the IFFT bin frequencies to the frequencies as they will actually appear in the final audio samples.

To calculate the true frequency of an FFT bin, you can use this formula:

frequency = binNumber * sampleRate / numBins;

Where binNumber is the IFFT bin number, sampleRate is how many samples the resulting sound has per second, and numBins is the total number of IFFT bins.

Simple Wave Forms

The simplest of the wave forms you can create is a cosine wave. You just put the complex number “1 + 0i” into one of the IFFT bins, that represents an amplitude of 1.0 and a starting phase of 0 degrees. After you do that and run ifft, you’ll get a nice looking cosine wave:

Note that the wave repeats multiple times. This is because I repeat the IFFT data over and over. If I didn’t repeat the IFFT data, the number of cycles that would appear would depend completely on which IFFT bin I used. If I used bin 1, it would have only one cycle. If i used bin 2, it would have two cycles.

Also note that since the IFFT deals with INTEGER frequencies only, that means that the wave forms begin and end at the same phase (angle) and thus the same amplitude, which means that if you repeat them end to end, there will be no discontinuities or popping. Pretty cool huh?

If you instead put in the complex number “0 – 1i” into one of the IFFT bins, that represents an amplitude of 1.0 and a starting phase of 270 degrees (or -90 degrees), which results in a sine wave:

We don’t have to stop there though. Once again thinking back to the info in DIY Synth 3: Sampling, Mixing, and Band Limited Wave Forms, the frequency components of a saw wave are described as:

If you have a saw wave of frequency 100, that means it contains a sine wave of frequency 100 (1 * fundamental frequency), another of frequency 200 (2 * fundamental frequency), another of 300 (3 * fundamental frequency) and so on into infinity.

The amplitude (volume) of each sine wave (harmonic) is 1 over the harmonic number. So in our example, the sine wave at frequency 100 has an amplitude of 1 (1/1). The sine wave at frequency 200 has an amplitude of 0.5 (1/2), the sine wave at frequency 300 has an amplitude of 0.333 (1/3) and so on into infinity.

After that you’ll need to multiply your sample by 2 / PI to get back to a normalized amplitude.

We can describe the same thing in IFFT frequency bins believe it or not!

Let’s say that we have 50 bins and that we want a 5hz saw wave. The first harmonic is 5hz and should be full amplitude, so we put an entry in bin 5 for amplitude 1.0 * 2/pi and phase -90 degrees (to make a sine wave instead of a cosine wave).

The next harmonic should be double the frequency, and 1/2 the amplitude, so in bin 10 (double the frequency of bin 5), we put an entry for amplitude 0.5 * 2/pi, phase -90 degrees. Next should be triple the original frequency at 1/3 the amplitude, so at bin 15 we put amplitude 0.33 * 2/pi, phase -90 degrees. Then, bin 20 gets amplitude 0.25 * 2/pi, -90 degrees phase. Bin 25 is the Nyquist frequency so we should stop here, and leave the actual Nyquist frequency empty.

If you run the IFFT on that, you’ll get a bandlimited saw wave!

You can do the same for bandlimited triangle and square waves, and also you can use a random phase and amplitude for each bin to generate noise! The source code for this post generates those in fact (:

While we are dealing in frequency space, I want to run something kind of gnarly by you. Using the technique described above, you can add frequencies with specific amplitudes to make a band limited saw wave. But, believe it or not, the starting phase of the frequencies don’t have to be -90 degrees. If you use a different phase, you’ll get a resulting wave form that looks different but sounds the same. Check these out, it’s a trip!

Here are the sound files pictured above, so you can hear that they really do all sound the same!
270 degrees
0 degrees
60 degrees
120 degrees
180 degrees

Bins Changing Over Time

If you are like me, when you think about designing sound in the frequency domain you think everything must be rainbows and glitter, and that you have no limitations and everything is wonderful.

Well, as happens so many times when the rubber hits the road, that isn’t quite true. Let’s go through an example of using IFFT on some bins that change over time so I can show you a really big limitation with IFFT.

In our example, let’s say that we have a single frequency in our IFFT data (so we are generating a single sinusoid wave) but that we want it’s amplitude (volume) to change over time. Let’s say we want the amplitude of the wave to be controlled by another sinusoid, so that it smoothly goes up and down in amplitude over time.

We immediately hit two problems, but the first is more obvious if you listen:

IFFTTest1.wav

Hear all that popping? It’s basically making a discontinuity at every IFFT window because we are changing the amplitude at the edge of each window. We can’t change it during the window (caveat: without some pretty advanced math i won’t go into), so changing at the end of the window is our only option. Check out this image to see the problem visually:

We can fix those discontinuities. If we change the phase of the wave from cosine (0 degrees) to sine (270 degrees), we make it so that the edge of the window is always at amplitude 0 (sine(0) = 0, while cos(0) = 1!). This means that when we change the amplitude of the wave, since it happens when the wave is at zero, there is no discontinuity, and so no more pops:

Let’s have a listen:
IFFTTest2.wav

WHAT?! There is still some periodic noise… it is tamed a little bit but not fixed. The reason for this is that even though we are first order continuous we aren’t 2nd order continuous (etc). So, by having a jump in amplitude, even constraining it at zero crossings, we’ve essentially added some frequencies into our resulting sound wave that we didn’t want to add, which we can hear in the results.

So yeah… basically, if you want to change your IFFT bins over time, you are going to have some audio artifacts from doing that. Boo hoo!

There are ways around this. One way involves complex math to add frequencies to your bins that make it APPEAR as if you are adjusting the amplitude of the wave smoothly over the duration of the window. Another way involves doing things like overlapping the output of your IFFT windows and blending between them. There are other ways too, including some IDFT algorithms which do allow you to alter the amplitude over time in a window, but are costlier to compute.

Anyways, you can also just generate the sound descibed in your IFFT bins with oscillators, literally adding together all the frequencies with the specified phase offsets to make the correct resulting time domain samples. I’ve done just that to solve the popping problem as well as the issue where you can’t have smooth volume adjustments because you can only change amplitude at the end of each window:

IFFTTest3.wav

You can also see it in action:

Here’s another failure case to check out:

Convolution

If you read the post about convolution reverb (DIY Synth: Convolution Reverb & 1D Discrete Convolution of Audio Samples), you’ll recall that convolution is super slow, but that convolution in the time domain is equivelant to multiplication in the frequency domain.

We are in the frequency domain, so how about we try some convolution?!

Let’s multiply the bins of a 1hz saw wave and a 1hz square wave and see what we get if we IFFT the result:

That result is definitely not right. First of all, the convolution is the same length as the inputs, when it should be length(a)+length(b)-1 samples. Basically it should be twice as long as it is.

Secondly, that is not what the convolution looks like. Doing convolution in the time domain of those samples looks like this:

So what’s the deal? Well as it turns out, when you do multiplication in the frequency domain, you are really doing CIRCULAR convolution, which is a bit different than the convolution i described before which is LINEAR convolution. Circular convolution is essentially used for cyclical data (or functions) and basically makes it so that if you try to read “out of bounds”, it will wrap around and use an in bounds value instead. It’s kind of like “texture wrap” if you are a graphics person.

Normally how this is gotten around, when doing convolution in the frequency domain, is to put a bunch of zeros on the end of your time domain samples before you bring them into the frequency domain. You pad them with those zeros to be the correct size (length(a)+length(b)-1, or longer is fine too) and then when you end up doing the “circular convolution”, there are no “out of bounds” values looked at, and you end up with the linear convolution output, even though you technically did circular convolution.

Unfortunately, since we are STARTING in frequency domain and have no time domain samples to bad before going into frequency domain, we are basically out of luck. I’ve tried asking DSP experts and nobody I talked to knows of a way to start in frequency domain and zero pad the time domain so that you could do a linear convolution – at least nobody knows of GENERAL case solution.

Those same experts though say that circular convolution isn’t a bad thing, and is many times exactly what you want to do, or is a fine stand in for linear convolution. I’m sure we’ll explore that in a future post (:

In the example code, i also have the code generate the circular convolution in the time domain so that you can confirm that circular convolution is really what is going on in the above image, when working in frequency domain.

Strike Two IFFT!

Luckily using IDFT to generate sounds (sometimes called Fourier Synthesis) isn’t across the board a losing strategy. If you want a static, repeating wave form, it can be really nice. Or, for dynamic waave forms, if you change your amplitude only a very small amount across window samples, it won’t noticeably degrade the quality of your audio.

The neat thing about IFFT is that it’s a “constant time process”. When using oscillators, the more frequencies you want to appear, the more computationally expensive it gets. With IFFT, it doesn’t matter if all the bins are full or empty or somewhere inbetween, it has the same computational complexity.

Here’s a non trivial sound generated both with IFFT and Oscillators. The difference is pretty negligible right?

Sample Code

#define _CRT_SECURE_NO_WARNINGS

#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

#define _USE_MATH_DEFINES
#include

//=====================================================================================
// SNumeric - uses phantom types to enforce type safety
//=====================================================================================
template
struct SNumeric
{
public:
explicit SNumeric(const T &value) : m_value(value) { }
SNumeric() : m_value() { }
inline T& Value() { return m_value; }
inline const T& Value() const { return m_value; }

typedef SNumeric TType;
typedef T TInnerType;

// Math Operations
TType operator+ (const TType &b) const
{
return TType(this->Value() + b.Value());
}

TType operator- (const TType &b) const
{
return TType(this->Value() - b.Value());
}

TType operator* (const TType &b) const
{
return TType(this->Value() * b.Value());
}

TType operator/ (const TType &b) const
{
return TType(this->Value() / b.Value());
}

TType operator% (const TType &b) const
{
return TType(this->Value() % b.Value());
}

TType& operator+= (const TType &b)
{
Value() += b.Value();
return *this;
}

TType& operator-= (const TType &b)
{
Value() -= b.Value();
return *this;
}

TType& operator*= (const TType &b)
{
Value() *= b.Value();
return *this;
}

TType& operator/= (const TType &b)
{
Value() /= b.Value();
return *this;
}

TType& operator++ ()
{
Value()++;
return *this;
}

TType& operator-- ()
{
Value()--;
return *this;
}

// Extended Math Operations
template
T Divide(const TType &b)
{
return ((T)this->Value()) / ((T)b.Value());
}

// Logic Operations
bool operatorValue() < b.Value();
}
bool operatorValue()  (const TType &b) const {
return this->Value() > b.Value();
}
bool operator>= (const TType &b) const {
return this->Value() >= b.Value();
}
bool operator== (const TType &b) const {
return this->Value() == b.Value();
}
bool operator!= (const TType &b) const {
return this->Value() != b.Value();
}

private:
T m_value;
};

//=====================================================================================
// Typedefs
//=====================================================================================

typedef uint8_t uint8;
typedef uint16_t uint16;
typedef uint32_t uint32;
typedef int16_t int16;
typedef int32_t int32;

// type safe types!
typedef SNumeric        TFrequency;
typedef SNumeric           TFFTBin;
typedef SNumeric          TTimeMs;
typedef SNumeric            TTimeS;
typedef SNumeric         TSamples;
typedef SNumeric     TFractionalSamples;
typedef SNumeric         TDecibels;
typedef SNumeric        TAmplitude;
typedef SNumeric          TDegrees;

//=====================================================================================
// Constants
//=====================================================================================

static const float c_pi = (float)M_PI;
static const float c_twoPi = c_pi * 2.0f;

//=====================================================================================
// Structs
//=====================================================================================

struct SSoundSettings
{
TSamples        m_sampleRate;
TSamples        m_sampleCount;
};

//=====================================================================================
// Conversion Functions
//=====================================================================================

inline TFFTBin FrequencyToFFTBin(TFrequency frequency, TFFTBin numBins, TSamples sampleRate)
{
// bin = frequency * numBin / sampleRate
return TFFTBin((uint32)(frequency.Value() * (float)numBins.Value() / (float)sampleRate.Value()));
}

inline TFrequency FFTBinToFrequency(TFFTBin bin, TFFTBin numBins, TSamples sampleRate)
{
// frequency = bin * SampleRate / numBins
return TFrequency((float)bin.Value() * (float)sampleRate.Value() / (float)numBins.Value());
}

{
return TRadians(degrees.Value() * c_pi / 180.0f);
}

{
return TDegrees(radians.Value() * 180.0f / c_pi);
}

inline TDecibels AmplitudeToDB(TAmplitude volume)
{
return TDecibels(20.0f * log10(volume.Value()));
}

inline TAmplitude DBToAmplitude(TDecibels dB)
{
return TAmplitude(pow(10.0f, dB.Value() / 20.0f));
}

TTimeS SamplesToSeconds(const SSoundSettings &s, TSamples samples)
{
return TTimeS(samples.Divide(s.m_sampleRate));
}

TSamples SecondsToSamples(const SSoundSettings &s, TTimeS seconds)
{
return TSamples((int)(seconds.Value() * (float)s.m_sampleRate.Value()));
}

TSamples MilliSecondsToSamples(const SSoundSettings &s, TTimeMs milliseconds)
{
return SecondsToSamples(s, TTimeS((float)milliseconds.Value() / 1000.0f));
}

TTimeMs SecondsToMilliseconds(TTimeS seconds)
{
return TTimeMs((uint32)(seconds.Value() * 1000.0f));
}

TFrequency Frequency(float octave, float note)
{
/* frequency = 440×(2^(n/12))
Notes:
0  = A
1  = A#
2  = B
3  = C
4  = C#
5  = D
6  = D#
7  = E
8  = F
9  = F#
10 = G
11 = G# */
return TFrequency((float)(440 * pow(2.0, ((double)((octave - 4) * 12 + note)) / 12.0)));
}

template
T AmplitudeToAudioSample(const TAmplitude& in)
{
const T c_min = std::numeric_limits::min();
const T c_max = std::numeric_limits::max();
const float c_minFloat = (float)c_min;
const float c_maxFloat = (float)c_max;

float ret = in.Value() * c_maxFloat;

if (ret  c_maxFloat)
return c_max;

return (T)ret;
}

//=====================================================================================
// Audio Utils
//=====================================================================================

void EnvelopeSamples(std::vector& samples, TSamples envelopeTimeFrontBack)
{
const TSamples c_frontEnvelopeEnd(envelopeTimeFrontBack);
const TSamples c_backEnvelopeStart(samples.size() - envelopeTimeFrontBack.Value());

for (TSamples index(0), numSamples(samples.size()); index < numSamples; ++index)
{
// calculate envelope
TAmplitude envelope(1.0f);
if (index < c_frontEnvelopeEnd)
envelope = TAmplitude(index.Divide(envelopeTimeFrontBack));
else if (index > c_backEnvelopeStart)
envelope = TAmplitude(1.0f) - TAmplitude((index - c_backEnvelopeStart).Divide(envelopeTimeFrontBack));

// apply envelope
samples[index.Value()] *= envelope;
}
}

void NormalizeSamples(std::vector& samples, TAmplitude maxAmplitude)
{
// nothing to do if no samples
if (samples.size() == 0)
return;

// 1) find the largest absolute value in the samples.
TAmplitude largestAbsVal = TAmplitude(abs(samples.front().Value()));
std::for_each(samples.begin() + 1, samples.end(), [&largestAbsVal](const TAmplitude &a)
{
TAmplitude absVal = TAmplitude(abs(a.Value()));
if (absVal > largestAbsVal)
largestAbsVal = absVal;
}
);

// 2) adjust largestAbsVal so that when we divide all samples, none will be bigger than maxAmplitude
// if the value we are going to divide by is <= 0, bail out
largestAbsVal /= maxAmplitude;
if (largestAbsVal <= TAmplitude(0.0f))
return;

// 3) divide all numbers by the largest absolute value seen so all samples are [-maxAmplitude,+maxAmplitude]
for (TSamples index(0), numSamples(samples.size()); index < numSamples; ++index)
samples[index.Value()] = samples[index.Value()] / largestAbsVal;
}

//=====================================================================================
// Wave File Writing Code
//=====================================================================================
{
//the main chunk
unsigned char m_szChunkID[4];      //0
uint32        m_nChunkSize;        //4
unsigned char m_szFormat[4];       //8

//sub chunk 1 "fmt "
unsigned char m_szSubChunk1ID[4];  //12
uint32        m_nSubChunk1Size;    //16
uint16        m_nAudioFormat;      //18
uint16        m_nNumChannels;      //20
uint32        m_nSampleRate;       //24
uint32        m_nByteRate;         //28
uint16        m_nBlockAlign;       //30
uint16        m_nBitsPerSample;    //32

//sub chunk 2 "data"
unsigned char m_szSubChunk2ID[4];  //36
uint32        m_nSubChunk2Size;    //40

//then comes the data!
};

//this writes a wave file
template
bool WriteWaveFile(const char *fileName, const std::vector &samples, const SSoundSettings &sound)
{
//open the file if we can
FILE *file = fopen(fileName, "w+b");
if (!file)
return false;

//calculate bits per sample and the data size
const int32 bitsPerSample = sizeof(T) * 8;
const int dataSize = samples.size() * sizeof(T);

//fill out the main chunk

//fill out sub chunk 1 "fmt "
waveHeader.m_nByteRate = sound.m_sampleRate.Value() * 1 * bitsPerSample / 8;
waveHeader.m_nBlockAlign = 1 * bitsPerSample / 8;

//fill out sub chunk 2 "data"

//write the wave data itself, converting it from float to the type specified
std::vector outSamples;
outSamples.resize(samples.size());
for (size_t index = 0; index < samples.size(); ++index)
outSamples[index] = AmplitudeToAudioSample(samples[index]);
fwrite(&outSamples[0], dataSize, 1, file);

//close the file and return success
fclose(file);
return true;
}

//=====================================================================================
// FFT / IFFT
//=====================================================================================

// Thanks RosettaCode.org!
// http://rosettacode.org/wiki/Fast_Fourier_transform#C.2B.2B
// In production you'd probably want a non recursive algorithm, but this works fine for us

// for use with FFT and IFFT
typedef std::complex Complex;
typedef std::valarray CArray;

// Cooley–Tukey FFT (in-place)
void fft(CArray& x)
{
const size_t N = x.size();
if (N <= 1) return;

// divide
CArray even = x[std::slice(0, N/2, 2)];
CArray  odd = x[std::slice(1, N/2, 2)];

// conquer
fft(even);
fft(odd);

// combine
for (size_t k = 0; k < N/2; ++k)
{
Complex t = std::polar(1.0f, -2 * c_pi * k / N) * odd[k];
x[k    ] = even[k] + t;
x[k+N/2] = even[k] - t;
}
}

// inverse fft (in-place)
void ifft(CArray& x)
{
// conjugate the complex numbers
x = x.apply(std::conj);

// forward fft
fft( x );

// conjugate the complex numbers again
x = x.apply(std::conj);

// scale the numbers
x /= (float)x.size();
}

//=====================================================================================
// Wave forms
//=====================================================================================

void SineWave(CArray &frequencies, TFFTBin bin, TRadians startingPhase)
{
// set up the single harmonic
frequencies[bin.Value()] = std::polar(1.0f, startingPhase.Value());
}

void SawWave(CArray &frequencies, TFFTBin bin, TRadians startingPhase)
{
// set up each harmonic
const float volumeAdjustment = 2.0f / c_pi;
const size_t bucketWalk = bin.Value();
for (size_t harmonic = 1, bucket = bin.Value(); bucket < frequencies.size() / 2; ++harmonic, bucket += bucketWalk)
frequencies[bucket] = std::polar(volumeAdjustment / (float)harmonic, startingPhase.Value());
}

void SquareWave(CArray &frequencies, TFFTBin bin, TRadians startingPhase)
{
// set up each harmonic
const float volumeAdjustment = 4.0f / c_pi;
const size_t bucketWalk = bin.Value() * 2;
for (size_t harmonic = 1, bucket = bin.Value(); bucket < frequencies.size() / 2; harmonic += 2, bucket += bucketWalk)
frequencies[bucket] = std::polar(volumeAdjustment / (float)harmonic, startingPhase.Value());
}

void TriangleWave(CArray &frequencies, TFFTBin bin, TRadians startingPhase)
{
// set up each harmonic
const float volumeAdjustment = 8.0f / (c_pi*c_pi);
const size_t bucketWalk = bin.Value() * 2;
for (size_t harmonic = 1, bucket = bin.Value(); bucket < frequencies.size() / 2; harmonic += 2, bucket += bucketWalk, startingPhase *= TRadians(-1.0f))
frequencies[bucket] = std::polar(volumeAdjustment / ((float)harmonic*(float)harmonic), startingPhase.Value());
}

void NoiseWave(CArray &frequencies, TFFTBin bin, TRadians startingPhase)
{
// give a random amplitude and phase to each frequency
for (size_t bucket = 0; bucket < frequencies.size() / 2; ++bucket)
{
float amplitude = static_cast  (rand()) / static_cast  (RAND_MAX);
float phase = 2.0f * c_pi * static_cast  (rand()) / static_cast  (RAND_MAX);
frequencies[bucket] = std::polar(amplitude, phase);
}
}

//=====================================================================================
// Tests
//=====================================================================================

template
void ConsantBins(
const W &waveForm,
TFrequency &frequency,
bool repeat,
const char *fileName,
bool normalize,
TAmplitude multiplier,
)
{
const TFFTBin c_numBins(4096);

//our desired sound parameters
SSoundSettings sound;
sound.m_sampleRate = TSamples(44100);
sound.m_sampleCount = MilliSecondsToSamples(sound, TTimeMs(500));

// allocate space for the output file and initialize it
std::vector samples;
samples.resize(sound.m_sampleCount.Value());
std::fill(samples.begin(), samples.end(), TAmplitude(0.0f));

// make test data
CArray data(c_numBins.Value());
waveForm(data, FrequencyToFFTBin(frequency, c_numBins, sound.m_sampleRate), startingPhase);

// inverse fft - convert from frequency domain (frequencies) to time domain (samples)
// need to scale up amplitude before fft
data *= (float)data.size();
ifft(data);

// convert to samples
if (repeat)
{
//repeat results in the output buffer
size_t dataSize = data.size();
for (size_t i = 0; i < samples.size(); ++i)
samples[i] = TAmplitude((float)data[i%dataSize].real());
}
else
{
//put results in the output buffer once.  Useful for debugging / visualization
for (size_t i = 0; i < samples.size() && i < data.size(); ++i)
samples[i] = TAmplitude((float)data[i].real());
}

// normalize our samples if we should
if (normalize)
NormalizeSamples(samples, DBToAmplitude(TDecibels(-3.0f)));

// apply the multiplier passed in
std::for_each(samples.begin(), samples.end(), [&](TAmplitude& amplitude) {
amplitude *= multiplier;
});

// write the wave file
WriteWaveFile(fileName, samples, sound);
}

void Convolve_Circular(const std::vector& a, const std::vector& b, std::vector& result)
{
// NOTE: Written for readability, not efficiency
TSamples ASize(a.size());
TSamples BSize(b.size());

// NOTE: the circular convolution result doesn't have to be the size of a, i just chose this size to match the ifft
// circular convolution output.
result.resize(ASize.Value());
std::fill(result.begin(), result.end(), TAmplitude(0.0f));

for (TSamples outputSampleIndex(0), numOutputSamples(ASize); outputSampleIndex < numOutputSamples; ++outputSampleIndex)
{
TAmplitude &outputSample = result[outputSampleIndex.Value()];
for (TSamples sampleIndex(0), numSamples(ASize); sampleIndex < numSamples; ++sampleIndex)
{
TSamples BIndex = (outputSampleIndex + ASize - sampleIndex) % ASize;
if (BIndex < BSize)
{
const TAmplitude &ASample = a[sampleIndex.Value()];
const TAmplitude &BSample = b[BIndex.Value()];
outputSample += BSample * ASample;
}
}
}
}

void Convolve_Linear(const std::vector& a, const std::vector& b, std::vector& result)
{
// NOTE: Written for readability, not efficiency
TSamples ASize(a.size());
TSamples BSize(b.size());

result.resize(ASize.Value() + BSize.Value() - 1);
std::fill(result.begin(), result.end(), TAmplitude(0.0f));

for (TSamples outputSampleIndex(0), numOutputSamples(result.size()); outputSampleIndex < numOutputSamples; ++outputSampleIndex)
{
TAmplitude &outputSample = result[outputSampleIndex.Value()];
for (TSamples sampleIndex(0), numSamples(ASize); sampleIndex = sampleIndex)
{
TSamples BIndex = outputSampleIndex - sampleIndex;
if (BIndex < BSize)
{
const TAmplitude &ASample = a[sampleIndex.Value()];
const TAmplitude &BSample = b[BIndex.Value()];
outputSample += BSample * ASample;
}
}
}
}
}

template
void DoConvolution(const W1 &waveForm1, const W2 &waveForm2)
{
const TFFTBin c_numBins(4096);

//our desired sound parameters
SSoundSettings sound;
sound.m_sampleRate = TSamples(44100);
sound.m_sampleCount = TSamples(c_numBins.Value());

// make the frequency data for wave form 1
CArray data1(c_numBins.Value());

// make the frequency data for wave form 2
CArray data2(c_numBins.Value());

// do circular convolution in time domain by doing multiplication in the frequency domain
CArray data3(c_numBins.Value());
data3 = data1 * data2;

// write out the first convolution input (in time domain samples)
std::vector samples1;
samples1.resize(sound.m_sampleCount.Value());
std::fill(samples1.begin(), samples1.end(), TAmplitude(0.0f));
{
data1 *= (float)data1.size();
ifft(data1);

// convert to samples
for (size_t i = 0; i < samples1.size() && i < data1.size(); ++i)
samples1[i] = TAmplitude((float)data1[i].real());

// write the wave file
WriteWaveFile("_convolution_A.wav", samples1, sound);
}

// write out the second convolution input (in time domain samples)
std::vector samples2;
samples2.resize(sound.m_sampleCount.Value());
std::fill(samples2.begin(), samples2.end(), TAmplitude(0.0f));
{
data2 *= (float)data2.size();
ifft(data2);

// convert to samples
for (size_t i = 0; i < samples2.size() && i < data2.size(); ++i)
samples2[i] = TAmplitude((float)data2[i].real());

// write the wave file
WriteWaveFile("_convolution_B.wav", samples2, sound);
}

// write the result of the convolution (in time domain samples)
{
data3 *= (float)data3.size();
ifft(data3);

// convert to samples
std::vector samples3;
samples3.resize(sound.m_sampleCount.Value());
for (size_t i = 0; i < samples3.size() && i < data3.size(); ++i)
samples3[i] = TAmplitude((float)data3[i].real());

// write the wave file
NormalizeSamples(samples3, TAmplitude(1.0f));
WriteWaveFile("_convolution_out_ifft.wav", samples3, sound);
}

// do linear convolution in the time domain and write out the wave file
{
std::vector samples4;
Convolve_Linear(samples1, samples2, samples4);
NormalizeSamples(samples4, TAmplitude(1.0f));
WriteWaveFile("_convolution_out_lin.wav", samples4, sound);
}

// do circular convolution in time domain and write out the wave file
{
std::vector samples4;
Convolve_Circular(samples1, samples2, samples4);
NormalizeSamples(samples4, TAmplitude(1.0f));
WriteWaveFile("_convolution_out_cir.wav", samples4, sound);
}
}

//=====================================================================================
// Frequency Over Time Track Structs
//=====================================================================================

struct SBinTrack
{
SBinTrack() { }

SBinTrack(
TFFTBin bin,
std::function amplitudeFunction,
)
: m_bin(bin)
, m_amplitudeFunction(amplitudeFunction)
, m_phase(phase) {}

TFFTBin                                     m_bin;
std::function   m_amplitudeFunction;
};

//=====================================================================================
// Frequency Amplitude Over Time Test
//=====================================================================================

void TracksToSamples_IFFT_Window(const std::vector &tracks, CArray &windowData, TTimeS time, TTimeS totalTime)
{
// clear out the window data
windowData = Complex(0.0f, 0.0f);

// gather the bin data
std::for_each(tracks.begin(), tracks.end(), [&](const SBinTrack &track) {
windowData[track.m_bin.Value()] = std::polar(track.m_amplitudeFunction(time, totalTime).Value(), track.m_phase.Value());
});

// convert it to time samples
windowData *= (float)windowData.size();
ifft(windowData);
}

void TracksToSamples_IFFT(const std::vector &tracks, std::vector &samples, TTimeS totalTime, TFFTBin numBins)
{
// convert the tracks to samples, one window of numBins at a time
CArray windowData(numBins.Value());
for (TSamples startSample(0), numSamples(samples.size()); startSample < numSamples; startSample += TSamples(numBins.Value()))
{
// Convert the tracks that we can into time samples using ifft
float percent = startSample.Divide(numSamples);
TTimeS time(totalTime.Value() * percent);
TracksToSamples_IFFT_Window(tracks, windowData, time, totalTime);

// convert window data to samples
const size_t numWindowSamples = std::min(numBins.Value(), (numSamples - startSample).Value());
for (size_t i = 0; i < numWindowSamples; ++i)
samples[startSample.Value() + i] = TAmplitude((float)windowData[i].real());
}
}

void TracksToSamples_Oscilators(const std::vector &tracks, std::vector &samples, TTimeS totalTime, TFFTBin numBins)
{
// Render each time/amplitude track in each frequency bin to actual cosine samples
float samplesPerSecond = (float)samples.size() / totalTime.Value();
float ratio = samplesPerSecond / (float)numBins.Value();
for (size_t i = 0, c = samples.size(); i < c; ++i)
{
float percent = (float)i / (float)c;
TTimeS time(totalTime.Value() * percent);
samples[i].Value() = 0.0f;
std::for_each(tracks.begin(), tracks.end(),
[&](const SBinTrack &track)
{
TAmplitude amplitude = track.m_amplitudeFunction(time, totalTime);
samples[i] += TAmplitude(cos(time.Value()*c_twoPi*ratio*(float)track.m_bin.Value() + track.m_phase.Value())) * amplitude;
}
);
}
}

{
TTimeS m_time;
TAmplitude m_amplitude;
};

{
// if no faid pairs, 0 amplitude always
{
return [](TTimeS time, TTimeS totalTime) -> TAmplitude
{
return TAmplitude(0.0f);
};
}

// otherwise, use the fade info to make an amplitude over time track
// NOTE: assume amplitude 0 at time 0 and totalTime
return [fadePairs](TTimeS time, TTimeS totalTime) -> TAmplitude
{

for (size_t i = 0; i < fadePairs.size(); ++i)
{
{
}
}
if (time < totalTime)
{
}

return TAmplitude(0.0f);
};
}

void DynamicBins(TFFTBin numBins, const std::vector& tracks, const char *fileNameFFT, const char * fileNameOsc)
{
//our desired sound parameters
SSoundSettings sound;
sound.m_sampleRate = TSamples(44100);
sound.m_sampleCount = MilliSecondsToSamples(sound, TTimeMs(2000));

// allocate space for the output file and initialize it
std::vector samples;
samples.resize(sound.m_sampleCount.Value());
std::fill(samples.begin(), samples.end(), TAmplitude(0.0f));

const TTimeS totalTime = SamplesToSeconds(sound, sound.m_sampleCount);

// convert our frequency over time descriptions to time domain samples using IFFT
if (fileNameFFT)
{
TracksToSamples_IFFT(tracks, samples, totalTime, numBins);
NormalizeSamples(samples, DBToAmplitude(TDecibels(-3.0f)));
EnvelopeSamples(samples, MilliSecondsToSamples(sound, TTimeMs(50)));
WriteWaveFile(fileNameFFT, samples, sound);
}

// convert our frequency over time descriptions to time domain samples using Oscillators
if (fileNameOsc)
{
TracksToSamples_Oscilators(tracks, samples, totalTime, numBins);
NormalizeSamples(samples, DBToAmplitude(TDecibels(-3.0f)));
EnvelopeSamples(samples, MilliSecondsToSamples(sound, TTimeMs(50)));
WriteWaveFile(fileNameOsc, samples, sound);
}
}

//=====================================================================================
// Main
//=====================================================================================
int main(int argc, char **argv)
{
// make some basic wave forms with IFFT
ConsantBins(NoiseWave, Frequency(3, 8), true, "_noise.wav", true, TAmplitude(1.0f));
ConsantBins(SquareWave, Frequency(3, 8), true, "_square.wav", true, TAmplitude(1.0f));
ConsantBins(TriangleWave, Frequency(3, 8), true, "_triangle.wav", true, TAmplitude(1.0f));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw.wav", true, TAmplitude(1.0f));
ConsantBins(SineWave, Frequency(3, 8), true, "_cosine.wav", true, TAmplitude(1.0f), TRadians(0.0f));
ConsantBins(SineWave, Frequency(3, 8), true, "_sine.wav", true, TAmplitude(1.0f));

// show saw wave phase shifted.  Looks different but sounds the same!
// You can do the same with square, saw, triangle (and other more complex wave forms)
// We take the saw waves down 12 db though because some variations have large peaks so would clip otherwise.
// we don't normalize because we want you to hear them all at the same loudness to tell that they really do sound the same.
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_0.wav"  , false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(0.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_15.wav" , false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(15.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_30.wav" , false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(30.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_45.wav" , false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(45.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_60.wav" , false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(60.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_75.wav" , false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(75.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_90.wav" , false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(90.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_105.wav", false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(105.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_120.wav", false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(120.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_135.wav", false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(135.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_150.wav", false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(150.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_165.wav", false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(165.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_180.wav", false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(180.0f)));
ConsantBins(SawWave, Frequency(3, 8), true, "_saw_270.wav", false, DBToAmplitude(TDecibels(-12.0f)), DegreesToRadians(TDegrees(270.0f)));

// show how IFFT can have popping at window edges
{
std::vector tracks;
tracks.emplace_back(SBinTrack(TFFTBin(10), [](TTimeS time, TTimeS totalTime) -> TAmplitude
{
return TAmplitude(cos(time.Value()*c_twoPi*4.0f) * 0.5f + 0.5f);
}));

// make a version that starts at a phase of 0 degrees and has popping at the
// edges of each IFFT window
DynamicBins(TFFTBin(1024), tracks, "_IFFTTest1.wav", nullptr);

// make a version that starts at a phase of 270 degrees and is smooth at the
// edges of each IFFT window but can only change amplitude at the edges of
// each window.
DynamicBins(TFFTBin(1024), tracks, "_IFFTTest2.wav", nullptr);

// make a version with oscillators and additive synthesis which has no
// popping and can also change amplitude anywhere in the wave form.
DynamicBins(TFFTBin(1024), tracks, nullptr, "_IFFTTest3.wav");
}

// make an alien sound using both IFFT and oscillators (additive synthesis)
{
std::vector tracks;
tracks.emplace_back(SBinTrack(TFFTBin(1), MakeFadeFunction({ { TTimeS(0.5f), TAmplitude(1.0f) }, { TTimeS(1.0f), TAmplitude(0.5f) } })));
tracks.emplace_back(SBinTrack(TFFTBin(2), MakeFadeFunction({ { TTimeS(1.0f), TAmplitude(1.0f) } })));
tracks.emplace_back(SBinTrack(TFFTBin(3), MakeFadeFunction({ { TTimeS(1.5f), TAmplitude(1.0f) } })));
tracks.emplace_back(SBinTrack(TFFTBin(5), MakeFadeFunction({ { TTimeS(1.25f), TAmplitude(1.0f) } })));
tracks.emplace_back(SBinTrack(TFFTBin(10), [](TTimeS time, TTimeS totalTime) -> TAmplitude
{
float value = (cos(time.Value()*c_twoPi*4.0f) * 0.5f + 0.5f) * 0.5f;
if (time < totalTime * TTimeS(0.5f))
value *= (time / (totalTime*TTimeS(0.5f))).Value();
else
value *= 1.0f - ((time - totalTime*TTimeS(0.5f)) / (totalTime*TTimeS(0.5f))).Value();
return TAmplitude(value);
}));
DynamicBins(TFFTBin(1024), tracks, "_alien_ifft.wav", "_alien_osc.wav");
}

// Make some drum beats
{
// frequency = bin * SampleRate / numBins
// frequency = bin * 44100 / 4096
// frequency ~= bin * 10.75
// up to 22050hz at bin 2048

TFFTBin c_numBins(4096);
TSamples c_sampleRate(44100);

std::vector tracks;

const TTimeS timeMultiplier(1.1f);

// base drum: 100-200hz every half second
{
const TFFTBin start(FrequencyToFFTBin(TFrequency(100.0f), c_numBins, c_sampleRate));
const TFFTBin end(FrequencyToFFTBin(TFrequency(200.0f), c_numBins, c_sampleRate));
const TFFTBin step(5);

auto beat = [&](TTimeS time, TTimeS totalTime)->TAmplitude
{
time *= timeMultiplier;
time = TTimeS(std::fmod(time.Value(), 0.5f));
const TTimeS attack(0.01f);
const TTimeS release(TTimeS(0.2f) - attack);
const TTimeS totalBeatTime(attack + release);

TAmplitude ret;
if (time < attack)
ret = TAmplitude(time.Divide(attack));
else if (time < totalBeatTime)
ret = TAmplitude(1.0f) - TAmplitude((time - attack).Divide(release));
else
ret = TAmplitude(0.0f);
return ret * TAmplitude(10.0f);
};
for (TFFTBin i = start; i TAmplitude
{
time *= timeMultiplier;
time = TTimeS(std::fmod(time.Value() + 0.25f, 1.0f));
const TTimeS attack(0.025f);
const TTimeS release(TTimeS(0.075f) - attack);
const TTimeS totalBeatTime(attack + release);

TAmplitude ret;
if (time < attack)
ret = TAmplitude(time.Divide(attack));
else if (time < totalBeatTime)
ret = TAmplitude(1.0f) - TAmplitude((time - attack).Divide(release));
else
ret = TAmplitude(0.0f);
return ret;
};
for (TFFTBin i = start; i TAmplitude
{
time *= timeMultiplier;
time = TTimeS(std::fmod(time.Value() + 0.75f, 1.0f));
const TTimeS attack(0.025f);
const TTimeS release(TTimeS(0.075f) - attack);
const TTimeS totalBeatTime(attack + release);

TAmplitude ret;
if (time < attack)
ret = TAmplitude(time.Divide(attack));
else if (time < totalBeatTime)
ret = TAmplitude(1.0f) - TAmplitude((time - attack).Divide(release));
else
ret = TAmplitude(0.0f);
return ret;
};
for (TFFTBin i = start; i <= end; i += step)
tracks.emplace_back(SBinTrack(i, beat));
}

// render the result with both ifft and oscillators
DynamicBins(c_numBins, tracks, "_drums_ifft.wav", "_drums_osc.wav");
}

// do our convolution tests
DoConvolution(SawWave, SquareWave);

return 0;
}


So, frequency domain audio synthesis with IFFT is kind of a mixed bag. It can be good so long as you are ok with it’s limitations, but if you aren’t, it’s better to stick to oscillators and do additive synthesis.

Working in the frequency domain in general is pretty cool though. If you bust out a spectrograph and analyze the frequency of a type of sound, once you understand the frequency domain makeup of that sound, you can go back and synthesize it yourself using either oscillators or IFFT. You can go down this route to make your own synth guitar sound, or trumpet, or you could try to mimic sound effects. The world is your oyster! Well, almost… things like “voice synthesis” are actually pretty complex, and this method for matching musical instruments will only get you so far. More to come in future posts!

Here are some links to dive a bit deeper into this stuff:

I think I might be burned out on audio for a while, my thoughts are turning towards graphics so look for some graphics posts soon 😛

Decibels (dB) and Amplitude

If you are a programmer, chances are that when you think of volume or volume adjustments of audio signals (or other streams of data), you are thinking in terms of amplitude.

For instance, to make an audio stream quieter, you are probably going to multiply all the samples by 0.5 to bring it down in volume. That 0.5 is a scalar value in amplitude space.

If you work with musicians or other audio folk, chances are they are not going to think in amplitude, may not be able to easily adjust to thinking in amplitude, and instead will talk to you in terms of decibels or dB, which is as foreign to you as amplitude is to them.

The main issue is that our ears do not hear linear adjustments in amplitude as linear adjustments in loudness, but linear adjustments in dB do sound like they are linear adjustments in loudness.

dB is a bit easier to understand as well. 0dB means full volume and positive numbers means a boost in volume, while negative numbers mean a decrease in volume.

dB combines linearly, unlike amplitudes which have to multiply together. -6db means that the volume has been cut in half, and -12db means that it has been cut in half again (so is 1/4 as loud) and -18db means that it has been cut in half yet again (now 1/8 as loud). Doing the same with amplitude, 0.5 means that the volume is cut in half, and then you multiply that by 0.5 to get 0.25 to make it 1/4 as loud, and multiply again to get 0.125 which is 1/8 as loud.

A fun byproduct of this is that using dB you can never describe zero (silence) exactly. People will usually adopt a convention of saying that below -60dB is silence, or below -96dB is silence. This has a nice side benefit that you can make the cutoff point of “assumed silence” be above the level that floating point denormals are (very small numbers that need special processing and are slower to work with than regular floating point numbers), so that in effect you can use this to remove denormals from your processing, which can boost the performance of your code.

Conversion Functions

To convert from amplitude to dB, the formula is:

dB = 20 * log10(amplitude)

To convert from dB to amplitude, the formula is:

amplitude = 10^(db/20)

Note that when converting audio samples to dB, you want to take the absolute value of the audio sample, since sign doesn’t matter for loudness. -1 and +1 have the same loudness (0dB).

Here’s some c++ code which does those two operations:

inline float AmplitudeTodB(float amplitude)
{
return 20.0f * log10(amplitude);
}

inline float dBToAmplitude(float dB)
{
return pow(10.0f, db/20.0f);
}


Conversion Table

Here are some dB values and corresponding amplitude values to help you better understand how dB and amplitude are related.

Decreasing Volume:

 dB Amplitude -1 0.891 -3 0.708 -6 0.501 -12 0.251 -18 0.126 -20 0.1 -40 0.01 -60 0.001 -96 0.00002

Increasing Volume:

 dB Amplitude 1 1.122 3 1.413 6 1.995 12 3.981 18 7.943 20 10 40 100 60 1000 96 63095.734

Next Up

I’m just about finished doing the research for a fourier synthesis post to show how to use the inverse fourier transform to turn frequency information into audio samples. Look for that in the next couple days!

DIY Synth: Convolution Reverb & 1D Discrete Convolution of Audio Samples

This is a part of the DIY Synthesizer series of posts where each post is roughly built upon the knowledge of the previous posts. If you are lost, check the earlier posts!

If there is a physical location that has the reverb you are looking for (a church, a cave, your bathroom, the subway, whatever), you can go there and record the sound of some short, loud sound (a clap, a starter pistol, whatever). Now that you have this sound file, you can use a mathematical technique called “Discrete Convolution” (not as difficult as it sounds) to apply that reverb to other sounds.

Not only that, you can use convolution to apply various properties of one sound to another sound. Like you could convolve a saw wave and a recording of a voice and make it sound robotic.

This is pretty cool right?

Convolution

There are two types of convolution out there… continuous convolution and discrete convolution.

Continuous convolution is a method where you can take two formulas and perform some operations (algebra, calculus, etc) to get a third formula.

In the cases that us programmers usually work with, however, you usually only have data, and not a mathematical formula describing that data.

When you have specific data points instead of an equation, you can use something called discrete convolution instead.

Discrete convolution is a method where you take two streams of data (IE two arrays of data!) and perform some operations (only looping, addition and multiplication!) to get a third stream of data.

Interestingly, this third stream of data will be the size of the first stream of data, plus the size of the second stream of data, minus one.

This link explains discrete convolution much better than I could, and even has an interactive demo, so check it out: Discrete Convolution

Here is some pseudo code that shows how discrete convolution works.

// We are calculating bufferA * bufferB (discrete convolution)
// initialize out buffer to empty, and make sure it's the right size
outBuffer[sizeof(bufferA)+sizeof(bufferB)-1] = 0

// reverse one of the buffers in preperation of the convolution
reverse(bufferB);

// Convolve!
for (int outBufferIndex = 0; outBufferIndex < sizeof(outBuffer); ++outBufferIndex)
{
bufferAIndex = 0;
bufferBIndex = 0;
if (outBufferIndex < sizeof(bufferB))
bufferBIndex = sizeof(bufferB) - outBufferIndex - 1;
else
bufferAIndex = outBufferIndex - sizeof(bufferB);

for (; bufferAIndex < sizeof(bufferA) && bufferBIndex < sizeof(bufferB); ++bufferAIndex, ++bufferBIndex)
{
outBuffer[outBufferIndex] += bufferA[bufferAIndex] * bufferB[bufferBIndex];
}
}


That's all there is to it!

Convolution Reverb

Like I mentioned above, to do convolution reverb you just convolve the audio file you want to add reverb to with the recorded sound from the place that has the reverb you want. That second sound is called the Impulse Response, or IR for short.

Besides letting you choose a reverb model (like “cavernous” or “small hallway” or “bright tiled room”), many reverbs will have a “wetness” parameter that controls how much reverb you hear versus how much of the origional noise you hear. A wetness of 1.0 means all you hear is the reverb (the convolution), while a wetness of 0.0 means all you hear is the origional sound (the dry sound). It’s just a percentage and literally is just used to lerp between the two signals.

Here is an example of how changes in wetness can sound, and also give you a first glimpse of just how good this technique works!

Dry Sound: SoundBite
Impulse Response: ReverbLarge

Convolved with 100% wetness: c_RL_SBite.wav
Convolved with 66% wetness: c_RL_SBite_66.wav
Convolved with 33% wetness: c_RL_SBite_33.wav

Sample Sound Files

Let’s convolve some sounds together and see what results we get!

Here are 11 sounds convoled against each other (and themselves) at 100% wetness so all you hear is the convolution. Convolution is commutative (A*B == B*A) so that cuts down on the number of combinations (66 instead of 121!)

There are 3 impulse responses, 4 short basic wave forms, 2 voice samples a cymbal drum pattern and a jaw harp twang.

Some of the results are pretty interesting, for instance check out SoundBite vs SoundCymbal. The beat of the cymbal pattern remains, but the actual symbal sound itself is replaced by the words!

Sample Code

Here is some sample code which will read in “in.wav” and “reverb.wav”, do convolution on them and save the fully wet result (convolution result only, no original in.wav mixed into the output) as “out.wav”.

Note that this code was used to process the sample sound files above. Usual caveat: The sound loading code isn’t bulletproof. I’ve found it works best with 16 bit mono sound files. If you need better sound loading, check out libsndfile!

#define _CRT_SECURE_NO_WARNINGS

#include
#include
#include
#include
#include
#include
#include
#include

#define _USE_MATH_DEFINES
#include

//=====================================================================================
// SNumeric - uses phantom types to enforce type safety
//=====================================================================================
template
struct SNumeric
{
public:
explicit SNumeric(const T &value) : m_value(value) { }
SNumeric() : m_value() { }
inline T& Value() { return m_value; }
inline const T& Value() const { return m_value; }

typedef SNumeric TType;
typedef T TInnerType;

// Math Operations
TType operator+ (const TType &b) const
{
return TType(this->Value() + b.Value());
}

TType operator- (const TType &b) const
{
return TType(this->Value() - b.Value());
}

TType operator* (const TType &b) const
{
return TType(this->Value() * b.Value());
}

TType operator/ (const TType &b) const
{
return TType(this->Value() / b.Value());
}

TType& operator+= (const TType &b)
{
Value() += b.Value();
return *this;
}

TType& operator-= (const TType &b)
{
Value() -= b.Value();
return *this;
}

TType& operator*= (const TType &b)
{
Value() *= b.Value();
return *this;
}

TType& operator/= (const TType &b)
{
Value() /= b.Value();
return *this;
}

TType& operator++ ()
{
Value()++;
return *this;
}

TType& operator-- ()
{
Value()--;
return *this;
}

// Extended Math Operations
template
T Divide(const TType &b)
{
return ((T)this->Value()) / ((T)b.Value());
}

// Logic Operations
bool operatorValue() < b.Value();
}
bool operatorValue()  (const TType &b) const {
return this->Value() > b.Value();
}
bool operator>= (const TType &b) const {
return this->Value() >= b.Value();
}
bool operator== (const TType &b) const {
return this->Value() == b.Value();
}
bool operator!= (const TType &b) const {
return this->Value() != b.Value();
}

private:
T m_value;
};

//=====================================================================================
// Typedefs
//=====================================================================================

typedef uint8_t uint8;
typedef uint16_t uint16;
typedef uint32_t uint32;
typedef int16_t int16;
typedef int32_t int32;

// type safe types!
typedef SNumeric      TFrequency;
typedef SNumeric        TTimeMs;
typedef SNumeric       TSamples;
typedef SNumeric   TFractionalSamples;
typedef SNumeric       TDecibels;
typedef SNumeric      TAmplitude;
typedef SNumeric          TPhase;

//=====================================================================================
// Constants
//=====================================================================================

static const float c_pi = (float)M_PI;
static const float c_twoPi = c_pi * 2.0f;

//=====================================================================================
// Structs
//=====================================================================================

struct SSoundSettings
{
TSamples        m_sampleRate;
};

//=====================================================================================
// Conversion Functions
//=====================================================================================
inline TDecibels AmplitudeToDB(TAmplitude volume)
{
return TDecibels(log10(volume.Value()));
}

inline TAmplitude DBToAmplitude(TDecibels dB)
{
return TAmplitude(pow(10.0f, dB.Value() / 20.0f));
}

TSamples SecondsToSamples(const SSoundSettings &s, float seconds)
{
return TSamples((int)(seconds * (float)s.m_sampleRate.Value()));
}

TSamples MilliSecondsToSamples(const SSoundSettings &s, float milliseconds)
{
return SecondsToSamples(s, milliseconds / 1000.0f);
}

TTimeMs SecondsToMilliseconds(float seconds)
{
return TTimeMs((uint32)(seconds * 1000.0f));
}

TFrequency Frequency(float octave, float note)
{
/* frequency = 440×(2^(n/12))
Notes:
0  = A
1  = A#
2  = B
3  = C
4  = C#
5  = D
6  = D#
7  = E
8  = F
9  = F#
10 = G
11 = G# */
return TFrequency((float)(440 * pow(2.0, ((double)((octave - 4) * 12 + note)) / 12.0)));
}

template
T AmplitudeToAudioSample(const TAmplitude& in)
{
const T c_min = std::numeric_limits::min();
const T c_max = std::numeric_limits::max();
const float c_minFloat = (float)c_min;
const float c_maxFloat = (float)c_max;

float ret = in.Value() * c_maxFloat;

if (ret  c_maxFloat)
return c_max;

return (T)ret;
}

TAmplitude GetLerpedAudioSample(const std::vector& samples, TFractionalSamples& index)
{
// get the index of each sample and the fractional blend amount
uint32 a = (uint32)floor(index.Value());
uint32 b = a + 1;
float fract = index.Value() - floor(index.Value());

// get our two amplitudes
float ampA = 0.0f;
if (a >= 0 && a = 0 && b < samples.size())
ampB = samples[b].Value();

// return the lerped result
return TAmplitude(fract * ampB + (1.0f - fract) * ampA);
}

void NormalizeSamples(std::vector& samples, TAmplitude maxAmplitude, TSamples envelopeTimeFrontBack)
{
// nothing to do if no samples
if (samples.size() == 0)
return;

// 1) find the largest absolute value in the samples.
TAmplitude largestAbsVal = TAmplitude(abs(samples.front().Value()));
std::for_each(samples.begin() + 1, samples.end(), [&largestAbsVal](const TAmplitude &a)
{
TAmplitude absVal = TAmplitude(abs(a.Value()));
if (absVal > largestAbsVal)
largestAbsVal = absVal;
}
);

// 2) adjust largestAbsVal so that when we divide all samples, none will be bigger than maxAmplitude
// if the value we are going to divide by is <= 0, bail out
largestAbsVal /= maxAmplitude;
if (largestAbsVal <= TAmplitude(0.0f))
return;

// 3) divide all numbers by the largest absolute value seen so all samples are [-maxAmplitude,+maxAmplitude]
// also apply front and back envelope
const TSamples c_frontEnvelopeEnd(envelopeTimeFrontBack);
const TSamples c_backEnvelopeStart(samples.size() - envelopeTimeFrontBack.Value());

for (TSamples index(0), numSamples(samples.size()); index < numSamples; ++index)
{
// calculate envelope
TAmplitude envelope(1.0f);
if (index < c_frontEnvelopeEnd)
envelope = TAmplitude(index.Divide(envelopeTimeFrontBack));
else if (index > c_backEnvelopeStart)
envelope = TAmplitude(1.0f) - TAmplitude((index - c_backEnvelopeStart).Divide(envelopeTimeFrontBack));

// apply envelope while normalizing
samples[index.Value()] = samples[index.Value()] * envelope / largestAbsVal;

}
}

void ResampleData(std::vector& samples, int srcSampleRate, int destSampleRate)
{
//if the requested sample rate is the sample rate it already is, bail out and do nothing
if (srcSampleRate == destSampleRate)
return;

//calculate the ratio of the old sample rate to the new
float fResampleRatio = (float)destSampleRate / (float)srcSampleRate;

//calculate how many samples the new data will have and allocate the new sample data
int nNewDataNumSamples = (int)((float)samples.size() * fResampleRatio);

std::vector newSamples;
newSamples.resize(nNewDataNumSamples);

//get each lerped output sample.  There are higher quality ways to resample
for (int nIndex = 0; nIndex < nNewDataNumSamples; ++nIndex)
newSamples[nIndex] = GetLerpedAudioSample(samples, TFractionalSamples((float)nIndex / fResampleRatio));

//free the old data and set the new data
std::swap(samples, newSamples);
}

void ChangeNumChannels(std::vector& samples, int nSrcChannels, int nDestChannels)
{
//if the number of channels requested is the number of channels already there, or either number of channels is not mono or stereo, return
if (nSrcChannels == nDestChannels ||
nSrcChannels  2 ||
nDestChannels  2)
{
return;
}

//if converting from mono to stereo, duplicate the mono channel to make stereo
if (nDestChannels == 2)
{
std::vector newSamples;
newSamples.resize(samples.size() * 2);
for (size_t index = 0; index < samples.size(); ++index)
{
newSamples[index * 2] = samples[index];
newSamples[index * 2 + 1] = samples[index];
}

std::swap(samples, newSamples);
}
//else converting from stereo to mono, mix the stereo channels together to make mono
else
{
std::vector newSamples;
newSamples.resize(samples.size() / 2);
for (size_t index = 0; index < samples.size() / 2; ++index)
newSamples[index] = samples[index * 2] + samples[index * 2 + 1];

std::swap(samples, newSamples);
}
}

float PCMToFloat(unsigned char *pPCMData, int nNumBytes)
{
switch (nNumBytes)
{
case 1:
{
uint8 data = pPCMData[0];
return (float)data / 255.0f;
}
case 2:
{
int16 data = pPCMData[1] << 8 | pPCMData[0];
return ((float)data) / ((float)0x00007fff);
}
case 3:
{
int32 data = pPCMData[2] << 16 | pPCMData[1] << 8 | pPCMData[0];
return ((float)data) / ((float)0x007fffff);
}
case 4:
{
int32 data = pPCMData[3] << 24 | pPCMData[2] << 16 | pPCMData[1] << 8 | pPCMData[0];
return ((float)data) / ((float)0x7fffffff);
}
default:
{
return 0.0f;
}
}
}

//=====================================================================================
// Wave File Writing Code
//=====================================================================================
{
//the main chunk
unsigned char m_szChunkID[4];      //0
uint32        m_nChunkSize;        //4
unsigned char m_szFormat[4];       //8

//sub chunk 1 "fmt "
unsigned char m_szSubChunk1ID[4];  //12
uint32        m_nSubChunk1Size;    //16
uint16        m_nAudioFormat;      //18
uint16        m_nNumChannels;      //20
uint32        m_nSampleRate;       //24
uint32        m_nByteRate;         //28
uint16        m_nBlockAlign;       //30
uint16        m_nBitsPerSample;    //32

//sub chunk 2 "data"
unsigned char m_szSubChunk2ID[4];  //36
uint32        m_nSubChunk2Size;    //40

//then comes the data!
};

//this writes a wave file
template
bool WriteWaveFile(const char *fileName, const std::vector &samples, const SSoundSettings &sound)
{
//open the file if we can
FILE *file = fopen(fileName, "w+b");
if (!file)
return false;

//calculate bits per sample and the data size
const int32 bitsPerSample = sizeof(T) * 8;
const int dataSize = samples.size() * sizeof(T);

//fill out the main chunk

//fill out sub chunk 1 "fmt "
waveHeader.m_nByteRate = sound.m_sampleRate.Value() * 1 * bitsPerSample / 8;
waveHeader.m_nBlockAlign = 1 * bitsPerSample / 8;

//fill out sub chunk 2 "data"

//write the wave data itself, converting it from float to the type specified
std::vector outSamples;
outSamples.resize(samples.size());
for (size_t index = 0; index < samples.size(); ++index)
outSamples[index] = AmplitudeToAudioSample(samples[index]);
fwrite(&outSamples[0], dataSize, 1, file);

//close the file and return success
fclose(file);
return true;
}

//loads a wave file in.  Converts from source format into the specified format
// TOTAL HONESTY: some wave files seem to have problems being loaded through this function and I don't have
// time to investigate why.  It seems to work best with 16 bit mono wave files.
// If you need more robust file loading, check out libsndfile at http://www.mega-nerd.com/libsndfile/
bool ReadWaveFile(const char *fileName, std::vector& samples, int32 sampleRate)
{
//open the file if we can
FILE *File = fopen(fileName, "rb");
if (!File)
{
return false;
}

//read the main chunk ID and make sure it's "RIFF"
char buffer[5];
buffer[4] = 0;
if (fread(buffer, 4, 1, File) != 1 || strcmp(buffer, "RIFF"))
{
fclose(File);
return false;
}

uint32 nChunkSize;
if (fread(&nChunkSize, 4, 1, File) != 1)
{
fclose(File);
return false;
}

//read the format and make sure it's "WAVE"
if (fread(buffer, 4, 1, File) != 1 || strcmp(buffer, "WAVE"))
{
fclose(File);
return false;
}

long chunkPosFmt = -1;
long chunkPosData = -1;

while (chunkPosFmt == -1 || chunkPosData == -1)
{
//read a sub chunk id and a chunk size if we can
if (fread(buffer, 4, 1, File) != 1 || fread(&nChunkSize, 4, 1, File) != 1)
{
fclose(File);
return false;
}

//if we hit a fmt
if (!strcmp(buffer, "fmt "))
{
chunkPosFmt = ftell(File) - 8;
}
//else if we hit a data
else if (!strcmp(buffer, "data"))
{
chunkPosData = ftell(File) - 8;
}

fseek(File, nChunkSize, SEEK_CUR);
}

//we'll use this handy struct to load in

//load the fmt part if we can
fseek(File, chunkPosFmt, SEEK_SET);
if (fread(&waveData.m_szSubChunk1ID, 24, 1, File) != 1)
{
fclose(File);
return false;
}

//load the data part if we can
fseek(File, chunkPosData, SEEK_SET);
if (fread(&waveData.m_szSubChunk2ID, 8, 1, File) != 1)
{
fclose(File);
return false;
}

//verify a couple things about the file data
if (waveData.m_nAudioFormat != 1 ||       //only pcm data
waveData.m_nNumChannels  2 ||        //must not have more than 2
waveData.m_nBitsPerSample > 32 ||     //32 bits per sample max
waveData.m_nBitsPerSample % 8 != 0 || //must be a multiple of 8 bites
waveData.m_nBlockAlign > 8)           //blocks must be 8 bytes or lower
{
fclose(File);
return false;
}

//figure out how many samples and blocks there are total in the source data
int nBytesPerBlock = waveData.m_nBlockAlign;
int nNumBlocks = waveData.m_nSubChunk2Size / nBytesPerBlock;
int nNumSourceSamples = nNumBlocks * waveData.m_nNumChannels;

//allocate space for the source samples
samples.resize(nNumSourceSamples);

//maximum size of a block is 8 bytes.  4 bytes per samples, 2 channels
unsigned char pBlockData[8];
memset(pBlockData, 0, 8);

//read in the source samples at whatever sample rate / number of channels it might be in
int nBytesPerSample = nBytesPerBlock / waveData.m_nNumChannels;
for (int nIndex = 0; nIndex = TPhase(1.0f))
phase -= TPhase(1.0f);
while (phase  0.5f ? 1.0f : -1.0f);
}

TAmplitude AdvanceOscilator_Triangle(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{
if (phase > TPhase(0.5f))
return TAmplitude((((1.0f - phase.Value()) * 2.0f) * 2.0f) - 1.0f);
else
return TAmplitude(((phase.Value() * 2.0f) * 2.0f) - 1.0f);
}

TAmplitude AdvanceOscilator_Saw_BandLimited(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{

// sum the harmonics
TAmplitude ret(0.0f);
for (int harmonicIndex = 1; harmonicIndex <= 4; ++harmonicIndex)
{
TPhase harmonicPhase = phase * TPhase((float)harmonicIndex);
ret += TAmplitude(sin(harmonicPhase.Value()*c_twoPi) / (float)harmonicIndex);
}

ret *= TAmplitude(2.0f / c_pi);

return ret;
}

TAmplitude AdvanceOscilator_Square_BandLimited(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{

// sum the harmonics
TAmplitude ret(0.0f);
for (int harmonicIndex = 1; harmonicIndex <= 4; ++harmonicIndex)
{
float harmonicFactor = (float)harmonicIndex * 2.0f - 1.0f;
TPhase harmonicPhase = phase * TPhase(harmonicFactor);
ret += TAmplitude(sin(harmonicPhase.Value()*c_twoPi) / harmonicFactor);
}

ret *= TAmplitude(4.0f / c_pi);

return ret;
}

TAmplitude AdvanceOscilator_Triangle_BandLimited(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{

// sum the harmonics
TAmplitude ret(0.0f);
TAmplitude signFlip(1.0f);
for (int harmonicIndex = 1; harmonicIndex <= 4; ++harmonicIndex)
{
float harmonicFactor = (float)harmonicIndex * 2.0f - 1.0f;
TPhase harmonicPhase = phase * TPhase(harmonicFactor);
ret += TAmplitude(sin(harmonicPhase.Value()*c_twoPi) / (harmonicFactor*harmonicFactor)) * signFlip;
signFlip *= TAmplitude(-1.0f);
}

ret *= TAmplitude(8.0f / (c_pi*c_pi));

return ret;
}

//=====================================================================================
// Main
//=====================================================================================
int main(int argc, char **argv)
{
// wetness parameter of reverb.  It's a percentage from 0 to 1.
TAmplitude c_reverbWetness(1.0f);

// keep track of when the process started so we can report how long it took
auto start = std::chrono::system_clock::now().time_since_epoch();

//our desired sound parameters
SSoundSettings sound;
sound.m_sampleRate = TSamples(44100);

// load the input wave file if we can and normalize it
std::vector inputData;
{
printf("could not load input wave file!");
return 0;
}
NormalizeSamples(inputData, TAmplitude(1.0f), TSamples(0));

// load the reverb wave file if we can and normalize it
std::vector reverbData;
{
printf("could not load reverb wave file!");
return 0;
}
NormalizeSamples(reverbData, TAmplitude(1.0f), TSamples(0));

// allocate space for the output file - which will be the number of samples in the two input files minus 1
// initialize it to silence!
std::vector samples;
samples.resize(inputData.size() + reverbData.size() - 1);
std::fill(samples.begin(), samples.end(), TAmplitude(0.0f));

// reverse the reverb data in preparation of convolution
std::reverse(reverbData.begin(), reverbData.end());

// report the number of samples of each file
printf("Input samples = %un", inputData.size());
printf("Reverb samples = %un", reverbData.size());

// apply the convolution for each output sample
float lastPercentageReported = 0.0f;
char percentageText[512];
percentageText[0] = 0;
printf("Progress: ");
for (TSamples sampleIndex(0), numSamples(samples.size()); sampleIndex < numSamples; ++sampleIndex)
{
// print some periodic output since this can take a while!
float percentage = sampleIndex.Divide(numSamples);
if (percentage >= lastPercentageReported + 0.01f)
{
// erase the last progress text
for (int i = 0, c = strlen(percentageText); i < c; ++i)
printf("%c %c", 8, 8);

// calculte and show progress text
lastPercentageReported = percentage;
auto duration = std::chrono::system_clock::now().time_since_epoch() - start;
auto millis = std::chrono::duration_cast(duration).count();
float expectedMillis = millis / percentage;
sprintf(percentageText, "%i%%  %0.2f / %0.2f seconds (%0.2f remaining)", int(percentage*100.0f), ((float)millis) / 1000.0f, expectedMillis / 1000.0f, (expectedMillis - millis) / 1000.0f);
printf("%s", percentageText);
}

// get a reference to our output sample
TAmplitude &outputSample = samples[sampleIndex.Value()];

// figure out the first reverb index and input index we should process.
TSamples startReverbIndex(0);
TSamples startInputIndex(0);
if (sampleIndex < TSamples(reverbData.size()))
startReverbIndex = TSamples(reverbData.size()) - sampleIndex - TSamples(1);
else
startInputIndex = sampleIndex - TSamples(reverbData.size());

// for each reverb sample
for (TSamples reverbIndex(startReverbIndex), numReverbSamples(reverbData.size()), inputIndex(startInputIndex), numInputSamples(inputData.size());
reverbIndex < numReverbSamples && inputIndex < numInputSamples;
++reverbIndex, ++inputIndex)
{
const TAmplitude &inputSample = inputData[inputIndex.Value()];
const TAmplitude &reverbSample = reverbData[reverbIndex.Value()];
outputSample += inputSample * reverbSample;
}
}

// normalize the reverb to the wetness volume level
NormalizeSamples(samples, c_reverbWetness, TSamples(0));

// apply the dry sound on top, per the wetness settings
for (TSamples inputIndex = TSamples(0), numInputSamples(inputData.size()); inputIndex < numInputSamples; ++inputIndex)
{
const TAmplitude &inputSample = inputData[inputIndex.Value()];
TAmplitude &outputSample = samples[inputIndex.Value()];
outputSample += inputSample * (TAmplitude(1.0f) - c_reverbWetness);
}

// normalize the amplitude of the samples to make sure they are as loud as possible without clipping.
// give 3db of headroom and also put an envelope on the front and back that is 50ms
NormalizeSamples(samples, DBToAmplitude(TDecibels(-3.0f)), MilliSecondsToSamples(sound, 50.0f));

// save as a wave file
WriteWaveFile("out.wav", samples, sound);
return 0;
}


Intuition

I had to think about it for a bit but the intuition for why this works is actually pretty straight forward!

The impulse response represents all the echos that occur over time if you make a single sample of 1.0 amplitude. This is your guide for how all samples in an impulse source (IS) should be transformed, where IS is the sound you want to add reverb to.

So, starting out, you put the entire IR into your output buffer starting at output[0] (the first sample in the output buffer), multiplied by IS[0] (the first sample of the sound you are reverbing).

This leaves you with the proper reverb of the first sample in your IS, and your total output buffer length is the same length as your IR.

We only have one sample reverbed though, so we move to the next…

Next, you put the entire IR into the output buffer again, but this time start at output[1] (the second sample in the output buffer) and those IR values you are writing need to be multiplied by IS[1] (the second sample of the sound you are reverbing).

You now have the proper reverb for the first two samples in your IS, and your total output buffer length is the length of your IR + 1.

Rinse and repeat for all samples in your IS, and at the end, the reverb of all samples in your IS will be accounted for in your output buffer, and the size of your output buffer will be the length of your IR plus the length of your IS, minus one.

Pretty simple and intuitive right?

You are essentially playing the IR in the output buffer length(IS) times, where IS[outputBufferIndex] determines the volume (amplitude) that you need to play the IR at.

Convolution does exactly this, which turns out to be pretty slow due to it being a lot of math to preform.

If you have a source file (IS) you are reverbing that is 4 seconds long, running at 44100 samples per second (176,400 samples total), and you have a reverb impulse response (IR) that is 2 seconds long at 44100 samples per second (88,200 samples total), that means that you are going to essentially be mixing the IR into an output buffer 176,400 times.

Each time you play the IR you need to do a multiplication per IR sample (to scale it to the IS[index] amplitude) and an addition to add it to the resulting output buffer.

At 88,200 IR samples with a multiply and add for each sample, and doing that 176,400 times, that means at the end you will need to do 15,558,480,000 (15.5 billion) multiplies and adds.

And remember… that is only for a 4 second sound that is receiving a 2 second reverb… those are pretty small sound files involved! And, that is only a single channel. That would double in stereo, and be even worse in 5.1 or 7.1 surround sound!

So, this method works, but unfortunately it’s pretty darn slow (it took like 5 minutes for me to convolve the legend quote with the large reverb). It could be made faster by introducing threading and SSE instructions, but there is a better way that gives us an even bigger speed up.

Having audio samples means you are working in the “time domain”. If you take a fourier transform of the audio samples, you’ll have the information in the “frequency domain”. As it turns out, if you do a multiplication in the frequency domain, that is equivalent to doing a convolution in the time domain. Using the fast fourier transform on your input sound and multiplying that by the pre-transformed FFT of the impulse response, and then doing an inverse FFT on the result to get back into the time domain (aka audio samples) is MUCH FASTER than doing actual convolution.

Another problem with convolution reverb is that you need the full sound you are convolving, which makes it basically impossible to use in a live setup. The solution to this is that people convolve windows of data at a time, instead of the whole sound. I think a common window size is 256 samples.

I’ll make a post in the future that addresses both those issues to allow for fast, real time convolution reverb via windowed FFT multiplication.

Also, I said in the last post that this technique is what the pros do. We’ll I should add that this is what they do SOMETIMES. Other times they use DSP algorithms involving comb filters and multi tap delays (and other things) to make fast reverb that sounds pretty decent without needing to do convolution or FFT/IFFT. Check the links section for a link to the details of one such algorithm.

Regarding IR samples (Impulse Response recordings), if you record your own, it should work as is. However, a common thing that people do is remove the loud sound from the impulse response (via deconvolution i think?) to get ONLY the impulse response. It basically makes it so that the IR really is just an IR of that room as if you had a single sample of a “1.0” echoing in the area. Without this step, your reverb convolution will still work, but you’ll get “smearing” of the sound, due to it not being a true 1.0 sample (or in other words, not a true dirac delta).

Luckily there are IR’s available for download online as well. Some are free, some are pay. Check the links section for a few links I found for that (:

Lastly, this post basically teaches you how to do 1 dimensional convolution, showing you an application for it. You can do convolution in higher dimensions as well though, and in fact if you have used photoshop and know about things like gaussian blur, sharpen mask, etc, those guys work by 2d convolution. Bokeh is even apparently convolution, as is a “Minkowski Sum” if you have done work in game physics.

Ill make a graphics related 2d convolution post in the future as well to delve into that stuff a bit deeper.

As one final sound sample, check out this CROSS CORRELATION (not convolution) of SoundBite.wav and ReverbLarge. Cross correlation is the same as convolution except you don’t reverse one of the samples. So, in effect, it’s like I convolved SoundBite.wav with ReverbLarge.wav played backwards.

c_ReverseRL_SBite.wav

More info on cross correlation coming soon. It has uses in audio, but more often it’s used for finding time delays of sounds compared to other sounds, recognizing patterns in sound data even with the presence of noise / distortion, versus having actual audible uses (although there are some of those too!).