# DIY Synth: Convolution Reverb & 1D Discrete Convolution of Audio Samples

This is a part of the DIY Synthesizer series of posts where each post is roughly built upon the knowledge of the previous posts. If you are lost, check the earlier posts!

If there is a physical location that has the reverb you are looking for (a church, a cave, your bathroom, the subway, whatever), you can go there and record the sound of some short, loud sound (a clap, a starter pistol, whatever). Now that you have this sound file, you can use a mathematical technique called “Discrete Convolution” (not as difficult as it sounds) to apply that reverb to other sounds.

Not only that, you can use convolution to apply various properties of one sound to another sound. Like you could convolve a saw wave and a recording of a voice and make it sound robotic.

This is pretty cool right?

## Convolution

There are two types of convolution out there… continuous convolution and discrete convolution.

Continuous convolution is a method where you can take two formulas and perform some operations (algebra, calculus, etc) to get a third formula.

In the cases that us programmers usually work with, however, you usually only have data, and not a mathematical formula describing that data.

When you have specific data points instead of an equation, you can use something called discrete convolution instead.

Discrete convolution is a method where you take two streams of data (IE two arrays of data!) and perform some operations (only looping, addition and multiplication!) to get a third stream of data.

Interestingly, this third stream of data will be the size of the first stream of data, plus the size of the second stream of data, minus one.

This link explains discrete convolution much better than I could, and even has an interactive demo, so check it out: Discrete Convolution

Here is some pseudo code that shows how discrete convolution works.

```// We are calculating bufferA * bufferB (discrete convolution)
// initialize out buffer to empty, and make sure it's the right size
outBuffer[sizeof(bufferA)+sizeof(bufferB)-1] = 0

// reverse one of the buffers in preperation of the convolution
reverse(bufferB);

// Convolve!
for (int outBufferIndex = 0; outBufferIndex < sizeof(outBuffer); ++outBufferIndex)
{
bufferAIndex = 0;
bufferBIndex = 0;
if (outBufferIndex < sizeof(bufferB))
bufferBIndex = sizeof(bufferB) - outBufferIndex - 1;
else
bufferAIndex = outBufferIndex - sizeof(bufferB);

for (; bufferAIndex < sizeof(bufferA) && bufferBIndex < sizeof(bufferB); ++bufferAIndex, ++bufferBIndex)
{
outBuffer[outBufferIndex] += bufferA[bufferAIndex] * bufferB[bufferBIndex];
}
}
```

That's all there is to it!

## Convolution Reverb

Like I mentioned above, to do convolution reverb you just convolve the audio file you want to add reverb to with the recorded sound from the place that has the reverb you want. That second sound is called the Impulse Response, or IR for short.

Besides letting you choose a reverb model (like “cavernous” or “small hallway” or “bright tiled room”), many reverbs will have a “wetness” parameter that controls how much reverb you hear versus how much of the origional noise you hear. A wetness of 1.0 means all you hear is the reverb (the convolution), while a wetness of 0.0 means all you hear is the origional sound (the dry sound). It’s just a percentage and literally is just used to lerp between the two signals.

Here is an example of how changes in wetness can sound, and also give you a first glimpse of just how good this technique works!

Dry Sound: SoundBite
Impulse Response: ReverbLarge

Convolved with 100% wetness: c_RL_SBite.wav
Convolved with 66% wetness: c_RL_SBite_66.wav
Convolved with 33% wetness: c_RL_SBite_33.wav

## Sample Sound Files

Let’s convolve some sounds together and see what results we get!

Here are 11 sounds convoled against each other (and themselves) at 100% wetness so all you hear is the convolution. Convolution is commutative (A*B == B*A) so that cuts down on the number of combinations (66 instead of 121!)

There are 3 impulse responses, 4 short basic wave forms, 2 voice samples a cymbal drum pattern and a jaw harp twang.

Some of the results are pretty interesting, for instance check out SoundBite vs SoundCymbal. The beat of the cymbal pattern remains, but the actual symbal sound itself is replaced by the words!

## Sample Code

Here is some sample code which will read in “in.wav” and “reverb.wav”, do convolution on them and save the fully wet result (convolution result only, no original in.wav mixed into the output) as “out.wav”.

Note that this code was used to process the sample sound files above. Usual caveat: The sound loading code isn’t bulletproof. I’ve found it works best with 16 bit mono sound files. If you need better sound loading, check out libsndfile!

```#define _CRT_SECURE_NO_WARNINGS

#include
#include
#include
#include
#include
#include
#include
#include

#define _USE_MATH_DEFINES
#include

//=====================================================================================
// SNumeric - uses phantom types to enforce type safety
//=====================================================================================
template
struct SNumeric
{
public:
explicit SNumeric(const T &value) : m_value(value) { }
SNumeric() : m_value() { }
inline T& Value() { return m_value; }
inline const T& Value() const { return m_value; }

typedef SNumeric TType;
typedef T TInnerType;

// Math Operations
TType operator+ (const TType &b) const
{
return TType(this->Value() + b.Value());
}

TType operator- (const TType &b) const
{
return TType(this->Value() - b.Value());
}

TType operator* (const TType &b) const
{
return TType(this->Value() * b.Value());
}

TType operator/ (const TType &b) const
{
return TType(this->Value() / b.Value());
}

TType& operator+= (const TType &b)
{
Value() += b.Value();
return *this;
}

TType& operator-= (const TType &b)
{
Value() -= b.Value();
return *this;
}

TType& operator*= (const TType &b)
{
Value() *= b.Value();
return *this;
}

TType& operator/= (const TType &b)
{
Value() /= b.Value();
return *this;
}

TType& operator++ ()
{
Value()++;
return *this;
}

TType& operator-- ()
{
Value()--;
return *this;
}

// Extended Math Operations
template
T Divide(const TType &b)
{
return ((T)this->Value()) / ((T)b.Value());
}

// Logic Operations
bool operatorValue() < b.Value();
}
bool operatorValue()  (const TType &b) const {
return this->Value() > b.Value();
}
bool operator>= (const TType &b) const {
return this->Value() >= b.Value();
}
bool operator== (const TType &b) const {
return this->Value() == b.Value();
}
bool operator!= (const TType &b) const {
return this->Value() != b.Value();
}

private:
T m_value;
};

//=====================================================================================
// Typedefs
//=====================================================================================

typedef uint8_t uint8;
typedef uint16_t uint16;
typedef uint32_t uint32;
typedef int16_t int16;
typedef int32_t int32;

// type safe types!
typedef SNumeric      TFrequency;
typedef SNumeric        TTimeMs;
typedef SNumeric       TSamples;
typedef SNumeric   TFractionalSamples;
typedef SNumeric       TDecibels;
typedef SNumeric      TAmplitude;
typedef SNumeric          TPhase;

//=====================================================================================
// Constants
//=====================================================================================

static const float c_pi = (float)M_PI;
static const float c_twoPi = c_pi * 2.0f;

//=====================================================================================
// Structs
//=====================================================================================

struct SSoundSettings
{
TSamples        m_sampleRate;
};

//=====================================================================================
// Conversion Functions
//=====================================================================================
inline TDecibels AmplitudeToDB(TAmplitude volume)
{
return TDecibels(log10(volume.Value()));
}

inline TAmplitude DBToAmplitude(TDecibels dB)
{
return TAmplitude(pow(10.0f, dB.Value() / 20.0f));
}

TSamples SecondsToSamples(const SSoundSettings &s, float seconds)
{
return TSamples((int)(seconds * (float)s.m_sampleRate.Value()));
}

TSamples MilliSecondsToSamples(const SSoundSettings &s, float milliseconds)
{
return SecondsToSamples(s, milliseconds / 1000.0f);
}

TTimeMs SecondsToMilliseconds(float seconds)
{
return TTimeMs((uint32)(seconds * 1000.0f));
}

TFrequency Frequency(float octave, float note)
{
/* frequency = 440×(2^(n/12))
Notes:
0  = A
1  = A#
2  = B
3  = C
4  = C#
5  = D
6  = D#
7  = E
8  = F
9  = F#
10 = G
11 = G# */
return TFrequency((float)(440 * pow(2.0, ((double)((octave - 4) * 12 + note)) / 12.0)));
}

template
T AmplitudeToAudioSample(const TAmplitude& in)
{
const T c_min = std::numeric_limits::min();
const T c_max = std::numeric_limits::max();
const float c_minFloat = (float)c_min;
const float c_maxFloat = (float)c_max;

float ret = in.Value() * c_maxFloat;

if (ret  c_maxFloat)
return c_max;

return (T)ret;
}

TAmplitude GetLerpedAudioSample(const std::vector& samples, TFractionalSamples& index)
{
// get the index of each sample and the fractional blend amount
uint32 a = (uint32)floor(index.Value());
uint32 b = a + 1;
float fract = index.Value() - floor(index.Value());

// get our two amplitudes
float ampA = 0.0f;
if (a >= 0 && a = 0 && b < samples.size())
ampB = samples[b].Value();

// return the lerped result
return TAmplitude(fract * ampB + (1.0f - fract) * ampA);
}

void NormalizeSamples(std::vector& samples, TAmplitude maxAmplitude, TSamples envelopeTimeFrontBack)
{
// nothing to do if no samples
if (samples.size() == 0)
return;

// 1) find the largest absolute value in the samples.
TAmplitude largestAbsVal = TAmplitude(abs(samples.front().Value()));
std::for_each(samples.begin() + 1, samples.end(), [&largestAbsVal](const TAmplitude &a)
{
TAmplitude absVal = TAmplitude(abs(a.Value()));
if (absVal > largestAbsVal)
largestAbsVal = absVal;
}
);

// 2) adjust largestAbsVal so that when we divide all samples, none will be bigger than maxAmplitude
// if the value we are going to divide by is <= 0, bail out
largestAbsVal /= maxAmplitude;
if (largestAbsVal <= TAmplitude(0.0f))
return;

// 3) divide all numbers by the largest absolute value seen so all samples are [-maxAmplitude,+maxAmplitude]
// also apply front and back envelope
const TSamples c_frontEnvelopeEnd(envelopeTimeFrontBack);
const TSamples c_backEnvelopeStart(samples.size() - envelopeTimeFrontBack.Value());

for (TSamples index(0), numSamples(samples.size()); index < numSamples; ++index)
{
// calculate envelope
TAmplitude envelope(1.0f);
if (index < c_frontEnvelopeEnd)
envelope = TAmplitude(index.Divide(envelopeTimeFrontBack));
else if (index > c_backEnvelopeStart)
envelope = TAmplitude(1.0f) - TAmplitude((index - c_backEnvelopeStart).Divide(envelopeTimeFrontBack));

// apply envelope while normalizing
samples[index.Value()] = samples[index.Value()] * envelope / largestAbsVal;

}
}

void ResampleData(std::vector& samples, int srcSampleRate, int destSampleRate)
{
//if the requested sample rate is the sample rate it already is, bail out and do nothing
if (srcSampleRate == destSampleRate)
return;

//calculate the ratio of the old sample rate to the new
float fResampleRatio = (float)destSampleRate / (float)srcSampleRate;

//calculate how many samples the new data will have and allocate the new sample data
int nNewDataNumSamples = (int)((float)samples.size() * fResampleRatio);

std::vector newSamples;
newSamples.resize(nNewDataNumSamples);

//get each lerped output sample.  There are higher quality ways to resample
for (int nIndex = 0; nIndex < nNewDataNumSamples; ++nIndex)
newSamples[nIndex] = GetLerpedAudioSample(samples, TFractionalSamples((float)nIndex / fResampleRatio));

//free the old data and set the new data
std::swap(samples, newSamples);
}

void ChangeNumChannels(std::vector& samples, int nSrcChannels, int nDestChannels)
{
//if the number of channels requested is the number of channels already there, or either number of channels is not mono or stereo, return
if (nSrcChannels == nDestChannels ||
nSrcChannels  2 ||
nDestChannels  2)
{
return;
}

//if converting from mono to stereo, duplicate the mono channel to make stereo
if (nDestChannels == 2)
{
std::vector newSamples;
newSamples.resize(samples.size() * 2);
for (size_t index = 0; index < samples.size(); ++index)
{
newSamples[index * 2] = samples[index];
newSamples[index * 2 + 1] = samples[index];
}

std::swap(samples, newSamples);
}
//else converting from stereo to mono, mix the stereo channels together to make mono
else
{
std::vector newSamples;
newSamples.resize(samples.size() / 2);
for (size_t index = 0; index < samples.size() / 2; ++index)
newSamples[index] = samples[index * 2] + samples[index * 2 + 1];

std::swap(samples, newSamples);
}
}

float PCMToFloat(unsigned char *pPCMData, int nNumBytes)
{
switch (nNumBytes)
{
case 1:
{
uint8 data = pPCMData[0];
return (float)data / 255.0f;
}
case 2:
{
int16 data = pPCMData[1] << 8 | pPCMData[0];
return ((float)data) / ((float)0x00007fff);
}
case 3:
{
int32 data = pPCMData[2] << 16 | pPCMData[1] << 8 | pPCMData[0];
return ((float)data) / ((float)0x007fffff);
}
case 4:
{
int32 data = pPCMData[3] << 24 | pPCMData[2] << 16 | pPCMData[1] << 8 | pPCMData[0];
return ((float)data) / ((float)0x7fffffff);
}
default:
{
return 0.0f;
}
}
}

//=====================================================================================
// Wave File Writing Code
//=====================================================================================
{
//the main chunk
unsigned char m_szChunkID[4];      //0
uint32        m_nChunkSize;        //4
unsigned char m_szFormat[4];       //8

//sub chunk 1 "fmt "
unsigned char m_szSubChunk1ID[4];  //12
uint32        m_nSubChunk1Size;    //16
uint16        m_nAudioFormat;      //18
uint16        m_nNumChannels;      //20
uint32        m_nSampleRate;       //24
uint32        m_nByteRate;         //28
uint16        m_nBlockAlign;       //30
uint16        m_nBitsPerSample;    //32

//sub chunk 2 "data"
unsigned char m_szSubChunk2ID[4];  //36
uint32        m_nSubChunk2Size;    //40

//then comes the data!
};

//this writes a wave file
template
bool WriteWaveFile(const char *fileName, const std::vector &samples, const SSoundSettings &sound)
{
//open the file if we can
FILE *file = fopen(fileName, "w+b");
if (!file)
return false;

//calculate bits per sample and the data size
const int32 bitsPerSample = sizeof(T) * 8;
const int dataSize = samples.size() * sizeof(T);

//fill out the main chunk

//fill out sub chunk 1 "fmt "
waveHeader.m_nByteRate = sound.m_sampleRate.Value() * 1 * bitsPerSample / 8;
waveHeader.m_nBlockAlign = 1 * bitsPerSample / 8;

//fill out sub chunk 2 "data"

//write the wave data itself, converting it from float to the type specified
std::vector outSamples;
outSamples.resize(samples.size());
for (size_t index = 0; index < samples.size(); ++index)
outSamples[index] = AmplitudeToAudioSample(samples[index]);
fwrite(&outSamples[0], dataSize, 1, file);

//close the file and return success
fclose(file);
return true;
}

//loads a wave file in.  Converts from source format into the specified format
// TOTAL HONESTY: some wave files seem to have problems being loaded through this function and I don't have
// time to investigate why.  It seems to work best with 16 bit mono wave files.
// If you need more robust file loading, check out libsndfile at http://www.mega-nerd.com/libsndfile/
bool ReadWaveFile(const char *fileName, std::vector& samples, int32 sampleRate)
{
//open the file if we can
FILE *File = fopen(fileName, "rb");
if (!File)
{
return false;
}

//read the main chunk ID and make sure it's "RIFF"
char buffer[5];
buffer[4] = 0;
if (fread(buffer, 4, 1, File) != 1 || strcmp(buffer, "RIFF"))
{
fclose(File);
return false;
}

uint32 nChunkSize;
if (fread(&nChunkSize, 4, 1, File) != 1)
{
fclose(File);
return false;
}

//read the format and make sure it's "WAVE"
if (fread(buffer, 4, 1, File) != 1 || strcmp(buffer, "WAVE"))
{
fclose(File);
return false;
}

long chunkPosFmt = -1;
long chunkPosData = -1;

while (chunkPosFmt == -1 || chunkPosData == -1)
{
//read a sub chunk id and a chunk size if we can
if (fread(buffer, 4, 1, File) != 1 || fread(&nChunkSize, 4, 1, File) != 1)
{
fclose(File);
return false;
}

//if we hit a fmt
if (!strcmp(buffer, "fmt "))
{
chunkPosFmt = ftell(File) - 8;
}
//else if we hit a data
else if (!strcmp(buffer, "data"))
{
chunkPosData = ftell(File) - 8;
}

fseek(File, nChunkSize, SEEK_CUR);
}

//we'll use this handy struct to load in

//load the fmt part if we can
fseek(File, chunkPosFmt, SEEK_SET);
if (fread(&waveData.m_szSubChunk1ID, 24, 1, File) != 1)
{
fclose(File);
return false;
}

//load the data part if we can
fseek(File, chunkPosData, SEEK_SET);
if (fread(&waveData.m_szSubChunk2ID, 8, 1, File) != 1)
{
fclose(File);
return false;
}

//verify a couple things about the file data
if (waveData.m_nAudioFormat != 1 ||       //only pcm data
waveData.m_nNumChannels  2 ||        //must not have more than 2
waveData.m_nBitsPerSample > 32 ||     //32 bits per sample max
waveData.m_nBitsPerSample % 8 != 0 || //must be a multiple of 8 bites
waveData.m_nBlockAlign > 8)           //blocks must be 8 bytes or lower
{
fclose(File);
return false;
}

//figure out how many samples and blocks there are total in the source data
int nBytesPerBlock = waveData.m_nBlockAlign;
int nNumBlocks = waveData.m_nSubChunk2Size / nBytesPerBlock;
int nNumSourceSamples = nNumBlocks * waveData.m_nNumChannels;

//allocate space for the source samples
samples.resize(nNumSourceSamples);

//maximum size of a block is 8 bytes.  4 bytes per samples, 2 channels
unsigned char pBlockData[8];
memset(pBlockData, 0, 8);

//read in the source samples at whatever sample rate / number of channels it might be in
int nBytesPerSample = nBytesPerBlock / waveData.m_nNumChannels;
for (int nIndex = 0; nIndex = TPhase(1.0f))
phase -= TPhase(1.0f);
while (phase  0.5f ? 1.0f : -1.0f);
}

TAmplitude AdvanceOscilator_Triangle(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{
if (phase > TPhase(0.5f))
return TAmplitude((((1.0f - phase.Value()) * 2.0f) * 2.0f) - 1.0f);
else
return TAmplitude(((phase.Value() * 2.0f) * 2.0f) - 1.0f);
}

TAmplitude AdvanceOscilator_Saw_BandLimited(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{

// sum the harmonics
TAmplitude ret(0.0f);
for (int harmonicIndex = 1; harmonicIndex <= 4; ++harmonicIndex)
{
TPhase harmonicPhase = phase * TPhase((float)harmonicIndex);
ret += TAmplitude(sin(harmonicPhase.Value()*c_twoPi) / (float)harmonicIndex);
}

ret *= TAmplitude(2.0f / c_pi);

return ret;
}

TAmplitude AdvanceOscilator_Square_BandLimited(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{

// sum the harmonics
TAmplitude ret(0.0f);
for (int harmonicIndex = 1; harmonicIndex <= 4; ++harmonicIndex)
{
float harmonicFactor = (float)harmonicIndex * 2.0f - 1.0f;
TPhase harmonicPhase = phase * TPhase(harmonicFactor);
ret += TAmplitude(sin(harmonicPhase.Value()*c_twoPi) / harmonicFactor);
}

ret *= TAmplitude(4.0f / c_pi);

return ret;
}

TAmplitude AdvanceOscilator_Triangle_BandLimited(TPhase &phase, TFrequency frequency, TSamples sampleRate)
{

// sum the harmonics
TAmplitude ret(0.0f);
TAmplitude signFlip(1.0f);
for (int harmonicIndex = 1; harmonicIndex <= 4; ++harmonicIndex)
{
float harmonicFactor = (float)harmonicIndex * 2.0f - 1.0f;
TPhase harmonicPhase = phase * TPhase(harmonicFactor);
ret += TAmplitude(sin(harmonicPhase.Value()*c_twoPi) / (harmonicFactor*harmonicFactor)) * signFlip;
signFlip *= TAmplitude(-1.0f);
}

ret *= TAmplitude(8.0f / (c_pi*c_pi));

return ret;
}

//=====================================================================================
// Main
//=====================================================================================
int main(int argc, char **argv)
{
// wetness parameter of reverb.  It's a percentage from 0 to 1.
TAmplitude c_reverbWetness(1.0f);

// keep track of when the process started so we can report how long it took
auto start = std::chrono::system_clock::now().time_since_epoch();

//our desired sound parameters
SSoundSettings sound;
sound.m_sampleRate = TSamples(44100);

// load the input wave file if we can and normalize it
std::vector inputData;
{
printf("could not load input wave file!");
return 0;
}
NormalizeSamples(inputData, TAmplitude(1.0f), TSamples(0));

// load the reverb wave file if we can and normalize it
std::vector reverbData;
{
printf("could not load reverb wave file!");
return 0;
}
NormalizeSamples(reverbData, TAmplitude(1.0f), TSamples(0));

// allocate space for the output file - which will be the number of samples in the two input files minus 1
// initialize it to silence!
std::vector samples;
samples.resize(inputData.size() + reverbData.size() - 1);
std::fill(samples.begin(), samples.end(), TAmplitude(0.0f));

// reverse the reverb data in preparation of convolution
std::reverse(reverbData.begin(), reverbData.end());

// report the number of samples of each file
printf("Input samples = %un", inputData.size());
printf("Reverb samples = %un", reverbData.size());

// apply the convolution for each output sample
float lastPercentageReported = 0.0f;
char percentageText[512];
percentageText[0] = 0;
printf("Progress: ");
for (TSamples sampleIndex(0), numSamples(samples.size()); sampleIndex < numSamples; ++sampleIndex)
{
// print some periodic output since this can take a while!
float percentage = sampleIndex.Divide(numSamples);
if (percentage >= lastPercentageReported + 0.01f)
{
// erase the last progress text
for (int i = 0, c = strlen(percentageText); i < c; ++i)
printf("%c %c", 8, 8);

// calculte and show progress text
lastPercentageReported = percentage;
auto duration = std::chrono::system_clock::now().time_since_epoch() - start;
auto millis = std::chrono::duration_cast(duration).count();
float expectedMillis = millis / percentage;
sprintf(percentageText, "%i%%  %0.2f / %0.2f seconds (%0.2f remaining)", int(percentage*100.0f), ((float)millis) / 1000.0f, expectedMillis / 1000.0f, (expectedMillis - millis) / 1000.0f);
printf("%s", percentageText);
}

// get a reference to our output sample
TAmplitude &outputSample = samples[sampleIndex.Value()];

// figure out the first reverb index and input index we should process.
TSamples startReverbIndex(0);
TSamples startInputIndex(0);
if (sampleIndex < TSamples(reverbData.size()))
startReverbIndex = TSamples(reverbData.size()) - sampleIndex - TSamples(1);
else
startInputIndex = sampleIndex - TSamples(reverbData.size());

// for each reverb sample
for (TSamples reverbIndex(startReverbIndex), numReverbSamples(reverbData.size()), inputIndex(startInputIndex), numInputSamples(inputData.size());
reverbIndex < numReverbSamples && inputIndex < numInputSamples;
++reverbIndex, ++inputIndex)
{
const TAmplitude &inputSample = inputData[inputIndex.Value()];
const TAmplitude &reverbSample = reverbData[reverbIndex.Value()];
outputSample += inputSample * reverbSample;
}
}

// normalize the reverb to the wetness volume level
NormalizeSamples(samples, c_reverbWetness, TSamples(0));

// apply the dry sound on top, per the wetness settings
for (TSamples inputIndex = TSamples(0), numInputSamples(inputData.size()); inputIndex < numInputSamples; ++inputIndex)
{
const TAmplitude &inputSample = inputData[inputIndex.Value()];
TAmplitude &outputSample = samples[inputIndex.Value()];
outputSample += inputSample * (TAmplitude(1.0f) - c_reverbWetness);
}

// normalize the amplitude of the samples to make sure they are as loud as possible without clipping.
// give 3db of headroom and also put an envelope on the front and back that is 50ms
NormalizeSamples(samples, DBToAmplitude(TDecibels(-3.0f)), MilliSecondsToSamples(sound, 50.0f));

// save as a wave file
WriteWaveFile("out.wav", samples, sound);
return 0;
}
```

## Intuition

I had to think about it for a bit but the intuition for why this works is actually pretty straight forward!

The impulse response represents all the echos that occur over time if you make a single sample of 1.0 amplitude. This is your guide for how all samples in an impulse source (IS) should be transformed, where IS is the sound you want to add reverb to.

So, starting out, you put the entire IR into your output buffer starting at output[0] (the first sample in the output buffer), multiplied by IS[0] (the first sample of the sound you are reverbing).

This leaves you with the proper reverb of the first sample in your IS, and your total output buffer length is the same length as your IR.

We only have one sample reverbed though, so we move to the next…

Next, you put the entire IR into the output buffer again, but this time start at output[1] (the second sample in the output buffer) and those IR values you are writing need to be multiplied by IS[1] (the second sample of the sound you are reverbing).

You now have the proper reverb for the first two samples in your IS, and your total output buffer length is the length of your IR + 1.

Rinse and repeat for all samples in your IS, and at the end, the reverb of all samples in your IS will be accounted for in your output buffer, and the size of your output buffer will be the length of your IR plus the length of your IS, minus one.

Pretty simple and intuitive right?

You are essentially playing the IR in the output buffer length(IS) times, where IS[outputBufferIndex] determines the volume (amplitude) that you need to play the IR at.

Convolution does exactly this, which turns out to be pretty slow due to it being a lot of math to preform.

If you have a source file (IS) you are reverbing that is 4 seconds long, running at 44100 samples per second (176,400 samples total), and you have a reverb impulse response (IR) that is 2 seconds long at 44100 samples per second (88,200 samples total), that means that you are going to essentially be mixing the IR into an output buffer 176,400 times.

Each time you play the IR you need to do a multiplication per IR sample (to scale it to the IS[index] amplitude) and an addition to add it to the resulting output buffer.

At 88,200 IR samples with a multiply and add for each sample, and doing that 176,400 times, that means at the end you will need to do 15,558,480,000 (15.5 billion) multiplies and adds.

And remember… that is only for a 4 second sound that is receiving a 2 second reverb… those are pretty small sound files involved! And, that is only a single channel. That would double in stereo, and be even worse in 5.1 or 7.1 surround sound!

So, this method works, but unfortunately it’s pretty darn slow (it took like 5 minutes for me to convolve the legend quote with the large reverb). It could be made faster by introducing threading and SSE instructions, but there is a better way that gives us an even bigger speed up.

Having audio samples means you are working in the “time domain”. If you take a fourier transform of the audio samples, you’ll have the information in the “frequency domain”. As it turns out, if you do a multiplication in the frequency domain, that is equivalent to doing a convolution in the time domain. Using the fast fourier transform on your input sound and multiplying that by the pre-transformed FFT of the impulse response, and then doing an inverse FFT on the result to get back into the time domain (aka audio samples) is MUCH FASTER than doing actual convolution.

Another problem with convolution reverb is that you need the full sound you are convolving, which makes it basically impossible to use in a live setup. The solution to this is that people convolve windows of data at a time, instead of the whole sound. I think a common window size is 256 samples.

I’ll make a post in the future that addresses both those issues to allow for fast, real time convolution reverb via windowed FFT multiplication.

Also, I said in the last post that this technique is what the pros do. We’ll I should add that this is what they do SOMETIMES. Other times they use DSP algorithms involving comb filters and multi tap delays (and other things) to make fast reverb that sounds pretty decent without needing to do convolution or FFT/IFFT. Check the links section for a link to the details of one such algorithm.

Regarding IR samples (Impulse Response recordings), if you record your own, it should work as is. However, a common thing that people do is remove the loud sound from the impulse response (via deconvolution i think?) to get ONLY the impulse response. It basically makes it so that the IR really is just an IR of that room as if you had a single sample of a “1.0” echoing in the area. Without this step, your reverb convolution will still work, but you’ll get “smearing” of the sound, due to it not being a true 1.0 sample (or in other words, not a true dirac delta).

Luckily there are IR’s available for download online as well. Some are free, some are pay. Check the links section for a few links I found for that (:

Lastly, this post basically teaches you how to do 1 dimensional convolution, showing you an application for it. You can do convolution in higher dimensions as well though, and in fact if you have used photoshop and know about things like gaussian blur, sharpen mask, etc, those guys work by 2d convolution. Bokeh is even apparently convolution, as is a “Minkowski Sum” if you have done work in game physics.

Ill make a graphics related 2d convolution post in the future as well to delve into that stuff a bit deeper.

As one final sound sample, check out this CROSS CORRELATION (not convolution) of SoundBite.wav and ReverbLarge. Cross correlation is the same as convolution except you don’t reverse one of the samples. So, in effect, it’s like I convolved SoundBite.wav with ReverbLarge.wav played backwards.

c_ReverseRL_SBite.wav

More info on cross correlation coming soon. It has uses in audio, but more often it’s used for finding time delays of sounds compared to other sounds, recognizing patterns in sound data even with the presence of noise / distortion, versus having actual audible uses (although there are some of those too!).

# HyperLogLog: Estimate Unique Value Counts Like The Pros

This is the last of my planned posts on probabilistic algorithms. There may be more in the future, but the initial volley is finished 😛

Thanks again to Ben Deane for exposure to this really interesting area of computer science.

This post is about HyperLogLog, which is used to estimate a count of how many “uniques” there are in a stream of data. It can also do intersection and union between two hyperloglog objects allowing you to see how many items HLL objects have in common. It was invented in 2007 and is currently used in many big data situations including use by Google and Redis and Amazon.

This algorithm is probabilistic in that you can trade storage space for accuracy. You can calculate how much accuracy to expect for a given amount of storage, or calculate how much storage you’d need for a specific amount of accuracy.

If this sounds a lot like the first probabilistic algorithm post I made (Estimating Counts of Distinct Values with KMV) that means you have been paying attention. HyperLogLog is in the same family of algorithms, but it is way better at most things than KMV is, and seems to be the current standard for DV Sketches (distinct value estimators). The one thing KMV seems to be better at is calculating intersections between objects, which i’ll talk about more below.

To give you an idea of the power of HyperLogLog, here’s a quote from the paper it was first described in (link at end of post):

“…the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes”

By the end of this post you should understand how that is possible, and also be able to start with the sample code and have HyperLogLog capabilities in your own C++ project immediately!

## A Usage Case: Upvoting

A usage case for this algorithm would be adding an “upvote” button to a webpage.

A naive solution to this would be to have an upvote button that when clicked would increment a counter on the server. That is problematic though because people can vote as many times as they want. You could hack together a solution to this by having some client side cookies and javascript limiting people from doing that, but all your security is out the window since it’s a client side fix, and you will soon have trolls spamming your vote counters by doing raw HTTP requests to your servers, to ruin your day, just because they can.

A less naive solution would be to have some unique identifier per user – whether that was something like a username, or just an IP address – and store that in a voting table, only allowing the counter to increment if the person wasn’t already in the table, and entering them into the table when doing the increment.

The problem with that solution is that the table of users who have already voted might get extremely huge, causing lots of memory usage to hold it, lots of processing time to see if a user exists in the table already (even with a hash based lookup), and it doesn’t parallelize very well to multiple servers. You also have to implement some protection against race conditions around the “look for user in table, increment vote counter, add user to table” work, which means some potentially costly synchronization logic.

A better solution would be to use HyperLogLog and Here are some reasons why:

• It has a fixed size you determine in advance. The bold quote from the paper would indicate that 1.5KB is likely enough for our needs, being able to count over one billion unique values. Bringing it up to 2KB would be enough to store a heck of a lot more, like somewhere up near 2^250 unique items.
• It automatically disallows the same item being accounted for more than once, so our multiple votes from same voter problem is gone with no need for costly synchronization work.
• It lends itself well to being parallelized across machines, or just using SIMD / GPGPU if you wanted to squeeze some more performance out of it.
• You can very quickly do set operations on multiple HLL objects.

The first three items solve the problems in the naive solutions, but the fourth adds some bonus functionality that is pretty cool.

Using set operations, you can do things like figure out if the same people are upvoting a dulce de leche flan desert compared to a cheesy grits recipe, and you can calculate that VERY QUICKLY.

A possibly naive way to suggest a recipe to a user would be to show them recipes that a lot of people have upvoted.

A more custom tailored suggestion (and thus, hopefully a better suggestion) would be when a person votes on a recipe, you can tell them “hey, a lot of the same people that upvoted the recipe you just upvoted also upvoted this other recipe, why don’t you check it out!”

So now, you don’t just have a voting system, you now also have a CLASSIFICATION SYSTEM.

Yeah ok, so that’s a little hand wavy and making a real suggestion system has more details to solve than just those etc etc, but hopefully you can see how this algorithm can be a fundamental building block to many “big data” problems.

Onto the details of how it works!

## Basic Premise – Coin Flips

When you flip a coin, what are the odds that it will be heads? There’s even chance of heads or tails, so the chance of a heads is 1 in 2 or 1/2 or 50%.

What are the odds that if you flip a coin twice it a row it will be heads both times? Well, on each flip there is a 50% chance, so you multiply .5 * .5 to get the answer which is .25 or 25%. There’s a 25% chance, or a 1 in 4 chance, that you will flip heads twice in a row with two coin flips. It’s the same chance that you will flip two tails in a row, or that you will flip heads then tails, or tails then heads. All possible outcomes have the same probability.

Another way to look at this, instead of probability, is to see how many combinations possible there are like this:

There’s a 100% chance that two coin flips will be somewhere in those four results, and all results have an even chance of happening, so result 0 “heads, heads” has a 1 in 4 chance of happening. All of the results above have a 1 in 4 chance. Interestingly, looking at the above table you can also see that there is a 2/4 chance that the first flip will be heads, which is also 1/2, 1 in 2 or 50% chance. That agrees with what we said before, that if only doing one coin flip (the first coin flip), there is a 50% chance of heads or tails.

If you switched to 3 coin flips, there would be a 1 in 8 chance of any specific event happening, since there are 8 possible outcomes:

It doesn’t matter which specific sequence you are looking at above, the chance that 3 coin flips will result in any of those specific sequences is 1 in 8.

In fact, for N coin flips, the probability that you will encounter any specific sequence (permutation) of heads and tails is 1 in 2^N, or 1/(2^N).

Now, let’s change it up a bit. What are the odds that you will have some number of heads in a row, and then a tails? In other words, what are the odds that you will have a “run” of heads of a specified length?

well, a run of 0 would just be you flip the coin once and get tails. Since you are doing one coin flip and you are looking for a specific one of two possible outcomes to happen, the probability is or 1 in 2, or 1/2, or 50%.

A run of 1 would mean you flipped the coin once, got heads, flipped it again and got tails. Since you are doing two coin flips and are looking for a specific one of four possible outcomes, the probability of that happening is 1/2*1/2 = 1/4, or 1 in 4, or 25%.

A run of 2 means heads, heads, tails. 3 coin flips = 1/2*1/2*1/2 = 1/8, or 1 in 8, or 12.5%.

By now you may have noticed that the probability for getting N heads and then a tail is just 1 in the number of coin flips, and the number of coin flips is 2^(N+1).

More formally, the odds of getting N heads and then a tail is 1/(2^(N+1)).

Let’s now swap out the idea of heads and tails with the idea of binary digits 0 and 1 in a sequence of random numbers. Think of “heads” as 0, and “tails” as 1.

If you generate a random bit (binary digit), there is a 50% chance it will be a 0, and a 50% chance it will be a 1.

If you generate two random bits, there is a 25% chance for each of the possible outcomes below:

 Sequence Number Sequence 0 00 1 01 2 10 3 11

Now, what are the odds that in a random binary number, we will have N zeros and then a 1?

Don’t let the binary digits scare you, it’s the same answer as the coin flip question: 1/(2^(N+1))

An interesting property of this is that if you ask for a random 8 bit binary number and get xxxxx100, using the formula above, you know there is a 1 in 8 chance that a random number would end in “100”.

Using this information you can say to yourself “i’ll bet I’ve seen about 8 different random numbers at this point”, and that’s a fairly decent guess, without actually having had to pay attention to any of the numbers that came before.

A better idea though would be to watch all the random numbers as they are generated, and keep track the longest run of zeros you’ve seen on the right side of any number.

Using this “longest run seen” value, you can guess how many unique random numbers you’ve seen so far. If the longest run you’ve seen is N zeros and then a 1, the guess as to how many random numbers you’ve seen is 2^(N+1).

If you’ve seen a maximum of 4 zeros and a 1 (xxx10000), you’ve probably seen about 32 numbers on average. If you’ve seen at maximum 2 zeros and a 1 (xxxxx100), you’ve probably seen about 8 numbers on average.

Since by definition, randomness is RANDOM, there will be fluctuation and your guess will not always be so accurate. You might have only seen one random number, but it’s value may have been 10000000 (0x80), which would incorrectly cause you to estimate that 256 items have been seen (2^8), when in fact only a single item has been seen.

To combat this, HyperLogLog uses multiple registers (counters) to keep multiple counts, and then averages the estimates together. More info on that below, but for now, hopefully you can see how that would smooth things out towards a more average case. The more registers you use, the more “average case” your estimation should be, so the more accurate your estimate should be.

There’s an interesting twist here though… you aren’t actually estimating how many random numbers you’ve seen, but instead are estimating how many UNIQUE random numbers you’ve seen. Random numbers can repeat, but this count estimation will only count each unique value once, no matter how many times it appears.

To help visualize that, no matter how many times you see 10110100 – whether it’s only once, or ten thousand times – the longest run will still be 2. After you’ve seen ten thousand of those numbers, as soon as you see the next number 10011000, the longest run will then be 3.

That may sound like a trivial difference that we are counting uniques, and not actual values, but as the next section will show, it’s actually a very important difference, and is where this technique derives it’s power.

Also, if we were counting non uniques, we could just use an integer and increment it for each item we saw (;

## Hashing Functions as Pseudorandom Number Generators

An interesting property of good hashing algorithms is that the output you get when you hash an object will be indistinguishable from random numbers. If you hash 3, you’ll get a random number, and if you hash 4, you’ll likely get a completely different random number. You can’t tell from the output what the nature of the input was, or how similar the input objects were.

But, of course, when you put the same input into a hash function, you will always get the same output.

These two properties are what makes hash functions so useful in what we are trying to do in HLL.

Quick Tangent: These same properties also apply to encryption by the way. The fact that they are random output is why hashed and encrypted data doesn’t compress very well. There are no patterns in the data that can be exploited to express the data as anything shorter than the full length of data itself. You should not be able to gain any knowledge about the nature of the input by looking at the output, except perhaps the size of the source data in the case of encryption. Also, whereas hashes are not reversible at all, encryption is only reversible if you have the encryption key (password). HLL and similar algorithms use hashes, not encryption, because they want a fixed size output, and they don’t care about reversing the process.

The output of the hash function is the source of the pseudorandom numbers that we plug into the HyperLogLog algorithm, and is what allows us to count uniques of any type of thing, so long as that thing can be hashed.

So, to do HLL counting, you hash every object you see in a stream keeping track of the longest run of zeros (in binary) you’ve seen in the resulting hashes. You store “longest runs seen” in multiple registers which you can then later use to get an averaged estimate of unique items encountered. That’s all there is to it.

MAGIC!

That’s how things work from a high level, now let’s get into the nitty gritty a bit…

## Handling Multiple Registers

Let’s say you have a hash function that spits out a 32 bit hash, which is a pretty common thing for HLL implementations.

We talked about figuring out the length of the run of 0’s in the hash output, but if you had 16 registers to store run lengths in, how do you choose which register to store each run length in?

The common way to solve this is to use some of your hash bits for register selection. If you have 16 registers, you could use the lowest 4 bits of your hash as the register index to store the count in for instance.

There is a problem here though, that hopefully you can see already. The problem is that if we have a run of 3 with a hash that ends in the binary number 1000, we will only ever store that run length in register 8! By using the same bits for register selection as we do for counting numbers, we’ve biased our count and introduced inaccuracy (error) because certain numbers will only get accounted for by specific registers. The ideal situation is that every number is as likely to end up in any specific register versus another one. It should “randomly” choose what register to use, but be deterministic in how it chose that register.

The bits you use for register selection cannot be reused for counting runs, or else you’ll fall into the trap of only storing certain numbers in specific registers.

You could perhaps hash your hash to get another pseudo random number to use for register selection, but a better option is to just throw out those register selection bits once you use them.

Reducing the number of bits you evaluate for runs of 0’s comes at a cost though. It means that your estimation of unique values seen is capped at a lower number. with 32 bits, you can estimate a count up to 2^32 (~4 billion), but at 28 bits, after using 4 bits for register selection, you can only estimate a count of up to 2^28 (~268 million).

I believe this is one of the reasons why google invented “HyperLogLog++” which uses a 64 bit hash and has some other improvements. Check the links at the bottom of this post for more information.

It’s a bit overkill, but in the sample code in this post, we create a 128 bit hash, use the top 32 bits for register selection, and the lower 96 bits for looking at runs. I say it’s overkill because at 96 bits, we can estimate counts up to 79 billion billion billion, which is way huger than anyone (even google!) is ever likely to need.

## Register Sizes

As I mentioned above, many people use 32 bit hashes, for estimating at most about 4 billion unique objects. Google bumps it up to 64 bits for up to 18 billion billion uniques, and our sample code uses 96 bits for run evaluation, letting us estimate counts up to 79 billion billion billion.

These numbers are beyond huge, but believe it or not, the size of the registers themselves used to track these things are pretty darn tiny.

Since we are looking for runs of zeros, if we have a 32 bit hash, we only need to be able to store the values 0 to 32. 0 to 31 can be stored in 5 bits, and chances are that people aren’t going to bump it up to 6 bits just to get that extra value – especially when in practice you are going to use a few bits of the hash as register selection.

So, for a 32 bit hash, you really only need 5 bits per register to keep track of longest runs seen.

For a 64 bit hash, you need to be able to store the values 0 to 64. Similar to the above, 0-63 can be stored in 6 bits, and we can ditch being able to store 64, so 6 bits per register is plenty.

For our 96 bit hash (since we use 32 bits for register selection), we’d only need to be able to store 0-96, which can fit entirely in 7 bits, since 7 bits can store 0-127.

In our example code, I’m an excessive glutton however, and store our longest run value in 8 bits, wasting an entire bit of memory per register.

Yep, in our excessively gigantic counting 128 bit hash HLL DV Sketch code, i use an ENTIRE BYTE of memory per register. The Horror!!!

With a 32 or 64 bit hash, you could drop that down to 5 or 6 bits per register, and either condense your registers in memory, or perhaps even use those extra bits for something else if you wanted (need some flags?!).

## Register Counts

So, our register size itself is fairly tiny, where my gluttonous, wasteful programming uses a single byte per register. How many registers do we need though?

The answer to that depends on how much accuracy you want.

To calculate how much error there is for M registers, the equation is: expectedError = 1.04 / sqrt(M)

To calculate how many registers you need to achieve a specific expected error, the equation is: M = 676 / (625 * expectedError^2)

In those equations, an expectedError of 0.03 would mean 3%.

Check out the table below to get an idea of accuracy vs size.

Note that since we use bits from our hash to do register selection, that our number of registers is a power of 2.

 Register Bits Register Count Error 0 1 104% 1 2 73.5% 2 4 52% 3 8 36.7% 4 16 26% 5 32 18.3% 6 64 13% 7 128 9% 8 256 6.5% 9 512 4.6% 10 1024 3.2% 11 2048 2.2% 12 4096 1.6%

Here is a graph showing how number of bits used affects error. Bits used is the x axis, and expected error is the y axis.

From the table and the graph you can see that adding bits (registers) gives diminishing returns in error reduction. It’s especially diminishing because whenever we add another bit, we double our storage size (double the number of registers we use).

This shows us that this algorithm is great at counting a large number of uniques, since one byte per counter can count up to about 2^256 (2^2^8) uniques, but it isn’t super great at getting a low error rate. If you are ok with about 2% accuracy, the storage space needed is pretty small though!

Remember the claim at the top of this post?

“…the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes”

Looking at the bits used / error table, you can see 11 bits, or 2048 registers, gives just a little over 2% accuracy.

If you are using 32 bits of hash to look for runs of zeros, you can use 6 bit registers to store the longest run seen, if you want to waste a bit to be able to store 0-32 instead of just 0-31.

So, 2048 registers * 6 bits = 12288 bits of register storage needed. That is 1536 bytes, or exactly 1.5KB.

You could count up to ~4 billion uniques (2^32) with that configuration, but error increases as you get closer to that limit, so I think that’s why they limited their statement to counting ~1 billion uniques (10^9).

## Estimating Counts

The math behind the count estimation is a bit complex (check the paper in the links section for more info!) but part of how it works is it uses the harmonic mean to average the data from the registers together. Since there is randomness involved, and differences in our run lengths means being off by exponential amounts, the harmonic mean is great at filtering out large outliers due to the random fluctuations. Fluctuation to the “too small” end won’t matter too much since it will be over-written by other values since we are storing the maximum value seen. Fluctuations to the “too large” end are mitigated with harmonic mean.

Here’s some pseudo code for estimating the count. Note that we are storing the position of the first 1 in the registers, not storing the run length of zeros. That’s an important distinction because it means the number is 1 higher than it would be otherwise, which if you get it wrong makes your estimation half as large as it should be. It also means that you know if you see a zero register, that it is uninitialized and hasn’t seen a value yet.

```Alpha = 0.7213 / (1 + 1.079 / NumRegisters)
Sum = 0
for i = 0 to NumRegisters
Sum = Sum + pow(2, -Register[i])

Estimation = Alpha * NumRegisters^2 / Sum

// do small range correction
if Estimation < 5 / 2 * NumRegisters
{
if NumEmptyRegisters > 0
Estimation = NumRegisters * ln(NumRegisters / NumEmptyRegisters)
}
// do large range correction
else if Estimation > 2^32 / 30
{
Estimation = -2^32 * ln(1 - Estimation/ 2^32);
}
```

Small range correction is there because when not enough registers have been filled in (they haven’t gotten any data yet), the normal algorithm path has expected error greater than 1.04. Large range correction is there for the same reason, but on the high end side, when the registers are saturated.

## Set Operations

You can do set operations on hyperloglog objects so long as they use the same number of registers, same sized registers, and same hashing algorithm.

There’s a link in the links section at the end of this post that shows you how to resize the number of registers so that you can do set operations on HLL objects that have different numbers of registers.

Union

Taking a union of two HLL objects is actually really simple.

If you have two HLL objects A and B, that you want to union to get C, all you do is take the maximum bucket value from A and B and store it in C. Check out this pseudocode to see what i mean:

```for i = 0 to NumRegisters
C.Register[i] = Max(A.Register[i], B.Register[i])
```

The neat thing about doing this union is that it is LOSSLESS and doesn’t introduce any new error. Doing a union of two HLL objects is just the same as if you had a third HLL object that processed all the same objects that A and B both processed.

Intersection

To do an intersection is a tiny bit trickier, but not by much. We have to use what is called the Inclusion-Exclusion Principle (check links section for more info).

Using that principle, we can estimate the count of how many items are in the intersection, but we can’t get a HLL object representing the intersection of the two objects unfortunately.

The formula is this:

Count(Intersection(A,B)) = Count(A) + Count(B) – Count(Union(A,B))

And here’s some more pseudocode to show you how to do it:

```C = Union(A,B)
IntersectionCountEstimate = A.CountEstimate() + B.CountEstimate() - C.CountEstimate()
```

Pretty simple eh?

At the beginning I mentioned that KMV was actually better at intersections than HyperLogLog. The reason for that is because with KMV, you have a small, random sample range from both objects, and you can do an intersection between those two ranges and get your result.

KMV really starts to shine when you need to do an intersection between more than 2 or 3 lists, because using the inclusion-exclusion principle causes a combinatorial explosion, while the KMV intersection easily extends for N sets.

Jaccard Index

There’s no special trick to calculating the Jaccard Index, as per usual it’s just:

JaccardIndex = Count(Intersection(A,B)) / Count(Union(A,B))

Which will give you a value from 0 to 1 indicating how similar the two sets A and B are, where 1 is totally the same, and 0 is completely different. In the case of HLL, this is an estimated Jaccard Index of course, not the actual value!

Contains Percent

I was noticing in runs of the sample code below that while the union operation had pretty decent error levels, the intersection operation had not so great error levels and thus the Jaccard Index wasn’t very accurate either. This is mostly due to the fact that the intersection levels were pretty small, so if you had +1 or -1, that came out to be a large percentage of the actual value.

Despite having a reasonable explanation that diminished the actual impact of the “high error levels”, I wanted to see if I could come up with a different similarity metric and see how the error rate was on it.

What I came up with was a “Contains Percent”, which is the percentage of how many items in A are contained in B. You calculate it like this:

ContainsPercent = Count(Intersection(A,B)) / Count(A)

Since it uses the intersection value (that is not so accurate), it didn’t give great results, but I wanted to mention it because it actually is a completely different measurement than Jaccard Index with different meaning. All of the items in A could be in B, which would give it a “ContainsPercent” of 1.0, but B may have 10 times as many items that don’t appear in A, which would make the Jaccard Index very low.

In some cases, you may want to use the information that the Jaccard Index represents to make decisions, and in other cases you may want this “Contains Percent” metric, or maybe even something else.

It’s a bit subtle, but it’s good to think about what it is that you are actually looking for if using these things in actual production code (:

Estimating Set Membership

So, HLL is NOT a bloom filter, but can it still be used like one?

The answer is yes, but I don’t have a whole lot of information about the formal accuracy of that.

Basically how you’d do this is create a temporary HLL object, make it store the single item you want to check the other HLL for set membership of, and then you’d do an estimated intersection count between the two HLL objects.

As crazy as it sounds, it looks like redis exposes this functionality and says it is pretty accurate (for however many registers they used anyways), which is pretty neat:
Redis: PFCOUNT

Dot Product

The dot product between two sets (or multi sets – where you have a count associated with each item) can be really useful to help gauge the similarity between the two sets.

You can do a dot product operation between two HLL objects too. If you think about it, getting the dot product between two HLL objects is the same as getting the estimated count of the intersection between those two objects.

## Example Code

Here is some example code in C++ that allows you to do HyperLogLog. It includes only standard include files so has no dependencies and is in a single file for convenience.

I use a hash called MurmurHash3 to generate 128 bits of hash, 32 bits of which are used to generate the register index, and the remaining 96 bits are used for looking at runs of zeros.

Below the code is the output of a run of this program

```#include <array>
#include <string>
#include <assert.h>
#include <unordered_set>
#include <stdint.h>
#include <memory>

// microsoft only, for _BitScanForward to quickly find the index of the first 1 bit
// Use clz in gcc
#include <intrin.h>

//=====================================================================================================
// MurmurHash3
//=====================================================================================================

// note that this 128 bit MurmurHash3 is optimized for x86.  There is a version at the above link
// optimized for x64 as well, but it gives different output for the same input.

#define ROTL32(x,y)     _rotl(x,y)

inline uint32_t getblock32 (const uint32_t * p, int i)
{
return p[i];
}

inline uint32_t fmix32 (uint32_t h)
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;

return h;
}

void MurmurHash3_x86_128 (const void * key, const int len,
uint32_t seed, std::array<uint32_t, 4> & out)
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 16;

uint32_t h1 = seed;
uint32_t h2 = seed;
uint32_t h3 = seed;
uint32_t h4 = seed;

const uint32_t c1 = 0x239b961b;
const uint32_t c2 = 0xab0e9789;
const uint32_t c3 = 0x38b34ae5;
const uint32_t c4 = 0xa1e38b93;

//----------
// body

const uint32_t * blocks = (const uint32_t *)(data + nblocks * 16);

for (int i = -nblocks; i; i++)
{
uint32_t k1 = getblock32(blocks, i * 4 + 0);
uint32_t k2 = getblock32(blocks, i * 4 + 1);
uint32_t k3 = getblock32(blocks, i * 4 + 2);
uint32_t k4 = getblock32(blocks, i * 4 + 3);

k1 *= c1; k1 = ROTL32(k1, 15); k1 *= c2; h1 ^= k1;

h1 = ROTL32(h1, 19); h1 += h2; h1 = h1 * 5 + 0x561ccd1b;

k2 *= c2; k2 = ROTL32(k2, 16); k2 *= c3; h2 ^= k2;

h2 = ROTL32(h2, 17); h2 += h3; h2 = h2 * 5 + 0x0bcaa747;

k3 *= c3; k3 = ROTL32(k3, 17); k3 *= c4; h3 ^= k3;

h3 = ROTL32(h3, 15); h3 += h4; h3 = h3 * 5 + 0x96cd1c35;

k4 *= c4; k4 = ROTL32(k4, 18); k4 *= c1; h4 ^= k4;

h4 = ROTL32(h4, 13); h4 += h1; h4 = h4 * 5 + 0x32ac3b17;
}

//----------
// tail

const uint8_t * tail = (const uint8_t*)(data + nblocks * 16);

uint32_t k1 = 0;
uint32_t k2 = 0;
uint32_t k3 = 0;
uint32_t k4 = 0;

switch (len & 15)
{
case 15: k4 ^= tail[14] << 16;
case 14: k4 ^= tail[13] << 8;
case 13: k4 ^= tail[12] << 0;
k4 *= c4; k4 = ROTL32(k4, 18); k4 *= c1; h4 ^= k4;

case 12: k3 ^= tail[11] << 24;
case 11: k3 ^= tail[10] << 16;
case 10: k3 ^= tail[9] << 8;
case  9: k3 ^= tail[8] << 0;
k3 *= c3; k3 = ROTL32(k3, 17); k3 *= c4; h3 ^= k3;

case  8: k2 ^= tail[7] << 24;
case  7: k2 ^= tail[6] << 16;
case  6: k2 ^= tail[5] << 8;
case  5: k2 ^= tail[4] << 0;
k2 *= c2; k2 = ROTL32(k2, 16); k2 *= c3; h2 ^= k2;

case  4: k1 ^= tail[3] << 24;
case  3: k1 ^= tail[2] << 16;
case  2: k1 ^= tail[1] << 8;
case  1: k1 ^= tail[0] << 0;
k1 *= c1; k1 = ROTL32(k1, 15); k1 *= c2; h1 ^= k1;
};

//----------
// finalization

h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;

h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;

h1 = fmix32(h1);
h2 = fmix32(h2);
h3 = fmix32(h3);
h4 = fmix32(h4);

h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;

out[0] = h1;
out[1] = h2;
out[2] = h3;
out[3] = h4;
}

struct SMurmurHash3
{
std::array<uint32_t, 4> HashBytes (const void *key, size_t len)
{
// use a random 32 bit number as the seed (salt) of the hash.
// Gotten from random.org
// https://www.random.org/cgi-bin/randbyte?nbytes=4&format=h
static const uint32_t c_seed = 0x2e715b3d;

// MurmurHash3 doesn't do well with small input sizes, so if the input is too small,
// make it longer in a way that hopefully doesn't cause likely collisions.
// "the" hashed before the fix (notice the last 3 are the same)
//  0x45930d0e
//  0xfc76ee5b
//  0xfc76ee5b
//  0xfc76ee5b
// and after the fix:
//  0x70220da0
//  0xe7d0664a
//  0xb4e4d832
//  0x25940640
std::array<uint32_t, 4> ret;
static const size_t c_minLen = 16;
if (len < c_minLen)
{
unsigned char buffer[c_minLen];

for (size_t i = 0; i < len; ++i)
buffer[i] = ((unsigned char*)key)[i];

for (size_t i = len; i < c_minLen; ++i)
buffer[i] = buffer[i%len] + i;

MurmurHash3_x86_128(buffer, c_minLen, c_seed, ret);
}
else
{
MurmurHash3_x86_128(key, len, c_seed, ret);
}

return ret;
}

template <typename T>
std::array<uint32_t, 4> operator() (const T &object);

template <>
std::array<uint32_t, 4> operator() <std::string> (const std::string &object)
{
return HashBytes(object.c_str(), object.length());
}

// NOTE: if you need to hash other object types, just make your own template specialization here
};

//=====================================================================================================
// The CHyperLogLog class
//=====================================================================================================
//
// TKEY is the type of objects to keep track of
// NUMREGISTERBITS is how many bits of the hash are used to index into registers.  It also controls
//   how many registers there are since that count is 2^NUMREGISTERBITS
// HASHER controls how the keys are hashed
//
template <typename TKEY, size_t NUMREGISTERBITS, typename HASHER>
class CHyperLogLog
{
public:
CHyperLogLog()
: m_counts{} // init counts to zero
{ }

// friends
template <typename TKEY, size_t NUMREGISTERBITS, typename HASHER>
friend float UnionCountEstimate (const CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>& A, const CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>& B);

// constants
static const size_t c_numRegisters = 1 << NUMREGISTERBITS;
static const size_t c_registerMask = (c_numRegisters - 1);

// interface
{
// because 2^32 does not fit in 32 bits
static_assert(NUMREGISTERBITS < 32, "CHyperLogLog must use fewer than 32 register bits");

// make as many hashed bits as we need
std::array<uint32_t, 4> hash = HASHER()(item);

// use the highest 32 bits for getting our register index
unsigned int registerIndex = hash[3] & c_registerMask;

// use the other 96 bits as our "unique number" corresponding to our object.  Note that we
// don't use the high 32 bits that we've already used for the register index because that
// would bias our results.  Certain values seen would ONLY get tracked by specific registers.
// That's ok though that we are only using 96 bits because that is still an astronomically large
// value.  Most HyperLogLog implementations use only 32 bits, while google uses 64 bits.  We are
// doing even larger with 96 bits.
// 2^(bits) = how many unique items you can track.
//
// 2^32  = 4 billion <--- what most people use
// 2^64  = 18 billion billion <--- what google uses
// 2^96  = 79 billion billion billion <--- what we are using
// 2^128 = 340 billion billion billion billion <--- way huger than anyone needs. Beyond astronomical.

// Figure out where the highest 1 bit is
unsigned long bitSet = 0;
if (_BitScanForward(&bitSet, hash[0]))
bitSet += 1;
else if (_BitScanForward(&bitSet, hash[1]))
bitSet += 32 + 1;
else if (_BitScanForward(&bitSet, hash[2]))
bitSet += 64 + 1;

// store the highest seen value for that register
assert(bitSet < 256);
unsigned char value = (unsigned char)bitSet;
if (m_counts[registerIndex] < value)
m_counts[registerIndex] = value;

}

unsigned int EmptyRegisterCount () const
{
unsigned int ret = 0;
std::for_each(m_counts.begin(), m_counts.end(), [&] (unsigned char count) {
if (count == 0)
ret++;
});
return ret;
}

float GetCountEstimation () const
{
// calculate dv estimate
const float c_alpha = 0.7213f / (1.0f + 1.079f / c_numRegisters);
float sum = 0.0f;
std::for_each(m_counts.begin(), m_counts.end(), [&](unsigned char count) {
sum += std::pow(2.0f, -(float)count);
});

float dv_est = c_alpha * ((float)c_numRegisters / sum) * (float)c_numRegisters;

// small range correction
if (dv_est < 5.0f / 2.0f * (float)c_numRegisters)
{
// if no empty registers, use the estimate we already have
unsigned int emptyRegisters = EmptyRegisterCount();
if (emptyRegisters == 0)
return dv_est;

// balls and bins correction
return (float)c_numRegisters * log((float)c_numRegisters / (float)emptyRegisters);
}

// large range correction
if (dv_est > 143165576.533f) // 2^32 / 30
return -pow(2.0f, 32.0f) * log(1.0f - dv_est / pow(2.0f, 32.0f));

return dv_est;
}

private:
std::array<unsigned char, c_numRegisters> m_counts;
};

// there seem to be numerical problems when using 26 bits or larger worth of registers
typedef CHyperLogLog<std::string, 10, SMurmurHash3> TCounterEstimated;
typedef std::unordered_set<std::string> TCounterActual;

//=====================================================================================================
// Set Operations
//=====================================================================================================

template <typename TKEY, size_t NUMREGISTERBITS, typename HASHER>
float UnionCountEstimate (
const CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>& A,
const CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>& B)
{
// dynamically allocate our hyperloglog object to not bust the stack if you are using a lot of registers.
std::unique_ptr<CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>> temp =
std::make_unique<CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>>();

// To make a union between two hyperloglog objects that have the same number of registers and use
// the same hash, just take the maximum of each register between the two objects.  This operation is
// "lossless" in that you end up with the same registers as if you actually had another object
// that tracked the items each individual object tracked.
for (size_t i = 0; i < (*temp).c_numRegisters; ++i)
(*temp).m_counts[i] = std::max(A.m_counts[i], B.m_counts[i]);
return (*temp).GetCountEstimation();
}

template <typename TKEY, size_t NUMREGISTERBITS, typename HASHER>
float IntersectionCountEstimate (
const CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>& A,
const CHyperLogLog<TKEY, NUMREGISTERBITS, HASHER>& B)
{
// We have to use the inclusion-exclusion principle to get an intersection estimate
// count(Intersection(A,B)) = (count(A) + count(B)) - count(Union(A,B))
// http://en.wikipedia.org/wiki/Inclusion%E2%80%93exclusion_principle

return (A.GetCountEstimation() + B.GetCountEstimation()) - UnionCountEstimate(A, B);
}

float UnionCountActual (const TCounterActual& A, const TCounterActual& B)
{
TCounterActual temp;
std::for_each(A.begin(), A.end(), [&] (const TCounterActual::key_type &s){temp.insert(s);});
std::for_each(B.begin(), B.end(), [&] (const TCounterActual::key_type &s){temp.insert(s);});
return (float)temp.size();
}

float IntersectionCountActual (const TCounterActual& A, const TCounterActual& B)
{
float ret = 0;
std::for_each(A.begin(), A.end(), [&](const TCounterActual::key_type &s)
{
if (B.find(s) != B.end())
++ret;
});
return ret;
}

//=====================================================================================================
// Error Calculations
//=====================================================================================================

float ExpectedError (size_t numRegisters)
{
return 1.04f / ((float)(sqrt((float)numRegisters)));
}

float IdealRegisterCount (float expectedError)
{
return 676.0f / (625.0f * expectedError * expectedError);
}

//=====================================================================================================
// Driver Program
//=====================================================================================================

template <typename L>
void ForEachWord(const std::string &source, L& lambda)
{
size_t prev = 0;
size_t next = 0;

while ((next = source.find_first_of(" ,.-":n", prev)) != std::string::npos)
{
if ((next - prev != 0))
{
std::string word = source.substr(prev, next - prev);
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
lambda(word);
}
prev = next + 1;
}

if (prev < source.size())
{
std::string word = source.substr(prev);
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
lambda(word);
}
}

template <typename T>
const char *CalculateError (const T&estimate, const T&actual)
{
float error = 100.0f * ((float)estimate - (float)actual) / (float)actual;
if (std::isnan(error) || std::isinf(error))
return "undef";

// bad practice to return a static local string, dont do this in production code!
static char ret[256];
sprintf_s(ret, sizeof(ret), "%0.2f%%", error);
return ret;
}

{
// bad practice to return a static local string, dont do this in production code!
static char ret[256];
if (bytes >= 1024 * 1024 * 1024)
sprintf_s(ret, sizeof(ret), "%0.2fGB", ((float)bytes) / (1024.0f*1024.0f*1024.0f));
else if (bytes >= 1024 * 1024)
sprintf_s(ret, sizeof(ret), "%0.2fMB", ((float)bytes) / (1024.0f*1024.0f));
else if (bytes >= 1024)
sprintf_s(ret, sizeof(ret), "%0.2fKB", ((float)bytes) / (1024.0f));
else
sprintf_s(ret, sizeof(ret), "%u Bytes", bytes);
return ret;
}

void WaitForEnter()
{
printf("nPress Enter to quit");
fflush(stdin);
getchar();
}

// These one paragraph stories are from http://birdisabsurd.blogspot.com/p/one-paragraph-stories.html

// The Dino Doc : http://birdisabsurd.blogspot.com/2011/11/dino-doc.html (97 words)
const char *g_storyA =
"The Dino Doc:n"
"Everything had gone according to plan, up 'til this moment. His design team "
"had done their job flawlessly, and the machine, still thrumming behind him, "
"a thing of another age, was settled on a bed of prehistoric moss. They'd "
"done it. But now, beyond the protection of the pod and facing an enormous "
"tyrannosaurus rex with dripping jaws, Professor Cho reflected that, had he "
"known of the dinosaur's presence, he wouldn't have left the Chronoculator - "
"and he certainly wouldn't have chosen "Stayin' Alive", by The Beegees, as "
"his dying soundtrack. Curse his MP3 player";

// The Robot: http://birdisabsurd.blogspot.com/2011/12/robot.html (121 words)
const char *g_storyB =
"The Robot:n"
"The engineer watched his robot working, admiring its sense of purpose.It knew "
"what it was, and what it had to do.It was designed to lift crates at one end "
"of the warehouse and take them to the opposite end.It would always do this, "
"never once complaining about its place in the world.It would never have to "
"agonize over its identity, never spend empty nights wondering if it had been "
"justified dropping a promising and soul - fulfilling music career just to "
"collect a bigger paycheck.And, watching his robot, the engineer decided that "
"the next big revolution in the robotics industry would be programming "
"automatons with a capacity for self - doubt.The engineer needed some company.";

// The Internet: http://birdisabsurd.blogspot.com/2011/11/internet.html (127 words)
const char *g_storyC =
"The Internet:n"
"One day, Sandra Krewsky lost her mind.Nobody now knows why, but it happened - "
"and when it did, Sandra decided to look at every page on the Internet, "
"insisting that she wouldn't eat, drink, sleep or even use the washroom until "
"the job was done. Traps set in her house stalled worried family members, and by "
"the time they trounced the alligator guarding her bedroom door - it managed to "
"snap her neighbour's finger clean off before going down - Sandra was already "
"lost… though the look of despair carved in her waxen features, and the cat "
"video running repeat on her flickering computer screen, told them everything "
"they needed to know.She'd seen too much. She'd learned that the Internet "
"played for keeps.";

int main (int argc, char **argv)
{
// show basic info regarding memory usage, precision, etc
printf("For %u registers at 1 byte per register, the expected error is %0.2f%%.n",
TCounterEstimated::c_numRegisters,
100.0f * ExpectedError(TCounterEstimated::c_numRegisters)
);
printf("Memory Usage = %s per HyperLogLog object.n", BytesToHumanReadable(TCounterEstimated::c_numRegisters));
static const float c_expectedError = 0.03f;
printf("For expected error of %0.2f%%, you should use %0.2f registers.nn",
100.0f*c_expectedError,
IdealRegisterCount(c_expectedError)
);

// populate our data structures
// dynamically allocate estimate so we don't bust the stack if we have a large number of registers
std::unique_ptr<TCounterEstimated> estimateTotal = std::make_unique<TCounterEstimated>();
std::unique_ptr<TCounterEstimated> estimateA = std::make_unique<TCounterEstimated>();
std::unique_ptr<TCounterEstimated> estimateB = std::make_unique<TCounterEstimated>();
std::unique_ptr<TCounterEstimated> estimateC = std::make_unique<TCounterEstimated>();
TCounterActual actualTotal;
TCounterActual actualA;
TCounterActual actualB;
TCounterActual actualC;
{
auto f = [&](const std::string &word)
{
actualTotal.insert(word);
};

auto fA = [&](const std::string &word)
{
actualA.insert(word);
f(word);
};

auto fB = [&](const std::string &word)
{
actualB.insert(word);
f(word);
};

auto fC = [&](const std::string &word)
{
actualC.insert(word);
f(word);
};

ForEachWord(g_storyA, fA);
ForEachWord(g_storyB, fB);
ForEachWord(g_storyC, fC);
}

// Show unique word counts for the three combined stories
{
float estimateCount = estimateTotal->GetCountEstimation();
float actualCount = (float)actualTotal.size();
printf("Unique words in the three stories combined:n");
printf("  %0.1f estimated, %.0f actual, Error = %snn",
estimateCount,
actualCount,
CalculateError(estimateCount, actualCount)
);
}

// show unique word counts per story
{
printf("Unique Word Count Per Story:n");
auto g = [](const char *name, const TCounterEstimated &estimate, const TCounterActual &actual)
{
float estimateCount = estimate.GetCountEstimation();
float actualCount = (float)actual.size();
printf("  %s = %0.1f estimated, %.0f actual, Error = %sn",
name,
estimateCount,
actualCount,
CalculateError(estimateCount, actualCount)
);
};

g("A", *estimateA, actualA);
g("B", *estimateB, actualB);
g("C", *estimateC, actualC);
}

// Set Operations
{
printf("nSet Operations:n");

auto f = [] (
const char *name1,
const TCounterEstimated &estimate1,
const TCounterActual &actual1,
const char *name2,
const TCounterEstimated &estimate2,
const TCounterActual &actual2
)
{
printf("  %s vs %s...n", name1, name2);

// union
float estimateUnionCount = UnionCountEstimate(estimate1, estimate2);
float actualUnionCount = UnionCountActual(actual1, actual2);
printf("    Union: %0.1f estimated, %.0f actual, Error = %sn",
estimateUnionCount,
actualUnionCount,
CalculateError(estimateUnionCount, actualUnionCount)
);

// intersection
float estimateIntersectionCount = IntersectionCountEstimate(estimate1, estimate2);
float actualIntersectionCount = IntersectionCountActual(actual1, actual2);
printf("    Intersection: %0.1f estimated, %.0f actual, Error = %sn",
estimateIntersectionCount,
actualIntersectionCount,
CalculateError(estimateIntersectionCount, actualIntersectionCount)
);

// jaccard index
float estimateJaccard = estimateIntersectionCount / estimateUnionCount;
float actualJaccard = actualIntersectionCount / actualUnionCount;
printf("    Jaccard Index: %0.4f estimated, %.4f actual, Error = %sn",
estimateJaccard,
actualJaccard,
CalculateError(estimateJaccard, actualJaccard)
);

// Contains Percent.
// What percentage of items in A are also in B?
float estimateSim = estimateIntersectionCount / estimate1.GetCountEstimation();
float actualSim = actualIntersectionCount / actual1.size();
printf("    Contains Percent: %0.2f%% estimated, %0.2f%% actual, Error = %sn",
100.0f*estimateSim,
100.0f*actualSim,
CalculateError(estimateSim, actualSim)
);
};

f("A", *estimateA, actualA, "B", *estimateB, actualB);
f("A", *estimateA, actualA, "C", *estimateC, actualC);
f("B", *estimateB, actualB, "C", *estimateC, actualC);
}

WaitForEnter();
return 0;
}
```

Here is the output of the above program:

## Want More?

As per usual, the rabbit hole goes far deeper than what I’ve shown. Check out the links below to go deeper!

# Count Min Sketch: A Probabilistic Histogram

Count min sketch is a probabilistic histogram that was invented in 2003 by Graham Cormode and S. Muthukrishnan.

It’s a histogram in that it can store objects (keys) and associated counts.

It’s probabilistic in that it lets you trade space and computation time for accuracy.

The count min sketch is pretty similar to a bloom filter, except that instead of storing a single bit to say whether an object is in the set, the count min sketch allows you to keep a count per object. You can read more about bloom filters here: Estimating Set Membership With a Bloom Filter.

It’s called a “sketch” because it’s a smaller summarization of a larger data set.

## Inserting an Item

The count min sketch is just a 2 dimensional array, with size of W x D. The actual data type in the array depends on how much storage space you want. It could be an unsigned char, it could be 4 bits, or it could be a uint64 (or larger!).

Each row (value of D) uses a different hash function to map objects to an index of W.

To insert an object, for each row you hash the object using that row’s hash function to get the W index for that object, then increment the count at that position.

In this blog post, I’m only going to be talking about being able to add items to the count min sketch. There are different rules / probabilities / etc for count min sketches that can have objects removed, but you can check out the links at the bottom of this post for more information about that!

## Getting an Item Count

When you want to get the count of how many times an item has been added to the count min sketch, you do a similar operation as when you insert.

For each row, you hash the object being asked about with that row’s hash function to get the W index and then get the value for that row at the W index.

This will give you D values and you just return the smallest one.

The reason for this, is because due to hash collisions of various hash functions, your count in a specific slot may have been incorrectly incremented extra times due to other objects hashing to the same object. If you take the minimum value you’ve seen across all rows, you are guaranteed to be taking the value that has the least number of hash collisions, so is guaranteed to be most correct, and in fact guaranteed to be greater than or equal to the actual answer – but never lower than the actual answer.

## Dot Product (Inner Product)

If you read the last post on using dot product with histograms to gauge similarity, you might be wondering if you can do a dot product between two count min sketch objects.

Luckily yes, you can! They need to have the same W x D dimensions and they need to use the same hash functions per row, but if that’s true, you can calculate a dot product value very easily.

If you have two count min sketch objects A and B that you want to calculate the dot product for, you dot product each row (D index) of the two count min sketch objects. This will leave you with D dot products and you just return the smallest one. This guarantees that the dot product value you calculate will have the fewest hash collisions (so will be most accurate), and will also guarantee that the estimate is greater that or equal to the actual answer, but will never be lower.

## To Normalize Or Not To Normalize

There is a caveat here though with doing a dot product between two count min sketch objects. If you do a normalized dot product (normalize the vectors before doing a dot product, or dividing the answer by the length of the two vectors multiplied together), the guarantee that the dot product is greater than or equal to the true answer no longer holds!

The reason for this is that the formula for doing a normalized dot product is like this:
normalized dot product = dot(A,B) / (length(A)*length(B))

In a count min sketch, the dot(A,B) estimate is guaranteed to greater than or equal to the true value.

The length of a vector is also guaranteed to be greater than or equal to the length of the true vector (the vector made from the actual histogram values).

This means that the numerator and the denominator BOTH have varying levels of overestimation in them. Overestimation in the numerator makes the normalized dot product estimate larger, while overestimation in the denominator makes the normalized dot product estimate smaller.

The result is that a normalized dot product estimate can make no guarantee about being greater than or equal to the true value!

This may or may not be a problem for your situation. Doing a dot product with unnormalized vectors still gives you a value that you can use to compare “similarity values” between histograms, but it has slightly different meaning than a dot product with normalized vectors.

Specifically, if the counts are much larger in one histogram versus another (such as when doing a dot product between multiple large text documents and a small search term string), the “weight” of the larger counts will count for more.

That means if you search for “apple pie”, a 100 page novel that mentions apples 10 times will be a better match than a 1/2 page recipe for apple pie!

When you normalize histograms, it makes it so the counts are “by percentage of the total document length”, which would help our search correctly find that the apple pie recipe is more relevant.

In other situations, you might want to let the higher count weigh stronger even though the occurrences are “less dense” in the document.

It really just depends on what your usage case is.

## Calculating W & D

There are two parameters (values) used when calculating the correct W and D dimensions of a count min sketch, for the desired accuracy levels. The parameters are ε (epsilon) and δ (delta).

ε (Epsilon) is “how much error is added to our counts with each item we add to the cm sketch”.

δ (Delta) is “with what probability do we want to allow the count estimate to be outside of our epsilon error rate”

To calculate W and D, you use these formulas:

W = ⌈e/ε⌉
D = ⌈ln (1/δ)⌉

Where ln is “natural log” and e is “euler’s constant”.

## Accuracy Guarantees

When querying to get a count for a specific object (also called a “point query”) the accuracy guarantees are:

1. True Count <= Estimated Count
2. Estimated Count <= True Count + ε * Number Of Items Added
3. There is a δ chance that #2 is not true

When doing an unnormalized dot product, the accuracy guarantees are:

1. True Dot Product <= Estimated Dot Product
2. Estimated Dot Product <= True Dot Product + ε * Number Of Items Added To A * Number Of Items Added To B
3. There is a δ chance that #2 is not true

## Conservative Update

There is an alternate way to implement adding an item to the cm sketch, which results in provably less error. That technique is called a “Conservative Update”.

When doing a conservative update, you first look at the values in each row that you would normally increment and keep track of the smallest value that you’ve seen. You then only increment the counters that have that smallest value.

The reason this works is because we only look at the smallest value across all rows when doing a look up. So long as the smallest value across all rows increases when you insert an object, you’ve satisfied the requirements to make a look up return a value that is greater than or equal to the true value. The reason this conservative update results in less error is because you are writing to fewer values, which means that there are fewer hash collisions happening.

While this increases accuracy, it comes at the cost of extra logic and processing time needed when doing an update, which may or may not be appropriate for your needs.

## Example Runs

The example program is a lot like the program from the last post which implemented some search engine type functionality.

This program also shows you some count estimations to show you that functionality as well.

The first run of the program is with normalized vectors, the second run of the program is with unnormalized vectors, and the third run of the program, which is most accurate, is with unnormalized vectors and conservative updates.

Third Run: Unnormalized Vectors, Conservative Updates

## Example Code

```#include
#include
#include
#include
#include
#include
#include
#include

const float c_eulerConstant = (float)std::exp(1.0);

// The CCountMinSketch class
template <typename TKEY, typename TCOUNTTYPE, unsigned int NUMBUCKETS, unsigned int NUMHASHES, typename HASHER = std::hash>
class CCountMinSketch
{
public:
typedef CCountMinSketch TType;

CCountMinSketch ()
: m_countGrid { } // init counts to zero
, m_vectorLengthsDirty(true)
{ }

static const unsigned int c_numBuckets = NUMBUCKETS;
static const unsigned int c_numHashes = NUMHASHES;
typedef TCOUNTTYPE TCountType;

void AddItem (bool conservativeUpdate, const TKEY& item, const TCOUNTTYPE& count)
{
// this count min sketch is only supporting positive counts
if (count = 0!n");
return;
}

// remember that our vector lengths are inaccurate
m_vectorLengthsDirty = true;

// if doing a conservative update, only update the buckets that are necesary
if (conservativeUpdate)
{
// find what the lowest valued bucket is and calculate what our new lowest
// value should be
TCOUNTTYPE lowestValue = GetCount(item) + count;

// make sure every bucket has at least the lowest value it should have
size_t rawHash = HASHER()(item);
for (unsigned int i = 0; i < NUMHASHES; ++i)
{
size_t hash = std::hash()(rawHash ^ std::hash()(i));
TCOUNTTYPE value = m_countGrid[i][hash%NUMBUCKETS];
if (value < lowestValue)
m_countGrid[i][hash%NUMBUCKETS] = lowestValue;
}
}
// else do a normal update
else
{
// for each hash, find what bucket this item belongs in, and add the count to that bucket
size_t rawHash = HASHER()(item);
for (unsigned int i = 0; i < NUMHASHES; ++i)
{
size_t hash = std::hash()(rawHash ^ std::hash()(i));
m_countGrid[i][hash%NUMBUCKETS] += count;
}
}
}

TCOUNTTYPE GetCount (const TKEY& item)
{
// for each hash, get the value for this item, and return the smalles value seen
TCOUNTTYPE ret = 0;
size_t rawHash = HASHER()(item);
for (unsigned int i = 0; i < NUMHASHES; ++i)
{
size_t hash = std::hash()(rawHash ^ std::hash()(i));
if (i == 0 || ret > m_countGrid[i][hash%NUMBUCKETS])
ret = m_countGrid[i][hash%NUMBUCKETS];
}
return ret;
}

void CalculateVectorLengths ()
{
// if our vector lengths were previously calculated, no need to do anything
if (!m_vectorLengthsDirty)
return;

// calculate vector lengths of each hash
for (unsigned int hash = 0; hash < NUMHASHES; ++hash)
{
m_vectorLengths[hash] = 0.0f;
for (unsigned int bucket = 0; bucket < NUMBUCKETS; ++bucket)
m_vectorLengths[hash] += (float)m_countGrid[hash][bucket] * (float)m_countGrid[hash][bucket];
m_vectorLengths[hash] = sqrt(m_vectorLengths[hash]);
}

// remember that our vector lengths have been calculated
m_vectorLengthsDirty = false;
}

friend float HistogramDotProduct (TType& A, TType& B, bool normalize)
{
// make sure the vector lengths are accurate. No cost if they were previously calculated
A.CalculateVectorLengths();
B.CalculateVectorLengths();

// whatever hash has the smallest dot product is the most correct
float ret = 0.0f;
bool foundValidDP = false;
for (unsigned int hash = 0; hash < NUMHASHES; ++hash)
{
// if either vector length is zero, don't consider this dot product a valid result
// we cant normalize it, and it will be zero anyways
if (A.m_vectorLengths[hash] == 0.0f || B.m_vectorLengths[hash] == 0.0f)
continue;

// calculate dot product of unnormalized vectors
float dp = 0.0f;
for (unsigned int bucket = 0; bucket  dp)
{
ret = dp;
foundValidDP = true;
}
}
return ret;
}

private:
typedef std::array TBucketList;
typedef std::array TTable;

TTable m_countGrid;
bool m_vectorLengthsDirty;
std::array m_vectorLengths;
};

// Calculate ideal count min sketch parameters for your needs.
unsigned int CMSIdealNumBuckets (float error)
{
return (unsigned int)std::ceil((float)(c_eulerConstant / error));
}

unsigned int CMSIdealNumHashes (float probability)
{
return (unsigned int)std::ceil(log(1.0f / probability));
}

typedef std::string TKeyType;
typedef unsigned char TCountType;

typedef CCountMinSketch THistogramEstimate;
typedef std::unordered_map THistogramActual;

// These one paragraph stories are from http://birdisabsurd.blogspot.com/p/one-paragraph-stories.html

// The Dino Doc : http://birdisabsurd.blogspot.com/2011/11/dino-doc.html (97 words)
const char *g_storyA =
"The Dino Doc:n"
"Everything had gone according to plan, up 'til this moment. His design team "
"had done their job flawlessly, and the machine, still thrumming behind him, "
"a thing of another age, was settled on a bed of prehistoric moss. They'd "
"done it. But now, beyond the protection of the pod and facing an enormous "
"tyrannosaurus rex with dripping jaws, Professor Cho reflected that, had he "
"known of the dinosaur's presence, he wouldn't have left the Chronoculator - "
"and he certainly wouldn't have chosen "Stayin' Alive", by The Beegees, as "
"his dying soundtrack. Curse his MP3 player";

// The Robot: http://birdisabsurd.blogspot.com/2011/12/robot.html (121 words)
const char *g_storyB =
"The Robot:n"
"The engineer watched his robot working, admiring its sense of purpose.It knew "
"what it was, and what it had to do.It was designed to lift crates at one end "
"of the warehouse and take them to the opposite end.It would always do this, "
"never once complaining about its place in the world.It would never have to "
"agonize over its identity, never spend empty nights wondering if it had been "
"justified dropping a promising and soul - fulfilling music career just to "
"collect a bigger paycheck.And, watching his robot, the engineer decided that "
"the next big revolution in the robotics industry would be programming "
"automatons with a capacity for self - doubt.The engineer needed some company.";

// The Internet: http://birdisabsurd.blogspot.com/2011/11/internet.html (127 words)
const char *g_storyC =
"The Internet:n"
"One day, Sandra Krewsky lost her mind.Nobody now knows why, but it happened - "
"and when it did, Sandra decided to look at every page on the Internet, "
"insisting that she wouldn't eat, drink, sleep or even use the washroom until "
"the job was done. Traps set in her house stalled worried family members, and by "
"the time they trounced the alligator guarding her bedroom door - it managed to "
"snap her neighbour's finger clean off before going down - Sandra was already "
"lost… though the look of despair carved in her waxen features, and the cat "
"video running repeat on her flickering computer screen, told them everything "
"they needed to know.She'd seen too much. She'd learned that the Internet "
"played for keeps.";

void WaitForEnter ()
{
printf("nPress Enter to quit");
fflush(stdin);
getchar();
}

template
void ForEachWord (const std::string &source, L& lambda)
{
size_t prev = 0;
size_t next = 0;

while ((next = source.find_first_of(" ,.-":n", prev)) != std::string::npos)
{
if ((next - prev != 0))
{
std::string word = source.substr(prev, next - prev);
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
lambda(word);
}
prev = next + 1;
}

if (prev < source.size())
{
std::string word = source.substr(prev);
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
lambda(word);
}
}

void PopulateHistogram (THistogramEstimate &histogram, const char *text, bool conservativeUpdate)
{
ForEachWord(text, [&](const std::string &word) {
});
}

void PopulateHistogram (THistogramActual &histogram, const char *text)
{
ForEachWord(text, [&histogram](const std::string &word) {
histogram[word]++;
});
}

float HistogramDotProduct (THistogramActual &A, THistogramActual &B, bool normalize)
{
// Get all the unique keys from both histograms
std::set keysUnion;
std::for_each(A.cbegin(), A.cend(), [&keysUnion](const std::pair& v)
{
keysUnion.insert(v.first);
});
std::for_each(B.cbegin(), B.cend(), [&keysUnion](const std::pair& v)
{
keysUnion.insert(v.first);
});

// calculate and return the normalized dot product!
float dotProduct = 0.0f;
float lengthA = 0.0f;
float lengthB = 0.0f;
std::for_each(keysUnion.cbegin(), keysUnion.cend(),
[&A, &B, &dotProduct, &lengthA, &lengthB]
(const TKeyType& key)
{
// if the key isn't found in either histogram ignore it, since it will be 0 * x which is
// always anyhow.  Make sure and keep track of vector length though!
auto a = A.find(key);
auto b = B.find(key);

if (a != A.end())
lengthA += (float)(*a).second * (float)(*a).second;

if (b != B.end())
lengthB += (float)(*b).second * (float)(*b).second;

if (a == A.end())
return;

if (b == B.end())
return;

// calculate dot product
dotProduct += ((float)(*a).second * (float)(*b).second);
}
);

// if we don't need to normalize, return the unnormalized value we have right now
if (!normalize)
return dotProduct;

// normalize if we can
if (lengthA * lengthB <= 0.0f)
return 0.0f;

lengthA = sqrt(lengthA);
lengthB = sqrt(lengthB);
return dotProduct / (lengthA * lengthB);
}

template
const char *CalculateError (const T&estimate, const T&actual)
{
float error = 100.0f * ((float)estimate - (float)actual) / (float)actual;
if (std::isnan(error) || std::isinf(error))
return "undef";

// bad practice to return a static local string, dont do this in production code!
static char ret[256];
sprintf(ret, "%0.2f%%", error);
return ret;
}

int main (int argc, char **argv)
{
// settings
const bool c_normalizeDotProducts = false;
const bool c_conservativeUpdate = true;

// show settings and implication
printf("Dot Products Normalized? %sn",
c_normalizeDotProducts
? "Yes! estimate could be  actual"
: "No! estimate <= actual");

c_conservativeUpdate
? "Yes! Reduced error"
: "No! normal error");

// populate our probabilistic histograms.
// Allocate memory for the objects so that we don't bust the stack for large histogram sizes!
std::unique_ptr TheDinoDocEstimate = std::make_unique();
std::unique_ptr TheRobotEstimate = std::make_unique();
std::unique_ptr TheInternetEstimate = std::make_unique();
PopulateHistogram(*TheDinoDocEstimate, g_storyA, c_conservativeUpdate);
PopulateHistogram(*TheRobotEstimate, g_storyB, c_conservativeUpdate);
PopulateHistogram(*TheInternetEstimate, g_storyC, c_conservativeUpdate);

// populate our actual count histograms for comparison
THistogramActual TheDinoDocActual;
THistogramActual TheRobotActual;
THistogramActual TheInternetActual;
PopulateHistogram(TheDinoDocActual, g_storyA);
PopulateHistogram(TheRobotActual, g_storyB);
PopulateHistogram(TheInternetActual, g_storyC);

// report whether B or C is a closer match for A
float dpABEstimate = HistogramDotProduct(*TheDinoDocEstimate, *TheRobotEstimate, c_normalizeDotProducts);
float dpACEstimate = HistogramDotProduct(*TheDinoDocEstimate, *TheInternetEstimate, c_normalizeDotProducts);
float dpABActual = HistogramDotProduct(TheDinoDocActual, TheRobotActual, c_normalizeDotProducts);
float dpACActual = HistogramDotProduct(TheDinoDocActual, TheInternetActual, c_normalizeDotProducts);
printf(""The Dino Doc" vs ...n");
printf("  "The Robot"    %0.4f (actual %0.4f) Error: %sn", dpABEstimate, dpABActual, CalculateError(dpABEstimate, dpABActual));
printf("  "The Internet" %0.4f (actual %0.4f) Error: %snn", dpACEstimate, dpACActual, CalculateError(dpACEstimate, dpACActual));
if (dpABEstimate > dpACEstimate)
printf("Estimate: "The Dino Doc" and "The Robot" are more similarn");
else
printf("Estimate: "The Dino Doc" and "The Internet" are more similarn");
if (dpABActual > dpACActual)
printf("Actual:   "The Dino Doc" and "The Robot" are more similarn");
else
printf("Actual:   "The Dino Doc" and "The Internet" are more similarn");

// let the user do a search engine style query for our stories!
char searchString[1024];
searchString[0] = 0;
scanf("%[^n]", searchString);

struct SSearchResults
{
SSearchResults(const std::string& pageName, float rankingEstimated, float rankingActual)
: m_pageName(pageName)
, m_rankingEstimated(rankingEstimated)
, m_rankingActual(rankingActual)
{ }

bool operator  other.m_rankingEstimated;
}

std::string m_pageName;
float       m_rankingEstimated;
float       m_rankingActual;
};
std::vector results;

// preform our search and gather our results!
std::unique_ptr searchEstimate = std::make_unique();
THistogramActual searchActual;
PopulateHistogram(*searchEstimate, searchString, c_conservativeUpdate);
PopulateHistogram(searchActual, searchString);
results.push_back(
SSearchResults(
"The Dino Doc",
HistogramDotProduct(*TheDinoDocEstimate, *searchEstimate, c_normalizeDotProducts),
HistogramDotProduct(TheDinoDocActual, searchActual, c_normalizeDotProducts)
)
);
results.push_back(
SSearchResults(
"The Robot",
HistogramDotProduct(*TheRobotEstimate, *searchEstimate, c_normalizeDotProducts),
HistogramDotProduct(TheRobotActual, searchActual, c_normalizeDotProducts)
)
);
results.push_back(
SSearchResults(
"The Internet",
HistogramDotProduct(*TheInternetEstimate, *searchEstimate, c_normalizeDotProducts),
HistogramDotProduct(TheInternetActual, searchActual, c_normalizeDotProducts)
)
);
std::sort(results.begin(), results.end());

// show the search results
printf("nSearch results sorted by estimated relevance:n");
std::for_each(results.begin(), results.end(), [](const SSearchResults& result) {
printf("  "%s" : %0.4f (actual %0.4f) Error: %sn",
result.m_pageName.c_str(),
result.m_rankingEstimated,
result.m_rankingActual,
CalculateError(result.m_rankingEstimated, result.m_rankingActual)
);
});

// show counts of search terms in each story (estimated and actual)
printf("nEstimated counts of search terms in each story:n");
std::for_each(searchActual.cbegin(), searchActual.cend(), [&] (const std::pair& v)
{
// show key
printf(""%s"n", v.first.c_str());

// the dino doc
TCountType estimate = TheDinoDocEstimate->GetCount(v.first.c_str());
TCountType actual = 0;
auto it = TheDinoDocActual.find(v.first.c_str());
if (it != TheDinoDocActual.end())
actual = it->second;
printf("  "The Dino Doc" %u (actual %u) Error: %sn", estimate, actual, CalculateError(estimate, actual));

// the robot
estimate = TheRobotEstimate->GetCount(v.first.c_str());
actual = 0;
it = TheRobotActual.find(v.first.c_str());
if (it != TheRobotActual.end())
actual = it->second;
printf("  "The Robot"    %u (actual %u) Error: %sn", estimate, actual, CalculateError(estimate, actual));

// the internet
estimate = TheInternetEstimate->GetCount(v.first.c_str());
actual = 0;
it = TheInternetActual.find(v.first.c_str());
if (it != TheInternetActual.end())
actual = it->second;
printf("  "The Internet" %u (actual %u) Error: %sn", estimate, actual, CalculateError(estimate, actual));
});

// show memory use
printf("nThe above used %u buckets and %u hashes with %u bytes per countn",
THistogramEstimate::c_numBuckets, THistogramEstimate::c_numHashes, sizeof(THistogramEstimate::TCountType));
printf("Totaling %u bytes of storage for each histogramnn",
THistogramEstimate::c_numBuckets * THistogramEstimate::c_numHashes * sizeof(THistogramEstimate::TCountType));

// show a probabilistic suggestion
float error = 0.1f;
float probability = 0.01f;
printf("You should use %u buckets and %u hashes for...n", CMSIdealNumBuckets(error), CMSIdealNumHashes(probability));
printf("true count <= estimated count <= true count + %0.2f * Items ProcessednWith probability %0.2f%%n", error, (1.0f - probability)*100.0f);

WaitForEnter();
return 0;
}
```

If you use this in production code, you should probably use a better quality hash.

The rabbit hole on this stuff goes deeper, so if you want to know more, check out these links!

Next up I’ll be writing about hyperloglog, which does the same thing as KMV (K-Minimum Values) but is better at it!

# Writing a Basic Search Engine AKA Calculating Similarity of Histograms with Dot Product

I came across this technique while doing some research for the next post and found it so interesting that it seemed to warrant it’s own post.

## Histograms and Multisets

Firstly, histogram is just a fancy word for a list of items that have an associated count.

If I were to make a histogram of the contents of my office, I would end up with something like:

• Books = 20
• Phone = 1
• Sombrero = 1 (thanks to the roaming tequilla cart, but that’s another story…)
• VideoGames = 15
• (and so on)

Another term for a histogram is multiset, so if you see that word, just think of it as being the same thing.

## Quick Dot Product Refresher

Just to make sure we are on the same page, using dot product to get the angle between two vectors is as follows:

cos(theta) = (A * B) / (||A||*||B||)

Or in coder-eese, if A and B are vectors of any dimension:

cosTheta = dot(A,B) / (length(A)*length(B))

To actually do the “dot” portion, you just multiply the X’s by the X’s, the Y’s by the Y’s, the Z’s by the Z’s etc, and add them all up to get a single scalar value. For a 3d vector it would look like this:

dot(A,B) = A.x*B.x + A.y*B.y + A.z*B.z

The value from the formulas above will tell you how similar the direction of the two vectors are.

If the value is 1, that means they are pointing the exact same way. if the value is 0 that means they are perpendicular. If the value is -1 that means they are pointing the exact opposite way.

Note that you can forgo the division by lengths in that formula, and just look at whether the result is positive, negative, or zero, if that’s enough information for your needs. We’ll be using the full on normalized vector version above in this post today though.

For a deeper refresher on dot product check out this link:
Wikipedia: Dot Product

## Histogram Dot Products

Believe it or not, if you treat the counts in a histogram as an N dimensional vector – where N is the number of categories in the histogram – you can use the dot product to gauge similarity between the contents of two histograms by using it to get the angular difference between the vectors!

In the normal usage case, histograms have counts that are >= 0, which ends up meaning that two histogram vectors can only be up to 90 degrees apart. That ends up meaning that the result of the dot product of these normalized vectors is going to be from 0 to 1, where 0 means they have nothing in common, and 1 means they are completely the same.

This is similar to the Jaccard Index mentioned in previous posts, but is different. In fact, this value isn’t even linear (although maybe putting it through acos and dividing by pi/2 may make it suitably linear!), as it represents the cosine of an angle, not a percentage of similarity. It’s still useful though. If you have histogram A and are trying to see if histogram B or C is a closer match to A, you can calculate this value for the A/B pair and the A/C pair. The one with the higher value is the more similar pairing.

Another thing to note about this technique is that order doesn’t matter. If you are trying to compare two multisets where order matters, you are going to have to find a different algorithm or try to pull some shenanigans – such as perhaps weighting the items in the multiset based on the order they were in.

## Examples

Let’s say we have two bags of fruit and we want to know how similar the bags are. The bags in this case represent histograms / multisets.

In the first bag, we have 3 apples. In the second bag, we have 2 oranges.

If we have a 2 dimensional vector where the first component is apples and the second component is oranges, we can represent our bags of fruit with these vectors:

Bag1 = [3, 0]
Bag2 = [0, 2]

Now, let’s dot product the vectors:

Dot Product = Bag1.Apples * Bag2.Apples + Bag1.Oranges * Bag2.Oranges
= 3*0 + 0*2
= 0

We would normally divide our answer by the length of the Bag1 vector and then the length of the Bag2 vector, but since it’s zero we know that won’t change the value.

From this, we can see that Bag1 and Bag2 have nothing in common!

What if we added an apple to bag 2 though? let’s do that and try the process again.

Bag1 = [3,0]
Bag2 = [1,2]
Dot Product = Bag1.Apples * Bag2.Apples + Bag1.Oranges * Bag2.Oranges
= 3*1 + 0*2
= 3

Next up, we need to divide the answer by the length of our Bag1 vector which is 3, multiplied the length of our Bag2 vector which is the square root of 5.

Cosine Angle (Similarity Value) = 3 / (3 * sqrt(5))
= ~0.45

Lets add an orange to bag 1. That ought to make it more similar to bag 2, so should increase our similarity value of the two bags. Let’s find out if that’s true.

Bag1 = [3,1]
Bag2 = [1,2]
Dot Product = Bag1.Apples * Bag2.Apples + Bag1.Oranges * Bag2.Oranges
= 3*1 + 1*2
= 5

Next, we need to divide that answer by the length of the bag 1 vector which is the square root of 10, multiplied by the length of the bag 2 vector which is the square root of 5.

Cosine Angle (Similarity Value) = 5 / (sqrt(10) * sqrt(5))
= ~0.71

So yep, adding an orange to bag 1 made the two bags more similar!

## Example Code

Here’s a piece of code that uses the dot product technique to be able to gauge the similarity of text. It uses this to compare larger blocks of text, and also uses it to allow you to search the blocks of text like a search engine!

```#include
#include
#include
#include
#include

typedef std::unordered_map THistogram;

// These one paragraph stories are from http://birdisabsurd.blogspot.com/p/one-paragraph-stories.html

// The Dino Doc : http://birdisabsurd.blogspot.com/2011/11/dino-doc.html
const char *g_storyA =
"The Dino Doc:n"
"Everything had gone according to plan, up 'til this moment. His design team "
"had done their job flawlessly, and the machine, still thrumming behind him, "
"a thing of another age, was settled on a bed of prehistoric moss. They'd "
"done it. But now, beyond the protection of the pod and facing an enormous "
"tyrannosaurus rex with dripping jaws, Professor Cho reflected that, had he "
"known of the dinosaur's presence, he wouldn't have left the Chronoculator - "
"and he certainly wouldn't have chosen "Stayin' Alive", by The Beegees, as "
"his dying soundtrack. Curse his MP3 player";

// The Robot: http://birdisabsurd.blogspot.com/2011/12/robot.html
const char *g_storyB =
"The Robot:n"
"The engineer watched his robot working, admiring its sense of purpose.It knew "
"what it was, and what it had to do.It was designed to lift crates at one end "
"of the warehouse and take them to the opposite end.It would always do this, "
"never once complaining about its place in the world.It would never have to "
"agonize over its identity, never spend empty nights wondering if it had been "
"justified dropping a promising and soul - fulfilling music career just to "
"collect a bigger paycheck.And, watching his robot, the engineer decided that "
"the next big revolution in the robotics industry would be programming "
"automatons with a capacity for self - doubt.The engineer needed some company.";

// The Internet: http://birdisabsurd.blogspot.com/2011/11/internet.html
const char *g_storyC =
"The Internet:n"
"One day, Sandra Krewsky lost her mind.Nobody now knows why, but it happened - "
"and when it did, Sandra decided to look at every page on the Internet, "
"insisting that she wouldn't eat, drink, sleep or even use the washroom until "
"the job was done. Traps set in her house stalled worried family members, and by "
"the time they trounced the alligator guarding her bedroom door - it managed to "
"snap her neighbour's finger clean off before going down - Sandra was already "
"lost… though the look of despair carved in her waxen features, and the cat "
"video running repeat on her flickering computer screen, told them everything "
"they needed to know.She'd seen too much. She'd learned that the Internet "
"played for keeps.";

void WaitForEnter ()
{
printf("nPress Enter to quit");
fflush(stdin);
getchar();
}

template
void ForEachWord (const std::string &source, L& lambda)
{
size_t prev = 0;
size_t next = 0;

while ((next = source.find_first_of(" ,.-":n", prev)) != std::string::npos)
{
if ((next - prev != 0))
{
std::string word = source.substr(prev, next - prev);
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
lambda(word);
}
prev = next + 1;
}

if (prev < source.size())
{
std::string word = source.substr(prev);
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
lambda(word);
}
}

void PopulateHistogram (THistogram &histogram, const char *text)
{
ForEachWord(text, [&histogram](const std::string &word) {
histogram[word] ++;
});
}

float HistogramDotProduct (THistogram &A, THistogram &B)
{
// Get all the unique keys from both histograms
std::set keysUnion;
std::for_each(A.cbegin(), A.cend(), [&keysUnion](const std::pair& v)
{
keysUnion.insert(v.first);
});
std::for_each(B.cbegin(), B.cend(), [&keysUnion](const std::pair& v)
{
keysUnion.insert(v.first);
});

// calculate and return the normalized dot product!
float dotProduct = 0.0f;
float lengthA = 0.0f;
float lengthB = 0.0f;
std::for_each(keysUnion.cbegin(), keysUnion.cend(),
[&A, &B, &dotProduct, &lengthA, &lengthB]
(const std::string& key)
{
// if the key isn't found in either histogram ignore it, since it will be 0 * x which is
// always anyhow.  Make sure and keep track of vector length though!
auto a = A.find(key);
auto b = B.find(key);

if (a != A.end())
lengthA += (float)(*a).second * (float)(*a).second;

if (b != B.end())
lengthB += (float)(*b).second * (float)(*b).second;

if (a == A.end())
return;

if (b == B.end())
return;

// calculate dot product
dotProduct += ((float)(*a).second * (float)(*b).second);
}
);

// normalize if we can
if (lengthA * lengthB  dpAC)
printf(""The Dino Doc" and "The Robot" are more similarn");
else
printf(""The Dino Doc" and "The Internet" are more similarn");

// let the user do a search engine style query for our stories!
char searchString[1024];
scanf("%[^n]", searchString);

// preform our search and gather our results!
THistogram search;
PopulateHistogram(search, searchString);

struct SSearchResults
{
SSearchResults (const std::string& pageName, float ranking, const char* pageContent)
: m_pageName(pageName)
, m_ranking(ranking)
, m_pageContent(pageContent)
{ }

bool operator  other.m_ranking;
}

std::string m_pageName;
float       m_ranking;
const char* m_pageContent;
};
std::vector results;

results.push_back(SSearchResults("The Dino Doc", HistogramDotProduct(TheDinoDoc, search), g_storyA));
results.push_back(SSearchResults("The Robot", HistogramDotProduct(TheRobot, search), g_storyB));
results.push_back(SSearchResults("The Internet", HistogramDotProduct(TheInternet, search), g_storyC));
std::sort(results.begin(), results.end());

// show the search results
printf("nSearch results sorted by relevance:n");
std::for_each(results.begin(), results.end(), [] (const SSearchResults& result) {
printf("  "%s" : %0.4fn", result.m_pageName.c_str(), result.m_ranking);
});

// show the most relevant result
printf("n-----Best Result-----n%sn",results[0].m_pageContent);

WaitForEnter();
return 0;
}
```

Here is an example run of this program:

## Improvements

Our “search engine” code does work but is pretty limited. It doesn’t know that “cat” and “Kitty” are basically the same, and also doesn’t know that the words “and”, “the” & “it” are not important.

I’m definitely not an expert in these matters, but to solve the first problem you might try making a “thesaurus” lookup table, where maybe whenever it sees “kitten”, “kitty”, “feline”, it translates it to “cat” before putting it into the histogram. That would make it smarter at realizing all those words mean the same thing.

For the second problem, there is a strange technique called tf–idf that seems to work pretty well, although people haven’t really been able to rigorously figure out why it works so well. Check out the link to it at the bottom of this post.

Besides just using this technique for text comparisons, I’ve read mentions that this histogram dot product technique is used in places like machine vision and object classification.

It is a pretty neat technique 😛

Wikipedia: Cosine Similarity – The common name for this technique
Easy Bar Chart Creator – Used to make the example graphs
Wikipedia: tf-idf – Used to automagically figure out which words in a document are important and which aren’t

# Estimating Set Membership With a Bloom Filter

Have you ever found yourself in a situation where you needed to keep track of whether or not you’ve seen an item, but the number of items you have to keep track of is either gigantic or completely unbounded?

This comes up a lot in “massive data” situations (like internet search engines), but it can also come up in games sometimes – like maybe having a monster in an MMO game remember which players have tried to attack it before so it can be hostile when it sees them again.

One solution to the game example could be to keep a list of all players the monster had been attacked by, but that will get huge and eat up a lot of memory and take time to search the list for a specific player.

You might instead decide you only want to keep the last 20 players the monster had been attacked by, but that still uses up a good chunk of memory if you have a lot of different monsters tracking this information for themselves, and 20 may be so low, that the monster will forget people who commonly attack it by the time it sees them again.

## Enter The Bloom Filter

There is actually an interesting solution to this problem, using a probabilistic data structure called a bloom filter – and before you ask, no, it has nothing to do with graphics. It was invented by a guy named Burton Howard Bloom in 1970.

A bloom filter basically has M number of bits as storage and K number of hashes.

To illustrate an example, let’s say that we have 10 bits and only 1 hash.

When we insert an item, what we want to do is hash that item and then modulus that has value against 10 to get a pseudo-random (but deterministic) value 0-9. That value is the index of the bit we want to set to true. After we set that bit to true, we can consider that item is inserted into the set.

When we want to ask if an item exists in a given bloom filter, what we do is again hash the item and modulus it against 10 to get the 0-9 value again. If that bit is set to false, we can say that item is NOT in the set with certainty, but if it’s true we can only say it MIGHT be in the set.

If the bit is true, we can’t say that for sure it’s part of the set, because something else may have hashed to the same bit and set it, so we have to consider it a maybe, not a definite yes. In fact, with only 10 bits and 1 hash function, with a good hash function, there is a 10% chance that any item we insert will result in the same bit being set.

To be a little more sure of our “maybe” being a yes, and not a false positive, we can do two things… we can increase the number of bits we use for the set (which will reduce collisions), or we can increase the number of hashes we do (although there comes a point where adding more hashes starts decreasing accuracy again).

If we add a second hash, all that means is that we will do a second hash, get a second 0-9 value, and write a one to that bit AS WELL when we insert an item. When checking if an item exists in a set, we also READ that bit to see if it’s true.

For N hashes, when inserting an item into a bloom filter, you get (up to) N different bits that you need to set to 1.

When checking if an item exists in a bloom filter, you get (up to) N different bits, all of which need to be 1 to consider the item in the set.

There are a few mathematical formulas you can use to figure out how many bits and hashes you would want to use to get a desired reliability that your maybes are in fact yeses, and also there are some formulas for figuring out the probability that your maybe is in fact a yes for any given bloom filter in a runtime state.

Depending on your needs and usage case, you may treat your “maybe” as a “yes” and move on, or you could treat a “maybe” as a time when you need to do a more expensive test (like, a disk read?). In those situations, the “no” is the valuable answer, since it means you can skip the more expensive test.

## Estimating Item Count In Bloom Filters

Besides being able to test an item for membership, you can also estimate how many items are in any given bloom filter:

EstimatedCount = -(NumBits * ln(1 – BitsOn / NumBits)) / NumHashes

Where BitsOn is the count of how many bits are set to 1 and ln is natural log.

## Set Operations On Bloom Filters

If you have two bloom filters, you can do some interesting set operations on them.

Union

One operation you can do with two bloom filters is union them, which means that if you have a bloom filter A and bloom filter B, you end up with a third bloom filter C that contains all the unique items from both A and B.

Besides being able to test this third bloom filter to see if it contains specific items, you can also ask it for an estimated count of objects it contains, which is useful for helping figure out how similar the sets are.

Essentially, if A estimates having 50 items and B estimates having 50 items, and their union, C, estimates having 51 items, that means that the items in A and B were almost all the same (probably).

How you union two bloom filters is you just do a bitwise OR on every bit in A and B to get the bits for C. Simple and fast to calculate.

Intersection

Another operation you can do with two bloom filters is to calculate their intersection. The intersection of two sets is just the items that the two sets have in common.

Again, besides being able to test this third bloom filter for membership of items, you can also use it to get an estimated count of objects to help figure out how similar the sets are.

Similary to the union, you can reason that if the intersection count is small compared to the counts of the individual bloom filters that went into it, that they were probably not very similar.

There are two ways you can calculate the intersection of two bloom filters.

The first way is to do a bitwise AND on every bit in A and B to get the bits for C. Again, super simple and faster to calculate.

The other way just involves some simple math:

Count(Intersection(A,B)) = (Count(A) + Count(B)) – Count(Union(A,B))

Whichever method is better depends on your needs and your usage case.

Jaccard Index

Just like I talked about in the KMV post two posts ago, you can also calculate an estimated Jaccard Index for two bloom filters, which will give you a value between 0 and 1 that tells you how similar two sets are.

If the Jaccard index is 1, that means the sets are the same and contain all the same items. If the Jaccard index is 0, that means the sets are completely different. If the Jaccard index is 0.5 that means that they have half of their items in common.

To calculate the estimated Jaccard Index, you just use this simple formula:

Jaccard Index = Count(Intersection(A,B)) / Count(Union(A,B))

## Estimating False Positive Rate

The more items you insert into a bloom filter the higher the false positive rate gets, until it gets to 100% which means all the bits are set and if you ask if it contains an item, it will never say no.

To combat this, you may want to calculate your error rate at runtime and maybe spit out a warning if the error rate starts getting too high, so that you know you need to adjust the number of bits or number of hashes, or at very least you can alert the user that the reliability of the answers has degraded.

Here is the formula to calculate the false positive rate of a bloom filter at runtime:

ErrorRate = (1 – e^(-NumHashes * NumItems / NumBits))^NumHashes

NumItems is the number of unique items that have been inserted into the bloom filter. If you know that exact value somehow you can use that real value, but in the more likely case, you won’t know the exact value, so you can use the estimated unique item count formula described above to get an estimated unique count.

## Managing the Error Rate of False Positives

As I mentioned above, there are formulas to figure out how many bits and how many hashes you need, to store a certain number of items with a specific maximum error rate.

You can work this out in advance by figuring out about how many items you expect to see.

First, you calculate the ideal bit count:

NumBits = – (NumItems * ln(DesiredFalsePositiveProbability)) / (ln(2)^2)

Where NumItems is the number of items you expect to put into the bloom filter, and DesiredFalsePositiveProbability is the error rate you want when the bloom filter has the expected number of items in it. The error rate will be lower until the item count reaches NumItems.

Next, you calculate the ideal number of hashes:

NumHashes = NumBits / NumItems * ln(2)

Then, you just create your bloom filter, using the specified number of bits and hashes.

## Example Code

Here is some example bloom filter c++ code with all the bells and whistles. Instead of using multiple hashes, I just grabbed some random numbers and xor them against the hash of each item to make more deterministic but pseudo-random numbers (that’s what a hash does too afterall). If you want to use a bloom filter in a more serious usage case, you may want to actually use multiple hashes, and you probably want to use a better hashing algorithm than std::hash.

```#include
#include
#include
#include
#include
#include
#include

// If you get a compile error here, remove the "class" from this enum definition.
// It just means your compiler doesn't support enum classes yet.
enum class EHasItem
{
e_no,
e_maybe
};

// The CBloomFilter class
template <typename T, unsigned int NUMBYTES, unsigned int NUMHASHES, typename HASHER = std::hash>
class CBloomFilter
{
public:
// constants
static const unsigned int c_numHashes = NUMHASHES;
static const unsigned int c_numBits = NUMBYTES*8;
static const unsigned int c_numBytes = NUMBYTES;

// friends
template
friend float UnionCountEstimate(const CBloomFilter& left, const CBloomFilter& right);
template
friend float IntersectionCountEstimate (const CBloomFilter& left, const CBloomFilter& right);

// constructor
CBloomFilter (const std::array& randomSalt)
: m_storage()              // initialize bits to zero
, m_randomSalt(randomSalt) // store off the random salt to use for the hashes
{ }

// interface
{
const size_t rawItemHash = HASHER()(item);
for (unsigned int index = 0; index < c_numHashes; ++index)
{
const size_t bitIndex = (rawItemHash ^ m_randomSalt[index])%c_numBits;
SetBitOn(bitIndex);
}
}

EHasItem HasItem (const T& item) const
{
const size_t rawItemHash = HASHER()(item);
for (unsigned int index = 0; index < c_numHashes; ++index)
{
const size_t bitIndex = (rawItemHash ^ m_randomSalt[index])%c_numBits;
if (!IsBitOn(bitIndex))
return EHasItem::e_no;
}
return EHasItem::e_maybe;
}

// probabilistic interface
float CountEstimate () const
{
// estimates how many items are actually stored in this set, based on how many
// bits are set to true, how many bits there are total, and how many hashes
// are done for each item.
return -((float)c_numBits * std::log(1.0f - ((float)CountBitsOn() / (float)c_numBits))) / (float)c_numHashes;
}

float FalsePositiveProbability (size_t numItems = -1) const
{
// calculates the expected error.  Since this relies on how many items are in
// the set, you can either pass in the number of items, or use the default
// argument value, which means to use the estimated item count
float numItemsf = numItems == -1 ? CountEstimate() : (float)numItems;
return pow(1.0f - std::exp(-(float)c_numHashes * numItemsf / (float)c_numBits),(float)c_numHashes);
}

private:

inline void SetBitOn (size_t bitIndex)
{
const size_t byteIndex = bitIndex / 8;
const uint8_t byteValue = 1 << (bitIndex%8);
m_storage[byteIndex] |= byteValue;
}

inline bool IsBitOn (size_t bitIndex) const
{
const size_t byteIndex = bitIndex / 8;
const uint8_t byteValue = 1 << (bitIndex%8);
return ((m_storage[byteIndex] & byteValue) != 0);
}

size_t CountBitsOn () const
{
// likely a more efficient way to do this but ::shrug::
size_t count = 0;
for (size_t index = 0; index < c_numBits; ++index)
{
if (IsBitOn(index))
++count;
}
return count;
}

// storage of bits
std::array m_storage;

// Storage of random salt values
// It could be neat to use constexpr and __TIME__ to make compile time random numbers.
// That isn't available til like c++17 or something though sadly.
const std::array& m_randomSalt;
};

// helper functions
template
float UnionCountEstimate (const CBloomFilter& left, const CBloomFilter& right)
{
// returns an estimated count of the unique items if both lists were combined
// example: (1,2,3) union (2,3,4) = (1,2,3,4) which has a union count of 4
CBloomFilter temp(left.m_randomSalt);
for (unsigned int index = 0; index < left.c_numBytes; ++index)
temp.m_storage[index] = left.m_storage[index] | right.m_storage[index];

return temp.CountEstimate();
}

template
float IntersectionCountEstimate (const CBloomFilter& left, const CBloomFilter& right)
{
// returns an estimated count of the number of unique items that are shared in both sets
// example: (1,2,3) intersection (2,3,4) = (2,3) which has an intersection count of 2
CBloomFilter temp(left.m_randomSalt);
for (unsigned int index = 0; index < left.c_numBytes; ++index)
temp.m_storage[index] = left.m_storage[index] & right.m_storage[index];

return temp.CountEstimate();
}

float IdealBitCount (unsigned int numItemsInserted, float desiredFalsePositiveProbability)
{
// given how many items you plan to insert, and a target false positive probability at that count, this returns how many bits
// of flags you should use.
return (float)-(((float)numItemsInserted*log(desiredFalsePositiveProbability)) / (log(2)*log(2)));
}

float IdealHashCount (unsigned int numBits, unsigned int numItemsInserted)
{
// given how many bits you are using for storage, and how many items you plan to insert, this is the optimal number of hashes to use
return ((float)numBits / (float)numItemsInserted) * (float)log(2.0f);
}

// random numbers from random.org
// https://www.random.org/cgi-bin/randbyte?nbytes=40&format=h
// in 64 bit mode, size_t is 64 bits, not 32.  The random numbers below will be all zero in the upper 32 bits!
static const std::array s_randomSalt =
{
0x6ff3f8ef,
0x9b565007,
0xea709ce4,
0xf7d5cbc7,
0xcb7e38e1,
0xd54b5323,
0xbf679080,
0x7fb78dee,
0x540c9e8a,
0x89369800
};

// data for adding and testing in our list
static const char *s_dataList1[] =
{
"hello!",
"blah!",
"moo",
nullptr
};

static const char *s_dataList2[] =
{
"moo",
"hello!",
"mooz",
"kitty",
"here is a longer string just cause",
nullptr
};

{
"moo",
"hello!",
"unf",
"boom",
"kitty",
"mooz",
"blah!",
nullptr
};

// driver program
void WaitForEnter ()
{
printf("nPress Enter to quit");
fflush(stdin);
getchar();
}

void main(void)
{
CBloomFilter set1(s_randomSalt);
CBloomFilter set2(s_randomSalt);
std::set actualSet1;
std::set actualSet2;

printf("Creating 2 bloom filter sets with %u bytes of flags (%u bits), and %u hashes.nn", set1.c_numBytes, set1.c_numBits, set1.c_numHashes);

// create data set 1
unsigned int index = 0;
while (s_dataList1[index] != nullptr)
{
printf("Adding to set 1: "%s"n", s_dataList1[index]);
actualSet1.insert(s_dataList1[index]);
index++;
}

// create data set 2
printf("n");
index = 0;
while (s_dataList2[index] != nullptr)
{
printf("Adding to set 2: "%s"n", s_dataList2[index]);
actualSet2.insert(s_dataList2[index]);
index++;
}

// query each set to see if they think that they contain various items
printf("n");
index = 0;
{
printf("Exists: "%s"? %s & %s (actually %s & %s)n",
set1.HasItem(s_askList[index]) == EHasItem::e_maybe ? "maybe" : "no",
set2.HasItem(s_askList[index]) == EHasItem::e_maybe ? "maybe" : "no",
actualSet1.find(s_askList[index]) != actualSet1.end() ? "yes" : "no",
actualSet2.find(s_askList[index]) != actualSet2.end() ? "yes" : "no");
index++;
}

// show false positive rates
printf ("nFalse postive probability = %0.2f%% & %0.2f%%n", set1.FalsePositiveProbability()*100.0f, set2.FalsePositiveProbability()*100.0f);
printf ("False postive probability at 10 items = %0.2f%%n", set1.FalsePositiveProbability(10)*100.0f);
printf ("False postive probability at 25 items = %0.2f%%n", set1.FalsePositiveProbability(25)*100.0f);
printf ("False postive probability at 50 items = %0.2f%%n", set1.FalsePositiveProbability(50)*100.0f);
printf ("False postive probability at 100 items = %0.2f%%n", set1.FalsePositiveProbability(100)*100.0f);

// show ideal bit counts and hashes.
const unsigned int itemsInserted = 10;
const float desiredFalsePositiveProbability = 0.05f;
const float idealBitCount = IdealBitCount(itemsInserted, desiredFalsePositiveProbability);
const float idealHashCount = IdealHashCount((unsigned int)idealBitCount, itemsInserted);
printf("nFor %u items inserted and a desired false probability of %0.2f%%nYou should use %0.2f bits of storage and %0.2f hashesn",
itemsInserted, desiredFalsePositiveProbability*100.0f, idealBitCount, idealHashCount);

// get the actual union
std::set actualUnion;
std::for_each(actualSet1.begin(), actualSet1.end(), [&actualUnion] (const std::string& s) {
actualUnion.insert(s);
});
std::for_each(actualSet2.begin(), actualSet2.end(), [&actualUnion] (const std::string& s) {
actualUnion.insert(s);
});

// get the actual intersection
std::set actualIntersection;
std::for_each(actualSet1.begin(), actualSet1.end(), [&actualIntersection,&actualSet2] (const std::string& s) {
if (actualSet2.find(s) != actualSet2.end())
actualIntersection.insert(s);
});

// caclulate actual jaccard index
float actualJaccardIndex = (float)actualIntersection.size() / (float)actualUnion.size();

// show the estimated and actual counts, and error of estimations
printf("nSet1: %0.2f estimated, %u actual.  Error: %0.2f%%n",
set1.CountEstimate(),
actualSet1.size(),
100.0f * ((float)set1.CountEstimate() - (float)actualSet1.size()) / (float)actualSet1.size()
);
printf("Set2: %0.2f estimated, %u actual.  Error: %0.2f%%n",
set2.CountEstimate(),
actualSet2.size(),
100.0f * ((float)set2.CountEstimate() - (float)actualSet2.size()) / (float)actualSet2.size()
);

float estimatedUnion = UnionCountEstimate(set1, set2);
float estimatedIntersection = IntersectionCountEstimate(set1, set2);
float estimatedJaccardIndex = estimatedIntersection / estimatedUnion;
printf("Union: %0.2f estimated, %u actual.  Error: %0.2f%%n",
estimatedUnion,
actualUnion.size(),
100.0f * (estimatedUnion - (float)actualUnion.size()) / (float)actualUnion.size()
);
printf("Intersection: %0.2f estimated, %u actual.  Error: %0.2f%%n",
estimatedIntersection,
actualIntersection.size(),
100.0f * (estimatedIntersection - (float)actualIntersection.size()) / (float)actualIntersection.size()
);
printf("Jaccard Index: %0.2f estimated, %0.2f actual.  Error: %0.2f%%n",
estimatedJaccardIndex,
actualJaccardIndex,
100.0f * (estimatedJaccardIndex - actualJaccardIndex) / actualJaccardIndex
);

WaitForEnter();
}
```

And here is the output of that program:

In the output, you can see that all the existence checks were correct. All the no’s were actually no’s like they should be, but also, all the maybe’s were actually present, so there were no false positives.

The estimated counts were a little off but were fairly close. The first list was estimated at 2.4 items, when it actually had 3. The second list was estimated at 4.44 items when it actually had 5 items.

It’s reporting a very low false positive rate, which falls in line with the fact that we didn’t see any false positives. The projected false positive rates at 10, 25, 50 and 100 items show us that the set doesn’t have a whole lot more capacity if we want to keep the error rate low.

The union, intersection and jaccard index error rate was pretty low, but the error rate was definitely larger than the false positive rate.

Interestingly, if you look at the part which reports the ideal bit and hash count for 10 items, it says that we should actually use FEWER hashes than we do and a couple less bits. You can actually experiment by changing the number of hashes to 4 and seeing that the error rate goes down. In the example code we are actually using TOO MANY hashes, and it’s hurting our probability rate, for the number of items we plan on inserting.

## Interesting Idea

I was chatting with a friend Dave, who said he was using a bloom filter like structure to try and help make sure he didn’t try the same genetic algorithm permutation more than once. An issue with that is that hash collisions could thwart the ability for evolution to happen correctly by incorrectly disallowing a wanted permutation from happening just because it happened to hash to the same value as another permutation already tried. To help this situation, he just biased against permutations found in the data structure, instead of completely blocking them out. Basically, if the permutation was marked as “maybe seen” in the data structure, he’d give it some % chance that it would allow it to happen “again” anyways.

Unfortunately, the idea in general turned out to be impractical. He had about 40 bits of genetic information which is about 1 trillion unique items (2^40).

for being able to store only 1 billion items – which is 1000 times smaller – with 5% false positive rate, that would require about 750MB of storage of bits.

Dropping the requirement to being 25% false positive rate, it still requires 350MB, and to 75% requires 70MB. Even at 99% false positive rate allowed, it requires 2.5MB, and we are still 1000 times too small.

So, for 1 trillion permutations, the size of the bloom filter is unfortunately far too large and he ended up going a different route.

The technique of rolling a random number when you get a maybe is pretty neat though, so wanted to mention it (:

## Next Up

We’ve now talked about a probabilistic unique value counter (KMV) that can count the number of unique objects seen.

We then talked about a probabilistic set structure (Bloom Filter) that can keep track of set membership of objects.

How about being able to probabilistically store a list of non unique items?

One way to do this would be to have a count with each item seen instead of only having a bit of whether it’s present or not. When you have a count per item, this is known as a multiset, or the more familiar term: histogram.

If you change a bloom filter to have a count instead of a single bit, that’s called a counting bloom filter and completely works.

There’s a better technique for doing this though called count min sketch. Look for a post soon!

Check this out, it’s a physical implementation of a bloom filter (WAT?!)
Wikipedia: Superimposed code

# Estimating Counts of Distinct Values with KMV

A few months ago I saw what I’m about to show you for the first time, and my jaw dropped. I’m hoping to share that experience with you right now, as a I share a quick intro to probabilistic algorithms, starting with KMV, otherwise known as K-Minimum Values, which is a Distinct Value (DV) Sketch, or a DV estimator.

Thanks to Ben Deane for exposure to this really interesting area of computer science.

## The Setup

Let’s say that you needed a program to count the number of unique somethings that you’ve encountered. The unique somethings could be unique visitors on a web page, unique logins into a server within a given time period, unique credit card numbers, or anything else that has a way for you to get some unique identifier.

One way to do this would be to keep a list of all the things you’ve encountered before, only inserting a new item if it isn’t already in the list, and then count the items in the list when you want a count of unique items. The obvious downside here is that if the items are large and/or there are a lot of them, your list can get pretty large. In fact, it can take an unbounded amount of memory to store this list, which is no good.

A better solution may be instead of storing the entire item in the list, maybe instead you make a 32 bit hash of the item and then add that hash to a list if it isn’t already in the list, knowing that you have at most 2^32 (4,294,967,296) items in the list. It’s a bounded amount of memory, which is an improvement, but it’s still pretty large. The maximum memory requirement there is 2^32 * 4 which is 17,179,869,184 bytes or 16GB.

You may even do one step better and just decide to have 2^32 bits of storage, and when you encounter a specific 32 bit hash, use that as an index and set that specific bit to 1. You could then count the number of bits set and use that as your answer. That only takes 536,870,912 bytes, or 512MB. A pretty big improvement over 16GB but still pretty big. Also, try counting the number of bits set in a 512MB region of memory. That isn’t the fastest operation around 😛

We made some progress by moving to hashes, which put a maximum size amount of memory required, but that came at the cost of us introducing some error. Hashes can have collisions, and when that occurs in our scenarios above, we have no way of knowing that two items hashing to the same value are not the same item.

Sometimes it’s ok though that we only have an approximate answer, because getting an exact answer may require resources we just don’t have – like infinite memory to hold an infinitely large list, and then infinite computing time to count how many items there are. Also notice that the error rate is tuneable if we want to spend more memory. If we used a 64 bit hash instead of a 32 bit hash, our hash collisions would decrease, and our result would be more accurate, at the cost of more memory used.

## The Basic Idea

Let’s try something different, let’s hash every item we see, but only store the lowest valued hash we encounter. We can then use that smallest hash value to estimate how many unique items we saw. That’s not immediately intuitive, so here’s the how and the why…

When we put something into a hash function, what we are doing really is getting a deterministic pseudo random number to represent the item. If we put the same item in, we’ll get the same number out every time, and if we put in a different item, we should get a different number out, even if the second item is really similar to the first item. Interestingly the numbers we get out should have no bearing on the nature of the items we put into the hash. Even if the items we put in are similar, the output should be just as different (on average) as if we put in items that were completely different.

This is one of the properties of a good hash function, that it’s output is (usually) evenly distributed, no matter the nature of the input. What that means is that if we put N items into a hash function, those items are on average going to be evenly spaced in the output of the hash, regardless of how similar or different they were before going into the hash function.

Using this property, the distance from zero to the smallest hash we’ve ever seen can be treated as a representative estimate of the average distance between each item that we hashed. So, to get the number of items in the hash, we convert the smallest hash into a percentage from 0 to 1 of the total hash space (for uint32, convert to float and divide by (float)2^32), and then we can use this formula:

numItems = (1 / percentage) – 1

To understand why we subtract the 1, imagine that our minimum hash value is 0.5. If that is an accurate representation of the space between values, that means that we only have 1 value, right in the middle. But, if we didvide 1 by 0.5 we get 2. We have 2 REGIONS, but we have only 1 item in the list, so we need to subtract 1.

As another example imagine that our minimum hash is 0.3333 and that it is an accurate representation of the space between values. If we divide 1 by 0.3333, we get 3. We do have 3 regions if we cut the whole space into 3 parts, but we only have 2 items (we made 2 cuts to make 3 regions).

## Reducing Error

This technique doesn’t suffer too much from hash collisions (so long as your hash function is a decent hash function), but it does have a different sort of problem.

As you might guess, sometimes the hash function might not play nice, and you could hash only a single item, get a hash of 1 and ruin the count estimation. So, there is possibility for error in this technique.

To help reduce error, you ultimately need information about more ranges so that you can combine the multiple pieces of information together to get a more accurate estimate.

Here are some ways to gather information about more ranges:

• Keep the lowest and highest hash seen instead of just the lowest. This doubles the range information you have since you’d know the range 0-min and max-1.
• Keep the lowest N hashes seen instead of only the lowest. This multiplies the amount of range information you have by N.
• Instead of keeping the lowest value from a single hash, perform N hashes instead and keep the lowest value seen from each hash. Again, this multiplies the amount of range information you have by N.
• Alternately, just salt the same hash N different ways (with high quality random numbers) instead of using N different hash functions, but still keep the lowest seen from each “value channel”. Multiplies info by N again.
• Also a possibility, instead of doing N hashes or N salts, do a single hash, and xor that hash against N different high quality random numbers to come up with N deterministic pseudorandom numbers per item. Once again, still keep the lowest hash value seen from each “value channel”. Multiplies info by N again
• Mix the above however you see fit!

Whatever route you go, the ultimate goal is to just get information about more ranges to be able to combine them together.

In vanilla KMV, the second option is used, which is to keep the N lowest hashes seen, instead of just a single lowest hash seen. Thus the full name for KMV: K-Minimum Values.

## Combining Info From Multiple Ranges

When you have the info from multiple ranges and need to combine that info, it turns out that using the harmonic mean is the way to go because it’s great at filtering out large values in data that don’t fit with the rest of the data.

Since we are using division to turn the hash value into an estimate (1/percentValue-1), unusually small hashes will result in exponentially larger values, while unusually large hashes will not affect the math as much, but also likely will be thrown out before we ever see them since they will likely not be the minimum hash that we’ve seen.

I don’t have supporting info handy, but from what I’ve been told, the harmonic mean is provably better than both the geometric mean and the regular plain old vanilla average (arithmetic mean) in this situation.

So, to combine information from the multiple ranges you’ve gathered, you turn each range into a distinct value estimate (by calculating 1/percentValue-1) and then putting all those values through the mean equation of your choice (which ought to be harmonic mean, but doesn’t strictly have to be!). The result will be your final answer.

## Set Operations

Even though KMV is just a distinct value estimator that estimates a count, there are some interesting probabilistic set operations that you can do with it as well. I’ll be talking about using the k min value technique for gathering information from multiple ranges, but if you use some logic you should be able to figure out how to make it work when you use any of the other techniques.

Jaccard Index

Talking about set operations, I want to start with a concept called the Jaccard index (sometimes called the Jaccard similarity coefficient).

If you have 2 sets, the Jaccard index is calculated by:

Jaccard Index = count(intersection(A,B)) / count(union(A,B))

Since the union of A and B is the combined list of all items in those sets, and the intersection of A and B is the items that they have in common, you can see that if the sets have all items in common, the Jaccard index will be 1 and if the sets have no items in common, the Jaccard index will be 0. If you have some items in common it will be somewhere between 0 and 1. So, the Jaccard Index is just a measurement of how similar two sets are.

Union

If you have the information for two KMV objects, you can get an estimate to the number of unique items there if you were to union them together, even though you don’t have much of the info about the items that are in the set.

To do a union, you just combine the minimum value lists, and remove the K largest ones, so that you are left with the K minimum values from both sets.

You then do business as usual to estimate how many items are in that resulting set.

If you think about it, this makes sense, because if you tracked the items from both lists in a third KMV object, you’d end up with the same K minimums as if you took the K smallest values from both sets individually.

Note that if the two KMV objects are of different size, due to K being different sizes, or because either one isn’t completely filled with K minimum values, you should use the smaller value of K as your union set K size.

Intersection

Finding an estimated count of intersections between two sets is a little bit different.

Basically, having the K minimum hashes seen from both lists, you can think of that as a kind of random sample of a range in both lists. You can then calculate the Jaccard index for that range you have for the two sets (by dividing the size of the intersection by the size of the union), and then use that Jaccard index to estimate an intersection count for the entire set based on the union count estimate.

You can do some algebra to the Jaccard index formula above to get this:

count(intersection(A,B)) = Jaccard Index * count(union(A,B))

Just like with union, if the two KMV objects are of different size, due to K being different sizes, or because either one isn’t completely filled with K minimum values, you should use the smaller value of K.

## Sample Code

```#include
#include
#include
#include
#include
#include

// The CKMVCounter class
template <typename T, unsigned int NUMHASHES, typename HASHER = std::hash>
class CKMVUniqueCounter
{
public:
// constants
static const unsigned int c_numHashes = NUMHASHES;
static const size_t c_invalidHash = (size_t)-1;

// constructor
CKMVUniqueCounter ()
{
// fill our minimum hash values with the maximum possible value
m_minHashes.fill(c_invalidHash);
m_largestHashIndex = 0;
}

// interface
{
// if the new hash isn't smaller than our current largest hash, do nothing
const size_t newHash = HASHER()(item);
if (m_minHashes[m_largestHashIndex] <= newHash)
return;

// if the new hash is already in the list, don't add it again
for (unsigned int index = 0; index < c_numHashes; ++index)
{
if (m_minHashes[index] == newHash)
return;
}

// otherwise, replace the largest hash
m_minHashes[m_largestHashIndex] = newHash;

// and find the new largest hash
m_largestHashIndex = 0;
for (unsigned int index = 1; index  m_minHashes[m_largestHashIndex])
m_largestHashIndex = index;
}
}

// probabilistic interface
void UniqueCountEstimates (float &arithmeticMeanCount, float &geometricMeanCount, float &harmonicMeanCount)
{
// calculate the means of the count estimates.  Note that if there we didn't get enough items
// to fill our m_minHashes array, we are just ignoring the unfilled entries.  In production
// code, you would probably just want to return the number of items that were filled since that
// is likely to be a much better estimate.
// Also, we need to sort the hashes before calculating uniques so that we can get the ranges by
// using [i]-[i-1] instead of having to search for the next largest item to subtract out
SortHashes();
arithmeticMeanCount = 0.0f;
geometricMeanCount = 1.0f;
harmonicMeanCount = 0.0f;
int numHashes = 0;
for (unsigned int index = 0; index < c_numHashes; ++index)
{
if (m_minHashes[index] == c_invalidHash)
continue;
numHashes++;
float countEstimate = CountEstimate(index);
arithmeticMeanCount += countEstimate;
geometricMeanCount *= countEstimate;
harmonicMeanCount += 1.0f / countEstimate;
}
arithmeticMeanCount = arithmeticMeanCount / (float)numHashes;
geometricMeanCount = pow(geometricMeanCount, 1.0f / (float)numHashes);
harmonicMeanCount /= (float)numHashes;
harmonicMeanCount = 1.0f / harmonicMeanCount;
}

// friends
template
friend CKMVUniqueCounter KMVUnion (
const CKMVUniqueCounter& a,
const CKMVUniqueCounter& b
);

template
friend float KMVJaccardIndex (
const CKMVUniqueCounter& a,
const CKMVUniqueCounter& b
);

private:

unsigned int NumHashesSet () const
{
unsigned int ret = 0;
for (unsigned int index = 0; index  0 ? m_minHashes[hashIndex- 1] : 0;
const float percent = (float)(currentHash-lastHash) / (float)((size_t)-1);
return (1.0f / percent) - 1.0f;
}

// the minimum hash values
std::array m_minHashes;
size_t m_largestHashIndex;
};

// Set interface
template
CKMVUniqueCounter KMVUnion (
const CKMVUniqueCounter& a,
const CKMVUniqueCounter& b
)
{
// gather the K smallest hashes seen, where K is the smaller, removing duplicates
std::set setMinHashes;
std::for_each(a.m_minHashes.begin(), a.m_minHashes.end(), [&setMinHashes](size_t v) {setMinHashes.insert(v); });
std::for_each(b.m_minHashes.begin(), b.m_minHashes.end(), [&setMinHashes](size_t v) {setMinHashes.insert(v); });
std::vector minHashes(setMinHashes.begin(), setMinHashes.end());
std::sort(minHashes.begin(),minHashes.end());
minHashes.resize(std::min(a.NumHashesSet(), b.NumHashesSet()));

// create and return the new KMV union object
CKMVUniqueCounter ret;
for (unsigned int index = 0; index < minHashes.size(); ++index)
ret.m_minHashes[index] = minHashes[index];
ret.m_largestHashIndex = ret.c_numHashes - 1;
return ret;
}

template
float KMVJaccardIndex (
const CKMVUniqueCounter& a,
const CKMVUniqueCounter& b
)
{
size_t smallerK = std::min(a.NumHashesSet(), b.NumHashesSet());

size_t matches = 0;
for (size_t ia = 0; ia < smallerK; ++ia)
{
for (size_t ib = 0; ib < smallerK; ++ib)
{
if (a.m_minHashes[ia] == b.m_minHashes[ib])
{
matches++;
break;
}
}
}

return (float)matches / (float)smallerK;
}

// data to add to the lists
const char *s_boyNames[] =
{
"Loki",
"Alan",
"Paul",
"Stripes",
"Shelby",
"Ike",
"Rafael",
"Sonny",
"Luciano",
"Jason",
"Brent",
"Jed",
"Lesley",
"Randolph",
"Isreal",
"Charley",
"Valentin",
"Dewayne",
"Trent",
"Abdul",
"Craig",
"Andre",
"Markus",
"Randolph",
"Isreal",
"Charley",
"Brenton",
"Herbert",
"Rafael",
"Sonny",
"Luciano",
"Joshua",
"Ramiro",
"Osvaldo",
"Monty",
"Mckinley",
"Colin",
"Hyman",
"Scottie",
"Tommy",
"Modesto",
"Reginald",
"Lindsay",
"Alec",
"Marco",
"Dee",
"Randy",
"Arthur",
"Hosea",
"Laverne",
"Bobbie",
"Damon",
"Les",
"Cleo",
"Robt",
"Rick",
"Alonso",
"Teodoro",
"Rodolfo",
"Ryann",
"Miki",
"Astrid",
"Monty",
"Mckinley",
"Colin",
nullptr
};

const char *s_girlNames[] =
{
"Chanel",
"Colleen",
"Scorch",
"Grub",
"Anh",
"Kenya",
"Georgeann",
"Anne",
"Inge",
"Georgeann",
"Anne",
"Inge",
"Analisa",
"Ligia",
"Chasidy",
"Marylee",
"Lashandra",
"Frida",
"Katie",
"Alene",
"Brunilda",
"Zoe",
"Shavon",
"Anjanette",
"Daine",
"Sheron",
"Hilary",
"Felicitas",
"Cristin",
"Ressie",
"Tynisha",
"Annie",
"Sharilyn",
"Astrid",
"Charise",
"Gregoria",
"Angelic",
"Lesley",
"Mckinley",
"Lindsay",
"Shanelle",
"Karyl",
"Trudi",
"Shaniqua",
"Trinh",
"Ardell",
"Doreen",
"Leanna",
"Chrystal",
"Treasa",
"Dorris",
"Rosalind",
"Lenore",
"Mari",
"Kasie",
"Ann",
"Ryann",
"Miki",
"Lasonya",
"Olimpia",
"Shelby",
"Lesley",
"Mckinley",
"Lindsay",
"Dee",
"Bobbie",
"Cleo",
"Leanna",
"Chrystal",
"Treasa",
nullptr
};

// driver program
void WaitForEnter ()
{
printf("nPress Enter to quit");
fflush(stdin);
getchar();
}

void main(void)
{
// how many min values all these KMV objects keep around
static const int c_numMinValues = 15;

printf("Using %u minimum valuesnn", c_numMinValues);

// =====================  Boy Names  =====================
// put our data into the KVM counter
CKMVUniqueCounter boyCounter;
unsigned int index = 0;
while (s_boyNames[index] != nullptr)
{
index++;
}

// get our count estimates
float arithmeticMeanCount, geometricMeanCount, harmonicMeanCount;
boyCounter.UniqueCountEstimates(arithmeticMeanCount, geometricMeanCount, harmonicMeanCount);

// get our actual unique count
std::set actualBoyUniques;
index = 0;
while (s_boyNames[index] != nullptr)
{
actualBoyUniques.insert(s_boyNames[index]);
index++;
}

// print the results!
printf("Boy Names:n%u actual uniquesn", actualBoyUniques.size());
float actualCount = (float)actualBoyUniques.size();
printf("Estimated counts and percent error:n  Arithmetic Mean: %0.2ft%0.2f%%n"
"  Geometric Mean : %0.2ft%0.2f%%n  Harmonic Mean  : %0.2ft%0.2f%%n",
arithmeticMeanCount, 100.0f * (arithmeticMeanCount - actualCount) / actualCount,
geometricMeanCount, 100.0f * (geometricMeanCount - actualCount) / actualCount,
harmonicMeanCount, 100.0f * (harmonicMeanCount - actualCount) / actualCount);

// =====================  Girl Names  =====================
// put our data into the KVM counter
CKMVUniqueCounter girlCounter;
index = 0;
while (s_girlNames[index] != nullptr)
{
index++;
}

// get our count estimates
girlCounter.UniqueCountEstimates(arithmeticMeanCount, geometricMeanCount, harmonicMeanCount);

// get our actual unique count
std::set actualGirlUniques;
index = 0;
while (s_girlNames[index] != nullptr)
{
actualGirlUniques.insert(s_girlNames[index]);
index++;
}

// print the results!
printf("nGirl Names:n%u actual uniquesn", actualGirlUniques.size());
actualCount = (float)actualGirlUniques.size();
printf("Estimated counts and percent error:n  Arithmetic Mean: %0.2ft%0.2f%%n"
"  Geometric Mean : %0.2ft%0.2f%%n  Harmonic Mean  : %0.2ft%0.2f%%n",
arithmeticMeanCount, 100.0f * (arithmeticMeanCount - actualCount) / actualCount,
geometricMeanCount, 100.0f * (geometricMeanCount - actualCount) / actualCount,
harmonicMeanCount, 100.0f * (harmonicMeanCount - actualCount) / actualCount);

// =====================  Set Operations  =====================

// make the KMV union and get our count estimates
CKMVUniqueCounter boyGirlUnion = KMVUnion(boyCounter, girlCounter);
boyGirlUnion.UniqueCountEstimates(arithmeticMeanCount, geometricMeanCount, harmonicMeanCount);

// make the actual union
std::set actualBoyGirlUnion;
std::for_each(actualBoyUniques.begin(), actualBoyUniques.end(),
[&actualBoyGirlUnion](const std::string& s)
{
actualBoyGirlUnion.insert(s);
}
);
std::for_each(actualGirlUniques.begin(), actualGirlUniques.end(),
[&actualBoyGirlUnion](const std::string& s)
{
actualBoyGirlUnion.insert(s);
}
);

// print the results!
printf("nUnion:n%u actual uniques in unionn", actualBoyGirlUnion.size());
actualCount = (float)actualBoyGirlUnion.size();
printf("Estimated counts and percent error:n  Arithmetic Mean: %0.2ft%0.2f%%n"
"  Geometric Mean : %0.2ft%0.2f%%n  Harmonic Mean  : %0.2ft%0.2f%%n",
arithmeticMeanCount, 100.0f * (arithmeticMeanCount - actualCount) / actualCount,
geometricMeanCount, 100.0f * (geometricMeanCount - actualCount) / actualCount,
harmonicMeanCount, 100.0f * (harmonicMeanCount - actualCount) / actualCount);

// calculate estimated jaccard index
float estimatedJaccardIndex = KMVJaccardIndex(boyCounter, girlCounter);

// calculate actual jaccard index and actual intersection
size_t actualIntersection = 0;
std::for_each(actualBoyUniques.begin(), actualBoyUniques.end(),
[&actualGirlUniques, &actualIntersection] (const std::string &s)
{
if (actualGirlUniques.find(s) != actualGirlUniques.end())
actualIntersection++;
}
);
float actualJaccardIndex = (float)actualIntersection / (float)actualBoyGirlUnion.size();

// calculate estimated intersection
float estimatedIntersection = estimatedJaccardIndex * (float)actualBoyGirlUnion.size();

// print the intersection and jaccard index information
printf("nIntersection:n%0.2f estimated, %u actual.  Error: %0.2f%%n",
estimatedIntersection,
actualIntersection,
100.0f * (estimatedIntersection - (float)actualIntersection) / (float)actualIntersection);

printf("nJaccard Index:n%0.2f estimated, %0.2f actual.  Error: %0.2f%%n",
estimatedJaccardIndex,
actualJaccardIndex,
100.0f * (estimatedJaccardIndex-actualJaccardIndex) / actualJaccardIndex
);

WaitForEnter();
}
```

Here’s the output of the program:

## Upgrading from Set to Multi Set

Interestingly, if you keep a count of how many times you’ve seen each hash in the k min hash, you can upgrade this algorithm from being a set algorithm to a multi set algorithm and get some other interesting information from it.

Where a set is basically a list of unique items, a multi set is a set of unique items that also have a count associated with them. In this way, you can think of a multiset as just a list of items which may appear more than once.

Upgrading KMV to a multi set algorithm lets you do some new and interesting things where instead of getting information only about unique counts, you can get information about non unique counts too. But to re-iterate, you still keep the ability to get unique information as well, so it is kind of like an upgrade, if you are interested in multiset information.

Sketch of the Day: K-Minimum Values

Sketch of the Day: Interactive KMV web demo

K-Minimum Values: Sketching Error, Hash Functions, and You

Wikipedia: MinHash – Jaccard similarity and minimum hash values

## All Done

I wanted to mention that even though this algorithm is a great, intuitive introduction into probabilistic algorithms, there are actually much better distinct value estimators in use today, such as one called HyperLogLog which seems to be the current winner. Look for a post on HyperLogLog soon 😛

KMV is better than other algorithms at a few things though. Specifically, from what I’ve read, that it can extend to multiset makes it very useful, and also it is much easier to calculate intersections versus other algorithms.

I also wanted to mention that there are interesting usage cases for this type of algorithms in game development, but where these probabilistic algorithms really shine is in massive data situations like google, amazon, netflix etc. If you go out searching for more info on this stuff, you’ll probably be led down many big data / web dev rabbit holes, because that’s where the best information about this stuff resides.

Lastly, I wanted to mention that I’m using the C++ std::hash built in hash function. I haven’t done a lot of research to see how it compares to other hashes, but I’m sure, much like rand(), the built in functionality leaves much to be desired for “power user” situations. In other words, if you are going to use this in a realistic situation, you probably are better off using a better hashing algorithm. If you need a fast one, you might look into the latest murmurhash variant!

More probabilistic algorithm posts coming soon, so keep an eye out!

# Programmatically Calculating GCD and LCM

I recently came across a really interesting technique for calculating GCD (greatest common divisor) and then found out you can use that to calculate LCM (least common multiple).

## Greatest Common Divisor

The greatest common divisor of two numbers is the largest number that divides evenly into those numbers.

For instance the GCD of 12 and 15 is 3, the GCD of 30 and 20 is 10, and the GCD of 7 and 11 is 1.

You could calculate this with brute force – starting with 1 and counting up to the smaller number, keeping track of the largest number that divides evenly into both numbers – but for larger numbers, this technique could take a long time.

Luckily Euclid came up a better way in 300 BC!

Euclid’s algorithm to find the GCD of numbers A and B:

1. If A and B are the same number, that number is the GCD
2. Otherwise, subtract the smaller number from the larger number
3. Goto 1

Pretty simple right? It’s not immediately intuitive why that works, but as an example let’s say that there’s a number that goes into A fives times, and goes into B three times. That same number must go into (A-B) two times.

Try it out on paper, think about it a bit, and check out the links at the end of this section (:

A refinement on that algorithm is to use remainder (modulus) instead of possibly having to do repeated subtraction to get the same result. For instance if you had the numbers 1015 and 2, you are going to have to subtract 2 from 1015 quite a few times before the 2 becomes the larger number.

Here’s the refined algorithm:

1. If A and B are the same number, that number is the GCD
2. Otherwise, set the larger number to be the remainder of the larger number divided by the smaller number
3. Goto 1

And here’s the C++ code:

```#include <stdio.h>
#include <algorithm>

unsigned int CalculateGCD (unsigned int smaller, unsigned int larger)
{
// make sure A <= B before starting
if (larger < smaller)
std::swap(smaller, larger);

// loop
while (1)
{
// if the remainder of larger / smaller is 0, they are the same
// so return smaller as the GCD
unsigned int remainder = larger % smaller;
if (remainder == 0)
return smaller;

// otherwise, the new larger number is the old smaller number, and
// the new smaller number is the remainder
larger = smaller;
smaller = remainder;
}
}

void WaitForEnter ()
{
printf("nPress Enter to quit");
fflush(stdin);
getchar();
}

void main (void)
{
// Get A
printf("Greatest Common Devisor Calculator, using Euclid's algorithm!n");
printf("nFirst number? ");
unsigned int A = 0;
if (scanf("%u",&A) == 0 || A == 0) {
printf("nMust input a positive integer greater than 0!");
WaitForEnter();
return;
}

// Get B
printf("Second number? ");
unsigned int B = 0;
if (scanf("%u",&B) == 0 || B == 0) {
printf("nMust input a positive integer greater than 0!");
WaitForEnter();
return;
}

// show the result
printf("nGCD(%u,%u) = %un", A, B, CalculateGCD(A,B));

// wait for user to press enter
WaitForEnter();
}
```

That book is filled with amazing treasures of knowledge and interesting stories to boot. I highly suggest flipping to a couple random chapters and reading it a bit. Very cool stuff in there (:

You might also find these links interesting or useful!
Wikipedia: Greatest Common Divisor
Wikipedia: Euclidean Algorithm

I’m sure there’s a way to extend this algorithm to work for N numbers at a time instead of only 2 numbers. I’ll leave that as a fun exercise for you if you want to play with that 😛

## Least Common Multiple

The least common multiple of two numbers is the smallest number that is evenly divisible by those numbers.

Kind of an ear full so some examples: The LCM of 3 and 4 is 12, the LCM of 1 and 7 is 7, the LCM of 20 and 35 is 140. Note that in the first two examples, the LCM is just the two numbers multiplied together, but in the 3rd example it isn’t (also an interesting thing of note is that the first 2 examples have a GCD of 1, while the 3rd example has a GCD of 5).

Well interestingly, calculating the LCM is super easy if you already know how to calculate the GCD. You just multiply the numbers together and divide by the GCD.

LCM(A,B) = (A*B) / GCD(A,B)

Interestingly though, GCD(A,B) divides evenly into both A and B and will result in an integer result. This means we can multiply by A or B after the division happens and get the exact same answer. More importantly though, it helps protect you against integer overflow in the A*B calculation. Using that knowledge the equation becomes this:

LCM(A,B) = (A / GCD(A,B))*B

Pretty neat! Here’s some C++ code that calculates LCM.

```#include <stdio.h>
#include <algorithm>

unsigned int CalculateGCD (unsigned int smaller, unsigned int larger)
{
// make sure A <= B before starting
if (larger < smaller)
std::swap(smaller, larger);

// loop
while (1)
{
// if the remainder of larger / smaller is 0, they are the same
// so return smaller as the GCD
unsigned int remainder = larger % smaller;
if (remainder == 0)
return smaller;

// otherwise, the new larger number is the old smaller number, and
// the new smaller number is the remainder
larger = smaller;
smaller = remainder;
}
}

unsigned int CalculateLCM (unsigned int A, unsigned int B)
{
// LCM(A,B) = (A/GCD(A,B))*B
return (A/CalculateGCD(A,B))*B;
}

void WaitForEnter ()
{
printf("nPress Enter to quit");
fflush(stdin);
getchar();
}

void main (void)
{
// Get A
printf("Least Common Multiple Calculator, using Euclid's algorithm for GCD!n");
printf("nFirst number? ");
unsigned int A = 0;
if (scanf("%u",&A) == 0 || A == 0) {
printf("nMust input a positive integer greater than 0!");
WaitForEnter();
return;
}

// Get B
printf("Second number? ");
unsigned int B = 0;
if (scanf("%u",&B) == 0 || B == 0) {
printf("nMust input a positive integer greater than 0!");
WaitForEnter();
return;
}

// show the result
printf("nLCM(%u,%u) = %un", A, B, CalculateLCM(A,B));

// wait for user to press enter
WaitForEnter();
}
```

Extending this to N numbers could be an interesting thing to try too (:

## Compile Time GCD and LCM

I’ve just heard that a compile time GCD and LCM implementation has been recomended for the STL. Check out the link below, kinda neat!

Greatest Common Divisor and Least Common Multiple

TTFN.

# Bresenham’s Drawing Algorithms

If you were asked to name a line drawing algorithm, chances are you would say Bresenham. I recently needed to write my own software line drawing algorithm (CPU and regular ram, not GPU and VRAM) and Bresenham was the first to come to mind for me as well.

Believe it or not, Jack Bresenham actually came up with 2 famous line drawing algorithms. One is a run length algorithm, and the other is a run slice algorithm. People are most often familiar with the run length algorithm, which I mainly talk about in this post, but there’s some information about the run slice algorithm and links to other Bresenham primitive rendering algorithms below as well!

An interesting thing about both line algorithms is that they involve only integer mathematics, so there is no round off error, epsilons, numerical instability or anything like that. You use only integers and simple math (no divisions even!), and get the exact right answer out. I think that is pretty cool.

## Bresenham’s Run Length Line Algorithm Summarized

To help understand the code, I want to give a brief summarization of how the algorithm works at a high level.

The first step of the Bresenham line algorithm is to see if the line is longer on the X axis or Y axis. Whichever one it is longer on is the major axis, and the shorter one is the minor axis.

Then, starting at the starting point of the line, you loop across all pixels of the major axis, doing some math to decide for each pixel whether you need to move the pixel on the minor axis yet or not. Basically you keep track of the error – or the distance the pixel is away from the true position of the line on the minor axis – and if the error is greater than or equal to 1 pixel, you move the pixel on the minor axis and subtract one pixel from the error. If the error is smaller than 1 pixel, you keep it at the same value it was for the last pixel, and keep the error value the same, so that it can accumulate some more error for the next pixel and test again.

The left line is longer on the X axis, so the major axis is the X axis. The right line is longer on the Y axis, so the major axis is the Y axis. Notice that there is one pixel for each value along the major axis of each line, but repeated pixel values along the minor axis of each line.

If you want the deeper details about the algorithm or the math behind it, check out this link:
Wikipedia: Bresenham’s line algorithm

## One Octant Code

Wikipedia says that many implementations implement the algorithm for a single octant (half of a quadrant) and then use coordinate conversions (flipping the x and y’s and/or negating them) to make the same logic work for any quadrant.

In that vein, here is a simple implementation for the first octant – where the major axis is X, and both X and Y are increasing from the starting point to the ending point.

```	void DrawLine (int x1, int y1, int x2, int y2, unsigned int color)
{
const int dx = x2 - x1;
const int dy = y2 - y1;

int Error = 2 * dy - dx;
int y = y1;

DrawPixel(x1, y1, color);
for (int x = x1+1; x < x2; ++x)
{
if (Error > 0)
{
y++;
Error += 2 * dy - 2 * dx;
}
else
{
Error += 2 * dy;
}
DrawPixel(x,y,color);
}
}
```

With the point of Bresenham being that it is efficient, let’s sacrifice readability a bit and get rid of those DrawPixel calls and perhaps over-micro-optimize a bit and make some consts for frequently used values.

```	void DrawLine (unsigned int* pixels, int width, int x1, int y1, int x2, int y2, unsigned int color)
{
const int dx = x2 - x1;
const int dy = y2 - y1;
const int dx2 = dx * 2;
const int dy2 = dy * 2;
const int dy2Mindx2 = dy2 - dx2;

int Error = dy2 - dx;

unsigned int* pixel = &amp;pixels[y1*width+x1];
*pixel = color;
for (int x = x2 - x1 - 1; x > 0; --x)
{
if (Error > 0)
{
pixel += width + 1;
Error += dy2Mindx2;
}
else
{
pixel++;
Error += dy2;
}
*pixel = color;
}
}
```

Cool, now let’s make it work for any line.

## Final Code

```	// Generic Line Drawing
// X and Y are flipped for Y maxor axis lines, but the pixel writes are handled correctly due to
// minor and major axis pixel movement
void DrawLineMajorAxis(
unsigned int* pixel,
int majorAxisPixelMovement,
int minorAxisPixelMovement,
int dx,
int dy,
unsigned int color
)
{
// calculate some constants
const int dx2 = dx * 2;
const int dy2 = dy * 2;
const int dy2Mindx2 = dy2 - dx2;

// calculate the starting error value
int Error = dy2 - dx;

// draw the first pixel
*pixel = color;

// loop across the major axis
while (dx--)
{
// move on major axis and minor axis
if (Error > 0)
{
pixel += majorAxisPixelMovement + minorAxisPixelMovement;
Error += dy2Mindx2;
}
// move on major axis only
else
{
pixel += majorAxisPixelMovement;
Error += dy2;
}

// draw the next pixel
*pixel = color;
}
}

// Specialized Line Drawing optimized for horizontal or vertical lines
// X and Y are flipped for Y maxor axis lines, but the pixel writes are handled correctly due to
// minor and major axis pixel movement
void DrawLineSingleAxis(unsigned int* pixel, int majorAxisPixelMovement, int dx, unsigned int color)
{
// draw the first pixel
*pixel = color;

// loop across the major axis and draw the rest of the pixels
while (dx--)
{
pixel += majorAxisPixelMovement;
*pixel = color;
};
}

// Draw an arbitrary line.  Assumes start and end point are within valid range
// pixels is a pointer to where the pixels you want to draw to start aka (0,0)
// pixelStride is the number of unsigned ints to get from one row of pixels to the next.
// Usually, that is the same as the width of the image you are drawing to, but sometimes is not.
void DrawLine(unsigned int* pixels, int pixelStride, int x1, int y1, int x2, int y2, unsigned int color)
{
// calculate our deltas
int dx = x2 - x1;
int dy = y2 - y1;

// if the X axis is the major axis
if (abs(dx) >= abs(dy))
{
// if x2 < x1, flip the points to have fewer special cases
if (dx < 0)
{
dx *= -1;
dy *= -1;
swap(x1, x2);
swap(y1, y2);
}

// get the address of the pixel at (x1,y1)
unsigned int* startPixel = &amp;pixels[y1 * pixelStride + x1];

// determine special cases
if (dy > 0)
DrawLineMajorAxis(startPixel, 1, pixelStride, dx, dy, color);
else if (dy < 0)
DrawLineMajorAxis(startPixel, 1, -pixelStride, dx, -dy, color);
else
DrawLineSingleAxis(startPixel, 1, dx, color);
}
// else the Y axis is the major axis
else
{
// if y2 < y1, flip the points to have fewer special cases
if (dy < 0)
{
dx *= -1;
dy *= -1;
swap(x1, x2);
swap(y1, y2);
}

// get the address of the pixel at (x1,y1)
unsigned int* startPixel = &amp;pixels[y1 * pixelStride + x1];

// determine special cases
if (dx > 0)
DrawLineMajorAxis(startPixel, pixelStride, 1, dy, dx, color);
else if (dx < 0)
DrawLineMajorAxis(startPixel, pixelStride, -1, dy, -dx, color);
else
DrawLineSingleAxis(startPixel, pixelStride, dy, color);
}
}
```

Using the above functions, I wrote a little test to draw lines from the center of an image towards the edges at 100 different angles, and also a horizontal and vertical line. Here’s the results:

## Run Slice Algorithm

So, like I said above, there is another Bresenham line algorithm that is a run slice algorithm.

What that means is that instead of looping across the major axis figuring out if each pixel needs to move on the minor axis as well, instead it loops across the minor axis, figuring out how many pixels it needs to write for that row or column of pixels.

One benefit of that algorithm is that it is less loop iterations since it’s looping across the minor axis pixels instead of the major axis.

Here’s an interesting read about Michael Abrash and his encounters with and the run slice algorithm:

Michael Abrash’s Graphics Programming Black Book Special Edition: The Good, the Bad, and the Run-Sliced

## Other Bresenham Algorithms

This was origionally just going to be about the line algorithms, and I was going to make up another post about the circle algorithm, but I changed my mind.

While doing some research to see if there was a Bresenham bezier curve algorithm I stumbled on the page below, which shows that yes, there is! There are quite a few other algorithms as well:

Many Bresenham algorithms with very short c++ implementations

# Dual Numbers & Automatic Differentiation

In the last post, I talked about imaginary numbers, complex numbers, and how to use them to rotate vectors in 2d.

In this post, I want to share another interesting type of number called a “Dual Number” that uses the symbol ε (epsilon) and has a neat trick of automatically calculating the derivative of a function while you calculate the value of the function at the same time.

Dual numbers are pretty similar to imaginary numbers but there is one important difference. With imaginary numbers, i^2 = -1, but with dual numbers, ε^2 = 0 (and ε is not 0!). That may seem like a small difference, but oddly, that opens up a whole interesting world of mathematical usefulness.

Before we dig into automatic differentiation, I want to go over the mathematical basics for how dual numbers behave.

## Basic Dual Number Math

Adding dual numbers is the same as adding complex numbers; you just add the real and dual parts separately:

(3 + 4ε) + (1 + 2ε) = 4 + 6ε

Subtraction works the same way as well:

(3 + 4ε) – (1 + 2ε) = 2 + 2ε

To multiply dual numbers, you use F.O.I.L. just like you do with complex numbers:

(3 + 4ε) * (1 + 2ε) =
3 + 6ε + 4ε + 8ε^2 =
3 + 10ε + 8ε^2

However, since ε^2 is zero, the last term 8ε^2 disappears:
3 + 10ε

It’s interesting to note that with complex numbers, the i^2 became -1, so the last term changed from imaginary to real, meaning that the imaginary numbers fed back into the real numbers during multiplication. With dual numbers, that isn’t the case, the dual numbers don’t feed back into the real numbers during multiplication.

In both complex and dual numbers the real terms do affect the non real terms during multiplication.

The division operator relates to the conjugate. I have source code for it below, and some of the links at the end of the post go into the details of that and other operations.

## Quick Review: Derivatives (Slope)

If you know the line formula y=mx+b, but you don’t know what a derivative is you are in luck. Remember how “m” is the slope of the line, specifying how steep it is? That is what the derivative is too, it’s just the slope.

Below is a graph of y=2x+1. At every point on that line, the derivative (or slope) is 2. That means that for every step we make on the x axis to the right (positive direction), we make 2 steps up on the y axis (positive direction).

Now, check out this graph of y=x^2-0.2

The derivative (or slope) at every point on this graph is 2x. That means that the slope changes depending on where the x coordinate is!

So, when x=0, the slope is 0. You can see that in the graph where x=0, that it is horizontal, meaning that a step on the x axis becomes no steps on the y axis (only at that point where x is 0, and only if you take an infinitely small step).

When x is 1, the slope is 2, when x is 2, the slope is 4, when x is 3, the slope is 6. Since the numbers increase as we increase x from 0, that tells us that the graph gets steeper as we go to the right, which you can see in the graph.

Alternately, when x is -1, the slope is -2, when x is -2, the slope is -4, and when x is -3, the slope is -6. This shows us that as we decrease x from 0, the graph gets steeper in the opposite direction, which you can see in the graph as well.

## What is Automatic Differentiation?

Let’s say you have a function (possibly a curve) describing the path of a rocket, and you want to make the rocket point down the path that it’s traveling.

One way you might do this is to evaluate your function f(T) to get the current location of your rocket (where T is how long the rocket has been flying), and then calculate the derivative f'(T) to find the slope of the graph at that point so that you can orient the rocket in that direction.

You could calculate the value and slope of the function at time T independently easily enough if you know how to get the derivative of a function (a calculus topic), or use wolframalpha.com.

However, if you have a complex equation, or maybe if the equation is controlled by user input, or game data, it might not be so easy to figure out what the derivative is at run time.

For instance… imagine having a function that rolled random numbers to figure out what mathematical operation it should preform on a number next (if we roll a 0, add 3, if we roll a 1 multiply by 2, if we roll a 2, square the number… etc). It isn’t going to be simple to take the derivative of the same mathematical function.

Here enters automatic differentiation (or AD). AD lets you calculate the derivative WHILE you are calculating the value of the function.

That way, you can do whatever math operations you want on your number, and in the end you will have both the value of f(T) as well as the derivative f'(T).

## Using ε for Automatic Differentiation

You can use dual number operations on numbers to calculate the value of f(x) while also calculating f'(x) at the same time. I’ll show you how with a simple example using addition and multiplication like we went over above.

the first thing we do is convert our 4 into a dual number, using 1 for the dual component, since we are plugging it in for the value of x, which has a derivative of 1.

4+1ε

Next, we want to multiply that by the constant 3, using 0 for the dual component since it is just a constant (and the derivative of a constant is 0)

(4+1ε) * (3 + 0ε) =
12 + 0ε + 3ε + 0ε^2 =
12 + 3e

Lastly, we need to add the constant 2, using 0 again for the dual component since it’s just a constant.
(12 + 3ε) + (2 + 0ε) =
14 + 3ε

In our result, the real number component (14) is the value of f(4) and the dual component (3) is the derivative f'(4), which is correct if you work it out!

Let’s try f(5). First we convert 5 to a dual number, with the dual component being 1.

5 + 1ε

Next we need to multiply it by the constant 3 (which has a dual component of 0)

(5 + 1ε) * (3 + 0e) =
15 + 0ε + 3ε + 0ε^2 =
15 + 3ε

Now, we add the constant 2 (which has a dual component of 0 again since it’s just a constant)
(15 + 3ε) + (2 + 0ε) =
17 + 3ε

So, our answer says that f(5) = 17, and f'(5) = 3, which again you can verify is true!

The example above worked well but it was a linear function. What if we want to do a function like f(x) = 5x^2 + 4x + 1?

Let’s calculate f(2). We are going to first calculate the 5x^2 term, so we need to start by making a dual number for the function parameter x:
(2 + 1ε)

Next, we need to multiply it by itself to make x^2:
(2 + 1ε) * (2 + 1ε) =
4 + 2ε + 2ε + 1ε^2 =
4 + 4ε

(remember that ε^2 is 0, so the last term disappears)

next, we multiply that by the constant 5 to finish making the 5x^2 term:
(4 + 4ε) * (5 + 0ε) =
20 + 0ε + 20ε + 0ε^2 =
20 + 20ε

Now, putting that number aside for a second we need to calculate the “4x” term by multiplying the value we plugged in for x by the constant 4
(2 + 1ε) * (4 + 0ε) =
8 + 0ε + 4ε + 0ε^2 =
8 + 4ε

Next, we need to add the last 2 values together (the 5x^2 term and the 4x term):
(20 + 20ε) + (8 + 4ε) =
28 + 24ε

Lastly, we need to add in the last term, the constant 1
(28 + 24ε) + (1 + 0ε) =
29 + 24e

There is our answer! For the equation y = 5x^2 + 4x + 1, f(2) = 29 and f'(2) = 24. Check it, it’s correct (:

As one last example let’s calculate f(10) and f'(10) with the same function above y = 5x^2 + 4x + 1.

First, to start calculating the 5x^2 term, we need to make 10 into a dual number and multiply it by itself to make x^2:
(10 + 1ε) * (10 + 1ε) =
100 + 10ε + 10ε + 1ε^2 =
100 + 20ε

Next, we multiply by the constant 5 to finish making the 5x^2 term:
(100 + 20ε) * (5 + 0ε) =
500 + 0ε + 100ε + 0ε^2 =
500 + 100ε

Putting that aside, let’s calculate the 4x term by multiplying our x value by the constant 4:
(10 + 1ε) * (4 + 0ε) =
40 + 0ε + 4ε + 0ε^2 =
40 + 4ε

Lastly, let’s add our terms: 5x^2, 4x and the constant 1
(500 + 100ε) + (40 + 4ε) + (1 + 0ε) =
541 + 104ε

The answer tells us that for the equation y = 5x^2 + 4x + 1, f(10) = 541 and f'(10) = 104.

## Sample Code

There are lots of other mathematical operations that you can do with dual numbers. I’ve collected as many as I was able to find and made up some sample code that uses them. The sample code is below, as well as the program output.

```#include
#include

#define PI 3.14159265359f

// In production code, this class should probably take a template parameter for
// it's scalar type instead of hard coding to float
class CDualNumber
{
public:
CDualNumber (float real = 0.0f, float dual = 0.0f)
: m_real(real)
, m_dual(dual)
{
}

float Real () const { return m_real; }
float Dual () const { return m_dual; }

private:
float m_real;
float m_dual;
};

//----------------------------------------------------------------------
// Math Operations
//----------------------------------------------------------------------
inline CDualNumber operator + (const CDualNumber &a, const CDualNumber &b)
{
return CDualNumber(a.Real() + b.Real(), a.Dual() + b.Dual());
}

inline CDualNumber operator - (const CDualNumber &a, const CDualNumber &b)
{
return CDualNumber(a.Real() - b.Real(), a.Dual() - b.Dual());
}

inline CDualNumber operator * (const CDualNumber &a, const CDualNumber &b)
{
return CDualNumber(
a.Real() * b.Real(),
a.Real() * b.Dual() + a.Dual() * b.Real()
);
}

inline CDualNumber operator / (const CDualNumber &a, const CDualNumber &b)
{
return CDualNumber(
a.Real() / b.Real(),
(a.Dual() * b.Real() - a.Real() * b.Dual()) / (b.Real() * b.Real())
);
}

inline CDualNumber sqrt (const CDualNumber &a)
{
float sqrtReal = ::sqrt(a.Real());
return CDualNumber(
sqrtReal,
0.5f * a.Dual() / sqrtReal
);
}

inline CDualNumber pow (const CDualNumber &a, float y)
{
return CDualNumber(
::pow(a.Real(), y),
y * a.Dual() * ::pow(a.Real(), y - 1.0f)
);
}

inline CDualNumber sin (const CDualNumber &a)
{
return CDualNumber(
::sin(a.Real()),
a.Dual() * ::cos(a.Real())
);
}

inline CDualNumber cos (const CDualNumber &a)
{
return CDualNumber(
::cos(a.Real()),
-a.Dual() * ::sin(a.Real())
);
}

inline CDualNumber tan (const CDualNumber &a)
{
return CDualNumber(
::tan(a.Real()),
a.Dual() / (::cos(a.Real()) * ::cos(a.Real()))
);
}

inline CDualNumber atan (const CDualNumber &a)
{
return CDualNumber(
::atan(a.Real()),
a.Dual() / (1.0f + a.Real() * a.Real())
);
}

inline CDualNumber SmoothStep (CDualNumber x)
{
// f(x) = 3x^2 - 2x^3
// f'(x) = 6x - 6x^2
return x * x * (CDualNumber(3) - CDualNumber(2) * x);
}

//----------------------------------------------------------------------
// Test Functions
//----------------------------------------------------------------------

void TestSmoothStep (float x)
{
CDualNumber y = SmoothStep(CDualNumber(x, 1.0f));
printf("smoothstep 3x^2-2x^3(%0.4f) = %0.4fn", x, y.Real());
printf("smoothstep 3x^2-2x^3'(%0.4f) = %0.4fnn", x, y.Dual());
}

void TestTrig (float x)
{
CDualNumber y = sin(CDualNumber(x, 1.0f));
printf("sin(%0.4f) = %0.4fn", x, y.Real());
printf("sin'(%0.4f) = %0.4fnn", x, y.Dual());

y = cos(CDualNumber(x, 1.0f));
printf("cos(%0.4f) = %0.4fn", x, y.Real());
printf("cos'(%0.4f) = %0.4fnn", x, y.Dual());

y = tan(CDualNumber(x, 1.0f));
printf("tan(%0.4f) = %0.4fn", x, y.Real());
printf("tan'(%0.4f) = %0.4fnn", x, y.Dual());

y = atan(CDualNumber(x, 1.0f));
printf("atan(%0.4f) = %0.4fn", x, y.Real());
printf("atan'(%0.4f) = %0.4fnn", x, y.Dual());
}

void TestSimple (float x)
{
CDualNumber y = CDualNumber(3.0f) / sqrt(CDualNumber(x, 1.0f));
printf("3/sqrt(%0.4f) = %0.4fn", x, y.Real());
printf("3/sqrt(%0.4f)' = %0.4fnn", x, y.Dual());

y = pow(CDualNumber(x, 1.0f) + CDualNumber(1.0f), 1.337f);
printf("(%0.4f+1)^1.337 = %0.4fn", x, y.Real());
printf("(%0.4f+1)^1.337' = %0.4fnn", x, y.Dual());
}

int main (int argc, char **argv)
{
TestSmoothStep(0.5f);
TestSmoothStep(0.75f);
TestTrig(PI * 0.25f);
TestSimple(3.0f);
return 0;
}
```

Here is the program output:

## Closing Info

When you are thinking what number ε has to be so that ε^2 is 0 but ε is not 0, you may be tempted to think that it is an imaginary number, just like i (the square root of -1) that doesn’t actually exist. This is actually not how it is… I’ve seen ε described in two ways.

One way I’ve seen it described is that it’s an infinitesimal number. That sort of makes sense to me, but not in a concrete and tangible way.

The way that makes more sense to me is to describe it as a matrix like this:
[0, 1]
[0, 0]

If you multiply that matrix by itself, you will get zero(s) as a result.

In fact, an alternate way to implement the dual numbers is to treat them like a matrix like that.

I also wanted to mention that it’s possible to modify this technique to get the 2nd derivative of a function or the 3rd, or the Nth. It isn’t only limited to the 1st derivative. Check the links at the bottom of this post for more info, but essentially, if you want 1st and 2nd derivative, you need to make it so that ε^3 = 0 instead of ε^2 = 0. There is a way to do that with matrices.

Another neat thing is that you can also extend this into multiple dimensions. This is useful for situations like if you have some terrain described by mathematical functions, when you are walking the grid of terrain to make vertex information, you can get the slope / gradient / surface normal at the same time.

Lastly, I wanted to mention a different kind of number called a hyperbolic number.

The imaginary number i^2 = -1 and we can use it to do 2d rotations.

The dual number ε^2 is 0 (and ε is not 0) and we can use it to do automatic differentiation.

Hyperbolic numbers have j, and j^2 = 1 (and j is not 1). I’m not sure, but I bet they have some interesting usefulness to them too. It would be interesting to research that more sometime. If you know anything about them, please post a comment!

This shadertoy is what got me started looking into dual numbers. It’s a mandelbrot viewer done by iq using dual numbers to estimate a distance from a point to the mandelbrot set (as far as I understand it anyhow, ha!). He uses that estimated distance to color the pixels.

I didn’t get very much into the reasons of why this works (has to do with taylor series terms disappearing if ε^2 is 0), or the rigorous math behind deriving the operators, but here are some great links I found researching this stuff and putting this blog post together.

# Using Imaginary Numbers To Rotate 2D Vectors

I’m a big fan of “exotic” math and programming techniques. There’s nothing I like more than seeing something common used in an uncommon way to do something that I didn’t know was possible.

In this post I share a technique that lets you use imaginary numbers (complex numbers more specifically) to be able to rotate vectors in 2d. This technique is so simple that you can even use it to rotate vectors by hand on paper!

## Quick Review: Imaginary and Complex Numbers

The imaginary number “i” is the square root of -1. Without using i, you can’t take the square root of a negative number because if the answer is negative, multiplying a negative by a negative is positive, and if the answer is a positive, multiplying a positive by a positive is still a positive.

But, using i, you CAN take the square root of negative numbers.

The square root of 25 is 5.

The square root of -25 is 5*i, or just 5i, which means “5 times the square root of -1”.

You can combine a real and imaginary number into something called a complex number like this: 3 + 5i

## Quick Review: Multiplying Complex Numbers

We’ll need to be able to multiply complex numbers together to do our rotations.

(3+5i)*(2+3i) = ???

Luckily, multiplying complex numbers together is as simple as using F.O.I.L. if you remember that from high school math class, it stands for First, Outer, Inner, Last.

Using that, we can multiply and then combine term, remembering that i^2 is -1:

(3+5i)*(2+3i) =
6 + 9i + 10i + 15i^2 =
6 + 19i + 15i^2 =
6 + 19i – 15 =
-9 + 19i

There we go, that’s all there is to multiplying complex numbers together!

## Getting Down To Business: Rotating

Let’s say that we want to rotate the vector (0,1) by the angle of the vector (1,1). To do that, we just convert the vectors to complex numbers, using the x axis as the real number component, and the y axis as the imaginary number component, then multiply them, and convert back to vectors.

That’s a mouth full, so here it is, step by step.

1) Convert Vectors to Complex Numbers

(0,1) = 0 + 1i = i
(1,1) = 1 + 1i = 1 + i

2) Multiply the Complex Numbers

i * (1 + i) =
i + i^2 =
i – 1 =
-1 + i

In the above we change i – 1 to -1 + i to make the next step easier. The real component of the complex number is the X axis and the imaginary component is the Y axis.

3) Convert from Complex Number to Vector

-1 + i =
-1 + 1i =
(-1, 1)

The diagram below shows this operation:

As you can see we rotated the purple vector (0,1) which has an angle of 90 degrees and length of 1, by the blue vector (1,1) which has an angle of 45 degrees and a length of sqrt(2), and as a result we got the tan vector (-1,1) which has an angle of 135 degrees and a length of sqrt(2). So, we essentially rotated the 90 degree angle by the 45 degree angle and ended up with a 135 degree angle.

Note that order of operations doesn’t matter, you could rotate the 90 degree angle by 45 degrees, or you could rotate the 45 degree angle by 90 degrees, either way you are going to end up with 135 degrees.

A caveat with this technique though is that when you do the rotation, the resulting vector’s length will be the the product of the two source vectors used; you won’t get a normalized vector out as a result unless your source vectors are normalized, or you normalize after you are done with your operations.

As another example, let’s rotate the vector (4,3) by the vector (-12,5).

The first step is to convert them to complex numbers:
(4,3) = 4 + 3i
(-12,5) = -12 + 5i

Then we multiply them:
(4 + 3i) * (-12 + 5i) =
-48 + 20i – 36i + 15i^2 =
-48 – 16i – 15 =
-63 – 16i

Then convert back to a vector to get: (-63, -16)

In the image, you can see that we started with the blue vector (4,3) which is about 37 degrees and has a length of 5.

We rotated that vector by the purple vector (-12,5) which is about 157 degrees and has a length of 13.

That gave us the tan vector of (-63, -16) which is about 194 degrees and has a length of 65.

The resulting vector has the rotations added, and the lengths multiplied together.

## Unrotating

If we multiply the complex numbers together to add rotations, does that mean that we divide the complex numbers to subtract rotations? Sort of…

If you want to subtract a rotation, you multiply the imaginary part of the vector you want to subtract by -1 (or just flip the sign) and then multiply as normal.

When you flip the sign of the imaginary part, this is actually called the “complex conjugate” but if that term scares you, feel free to ignore it 😛

As an example of unrotation, let’s take the vector (5,1) and subtract the vector (2,2) from it to see what we get.

The first step is to convert them into complex numbers.
(5,1) = 5 + i
(2,2) = 2 + 2i

Next up, we are going to flip the imaginary part of the vector we want to subtract.
2 + 2i becomes 2 – 2i

Then, we multiply as per normal:
(5 + i) * (2 – 2i) =
10 – 10i + 2i – 2i^2 =
10 – 8i + 2 =
12 – 8i

Finally, we convert back to a vector to get (12, -8).

In the image above, we started with the blue vector (5,1) which is about 11 degrees and has a length of sqrt(26).

We then unrotated by the purple vector (2,2) which is 45 degrees and has a length of sqrt(8).

That gave us the tan vector as a result (12,-8) which is about -34 degrees and has a length of sqrt(208).

Our resulting vector is the result of the second vector’s angle subtracted from the first vector’s angle, but the length of the vectors are still multiplied together, not divided.

## Unrotating to Get Vector Length

As a fun aside, if you unrotate a vector by itself, the result will be that the imaginary components (the Y component) will disappear, and in the result, the real component (the x component) will be the squared length of the vector.

It’s easy enough to calculate the squared length of a vector by adding x^2 and y^2 but this is an interesting property.

In fact, in the last post I mentioned CORDIC math using imaginary numbers to rotate vectors to try and find sine, cosine, etc. This is related to how it does that work. It basically rotates or unrotates a vector by smaller and smaller angles iteratively, based on the sign of the y (imaginary) component to try and get y to zero, which leaves the answer in the x component. It also has some other magic sprinkled in to where it only has to deal with integer math.

Hopefully before too long I’ll be able to make a CORDIC blog post and talk more about that in detail.

## Example Code

Ok so this theory stuff is all well and good but how about some code?

Before we get to that I want to give the naked formulas for rotating or unrotating vectors.

Given two vectors (A.x, A.y) and (B.x, B.y)…

Rotating A by B to get C:
C.x = A.x*B.x – A.y*B.y
C.y = A.x*B.y + A.y*B.x

Note: In the resulting vector C, the angles will be added, but the vector lengths will be multiplied.

Unrotating A by B to get C, we just flip the sign of any terms that contain B.y:
C.x = A.x*B.x + A.y*B.y
C.y = -A.x*B.y + A.y*B.x

Note: In the resulting vector C, the angles will be subtracted, but the vector lengths will be multiplied.

Below is some sample code, along with the output, showing our examples above working correctly.

```#include

struct SVector
{
float x;
float y;
};

void Rotate (const SVector &A, const SVector &B, SVector &C)
{
C.x = A.x * B.x - A.y * B.y;
C.y = A.x * B.y + A.y * B.x;
}

void Unrotate (const SVector &A, const SVector &B, SVector &C)
{
C.x = A.x * B.x + A.y * B.y;
C.y = -A.x * B.y + A.y * B.x;
}

void print (const SVector &V)
{
printf("(%0.2f,%0.2f)", V.x, V.y);
}

int main(int argc, char **argv)
{
{
SVector testA = {0.0f, 1.0f};
SVector testB = {1.0f, 1.0f};
SVector testC = {0.0f, 0.0f};
Rotate(testA, testB, testC);
printf("Rotating ");
print(testA);
printf(" by ");
print(testB);
printf(" = ");
print(testC);
printf("n");
}
{
SVector testA = {4.0f, 3.0f};
SVector testB = {-12.0f, 5.0f};
SVector testC = {0.0f, 0.0f};
Rotate(testA, testB, testC);
printf("Rotating ");
print(testA);
printf(" by ");
print(testB);
printf(" = ");
print(testC);
printf("n");
}
{
SVector testA = {5.0f, 1.0f};
SVector testB = {2.0f, 2.0f};
SVector testC = {0.0f, 0.0f};
Unrotate(testA, testB, testC);
printf("Unrotating ");
print(testA);
printf(" by ");
print(testB);
printf(" = ");
print(testC);
printf("n");
}
return 0;
}
```

Program output: