Over 4 years ago I wrote a short blog post on images in the frequency domain: https://blog.demofox.org/2016/07/28/fourier-transform-and-inverse-of-images/
It’s time to revisit the topic a bit and add some more things.
If you are curious about how the Fourier transform works, which can transform images or other data into the frequency domain, give this a read: https://blog.demofox.org/2016/08/11/understanding-the-discrete-fourier-transform/
The C++ code that goes with this blog post can be found at https://github.com/Atrix256/FrequencySpaceImages
When you transform an image into the frequency domain, you get a complex number (with a real and imaginary component) per pixel that you can use to get information about the frequencies (literal sine and cosine waves) that go into making the image. One piece of information is the “phase” or starting angle of that wave. You get the phase by using atan2(imaginary, real). The other piece of information is the “amplitude” of that wave, or how large the wave is in the image. The amplitude is the length of the 2d vector (real, imaginary).
A quick and easy way to do image compression then, is to convert an image to frequency space, find the lowest amplitude frequencies and throw them away – literally zero out the complex number. If you throw enough of them away, it’ll take less data to describe the frequency content of an image, than the pixels of the image, and you’ll have compressed the image.
The more aggressive you are at throwing away frequencies though, the more the image quality will degrade. This is “lossy” compression and is a simplified version of how jpg image compression works. Lossy compression is in contrast to lossless compression like you find in png files, which use something more like a .zip compression algorithm to perfectly encode all the source data.
In the code that goes with this post, the DoTestZeroing() function throws out the lowest 10% amplitude frequencies, then the lowest 20%, then 30% and so on up to 90%. At each stage, it writes all complex frequency values out into a binary file, which can then be compressed using .zip as a method for realizing the image compression. As the data gets more zeros, it gets more compressible.
The top row in the image below shows an original 512×1024 image, the DFT amplitude information, and the DFT phase information. The bottom row shows the same, but for an image which has had it’s lower 90% amplitude frequencies thrown away. The DFT data is 8MB for both (uncompressed), and compresses to 7.7MB for the top picture, but only 847KB for the bottom picture. The inverse DFT was used to turn the modified frequency data on the bottom back into an image.
Here is another image which is 512×512 and has DFT which is 4MB uncompressed. The top image’s DFT data compresses to 3.83MB, while the bottom compresses to 438KB.
While fairly effective, this is also a pretty naive way of doing frequency based image compression!
More sophisticated methods use the “Discrete Cosine Transform” or DCT instead of the DFT because it tends to make more of the frequency magnitudes zero, consolidating the data into fewer important frequencies, which means it’s already smaller before you start throwing away frequencies. DCT and DFT also pretend that the images go on forever, instead of just stopping at the edge. DFT acts as if those images repeat in a tiled fashion, while DCT acts as if they are mirrored at each repeat, which can also be a nice property for image quality.
Other methods break an image up into blocks before doing frequency based compression. Also, you can use wavelets to compress images, or principle component analysis or singular value decomposition. You can also fit your image with “whatever” basis functions you want, using L1 norm regularization to promote the coefficients of your fitting to be zero, to make the fit data be less sparse, just like DCT does compared to DFT.
Another thing you can do is use compressed sensing to skip a couple steps: You take a couple randomized but roughly evenly spaced samples from the image (blue noise or LDS are going to be good options here), and then you can eg find Fourier basis coefficients (DFT!) that match the sparse/irregular data samples you took. This is like throwing out low frequencies, but without having to DFT the whole data set, and then throw things out. It starts with sparse data and then fits it.
Bart Wronski has several write ups on his blog in this area, so give them a read if you are interested: https://bartwronski.com/2020/08/30/compressing-pbr-texture-sets-with-sparsity-and-dictionary-learning/
This is a great read showing how to fit data using L1 regularization and all the related information you might be interested in: https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/
This video is a great overview of the random grab bag of other things I mentioned: https://www.youtube.com/watch?v=aHCyHbRIz44&feature=youtu.be
In my previous post on this topic I showed how you could throw away frequencies that were farther than a certain distance from the center to low pass filter an image, aka blur it. I also showed how if you threw away frequencies closer than a certain distance, it would high pass filter an image, aka sharpen it.
That throwing away of frequency data based on distance is the same as multiplying the frequency data by a mask which has a 1.0 in some places and a 0.0 in others. You can generalize this to multiply frequencies by any number. In the below I restrict the multiplications to be between 0 and 1, but you could definitely go to larger numbers or even go to negative numbers if you wanted.
The below shows the patterns that the images are multiplied by in this section. Top row left to right is a low pass filter, then a stronger low pass filter (gets rid of more high frequencies than the other) and lastly is a notch filter or “band stop” filter. The bottom row is the complement, such that you get the bottom by subtracting the image from white (1.0). Left to right, the bottom row is a high pass filter, then a weaker high pass filter (lets more low frequencies in) and then a band pass filter which only lets certain frequencies through.
First up is the “Loki and Alan” picture. Frequencies and actual picture values filtered from the pictures on the top are present in the pictures on the bottom and vice versa. In this way, blurring compared to sharpening (and edge detection) are two sides of the same coin. It just matters which part you throw away and which part you keep.
Here is what the frequency magnitudes look like. Note that each image has the magnitudes put through a log function, and also normalized to be 1.0 max. This is why even though the high pass filters (and band pass) darken the middle, it doesn’t seem like it. The renormalization obscures that fact a bit, and the middle is brightest (largest amplitudes) which we saw when throwing out the lowest amplitudes in the last section.
Here are the same filters applied to the scenery image. The top right image has some strange patterns in it if you look closely (click the image to view the full size in another tab).
In the last section, we made “images” by using a distance function, to make values to multiply the frequencies by to filter out certain frequencies.
In this section, we are going to take two images, put them into frequency space, multiply them together, take them out of frequency space, and see what kind of results come out.
There is something called the “convolution theorem” which tells us that multiplication in the frequency domain, is the same as convolution between the images. Convolution is an expensive operation, because you have to loop through all the pixels of one image, and at each pixel, loop through the pixels of the other image, and do some multiplications and additions. Convolution is so slow, that it can actually be quicker to take two images you want convolved to frequency domain, multiply them together, and then take them out of frequency space to be images again.
Convolution is used in graphics for things like blurs, sharpening, or applying bokeh for depth of field, so speeding it up can be a big help! Convolution is also used in audio for things like reverberation which makes audio sound like it was played inside of a cave or a big cathedral.
Technical note: the “kernel” image needs to be centered at pixel (0,0), not the center of the image. Also, the kernel image should be normalized so that summing up all of it’s pixels adds up to 1.0. You also need to zero pad (add a black pixel border to) both the source image and kernel image to be the size of source+kernel+1 on the x and y axis before DFT’ing so they are the same size, and to avoid wrapping problems. After you are done multiplying and inverse DFT’ing, you can remove the black border again.
Here are the 4 images we are going to use as kernel images: A star, a plus, a circle, and a blob.
Here are the DFT magnitudes of those images.
Here is the “Loki and Alan” picture convolved with those kernel images.
You can see that the images somehow take on the qualities of the kernel… the star one is very angular, the plus one is very “plus like” and the circular one is very circular. Note how the blob acts a lot like a low pass filter! In frequency space, it does actually look like one, so that makes sense:
Here is the scenery picture convolved by the same shapes.
If you think the above looks weird when doing convolution on images, you should give a listen to convolution being used in audio. When used for reverb it sounds good, and sounds correct, but if you use it to convolve arbitrary audio samples together, you can get some really interesting and bizarre sounds! You can hear that here: https://blog.demofox.org/2015/03/23/diy-synth-convolution-reverb-1d-discrete-convolution-of-audio-samples/
The dark border around the image is an artifact from adding a black border around the images to make them the right size (zero padding). If you instead just make the convolution kernel image as large as the image you are convolving (and that is already a power of 2, since this FFT requires that), you’d get the below, which has part of the image “wrapping” across from the other side.
If you used the DCT (discrete cosine transform) instead, it would MIRROR the texture instead of wrapping it, so you’d get more similar pixels to what should be there most of the time, compared to DFT which wraps. Another way to solve this problem though is if you are doing convolution in image space, instead of frequency space, is you can throw away any samples that go outside of the valid area of the images. You want to sum up the weight of the samples you actually took though in this case, and divide the final convolution sum by that weight, to normalize it. That will make pixels near the border have higher weights than they should, but it can be a less jarring artifact than the black border, wrapping, or mirroring artifacts.
Truth be told, many of the operations in this article can be done in a handful of lines of python. I find a lot of value in implementing things myself though, as it helps me internalize the ideas to better understand when and how to use them, and how to avoid problems/mysteries that come up when things are used as black boxes. I feel the tide turning though after a recent look at the sea of algorithms relating to SVD,PCA and finding eigenvectors. That is some crazy stuff, and way too much for a single person to deal with, while still trying to be competent in other topics 😛