This is the second article in a series on the basics of cryptography:
- Cryptography 101: Hashing
- Cryptography 101: Encryption – One Time Pad
- Cryptography 101: Encryption – Symmetric Keys
- Cryptography 101: Encryption – Asymmetric Keys
- Cryptography 101: Realistic Security
DISCLAIMER: These articles are meant for educational purposes only. The methods explained here are meant only to illustrate the basic concepts of cryptography and may or may not be suitable in real world applications. For serious applications such as financial transactions, I recommend hiring security professionals and also getting a lawyer involved. Use this info and code at your own risk, I claim no responsibility!
If you want more in depth information about cryptography than these introductory articles provide, I highly recommend a book called Applied Cryptography by Bruce Schneier. That book literally almost didnt get published because the NSA didnt want the info getting out into the public. Yay for the 1st amendment!
Plaintext, Ciphertext and Keys
When talking about encryption, you’ll often hear two terms: Plaintext and Ciphertext.
The plaintext is the unencrypted data and may be either text or binary data.
Ciphertext is the encrypted data.
Ideally, the ciphertext will give no information about the nature of the plaintext that created it, other than perhaps the size of the plaintext itself. Good ciphertext will look indistinguishable from random numbers, both by the human eye and mathematically.
This is because the point of encryption is to hide any patterns in the data, and good encryption will hide all discernible patterns. The only possible exception to this would be if the encryption process made ciphertext with misleading patterns that didn’t give any information about the plaintext. I’m not sure if this comes up in practice, but it definitely could.
Another term you’ll hear often is “Keys”. A key is just the data that you encrypt or unencrypt data with. You can think of it as the password.
The One Time Pad
The one time pad is an extremely simple, yet secure way of encrypting data.
It is so simple that it only uses the xor operation, and is so secure that the ciphertext is literally uncrackable if done correctly.
The downside is that it requires a lot of pre-shared data which gets used up as you encrypt data. When you run out, you have to share more of this data if you want to keep communicating with that person.
This pre-shared data is the key used for encryption and unencryption.
To use a one time pad, you first gather a large amount of random data and share that with the person you want to communicate securely with. This is the one time pad itself and you’ll want one byte of random data for each byte of information you want to send to that person. This step is also the crux of the security. You need to make sure that nobody else is able to get the one time pad except your intended target, and you also need to ensure that you have high quality random data (more on that later on).
To encrypt data, you take one byte from the one time pad for each byte of data you want to encrypt and XOR them together. When you are done, you throw away the used bytes of the one time pad and never use them again.
Then, you send the ciphertext to the person you already pre-shared the one time pad data with.
To decrypt the data, that person xors each byte of the encrypted data with a byte of the one time pad, and they also throw away each byte used of the one time pad just like you did.
When they are done decrypting, they will have the plaintext data, and their one time pad will be in the same state that yours is in (ie their next number will be your next number).
From here you can rinse and repeat until you run out of one time pad data.
Super simple, and as long as nobody else has your one time pad data, and your one time pad data is truly random, nobody will be able to crack your ciphertext and get the plaintext.
The Importance of Randomness
Besides securely transmitting the random data,the other crux of the security i mentioned was the quality of the random numbers in your one time pad.
The reason this is important is because if the numbers aren’t truly random, there will be patterns in the data. If there are patterns in the data, people can possibly discover those patterns, thus being able to separate the plaintext from the key and unencrypting some or all of your data.
Randomness comes up EVERYWHERE in cryptography, both in input and output to cryptographic algorithms. Because of this, truly random data is often somewhat of a commodity to cryptographers. Since re-using random data means that its slightly less secure (would be attackers have a pattern to gain knowledge with if you re-use your random numbers!), it’s also a consumable commodity!
In fact, there are famous books that are nothing but hundreds and hundreds of pages of random numbers generated from various real world sources – such as taking the wind speed over time in Juneau, Alaska and multiplying it by static gathered from a radio antenna which is tuned to dead air. Using real world data like that, people can be relatively sure that the data doesn’t have any discernible patterns. They just have to watch out for those pesky physicists unlocking the nature of the universe and finding the patterns in the background radiation 😛
I’m not even joking about these books by the way, check this out, here’s one such book!
A Million Random Digits with 100,000 Normal Deviates
Using random numbers from a published book makes your random numbers slightly less random (since other people have the book too, and attackers may notice it on your bookshelf or something), but so long as you don’t just use the numbers of the first or last pages (or anything else predictable), and the book actually contains high quality random numbers, it ought to be fine.
you can also BUY large amounts of high quality random data online from places like random.org.
The astute reader might ask “Why don’t i just use a pseudo random number generator on each side and never run out of one time pad data?”.
Well, if someone knows the PRNG you are using, and your seed, they would be able to unencrypt your data just like your intended target can.
HOWEVER, this kind of setup can be appropriate sometimes if you know the risks and are ok with them. Check out this wikipedia page for more information:
Cryptographically Secure Pseudorandom Number Generator
Specific Attack Against Randomness
As an extreme example, lets say that instead of random numbers, your one time pad data is all 0xFFFFFFFF and that you are using it to encrypt a text file (say, this article for instance).
When you encrypted your data by XORing each byte against 255 (0xFF), all the bits of each byte would be flipped from 0 to 1 or 1 to 0.
While it’s true that it would make the data un readable, and seemingly random, garbage data to the human eye, mathematically it’s a very different story.
If someone were analyzing your ciphertext, they would first notice that the byte value 154 (which looks like Ü and has a binary value of 10011010) occurs in the ciphertext roughly the same amount that the letter ‘e’ appears in the typical english language text document. This would be astute because that value of 154 is just the flipped bits of ‘e’ which has a byte value of 105 and a binary value of 1100101 (the binary bits are just flipped due to the XOR against 0xFF).
Then, they may notice the same for other letters… that some other value occurs as often as you’d expect an ‘o’ to appear in english, or an ‘m’ etc.
Pretty soon they have a clear picture that this is english plaintext, and they can start replacing letters with what they seem like they should be statistically (for the statistically significant letters).
After that, they have some of your plain text, and figuring out the rest is similar to playing sudoku… figuring out which letters fit where, based on how words are spelled, and then doing a find / replace in the entire document for each letter you figure out.
In the end, they have your plaintext and your encryption failed you.
This is an extreme case that is really simple to break, but hopefully you can see that if you even use slightly lower quality random numbers (such as the built in rand() function of C++, whether or not you use srand(time(0)) or not!) that you open yourself up to attack and it can compromise your whole communication stream.
Requiring Less Pre-Shared Data
You can modify the one time pad algorithm to use less pre-shared data if you are ok with the changes in your security profile (your data may be weaker against some attacks, stronger against others).
There are many ways to skin a cat but I’ll just talk about a couple.
One way would be to generate more random data from the random data you do have. For instance, if you and the person you are pre-sharing data with agree on a protocol of MD5 hashing every 100 bytes of one time pad data to generate more random bytes that you can interleave with your one time pad data, you would have a way of generating 16% more one time pad data than what you gathered or shared with the other person. (16% more because MD5 hashes of 100 byte blocks spit out 16 byte hashes of seemingly random numbers – see the previous article on hashing for more information!).
However, doing this obviously makes the “random” data *somewhat* lower quality since there is a pattern to some of the random data. As non obvious as that pattern may be, if someone were to do fancy mathematical analysis of the data, this sort of technique may cause patterns to crop up which lead to a “chink in the armor” giving the attacker a foothold in recovering all or some of the plaintext.
Another way of making your one time pad go farther is instead of XORing the one time pad data against the plaintext and ciphertext to encrypt and unencrypt, you can use the one time pad to give you the keys (passwords) to encrypt / decrypt each communication.
For instance, if you and the person you are communicating with agree in advance on a symmetric key encryption algorithm (more on this topic in the next article!) that takes a 16 byte encryption key, you could use every 16 byte block of one time pad data for an entire single message no matter how large the message is.
For instance, you could encrypt 2GB of data using the first 16 bytes of a one time pad, send that to the person, then you encrypt 500MB with the next 16 bytes and send that to the person.
You’ve effectively used 32 bytes of your one time pad to encrypt 2.5GB of data, which is a crazy good ratio compared to the traditional one time pad protocol which would have required 2.5GB of pre-shared one time pad random data.
If you go this route, your ciphertext now becomes vulnerable to whatever attacks your symmetric key encryption algorithm are vulnerable to though. If the algorithm you are using turns out to have a serious flaw that mathematicians find out about (such as there’s a really easy way to recover the plaintext – this happens fairly often believe it or not!), your whole communication channel is screwed, whereas with the one time pad, it’s just the quality of your random numbers, and the security of your pre-shared data that define the security. So, there are definitely pros and cons to weigh.
There are a lot of ways to attack each cryptographic technique, and if you are serious about cryptography you really need to read up on a lot of things and be extremely clever, thinking of every possible situation that anyone else might think of.
Security is hard because often times you have a limited amount of time to implement your security (because you need to ship your software or open your service to the public SOME DAY), and there are most certainly more attackers than there are security professionals on your team, and they have all the time in the world to search for what you’ve missed! Just as there is no rest for the wicked, the same too is true for security professionals.
I mentioned that the quality of your random numbers and the security of your pre-shared data was the lynchpin of protecting against people getting your plaintext from your cyphertext, but there is another way to attack the communication channel as well.
Namely, if someone were to intercept a message between you and your target person, they may not be able to get your plaintext out, but if they can keep that message from getting to your target, and do so in a way that you aren’t aware of this, they can completely break your communication channel.
The reason for this is that doing this makes the one time pads of you and your target person get out of sync when you throw away one time pad data that the target person did not throw away. This means that the random numbers you are using to encrypt your data is not the same numbers your target person is decrypting data with, so they will get garbage, random data as output and not be able to recover the plaintext.
A malicious person in the middle was able to thwart your ability to communicate securely!
Also, if a person was able to modify the ORDER that the target person got the encrypted messages in, they would be able to break the channel that way as well (at least temporarily) since it would make the recieving person unencrypt the messages with the wrong pieces of data. The next message the person got would be unencryptable in this case though, since the same number of bytes were used up by the out of order messages as if they had come in the right order.
This is not the traditional man in the middle attack, but it is definitely *A* man in the middle attack.
As with so many things, there are often strange, non obvious connections between different subjects. Case in point, one way to protect against these sort of attacks of lost or re-ordered messages would be to implement the sorts of algorithms used in network programming (like those used in TCP/IP) that ensure “guaranteed” and “in order” communication between two computers or individuals.
Going this route, just like how computers on the internet can know when they got message B but haven’t received message A yet, or that when they sent a message to another person that it never got there, you too would be able to know if a message got to the target, and they would be able to know if they have received messages out of order or not.
Until Next Time!
That’s the essence of the one time pad and I hope you found it interesting!
Next Up I’ll be talking about symmetric key algorithms which are the more traditional way of encrypting where you use a password to protect data.
For those interested in cracking encrypted data (which technically is against the DMCA these days, but used to be a common academic activity, and a way of weeding out insecure algorithms), here’s a nice morsel for you. It’s hexadecimal encoded encrypted data. Every 2 hex characters equals one byte of encrypted data. If you use the information from the article, you ought to be able to crack it (there’s an easy way and a hard way).
And no, cracking the encrypted data below is not even technically against the law, I’m giving you explicit permission to crack it if you can (: