Welcome to the first article in a series aimed to teach the basics of cryptography:
- Cryptography 101: Hashing
- Cryptography 101: Encryption – One Time Pad
- Cryptography 101: Encryption – Symmetric Keys
- Cryptography 101: Encryption – Asymmetric Keys
- Cryptography 101: Realistic Security
In this digital age, cryptography is more important than ever. It’s used to protect financial transactions, ensure the anonymity of political dissidents, protect private conversations, help prevent cheating in video games and many other things as well.
DISCLAIMER: These articles are meant for educational purposes only. The methods explained here are meant only to illustrate the basic concepts of cryptography and may or may not be suitable in real world applications. For serious applications such as financial transactions, I recommend hiring security professionals and also getting a lawyer involved. Use this info and code at your own risk, I claim no responsibility!
If you want more in depth information about cryptography than these introductory articles provide, I highly recommend a book called Applied Cryptography by Bruce Schneier. That book literally almost didn’t get published because the NSA didn’t want the info getting out into the public. Yay for the 1st amendment!
Hashing in Computer Science
You may be familiar with the term “hashing” from computer science. In computer science, a hash function is a function which takes input data, preforms some operations on it and spits out some (usually) smaller output data that can be used more or less as a unique identifier for the input data.
Common uses of hashing include:
- Hashing the contents of large files to be able to compare the hashes to quickly know if they are the same or different, instead of having to compare the files byte by byte. This is especially useful when comparing files over a network connection.
- Hashing peices of data which are difficult or time consuming to compare (such as strings) and using the hashed value as a “look up key” within a database or array or list, so that you can look up items very quickly by their hash, instead of having to do more expensive string compares (or whatever other more complex comparison and lookup methods).
More info is available here: http://en.wikipedia.org/wiki/Hash_function
Hashing in Cryptography
As you can probably guess, since hashing makes large pieces of data (such as entire files) into small pieces of data (often only a handful of bytes large), there are many pieces of larger source data that can result in the same smaller hashed data. When this happens, it’s called a hash collision and a good hash function will do it’s best to minimize collisions for optimal performance. The more output bits you have, the more “space” you have before a collision is unavoidable.
Often times, a good hashing algorithm will have 2 properties to minimize collisions…
- Small changes in input give large changes in output. In other words, it’s very sensitive to initial conditions and so is a chaotic function (http://en.wikipedia.org/wiki/Chaos_theory)
- If you give it a set of well distributed random inputs, it should give a well distributed set of random outputs. Heck, if you give it any set of (varying) inputs, it should give a well distributed set of random outputs ideally. By random i mean no discernible patterns.
If these things aren’t true, the hashed output can give clues as to the nature of the input, or, it can make it easier to provide input that hashes to the same output (which is the main way to attack hash based security).
For instance, if your hashing algorithm made an 8 bit hash (very small!) that always (or often) set the 7th and 8th bits to 1, that means effectively you really have a 6 bit hash, because 2 of the bits are almost always the same. In general, more bits means more security, since it’s harder to get a hash collision on purpose.
Quick aside, can you think of something else with these properties? Some deterministic algorithm that spits out chaotic, well distributed, seemingly random numbers based on (perhaps) non random input? How about a pseudo random number generator? There is a lot of crossover between these two types of algorithms and I find it pretty neat that they are working towards almost the same goals, but that they are used for such different things.
Also as you might guess, hashes are one way. If you are given hashed data, it’s difficult or impossible to work backwards and get the source data back again. In fact, you often hear hash functions referred to as “one way hash functions” because of this. This is important because in cryptographical uses you want a hash to reveal as little information about the source data as possible.
Example Uses of Cryptographic Hashing
Here’s two examples of places where hashing comes in handy. One is for protecting passwords, and the other is for protecting save game data of video games.
For protecting passwords, many times when you have a large online service such as facebook, youtube, etc, there will be a central database (cluster) storing everyone’s account information, including their passwords.
If someone were to hack a server and get access to the user database table, they would have everyone’s username and password and the users would be screwed.
A way that people address this is to store the HASH of a each password in the database table instead of the password itself. When people log in, the server takes the password that it received from the user, puts it through the hash algorithm, and compares it to the hash stored in the database. If the hashes match, they know (can assume with a good level of certainty) that the user is who they claim to be. However, if this server gets hacked and the database is compromised, the attacker won’t have the passwords, they will only have the hashed passwords. The attacker will have to try and brute force the hashes, which is essentially the same as having to brute force the password – except that they can do it on their own computer in their own time of course, which makes it easier and untraceable unfortunately. Hopefully if this happens, the service can tell their users to change their passwords before the attacker is able to crack many of the logins.
Protecting Save Game Data
For protecting save game data, hashes are used in conjunction with encryption to prevent both read and write access to the save game data.
To write this protected save game data, you first hash the unencrypted save game data, and write that to the front of the file. Next, you encrypt the save game data and write that after the hash.
When reading save game data, you read in both the hash and the encrypted save game data. Next you unencrypt the save game data and hash it. Then, you can compare the hash you made with the hash stored in the file and if they don’t match, you know that someone tried to tamper with the file and you can consider it invalid / corrupt.
Also, since the save game data is encrypted, it’s difficult for a user to read the data in your save game data. Thus you protect the file from both reading and writing.
It’s possible that the person could modify the data in such a way that it will unencrypt and then hash to the same hash value stored in the beginning of the file, but it’s extremely unlikely, and also even less unlikely that doing so will result in something favorable for the attacker. They can’t even be sure they are increasing a value thanks to the encryption function scrambling the data completely.
Hashing Algorithm Overview
In a nutshell, besides all the stuff we talked about above, a hashing algorithm is just a deterministic algorithm (meaning it acts the same way every time, no randomness) that takes some input, chews it up, and spits out some (often) smaller piece of data to represent it. When chewing it up, it can do things that are destructive to the data (such as integer division, which loses precision) and isn’t just limited to non destructive operations like encryption algorithms are (non destruction operations can be reversed, such as XOR, addition, subtraction, bit rotations).
As an extra piece of security, people often “SALT” their hashes which means they hash some constant before hashing whatever data they want to hash. This constant is called the salt and you can think of it kind of like a password. This way, even if someone knows what algorithm you are using to hash data (such as the popular MD5 or SHA-1 hash functions), you’d also have to know the salt used to more effectively attack the system. It’s a little extra bit of security, which is always nice.
Example Hash Function
Here’s an example hash function in C++. Again, note that this is not really fit for real world use or important situations, it’s just for educational purposes. You’d want to do more “chewing” and use different operations, bit rotations to make sure all the bits got “hit” by the xor’s, etc. Check out some more complex, real world hashing algorithms for more info!
//assuming sizeof(int) == 4
typedef unsigned int uint32;
//Takes a pointer and length so you can hash binary data as well as text
//note that this function as is won't give the same answers on machines with different endian-ness
uint32 Calculate4ByteHash(const unsigned char *pData, int nDataLength, const unsigned char *pSalt, int nSaltLength)
//setup some variables
uint32 nHash = 0;
unsigned char *pHashPointer = (unsigned char *)&nHash;
//salt the hash
for(int nIndex = 0; nIndex < nSaltLength; ++nIndex)
pHashPointer[nIndex%4] = pHashPointer[nIndex%4] ^ pSalt[nIndex];
//hash the data
for(int nIndex = 0; nIndex < nDataLength; ++nIndex)
pHashPointer[nIndex%4] = pHashPointer[nIndex%4] ^ pData[nIndex];
Assuming the algorithm meets the critera above, the only real way to attack something secured by hashing (besides asking the nice receptionist for the secret info while batting your eyelashes) is to brute force hash a bunch of values until you find something that gives the same hash as what you are looking for.
Unfortunately, there are something called “Rainbow Tables” where people have gone through and created tables of unique hashes and “source data” that results in those hash values for common algorithms. This way, if for instance, an attacker saw that a hashed value was “3”, and he knew you were using the MD5 algorithm, he could look at an MD5 rainbow table to find the value he could put into your system to result in a hash value of “3” and thus he’d gain some ground at attacking your security.
Of course, if you salt your hash, he would have to find out your salt value too and perhaps salting would invalidate the rainbow table entirely (depending on the algorithms used). Also, the more bits your output hash contains, the larger a rainbow table would have to be, so if you really wanna screw with would be attackers, make your output bit count larger – it makes their job exponentially harder! (: