Cryptography 101: What Is Hashing?
Hashing plays a crucial role in managing and protecting digital information. But what is it exactly, and how does it work?
In this blog post, we’ll explore the principles and practical applications of hashing.
What is hashing?
Hashing is the process of converting data—text, numbers, files, and practically anything else—into a fixed number of bytes (represented by a string of seemingly random letters and numbers). This is done through a specialized mathematical algorithm called a hash function.
The key feature of a hash function is its one-way nature. Once the data is hashed, it becomes computationally infeasible* to reverse the process and decode the original message. Hashing irreversibly changes the data to ensure its integrity.
Hashing also prevents unauthorized actors from exploiting sensitive information. Even if a malicious party gains access to the information, it will remain unreadable in hashed form.
* While it may be computationally infeasible to reverse the hashing operation, attackers can leverage precomputed tables to guess the original input. This vulnerability is commonly exploited in cracking password hashes. Therefore, it is imperative to employ hash functions that are considered secure and up-to-date (more on this below).
A brief history of hashing: an algorithm and a data structure
We can trace the invention of the first hash functions back to 1958, when Hans Peter Luhn introduced the Key Word in Context (KWIC) algorithm.
The impact of KWIC was nothing short of revolutionary: It transformed the indexing of textual information, making it possible to automatically create indexes from extensive text sources. In essence, KWIC stood as the era's equivalent of a search engine, helping users quickly access the information they needed.
Over time, hashing evolved and found many other applications.
While hashing remains relevant for efficient data indexing (see hash table data structure), today it’s primarily used to enhance security.
The principle is always the same: transforming data into a fixed-sized string of characters. However, the implementations are different.
Our focus in this article is mainly on “cryptographic hashing,” which refers to hashing functions designed to strengthen security measures.
Cryptographic hash function example
To get a better grasp of how hashing works, let's consider a scenario where we need to hash two different text strings:
Text string 1: "The quick brown fox jumps over the lazy dog."
MD5: 9e107d9d372bb6826bd81d3542a419d6
SHA-256: e4d909c290d0fb1ca068ffaddf22cbd0d0be6a8a3e8365e361d13ec37dd6b674
SHA3-256: a80f839cd4f83f6c3dafc87feae470045e4eb0d366397d5c6ce34ba1739f734d
Text string 2: "The lazy dog is jumped over by a quick brown fox."
MD5: 3278a6f1b9bdc8a0ff58f8bfc1158fb1
SHA-256: 5da0032e38cb7b00c9ff1c1b82b5167aee0cf3e031c16f1b65d61b189b1d4cb7
SHA3-256: 46fe653b6903bfa3397c4f095c034738673c45acfad1c782fc4b72e32da13304
Notice how the resulting hashes are indecipherable, which makes it impossible to reverse-engineer the original message.
Also, the functions consistently generate fixed-length hashes, even though the original sentences have different lengths. Even if we hashed the content of an entire book, the length of the hash would stay the same.
How is this possible? The secret is dividing the data into equal-sized blocks.
Let’s see exactly how the hashing function works.
How does a hashing algorithm work?
At a high level, hashing algorithms follow these steps:
- Message input: The user selects data to be hashed.
- Algorithm selection: The hashing algorithm is chosen based on the specific use case and security requirements. Common choices include SHA-256 and SHA3-256.
- Hash function application: The data is processed through the chosen hash function, which takes the input data and transforms it into a fixed-size hash value.
For instance, SHA-256 works with block sizes of 512 bits, which is roughly equivalent to 32 Unicode characters (or 64 ASCII characters). When hashing a short message, the algorithm processes it once to generate the final hash value. If the message is shorter than a full block, it's padded to fit the block size. For larger inputs (exceeding 512 bits), the algorithm divides the data into 512-bit chunks and processes them sequentially. The algorithm then combines the hash values of these chunks to create the final hash value. For very large data sets or files, the process is repeated multiple times.
- Storage or sharing: The resulting hash value, often called a "message digest," is sent to the recipient or stored to verify data integrity.
It’s a very quick process: The hash is typically computed in microseconds. SHA3-256, for instance, can process about 100 megabits per second.
You can find many online tools designed for hashing data.
If you're a developer, you’re likely already incorporating cryptographic hashing into your applications. This may involve cryptographic libraries like hashlib or bcrypt, native Node crypto module for hash generation, or JWT libraries such as jsonwebtoken that use hashing to sign and verify JWT tokens.
Properties of a cryptographic hash function
Cryptographic hash functions have several key properties that make them suitable for security-related applications. A robust cryptographic hash function is:
- Deterministic
A cryptographic hash function is deterministic, meaning it consistently produces the same output for a given input. The slightest alteration in the input will result in a completely different hash, as we’ve illustrated above. - Irreversible (preimage resistance)
It is computationally infeasible to reverse a hash and find its original input. This property, known as preimage resistance, safeguards hashed data. -
Collision-resistant
Cryptographic hash functions are designed to resist collision attacks. A collision happens when two different inputs produce the same hash value. Collisions are problematic as they can be exploited by attackers. A good cryptographic hash function makes finding collisions exceedingly difficult.
Collectively, the above properties make cryptographic hash functions especially well-suited for security applications, where data integrity and confidentiality are essential.
In contrast, non-cryptographic hash functions have different requirements. They tend to prioritize efficiency and speed, with less emphasis on collision resistance. These functions do not need the complex security features of cryptographic hash functions.
What hashing algorithms are in use?
Message Digest 5 (MD5)
MD5 was one of the initial standards in hashing algorithms. It was widely used for file integrity verification (checksums) and storing hashed passwords in databases.
It’s a straightforward algorithm that outputs a fixed, 128-bit string for every input and uses a basic one-way operation across multiple rounds to compute the output. However, its simplicity and short output length make MD5 highly susceptible to exploitation. Nowadays, MD5 is considered insecure and should no longer be used.
Secure Hash Algorithm (SHA)
SHA is a family of hashing algorithms.
SHA1 was developed by the US National Security Agency (NSA) and is similar to MD5. It generates 160-bit hash values, represented by 40-digit long hexadecimal strings. SHA1 is also considered outdated and unreliable for security purposes. Instead, it is recommended to use SHA2 or SHA3.
The SHA2 family, also developed by the NSA, consists of six distinct hash functions producing hash values of varying lengths: 224, 256, 384, or 512 bits. SHA2 is the current secure standard for hashing sensitive data.
SHA3 was introduced in 2006 as part of a broader scheme of hashing algorithms known as KECCAK (pronounced “ketch-ak”). It provides an alternative to SHA2 and can be used for secure hashing.
Other hashing algorithms
There are many other hashing algorithms out there, including BLAKE (which is employed in Ethereum), Bcrypt, Argon2, and more.
Here’s a comprehensive comparison of hash functions.
Over time, hashing algorithms have become more advanced and secure. This made it increasingly challenging for malicious actors to reverse-engineer hashed values. While hashes can still be broken, the complex mathematical operations behind them make doing so a considerably daunting task without substantial computational power.
What is hashing used for?
SHA-256 checksums: Used to verify the integrity of files and downloads to ensure they haven't been tampered with.
Password hashing: Used for secure password storage and verification. Hashing is frequently paired with salting to make passwords more complex and unique without placing additional requirements on the users. In simple terms: If a random salt value is added to two identical passwords, the resulting hashes will be different. This protects from rainbow table attacks and is strongly recommended because humans are exceptionally bad at coming up with secure passwords by themselves.
SSL/TLS certificates: Used to authenticate a website's identity, establishing trust and secure connections for online transactions.
Digital signatures: Used to validate the authenticity of digital messages or documents.
Trusted timestamping: Used to establish the time when digital data was created or modified via a trusted timestamp, essential for legal and regulatory purposes.
Signed JSON Web Tokens (JWT): Used for identity management and to safely transmit information between relying parties.
Each of these use cases relies on the core function of cryptographic hashing: to guarantee data integrity and prevent interference or tampering with information.
Hashing vs. encryption
While the terms "hashing" and "encryption" are sometimes used interchangeably, they serve different purposes.
Hashing is inherently irreversible: It renders hashed values impossible to decode. In contrast, encryption always provides a decryption key to interpret the data.
As a result, hashing is intended for integrity validation, while encryption ensures data confidentiality.