Data Structure and Algorithm

Overview

The hash table is a basic data structure that is highly valued by developers and widely used in practical development. It is also a frequently tested data structure in interviews and algorithm assessments. It's a classic example of a space-for-time algorithm.

If an array is akin to a pawn in chess, suitable for basic algorithms and steady progression, but always ready to make way for mid-to-high-level algorithms, then a hash table is like a knight in chess: when used at the right moment, this simple data structure can achieve effects comparable to any advanced algorithm.

The hash table is so highly regarded because it can guarantee O(1) time complexity for lookups. Therefore, hash tables are generally used to quickly check if an element exists in a set, something that linked lists and arrays cannot easily accomplish.

Theoretical Basis of Hash Tables

First, the official definition of a hash table is fairly straightforward: "A hash table is a data structure that accesses data directly based on the value of the key."

Based on this definition, we could say that arrays are also a type of hash table. However, to avoid ambiguity, it's necessary to clarify that when I refer to "array algorithms," I mean "arrays (as a data structure) + algorithms based on the structural properties of arrays (e.g., binary search, prefix sum, etc.)."

Of course, the popularity of hash tables is driven by reasons far more complex than just a simplistic O(1) search. While arrays can also achieve O(1) time complexity for searches, this is limited to index-based searches. If the search key cannot serve as an index in practical applications, this approach falls short.

The principle behind hash tables represents the culmination of efforts by pioneers in computer science to tackle this issue, along with various historical challenges, and ultimately arriving at an optimal solution.

Imagine a time before hash tables were invented, when all constant-time searches relied on arrays. Naturally, the first historical problem that arose was "how to transform a non-integer key into an integer index that an array can use." Thus, when people attempted to solve the issue of "performing constant-time searches based on a string key," the most straightforward solution was: "simply find a way to convert the string into an integer to use as an index," allowing O(1) searches based on arrays.

From an engineering perspective, this solution is quite practical because it allows us to efficiently use existing resources (arrays) by developing a logic to convert strings to integers. While there are many ways to convert a string to an integer, the simplest method is using the ASCII decimal representation of the string characters. However, this approach is not adopted because the resulting array size would grow without bound as the string length increases, theoretically requiring larger arrays to accommodate the corresponding indices. We don't even need to discuss whether the indices are sparsely distributed; the fact that the upper bound of space complexity depends on the length of the longest string already makes this approach unacceptable for developers.

Therefore, from a performance perspective, an ideal conversion logic must meet the following criteria to make building an ideal hash table possible:

"Strong Determinism": The generated integer must be reproducible. The same string must generate the same integer (otherwise, the search is meaningless).
"Collision Resistance": The generated integer should be unique. Different strings should generate different integers (otherwise, information could be unintentionally overwritten).
"Controlled Range": The generated integer should have a limited range. The range of generated integers should be predictable (otherwise, memory usage could become excessive).

Mathematicians developed this approach long ago, known as the Hash Function.

Hash Function

A hash function is a mathematical method derived from cryptography that can convert other data formats into different values through a specific process (hashCode), allowing non-integer keys to be mapped to array indices.

Beyond type conversion, the most remarkable feature of hash functions is their "irreversibility." Given A = hash(B), if I know B, I can always find A, but knowing A does not allow me to determine what B is. A simple example is:

                    def hash(val):
                        return "0" if val % 2 == 0 else "1"

In the above example, I can get the hash value "1" based on the oddness of val = 7, but from the hash value, I can at most infer that val is odd, without knowing the exact odd number.

This is a general characteristic of hash functions, which is why they are widely used in network security and blockchain. They allow users to hash sensitive data/passwords/keys and share them without fear of decoding (though this also requires following an entire asymmetric encryption protocol).

While the above discussion on cryptography and network security is tangential, it reveals a limitation of hash functions: they cannot guarantee uniqueness. There are cases where two different strings, A and B, may produce the same hash result, hash(A) = hash(B). This phenomenon is called a "hash collision." Although this probability may be small, it cannot be ignored in applications due to memory trade-offs. (Note: The likelihood of hash collisions depends on the hash space, which is the range of integers after mapping. The smaller the hash space, the more likely collisions become.)

For example, the commonly used SHA-2 hash function in blockchain has a large hash space, making hash collisions virtually negligible. However, this means it generates hashes over the entire 32-bit integer range, implying the need for an array of size 2**32, which is impractical. Thus, a tradeoff is necessary: "a higher collision probability for reduced memory usage."

This brings us to the second historical problem: "hash collisions." If hash collisions cannot be addressed, hash functions remain impractical.

Solutions to Hash Collisions

Hash collisions are inherently a probability issue, so preventing collisions entirely is an unsolvable problem. From the perspective of discrete mathematics, increasing the hash space can reduce the likelihood of collisions, but it can't eliminate the probability entirely. Since collisions are inevitable, let's focus on how to handle them when they occur.

The essence of the collision problem is actually a conflict problem. When a "hash collision" occurs, it means that at least two different keys point to the same memory block, and both stored values are equally meaningful, so neither can be discarded. Overwriting the original key with a new value could result in unexpected values during searches, potentially leading to severe issues.

Thus, the only reasonable solution is to consider how to "store all collision results and distinguish them during lookup."

Clearly, "storing collision results" is much simpler than "avoiding collisions." The most straightforward approach is to "improve the hash table structure" to allow it to function correctly even with hash collisions.

In practice, there are two main structural improvements for hash tables: "Separate Chaining" and "Open Addressing."

"Separate Chaining"

In a basic hash table, each memory block (also called a "bucket") can store only one key-value pair. Separate chaining converts a single element into a linked list, with key-value pairs as nodes, storing all colliding key-value pairs in the same linked list.

From a learner's perspective, a question might arise (at least it puzzled me for a while when I was learning): since the hashed keys are already the same, how do we find the desired element in the linked list?

For a developer, this is quite simple to address. By retaining the original, pre-hashed key in the data structure, we can locate the specific element in the linked list when a hash collision occurs.

With this, the basic principle of "Separate Chaining" is clear, and we can design the following data structure.

Limitations of Separate Chaining

Increased space usage: Adding a linked list structure to the original array requires node pointers, which consumes more memory.
Reduced search efficiency: With the addition of linked lists, linear search time complexity is introduced. Although it only occurs during "hash collisions," the worst case is equivalent to a linear search through the linked list.

"Open Addressing"

While "Separate Chaining" solves the "hash collision" problem, it introduces additional data structures, which may not be optimal for performance. Particularly when developers consider the pessimistic scenario—where all elements collide and cluster in one bucket—the search efficiency would degrade to O(N) time complexity.

At this point, many developers notice that since hash tables don't require contiguous insertions, memory usage is actually sparse, scattered, and inefficient. Given that there's so much unused memory, why add linked lists to handle collisions?

This is where the concept of "Open Addressing" emerges.

Unlike separate chaining, which relies on extra data structures, the main difference with open addressing is that it doesn't "solve the problem in place." Personally, I find that the open addressing approach aligns more with real-world logic: if people find their seat occupied, their first instinct is to sit in the next available spot rather than bring a new chair. (This is actually how I grasped the concept of open addressing 🤦)

However, this raises a question: Separate chaining uses extra data structures to handle collisions "in place" for easy lookup, since the hash value directly points to a single bucket. But with open addressing, placing entries in adjacent buckets—how will we find them during lookups?

We can think a bit further: when a collision occurs, we place the new key-value pair in the next available bucket. But if the "primary owner" of this bucket, the initial hash value intended for that spot, arrives later, how will it know the spot is occupied due to another key's collision?

This requires the algorithm to actively probe for handling hash collisions, introducing another method called "Linear Probing."

The probing logic is straightforward: if the bucket pointed to by the hash value is not empty, a collision has occurred. I then use a fixed step size to probe forward from the hash bucket. When I find a bucket with a matching original key, I lock in that position. If I encounter an empty bucket during probing, it means the key doesn't exist, as by the insertion rule, an empty bucket wouldn't appear before locating the value.

If you're observant, you might notice an issue with this approach: if an insertion encounters a conflict and is placed in the nearest empty bucket, what happens if an element between the original bucket and the insertion bucket is deleted? According to the current "linear probing" logic, probing stops when it encounters an empty bucket, rendering the inserted value meaningless.

The solution to this problem is simpler and more straightforward than it seems: instead of removing elements from the bucket, we use a specific constant, "TOMBSTONE," to mark the bucket. This way, linear probing won't prematurely terminate due to an empty bucket, and new values can still be inserted.

This approach is called "Lazy Deletion."

"Clustering" in Linear Probing

When hash collisions occur frequently, constant overwriting to place values in the nearest available bucket results in increasingly long occupied sequences within the underlying array. This increases the likelihood of further collisions, creating a vicious cycle of clustering. In such cases, each lookup incurs a considerable overhead.

Besides "linear probing," there are other probing methods like "quadratic probing" and "double hashing," both commonly used in open addressing strategies. These two methods aim to mitigate the clustering problem in linear probing.

Additional Optimization: Red-Black Tree (Detailed in the Tree Chapter)

In principle, hash tables are already robust enough as standalone data structures. However, some cautious (in a non-pejorative sense 🤐) pioneers in computer science have raised new concerns about hash collisions, leading to our final historical issue: "How to handle the query performance problem of long linked lists when numerous hash collisions occur."

This issue is not too distant from us and is still somewhat controversial. Most programming languages haven't made adjustments for this problem. Currently, only Java and C++ have modified the underlying implementation of HashMap, using a "Red-Black Tree" in certain cases to replace the linked list, thereby improving search efficiency from O(n) to O(log n) when a large number of collisions occur.

Hash Table Performance Comparison

	Array	Linked List	Hash Table
Search Element	O(1)	O(N)	O(1)
Add Element	O(N)	O(1)	O(1)
Delete Element	O(N)	O(1)	O(1)
Traverse All Elements	O(N)	O(N)	O(N)

Problem-Solving Experience Summary

Conclusion: When to Use Hash Tables?

When we need to check if an element has appeared or if an element is in a set, hash tables should be the first method to consider.

ID	Name	Difficulty	Hash-Based Solution
1	Two Sum	Easy	The first problem in the entire LeetCode library is a classic hash table problem, as we need to check if the complementary element for element i has already been seen while iterating.
202	Happy Number	Easy	As the problem states, during the sum process, the sum may repeat. Therefore, we can use a hash table to check if the sum has appeared before.
242	Valid Anagram	Easy	The brute-force solution requires two nested loops, but we can define a hash table to record character occurrences in `s` in a single loop. (Since the problem only includes 26 lowercase letters, a length-26 array can replace the hash table.)
349	Intersection of Two Arrays	Easy	Instead of repeatedly iterating over array A to find intersections, it's more efficient to store elements of A in a hash table and check if elements from B exist in A in O(1) time.
383	Ransom Note	Easy	Since the problem only includes lowercase letters, we can use a space-for-time strategy with a hash table, using a length-26 array to record letter occurrences in `magazine`.
454	4Sum II	Medium	Unlike Two Sum, this problem involves more variables, making O(N) infeasible. By defining a hash table where the key is the sum of `a` and `b` and the value is their occurrence count, we can reduce the time complexity from O(N^4) to O(N^2) by iterating through `C + D`.

#Tech

Data Structure and Algorithm - Hash Map

https://billyjojojobulidogithub.io/notes/Algorithm/hash_map/english/index.html

Author

Baocheng Wang

Posted on

November 04, 2024

Licensed under