Data Structure and Algorithm

Overview

The Union-Find is an advanced algorithm in graph theory commonly used to solve connectivity problems. However, since the implementation of the Union-Find algorithm is based on a tree structure, many people categorize it under trees. On LeetCode, there are no "easy" problems requiring Union-Find; most problems purely using Union-Find are of "medium" to "hard" difficulty, and if a problem adds any traps or constraints on top of Union-Find, it's usually a typical "hard" problem.

To be frank, many developers already employed might not have systematically studied Union-Find, as it has a relatively high learning difficulty. Additionally, graph theory problems do not occupy a major portion of coding tests (the weight of graph theory in tests correlates with its usage in company tasks), and there are numerous other types of graph theory questions worth considering. DFS/BFS and Dijkstra appear far more often than Union-Find, so failing an interview due to Union-Find is relatively rare. Consequently, some believe Union-Find isn't essential to study (especially when preparation time is limited, making prioritization necessary).

Fortunately, one characteristic of advanced algorithms like Union-Find is that they usually solve specific types of problems. Therefore, when facing connectivity problems, it's easy to think of Union-Find. Furthermore, the implementation of Union-Find follows a familiar pattern; in the worst case, using a code template can usually solve "medium" difficulty Union-Find problems.

In summary, there are two key points to focus on with Union-Find:

What problems Union-Find can solve (when to use Union-Find)
How to implement Union-Find (why Union-Find can achieve this while other algorithms cannot)

Challenges of Union-Find ⭐️

There are two major misconceptions to avoid when learning Union-Find:

Focusing too much on concepts: Some algorithm beginners like to study concepts directly without context, which can work for simpler algorithms, but this approach should be avoided for advanced algorithms. Union-Find is relatively complex compared to other advanced algorithms, so diving directly into code templates and underlying principles can be overwhelming and discouraging.
Over-relying on templates: This point is often emphasized in notes for algorithms with general templates. Templates are designed based on the algorithm, not the problem itself. While using a template can be highly efficient for similar problems, if the problem is slightly modified, rigidly applying a template can lead to difficult and frustrating situations.

Basic Concept of Union-Find

As mentioned briefly, Union-Find is used to solve connectivity problems. Simply put, when we need to determine whether two elements belong to the same set, we should think of using Union-Find.

To determine if two elements are in the same set, the algorithm needs to be able to achieve at least two things:

Add two elements to the same set.
Check if two elements are in the same set.

At first glance, these tasks may seem trivial, but achieving them is actually not that easy.

Union-Find Implementation ⭐️

Trick to Code Implementation: "Data Structure for Representing 'Connectivity'"

From a coding perspective, how do we add two elements to the same set?

The most straightforward approach might be: create an array, set, or map; when two elements are placed in the same set/map/array, it represents that they are in the same set.

Logically, this approach actually makes sense in practice. However, in algorithms, we have to consider: there might be not just one set, but hundreds or thousands of sets. Should we define that many arrays?

Alternatively, the above problem could be managed with a 2D array to handle a large number of sets. But this is still a temporary solution, as it only solves the first point, "adding two elements to a set." If we use a 2D array to address the second point, "checking if two elements are in the same set," we'll find that there is no shortcut other than traversing the entire array. More importantly, each time we add an element to a set, we would have to traverse the 2D array to determine which set it belongs to.

If we limit ourselves to using arrays or hash tables, the code implementation will certainly become complex, as managing sets requires a lot of control logic. Therefore, we clearly need a different approach: an approach based on connectivity.

If I want to represent three elements, A, B, and C, that are connected, a simple hash table or even a 1D array can achieve this, as long as we can establish the relationship father[A] = B, father[B] = C.

Those familiar with graph theory will recognize this as a directed connected graph.

Many algorithm learners may have thought of this representation but perhaps dismissed it. One of the challenging questions to overcome is: If father[A] = B can represent that A is connected to B, how do we represent that B is also connected to A?

This question is understandable, as in development, one might think: "When traversing the input array, if we first encounter B, won't we be unable to determine B's connectivity with A?"

The answer to this question is: we don't need to represent it bidirectionally.

Since our goal is to determine whether three elements are in the same set, establishing one-way connectivity is sufficient. The reasoning is simple: trace the root. Since starting from B doesn't determine connectivity with A, we can simply enforce that all elements move in the same direction to reach a common root.

We can see that with a directed graph defined by father[A] = B and father[B] = C, giving B allows us to see B is connected to C, and giving A shows A is connected to B, which in turn is connected to C. Therefore, we can refer to C as the "root" of both A and B. So, we can confidently build a directed graph. As long as any two elements share the same root, they are in the same set, and if they do not, they are not.

With this understanding, the code implementation begins to take shape. This recursive, nested tracing is well-suited for recursion, and the code template for this part is as follows.

                  father = dict()
                  ... # Construction process of the directed connected graph omitted
                
                  def find(u: int) -> int:
                      if (u == father[u]):    # If the root is itself, return directly
                          return u
                      return find(father[u])  # If the root is not itself, keep finding the root layer by layer

The code uses a 1D array to implement a directed connected graph. To set a more elegant stopping condition for recursion, we make the root element point to itself. Therefore, during the actual initialization, don't forget to set father[i] = i, meaning each element points to itself by default, and then update based on connectivity. The initialization code is as follows.

                  arr = [...] # Array of connected elements from input
                
                  def init():
                      for i in range(max(arr)):
                          father[i] = i

Of course, if using a hash table, we can determine if an element is a root by checking if father[u] is empty.

With this setup, checking if two elements are in the same set becomes straightforward. Here is the code:

                  def is_same(u: int, v: int) -> bool:
                      u = find(u)
                      v = find(v)
                      return u == v

At this point, we have successfully implemented the second point, "checking if two elements are in the same set." As for the first point, "adding two elements to the same set," there is actually a smarter way than simply constructing a directed connected graph.

Optimization: "Path Compression"

The previous approach used recursion to implement the find function, continually retrieving values from the father array indices until the root was found. However, recursion depth impacts performance. If the directed graph has significant depth, the find function would need to recurse many times to locate the root, reducing performance with repeated calls, as shown in the diagram.

Our goal is simply to identify that these nodes share the same root, so we can control the depth of this recursive tree by attaching all nodes directly to the root. See the diagram:

This way, root search becomes very quick, requiring only one step. To achieve this effect, we use "path compression," where all non-root nodes point directly to the root.

The implementation isn't complicated; during recursion, we propagate the root information back and assign it to father[u]. Here's the code:

                  def find(u: int) -> int:
                      if (u == father[u]):
                          return u
                      return father[u] = find(father[u])  # This line is the core of path compression

With this, the basic implementation and principle of Union-Find are clear. The complete code template is shown below.

Union-Find Code Template

                  n = 1005                    # n is set based on the number of nodes in the problem; typically, slightly larger than the number of nodes
                  father = [i for i in range(n)]  # Union-Find initialization
                
                  # Root-finding process in Union-Find
                  def find(u: int) -> int:
                      if u == father[u]:
                          return u
                      return father[u] = find(father[u])  # Path compression
                
                  # Check if u and v have the same root
                  def is_same(u: int, v: int) -> bool:
                      u = find(u)
                      v = find(v)
                      return u == v
                
                  # Add edge v->u to the Union-Find
                  def join(u: int, v: int):
                      u = find(u)             # Find root of u
                      v = find(v)             # Find root of v
                      if (u == v):            # If roots are the same, they are in the same set; no need to connect the two nodes, so return
                          return
                      father[v] = u

This template shows that Union-Find has three main functions:

Finding the root node, function: find(int u), which determines the ancestor node of a given node
Connecting two nodes into the same set, function: join(int u, int v), which links two nodes under the same root
Checking if two nodes are in the same set, function: is_same(int u, int v), which verifies if two nodes share the same root

Development Pitfall: "Equivalent Replacement"

There is a tricky pitfall in the implementation of the Union-Find algorithm that can easily lead to errors. Many developers who are not fully familiar with the principles of Union-Find will notice that some parts of the join and is_same functions overlap:

                  # Original version
                  def is_same(u, v):
                      u = find(u)
                      v = find(v)
                      return u == v
                  
                  def join(u, v):
                      u = find(u)     # Repeated
                      v = find(v)     # Repeated
                      if (u == v):    # Repeated
                          return
                      father[v] = u

Following the principle of code reuse, many developers might attempt the following optimization:

                  # "Optimized" version
                  def join(u, v):
                      if is_same(u, v):
                          return
                      father[v] = u

If you think this optimization has no issues, take a moment to consider whether you fully understand the principles of Union-Find discussed above.

The error's root isn't in the repeated parts but in the line father[v] = u. If you substitute some examples, youll see that in the "optimized" version of the join function, when father[v] = u is executed, the values of u and v may differ from those in the original version.

e.g., Execute join(1, 2), then join(2, 3) in order

Original Version

                  join(1, 2)
                      father[2] = 1   (u = 1, v = 2)
                
                  join(2, 3)
                      father[3] = 1   (u = father[2] = 1, v = 3)

"Optimized" Version

                  join(1, 2)
                      father[2] = 1   (u = 1, v = 2)
                  
                  join(2, 3)
                      father[3] = 2   (u = 2, v = 3)

As we can see, the biggest problem with the "optimized" version is that after running join(1, 2) and join(2, 3), calling is_same(1, 3) will return false, which completely contradicts the purpose of connectivity. Therefore, this optimization should not be adopted.

Diagram: "Union-Find Workflow"

In the following illustration, the operations are executed as follows:

join(1, 8)
join(3, 8)
join(1, 7)
join(8, 5) Path compression occurs, and 3 replaces 1 as the new root of 8
join(2, 9)
join(6, 9)

Union-Find Join and Path Compression Illustration

Optimization: "Union by Rank"

In addition to "Path Compression," another optimization technique for Union-Find is "Union by Rank." When I first learned Union-Find, I went straight to the concepts, and seeing "Union by Rank" initially discouraged me. The term "rank" was unfamiliar, and the shorter the term, the more complex it seemed, making me think it must be difficult.

But in reality, rank simply refers to the height of the tree, which is the distance from the root to the farthest node in the directed connected graph.

What does Union by Rank mean? Simply put, when connecting two sets (recursive trees) with an edge, there's a performance difference between merging A into B versus B into A.

Suppose the rank of A is 2 and the rank of B is 3. If A merges into B, the final connected graph will have a rank of 3. However, if B merges into A, the final connected graph will have a rank of 4. The merging process is illustrated in the diagram below.

Thus, we can conclude: the tree with the smaller rank should always merge into the tree with the larger rank, reducing the overall rank of the tree to minimize the path length during searches.

However, on a side note, when optimizing Union-Find for query efficiency, "Path Compression" alone is usually sufficient. It's simpler to implement and provides excellent performance.

Union by Rank doesn't flatten the tree structure as much as Path Compression, so it's less efficient in organizing the structure.

Union-Find Basic Problems - Classic Collection

ID	Name	Difficulty
684	Redundant Connection	Medium
685	Redundant Connection II	Hard
128	Longest Consecutive Sequence	Medium
200	Number of Islands	Medium
305	Number of Islands II	Hard
765	Couples Holding Hands	Hard
547	Number of Provinces	Medium
803	Bricks Falling When Hit	Hard

#Tech

Data Structure and Algorithm - Union-Find

https://billyjojojobulidogithub.io/notes/Algorithm/unionfind/english/index.html

Author

Baocheng Wang

Posted on

October 29, 2024

Licensed under