03/05/2021

[Java] Tarjan's algorithm for finding strongly connected components in a graph

Last time we saw Kosaraju's algorithm for finding strongly connected components in a graph. This time we look at another approach with Tarjan's algorithm.

Both solutions run in O(V + E) time, however to be more precise, Kosaraju's approach runs in 3(V + E): first DFS, edge reversal, second DFS while Tarjan's approach runs in (V+E) only. Additionally, the required edge reversal step in Kosaraju's approach means we would either modify the graph or add an extra structure to track that information.

The basis of this approach revolves around the following concepts:

  • we keep a global timer which is used to set specific values for each node as the DFS visit happens
  • a forward link is a link between a two nodes that are not reachable with a direct connection from some other node
  • a back link is a link between a node and an already visited node which is part of the same SCC
  • a cross link is a link between a node and an already visited node which is part of a different SCC

For example doing a visit starting on node 1 on this graph:

 
1 <- 5  
|    ^  
v    |  
2 -> 4 -> 7  
|    |  ^  
|    | //  
v    v v  
3 <- 6  

we have the following SCCs:

  • 1, 2, 4, 5
  • 3
  • 6, 7

and we see that:

  • 1 - 2 is a forward link, since no other node reaches 2 without passing through this connection
  • 5 - 1 is a back link, since we are returning to an already visited node which is part of our SCC
  • 6 - 3 is a cross link, since we are returning to an already visited node which part of a different SCC

If we squash the graph saying SCC1 = 1, 2 , 4, 5, SCC2 = 3, SCC3 = 6, 7 we obtain:

 
SCC1 - > SCC3  
 |       /  
 v      v  
 SCC2  

where all remaining links are cross links.

 

We use the following structures to help us with the calculation:

  • a map tracking all visited nodes and their discovery time
  • a map tracking the lowest discovery time for a node, that is, the lowest discovery time among all nodes in the same SCC as this one
  • a stack tracking all nodes in the order we visit them
  • a set tracking all nodes currently in the stack - this is necessary since the stack does not offer a O(1) method to confirm whether an item is present inside it
The global timer is used to determine at which point in our visit did we first encounter a node. We increase this timer every time before moving on to an undiscovered node. This means, if we do a recursive approach, such value must live outside of our function, otherwise we wouldn't be able to correctly increase it with every new node.
 
The idea works as follows:
  • while doing a DFS visit of the graph, we encounter each node at a unique discovery time and track it in the map
  • while doing the DFS visit, we track all nodes in the order we visit them in a stack
  • whenever we have no more links to follow, before backtracking, we need to verify if we are the head of a SCC. 
We can have two possibilities here: we either are the head of a SCC or part of it. 
If we are the head of a SCC, our discovery time will be equal to the lowest discovery time for our SCC. 
Should we be the head of a SCC, in the stack there must be all nodes that are part of it immediately following us. We can therefore keep popping nodes out of the stack until it's either empty or we have found ourselves. All these nodes are one SCC
  • whenever we would follow a link to an already discovered node we can be in one of two cases: following a back link, or following a cross link

To determine whether a node is part of an SCC, the link must be a back link. Looking at the logic described in the previous point, we can see that everytime we discover a full SCC, we remove it from the stack. Therefore a cross link would find no match for the destination node in the stack, while a back link would, since the SCC has not been fully discovered yet.

If we are trying to follow a cross link, we simply skip that edge, otherwise we might need to update the lowest discovery time for the current node with the information of the discovery time of the destination node (NOT lowest discovery time), taking the minimum of the two values. This means that as soon as we discover a SCC, we can start propagating the information back to all the nodes that are part of it.

  • whenever we are backtracking from a DFS on one of our edges, we need to update if necessary our lowest discovery time with the lowest discovery time of the destination we just completed visiting, taking the minimum of the two values. This ensures be propagate back information discovered at the previous step. Note that differently from the previous step we ONLY look at the lowest discovery time data.
 
As a bonus of this algorithm, since NO SCC is identified before subsequent ones, the order in which SCCs are discovered is a reverse topological sort of the graph where all SCCs have been squashed.


You can find my implementation of findStronglyConnectedComponentsTarjan on my Gist along with some tests in TarjanJTests.

You can also check this tutorial video for a step by step explanation.


No comments:

Post a Comment

With great power comes great responsibility