Back to articles list Articles Cookbook
8 minutes read

Get to Know the Power of SQL Recursive Queries

Most commonly, the SQL queries we run on a database are quite simple. Well, that depends on your role, of course. Analysts in data warehouses retrieve completely different sorts of information using (very often) much more complicated queries than software engineers creating CRUD applications.

However, sometimes it's simpler or more elegant to run a query that is a little bit more sophisticated without needing further data processing in the code. One way to accomplish this is with a SQL feature called recursive queries.

Let's take a real-life example. Let's assume we've got a database with a list of nodes and a list of links between them (you can think of them as cities and roads). Our task is to find the shortest path from node 1 to node 6.

graph

Well, in fact, it's nothing more than graph traversal. The very first idea an average software engineer may have would be to get all rows from both tables and implement a DFS (Depth-First Search) or BFS (Breadth-First Search) algorithm in his/her favorite programming language. It's not a bad idea (if you like coding 😉 ) but you can do it with a single SQL query!

WITH Clause – Common Table Expressions to the Rescue!

Note: all examples are written for PostgreSQL 9.3; however, it shouldn't be hard to make them usable with a different RDBMS.

If your RDBMS is PostgreSQL, IBM DB2, MS SQL Server, Oracle (only from 11g release 2), or MySQL (only from release 8.0.1) you can use WITH queries, known as Common Table Expressions (CTEs). Generally speaking, they allow you to split complicated queries into a set of simpler ones which makes a query easier to read. The structure of a WITH clause is as follows:

WITH [cte_name] AS (
	[cte_term])
SELECT ... FROM [cte_name];

For example, we might want to get at most 3 nodes, whose total length of outgoing links is at least 100 and at least one single outgoing link has a length bigger than 50. You don't have to fully understand the following example, just look at the query structure. Instead of writing a query like this:

SELECT * FROM node 
WHERE id IN (
    SELECT distinct node_from_id 
    FROM link
    WHERE length > 50 
      AND node_from_id IN (
        SELECT node_from_id
        FROM link 
        GROUP BY node_from_id 
        HAVING SUM(length) >= 100 
        ORDER BY SUM(length) DESC
        LIMIT 3));

we can write it like this:

WITH
longest_outgoing_links AS (
    SELECT node_from_id
    FROM link 
    GROUP BY node_from_id 
    HAVING SUM(length) >= 100 
    ORDER BY SUM(length) DESC
    LIMIT 3),
nodes_from_longest_outgoing_links AS (
    SELECT distinct node_from_id 
    FROM link
    WHERE length > 50 
      AND node_from_id IN 
         (SELECT * FROM longest_outgoing_links)
)
SELECT * FROM node 
WHERE id IN (SELECT * FROM nodes_from_longest_outgoing_links);

The queries are defined separately, which makes it a lot easier to understand when implementing much more complicated examples.

An important point: CTEs may also have a recursive structure:

WITH RECURSIVE [cte_name] (column, ...) AS (
    [non-recursive_term]
    UNION ALL
    [recursive_term])
SELECT ... FROM [cte_name];

How does it work?

It's quite simple. In the first step a non-recursive term is evaluated. Next, for every result row of the previous evaluation, a recursive term is evaluated and its results are appended to the previous ones. The recursive term has access to results of the previously evaluated term.

An Easy Example #1

Let's take a look at a simple example – multiplication by 2:

WITH RECURSIVE x2 (result) AS ( 
    SELECT 1 
    UNION ALL 
    SELECT result*2 FROM x2) 
SELECT * FROM x2 LIMIT 10; 
 result 
-------- 
      1 
      2 
      4 
      8 
     16 
     32 
     64 
    128 
    256 
    512 
(10 rows)

In the first step, the only result row is "1." Then, there is UNION ALL with a recursive term. 1 is multiplied by 2, which results in one result row – "2". For now, there are two result rows: 1, 2. However, the last term evaluation produced only one row – "2" – and it will be passed to the next recursive step. 2 is multiplied by 2, which returns "4," and so on... In this example, recursion would be infinite if we didn't specify the LIMIT clause.

An Easy Example #2

Let's do another quick (typically academic) example – the Fibonacci sequence. It's defined as follows:

F(n) = 0 , n = 0
1 , n = 1
F(n-1) + F(n-2) , n > 1
F(0) F(1) F(2) F(3) F(4) F(5) F(6) F(7) F(8) ...
0 1 1 2 3 5 8 13 21 ...

Such a function can be defined in SQL using the WITH clause:

WITH RECURSIVE fib(f1, f2) AS ( 
    SELECT 0, 1 
    UNION ALL 
    SELECT f2, (f1+f2) FROM fib ) 
SELECT f1 FROM fib LIMIT 10;
 f1 
---- 
  0 
  1 
  1 
  2 
  3 
  5 
  8 
 13 
 21 
 34 
(10 rows)

I hope the concept is now clear.

Graph Traversal

Let's go back to our example with a graph traversal. What we want to do is to find the shortest path between two nodes. This is how DB structure looks like:

Just to make our SQL more readable, let's define a simple view node_links_view joining node with link and with node again:

CREATE VIEW node_links_view AS 
SELECT 
    n1.id AS node_id, 
    n1.name AS node_name, 
    n2.id AS destination_node_id, 
    n2.name AS destination_node_name, 
    l.length AS link_length 
FROM 
    node n1 
    JOIN link l ON (n1.id = l.node_from_id) 
    JOIN node n2 ON (l.node_to_id = n2.id);

Now, our model structure looks as follows:

What do we need as a result of the query? We want an exact path between the nodes and its entire length. In order to exclude any cycles in the graph, we also need a flag to identify if the last node was already visited. So, the first part of CTE definition will look like this:

WITH RECURSIVE search_path (path_ids, length, is_visited) AS 

In the first step we have to get all links from the beginning node:

   
SELECT 
    ARRAY[node_id, destination_node_id], 	-- path_ids
    link_length,  				-- length
    node_id = destination_node_id  		-- is_visited
FROM 
    node_links_view;

It's our non-recursive term.

Now, we'll go recursively starting from the last visited node, which is the last element in an array:

    SELECT 
        path_ids || d.destination_node_id,  	-- path_ids
        f.length + d.link_length, 		-- length
        d.destination_node_id = ANY(f.path_ids)	-- is_visited
    FROM 
        node_links_view d, 
        search_path f 
    WHERE 
        f.path_ids[array_length(path_ids, 1)] = d.node_id 
        AND NOT f.is_visited;

How does it work? Look at the FROM and WHERE clauses. The query gets the next rows from node_link_view which start at the last node of the previous evaluation that didn't finish with a cycle. It returns an array extended with a destination node of the link, a sum of lengths and a flag determining if this node was previously visited. This recursive part of the query will be executed as long as there are any links to non-visited nodes.

So, here is a complete SQL query retrieving all paths from the node with id=1 to the node with id=6:

WITH RECURSIVE search_path (path_ids, length, is_visited) AS 
( 
    SELECT 
        ARRAY[node_id, destination_node_id], 
        link_length, 
        node_id = destination_node_id 
    FROM 
        node_links_view 

    UNION ALL 
    SELECT 
        path_ids || d.destination_node_id, 
        f.length + d.link_length, 
        d.destination_node_id = ANY(f.path_ids) 
    FROM 
        node_links_view d, 
        search_path f 
    WHERE 
    f.path_ids[array_length(path_ids, 1)] = d.node_id 
    AND NOT f.is_visited 
) 
SELECT * FROM search_path 
WHERE path_ids[1] = 1 
AND path_ids[array_length(path_ids, 1)] = 6 
ORDER BY length;

As a result we get all paths from node 1 to node 6 ordered by total path length:

   path_ids    | length | is_visited 
---------------+--------+------------ 
 {1,3,2,5,6}   |    140 | f 
 {1,2,5,6}     |    150 | f 
 {1,3,4,5,6}   |    150 | f 
 {1,3,4,6}     |    190 | f 
 {1,2,3,4,5,6} |    200 | f 
 {1,2,3,4,6}   |    240 | f 
(6 rows) 

The shortest path is the first one, so we could add a LIMIT clause to get just one result.

Remember that we created the external view – node_links_view – to make the SQL easier to read? We may do the same with a CTE:

WITH RECURSIVE search_path (path_ids, length, is_visited) AS 
( 
    SELECT 
        ARRAY[node_id, destination_node_id], 
        link_length, 
        node_id = destination_node_id 
    FROM node_links_select

    UNION ALL 
    SELECT 
        path_ids || d.destination_node_id, 
        f.length + d.link_length, 
        d.destination_node_id = ANY(f.path_ids) 
    FROM node_links_select d, 
        search_path f 
    WHERE 
        f.path_ids[array_length(path_ids, 1)] = d.node_id 
        AND NOT f.is_visited 
), 

node_links_select AS ( 
    SELECT 
        n1.id AS node_id, 
        n1.name AS node_name, 
        n2.id AS destination_node_id, 
        n2.name AS destination_node_name, 
        l.length AS link_length 
    FROM 
        node n1 
        JOIN link l ON (n1.id = l.node_from_id) 
        JOIN node n2 ON (l.node_to_id = n2.id) 
)

SELECT * FROM search_path 
WHERE path_ids[1] = 1 
AND path_ids[array_length(path_ids, 1)] = 6 
ORDER BY length;

Note: this example is by no means optimized! Its purpose is just to show you how to use recursive CTEs.

Oracle (Prior to 11g Release 2) – Hierarchical Queries

Up to Oracle 11g release 2, Oracle databases didn't support recursive WITH queries. In Oracle SQL these kinds of queries are called hierarchical queries and they have completely different syntax, but the idea is quite the same. You can read more about hierarchical queries in the Oracle documentation.

MySQL – 33 Months of Silence...

Bad news for MySQL users. It doesn't support WITH clause though there were many feature requests asking for it. Probably the first one was this one which had been ignored for 33 months and hasn't been resolved since January 2006...

Update: Recursive WITH queries have been available in MySQL since release 8.0.1, published in April 2017.

I hope the idea of recursive queries is now clear to you. Enjoy recursively enjoying recursive queries!