20th Feb 2019 8 minutes read Get to Know the Power of SQL Recursive Queries Michał Kołodziejski SQL recursive queries learn sql Table of Contents WITH Clause – Common Table Expressions to the Rescue! An Easy Example #1 An Easy Example #2 Graph Traversal Oracle (Prior to 11g Release 2) – Hierarchical Queries MySQL – 33 Months of Silence... Most commonly, the SQL queries we run on a database are quite simple. Well, that depends on your role, of course. Analysts in data warehouses retrieve completely different sorts of information using (very often) much more complicated queries than software engineers creating CRUD applications. However, sometimes it's simpler or more elegant to run a query that is a little bit more sophisticated without needing further data processing in the code. One way to accomplish this is with a SQL feature called recursive queries. Let's take a real-life example. Let's assume we've got a database with a list of nodes and a list of links between them (you can think of them as cities and roads). Our task is to find the shortest path from node 1 to node 6. Well, in fact, it's nothing more than graph traversal. The very first idea an average software engineer may have would be to get all rows from both tables and implement a DFS (Depth-First Search) or BFS (Breadth-First Search) algorithm in his/her favorite programming language. It's not a bad idea (if you like coding 😉 ) but you can do it with a single SQL query! WITH Clause – Common Table Expressions to the Rescue! Note: all examples are written for PostgreSQL 9.3; however, it shouldn't be hard to make them usable with a different RDBMS. If your RDBMS is PostgreSQL, IBM DB2, MS SQL Server, Oracle (only from 11g release 2), or MySQL (only from release 8.0.1) you can use WITH queries, known as Common Table Expressions (CTEs). Generally speaking, they allow you to split complicated queries into a set of simpler ones which makes a query easier to read. The structure of a WITH clause is as follows: WITH [cte_name] AS ( [cte_term]) SELECT ... FROM [cte_name]; For example, we might want to get at most 3 nodes, whose total length of outgoing links is at least 100 and at least one single outgoing link has a length bigger than 50. You don't have to fully understand the following example, just look at the query structure. Instead of writing a query like this: SELECT * FROM node WHERE id IN ( SELECT distinct node_from_id FROM link WHERE length > 50 AND node_from_id IN ( SELECT node_from_id FROM link GROUP BY node_from_id HAVING SUM(length) >= 100 ORDER BY SUM(length) DESC LIMIT 3)); we can write it like this: WITH longest_outgoing_links AS ( SELECT node_from_id FROM link GROUP BY node_from_id HAVING SUM(length) >= 100 ORDER BY SUM(length) DESC LIMIT 3), nodes_from_longest_outgoing_links AS ( SELECT distinct node_from_id FROM link WHERE length > 50 AND node_from_id IN (SELECT * FROM longest_outgoing_links) ) SELECT * FROM node WHERE id IN (SELECT * FROM nodes_from_longest_outgoing_links); The queries are defined separately, which makes it a lot easier to understand when implementing much more complicated examples. An important point: CTEs may also have a recursive structure: WITH RECURSIVE [cte_name] (column, ...) AS ( [non-recursive_term] UNION ALL [recursive_term]) SELECT ... FROM [cte_name]; How does it work? It's quite simple. In the first step a non-recursive term is evaluated. Next, for every result row of the previous evaluation, a recursive term is evaluated and its results are appended to the previous ones. The recursive term has access to results of the previously evaluated term. An Easy Example #1 Let's take a look at a simple example – multiplication by 2: WITH RECURSIVE x2 (result) AS ( SELECT 1 UNION ALL SELECT result*2 FROM x2) SELECT * FROM x2 LIMIT 10; result -------- 1 2 4 8 16 32 64 128 256 512 (10 rows) In the first step, the only result row is "1." Then, there is UNION ALL with a recursive term. 1 is multiplied by 2, which results in one result row – "2". For now, there are two result rows: 1, 2. However, the last term evaluation produced only one row – "2" – and it will be passed to the next recursive step. 2 is multiplied by 2, which returns "4," and so on... In this example, recursion would be infinite if we didn't specify the LIMIT clause. An Easy Example #2 Let's do another quick (typically academic) example – the Fibonacci sequence. It's defined as follows: F(n) = 0 , n = 0 1 , n = 1 F(n-1) + F(n-2) , n > 1 F(0) F(1) F(2) F(3) F(4) F(5) F(6) F(7) F(8) ... 0 1 1 2 3 5 8 13 21 ... Such a function can be defined in SQL using the WITH clause: WITH RECURSIVE fib(f1, f2) AS ( SELECT 0, 1 UNION ALL SELECT f2, (f1+f2) FROM fib ) SELECT f1 FROM fib LIMIT 10; f1 ---- 0 1 1 2 3 5 8 13 21 34 (10 rows) I hope the concept is now clear. Graph Traversal Let's go back to our example with a graph traversal. What we want to do is to find the shortest path between two nodes. This is how DB structure looks like: Just to make our SQL more readable, let's define a simple view node_links_view joining node with link and with node again: CREATE VIEW node_links_view AS SELECT n1.id AS node_id, n1.name AS node_name, n2.id AS destination_node_id, n2.name AS destination_node_name, l.length AS link_length FROM node n1 JOIN link l ON (n1.id = l.node_from_id) JOIN node n2 ON (l.node_to_id = n2.id); Now, our model structure looks as follows: What do we need as a result of the query? We want an exact path between the nodes and its entire length. In order to exclude any cycles in the graph, we also need a flag to identify if the last node was already visited. So, the first part of CTE definition will look like this: WITH RECURSIVE search_path (path_ids, length, is_visited) AS In the first step we have to get all links from the beginning node: SELECT ARRAY[node_id, destination_node_id], -- path_ids link_length, -- length node_id = destination_node_id -- is_visited FROM node_links_view; It's our non-recursive term. Now, we'll go recursively starting from the last visited node, which is the last element in an array: SELECT path_ids || d.destination_node_id, -- path_ids f.length + d.link_length, -- length d.destination_node_id = ANY(f.path_ids) -- is_visited FROM node_links_view d, search_path f WHERE f.path_ids[array_length(path_ids, 1)] = d.node_id AND NOT f.is_visited; How does it work? Look at the FROM and WHERE clauses. The query gets the next rows from node_link_view which start at the last node of the previous evaluation that didn't finish with a cycle. It returns an array extended with a destination node of the link, a sum of lengths and a flag determining if this node was previously visited. This recursive part of the query will be executed as long as there are any links to non-visited nodes. So, here is a complete SQL query retrieving all paths from the node with id=1 to the node with id=6: WITH RECURSIVE search_path (path_ids, length, is_visited) AS ( SELECT ARRAY[node_id, destination_node_id], link_length, node_id = destination_node_id FROM node_links_view UNION ALL SELECT path_ids || d.destination_node_id, f.length + d.link_length, d.destination_node_id = ANY(f.path_ids) FROM node_links_view d, search_path f WHERE f.path_ids[array_length(path_ids, 1)] = d.node_id AND NOT f.is_visited ) SELECT * FROM search_path WHERE path_ids[1] = 1 AND path_ids[array_length(path_ids, 1)] = 6 ORDER BY length; As a result we get all paths from node 1 to node 6 ordered by total path length: path_ids | length | is_visited ---------------+--------+------------ {1,3,2,5,6} | 140 | f {1,2,5,6} | 150 | f {1,3,4,5,6} | 150 | f {1,3,4,6} | 190 | f {1,2,3,4,5,6} | 200 | f {1,2,3,4,6} | 240 | f (6 rows) The shortest path is the first one, so we could add a LIMIT clause to get just one result. Remember that we created the external view – node_links_view – to make the SQL easier to read? We may do the same with a CTE: WITH RECURSIVE search_path (path_ids, length, is_visited) AS ( SELECT ARRAY[node_id, destination_node_id], link_length, node_id = destination_node_id FROM node_links_select UNION ALL SELECT path_ids || d.destination_node_id, f.length + d.link_length, d.destination_node_id = ANY(f.path_ids) FROM node_links_select d, search_path f WHERE f.path_ids[array_length(path_ids, 1)] = d.node_id AND NOT f.is_visited ), node_links_select AS ( SELECT n1.id AS node_id, n1.name AS node_name, n2.id AS destination_node_id, n2.name AS destination_node_name, l.length AS link_length FROM node n1 JOIN link l ON (n1.id = l.node_from_id) JOIN node n2 ON (l.node_to_id = n2.id) ) SELECT * FROM search_path WHERE path_ids[1] = 1 AND path_ids[array_length(path_ids, 1)] = 6 ORDER BY length; Note: this example is by no means optimized! Its purpose is just to show you how to use recursive CTEs. Oracle (Prior to 11g Release 2) – Hierarchical Queries Up to Oracle 11g release 2, Oracle databases didn't support recursive WITH queries. In Oracle SQL these kinds of queries are called hierarchical queries and they have completely different syntax, but the idea is quite the same. You can read more about hierarchical queries in the Oracle documentation. MySQL – 33 Months of Silence... Bad news for MySQL users. It doesn't support WITH clause though there were many feature requests asking for it. Probably the first one was this one which had been ignored for 33 months and hasn't been resolved since January 2006... Update: Recursive WITH queries have been available in MySQL since release 8.0.1, published in April 2017. I hope the idea of recursive queries is now clear to you. Enjoy recursively enjoying recursive queries! Tags: SQL recursive queries learn sql