20th Apr 2021 13 minutes read

Hierarchical Data and How to Query It in SQL

Table of Contents

What Is Hierarchical Data?
Hierarchical Data Examples
Storing Hierarchical Data In a Database
- Querying Hierarchical Data Using a Self-Join
Using Recursive Queries on Deep Hierarchical Data
Querying Hierarchical Data is Easier Than Ever!

How do you recognize hierarchical data, and how do you query it? Two common SQL questions are about to be answered.

Querying hierarchical data is a rather common situation if you’re working with SQL and relational databases. Not that hierarchical data is something mystical and rare; on the contrary, it’s everywhere. So why is hierarchical data so challenging when it comes to relational databases? Mainly, the difficulties occur in translating hierarchical data to relational database principles.

When working with hierarchical data, the first step is to recognize it. I’ll start off by defining what hierarchical data is and giving you several everyday examples of it. Then I’ll move on to explaining how hierarchical data is usually stored in the databases. Finally, you’ll learn how to query such data using SQL.

What you learn in this article can be practiced in our interactive Recursive Queries course. It contains over 100 exercises that teach recursive queries starting with the basics and progressing to advanced topics like processing trees and graphs with SQL queries.

What Is Hierarchical Data?

Hierarchical data is a specific kind of data, characterized by a hierarchical relationship between the data sets. What do you think of when you think about hierarchy? Probably of different levels: something is above, below, or at the same level as something else. In relational databases, a hierarchical relationship is also called a parent-child relationship. This means the child data has only one parent, while the parent data has one or more ‘children’.

It’s common to say that hierarchical data is recognized by its tree-like structure. You’ll see why in a moment as we look at common examples of hierarchical data.

Hierarchical Data Examples

Employee Hierarchies

One of the typical examples used for explaining hierarchical data are employee hierarchies. They’re represented by organizational charts like this one:

As you can see, this structure is narrow at the top and becomes wider further down – much like a pine tree. The president is at the top. His subordinates are two board members. Those two board members also have their subordinates. In the case of Jacqueline Managerovicz, those are Diane Drinkalot, the HR manager, and Rashawn Mangarello, the accounting manager. But they are also the president’s subordinates, even though they are indirect subordinates.

The other board member’s direct subordinates are Tony Workaholio, the sales manager, and Cassandra Ninetofiver, the IT manager. Paul Bossenheim, the company president, is their indirect superior too.

Family Tree

A family tree is another common hierarchical data example. Its structure allows us to find ancestors and their descendants. A family tree could look something like this:

In this example, Mike Strongbow married Victoria Stromboli. They had two children, Florence and Claudio. Florence had one child (Valerie), and Claudio had two children (Art and Michelle). All three are children to their parents, but they are also grandchildren to Mike and Victoria. Mike and Victoria are also great-grandparents; their granddaughter had two children, Judy and James.

If you go to a bar, you’ve likely come across a hierarchical structure. I’m talking about the one you usually look at immediately after you take a seat: the drink menu, or card, or whatever you call it. For example, if you visit the Panthelya Bar, you’ll find it’s a very primitive bar that only offers beers and wines.

The bar offers two types of beers: ale and wheat beer. There are many beers you can order from each category. The wine category is divided into reds and whites, with four wines in each category.

Living Things Taxonomy

Even you, the person reading this article, are a record in hierarchical data. As a human being, you have a certain position in the hierarchy of the taxonomy of living things.

The diagram above shows the hierarchy for humans, or homo sapiens. It would be too complicated to show the whole living things taxonomy. However, the principle is the same as in all the above diagrams. All humans belong to the Sapiens species, which is part of the genus Homo. This genus is part of the Hominidae family, one of the family under the order of Primates. Primates belong to the class of Mammalia, which is subordinate to the phylum, the kingdom, and, finally, the domain.

Folders on Your Computer

If you’re reading this article, there’s a good chance you’re doing it on your computer. Taking this into account, it’s entirely possible that your computer folders look something like this:

All your folders in this example (Learning and Freelancing) are on your D: disc. The Learning folder has two sub-folders: SQL, where you put all the interesting SQL-related articles like this one, and Python. Your freelancing folder contains three sub-folders: Jobs, Invoices, and Other documents.

Now that you know how to recognize hierarchical data, let’s see how it’s stored in the database and how to query it.

Storing Hierarchical Data In a Database

A problem with hierarchical data usually surfaces when you try to save such data in a database. To do that, you need to cram all those multi-level data into a relatively flat format: a table. How do you convert hierarchical data to simple rows of data?

To store the hierarchical data in a database, there’s usually a column that refers to the same table. What does that mean? It’s probably best that I show you an example. The employee hierarchy seems very suitable for that!

Querying Hierarchical Data Using a Self-Join

I’ll show you how to query an employee hierarchy. Suppose we have a table named employee with the following data:

employee_id – The employee’s ID and the table’s primary key (PK).
first_name – The employee’s first name.
last_name – The employee’s last name.
reports_to – The ID of this employee’s immediate supervisor or manager.

The reports_to is nothing but the employee_id column serving as a tool to show which employee reports to which employee. If the employee’s ID appears in the column reports_to, this employee is the boss of (at least some) other employees. Let me show you how it works:

employee_id	first_name	last_name	reports_to
1	Sharon	Simon	6
6	Martina	Novak	NULL

We see that Sharon Simon reports to the employee that has employee_id = 6, Martina Novak. In her case, the reports_to value is NULL. This means Martina Novak does not report to anyone. From that, we can conclude she’s at the top of the employee hierarchy.

This is the part where you’ll be self-joining the table. Don’t know what a self-join is? You can easily learn by reading this article with seven self-join examples. Remember, I’ve already mentioned that hierarchical data in a database usually has a column that refers to the same table. This is such an example. To get the direct subordinates from the table employee, you’ll need to write this query:

SELECT	
sub.employee_id AS subordinate_id,
sub.first_name AS subordinate_first_name,
	sub.last_name AS subordinate_last_name,
	sup.employee_id AS superior_id,
	sup.first_name AS superior_first_name,
	sup.last_name AS superior_last_name
FROM employee sub 
JOIN employee sup 
ON sub.reports_to = sup.employee_id
ORDER BY superior_id;

This query joins the employee table with itself. Let me explain how it works. When joining a table with itself, you must use clear aliases so that SQL knows which data comes from which table – and that you know which data is from which table. In the above query, one table alias is sub. This means it’s the table with the subordinates’ data. The other alias is sup, meaning the table with the superiors’ data. Even though this is the same table, we’re treating it like it’s two different tables.

So the above query first selects the columns employee_id, first_name, and the last_name from the sub table. Then it takes the same data from the table sup. Thus, the table employee is then joined with itself using the two aliases. The self-join is made where the column reports_to from the table sub equals the column employee_id from the table sup. The data is finally ordered by the column superior_id.

Here’s the result:

subordinate_id	subordinate_first_name	subordinate_last_name	superior_id	superior_first_name	superior_last_name
4	Kim	Magnus	6	Martina	Novak
1	Sharon	Simon	6	Martina	Novak
5	Vincent	Trafalgar	6	Martina	Novak
7	Victor	Fonseca	2	Patricia	Rooney
2	Patricia	Rooney	1	Sharon	Simon
3	James	Palin	1	Sharon	Simon

The table shows that Kim Magnus, Sharon Simon, and Vincent Trafalgar are direct subordinates to Martina Novak. Victor Fonseca reports to Patricia Rooney. In turn, Patricia Rooney is a direct subordinate to Sharon Simon, as is James Palin.

Using Recursive Queries on Deep Hierarchical Data

In the above example, I’ve shown you how to find direct superiors/subordinates. This means you’ve learned how to look only one level above or below. While this is very useful, hierarchies can be very deep and have an immense number of levels. Before querying such data, you’ll need to learn your way around recursive queries. Let’s talk about recursive queries first; then I’ll show you how they work with an example or two.

What Are Recursive Queries?

If you want to know the recursive queries, you’ll first have to learn about Common Table Expressions, or CTEs.

A CTE is a temporary data set returned by a query, which is then used by another query. It’s temporary because the result is not stored anywhere; it exists only when the query is run. CTEs can be non-recursive and recursive. I’ve already written about (non-recursive) CTEs and when to use them; feel free to check that out later for more info.
A recursive query is a query that references itself. By doing so, they return the sub-result and repeat the process until they return the final result. Following this logic, a recursive CTE is a CTE that references itself.

The Recursive CTE Syntax

The general recursive CTE syntax looks like this and can be divided into three parts:

WITH RECURSIVE cte_name AS (
   cte_query_definition

   UNION ALL

   cte_query_definition
   )


SELECT *
FROM cte_name;

CTEs are also called “WITH queries”. If you take a look at the above syntax, you’ll see why; the CTE always starts with the WITH clause. If you want your CTE to be recursive, you have to follow it by the word RECURSIVE. After that, you define the CTE name.

Then you have to write the CTE query definition. This part of the query is called the anchor member. It is “connected” to the other CTE using the UNION ALL. This second CTE query definition is called the recursive member, and it references the CTE itself.

At the end comes the SELECT statement, which fetches the data from the CTE. This part of the query is called the invocation.

Syntax is always best learned when you see it in an example. So here’s your first recursive CTE example!

Querying the Employee Hierarchy

I want to build on the example where you learned how to self-join the employee table. Now I’m going to use the same table, but this time we’ll use a recursive query. The task is to find every employee’s direct and indirect boss. This relationship between the employees will be shown as a path that leads from the boss at the top (the owner) to every employee in the table.

WITH RECURSIVE employee_hierarchy AS (
  SELECT	employee_id,
    		first_name,
    		last_name,
    		reports_to,
    		'Owner' AS path
  FROM employee
  WHERE reports_to IS NULL

  UNION ALL 
  
SELECT
    e.employee_id,
    e.first_name,
    e.last_name,
    e.reports_to,
    employee_hierarchy.path || '->' || e.last_name
  FROM employee e, employee_hierarchy
  WHERE e.reports_to = employee_hierarchy.employee_id
)
SELECT *
FROM employee_hierarchy;

As you’ve already learned, you start writing a recursive CTE using WITH RECURSIVE. Then you name the CTE. It’s employee_hierarchy, in this case.

The anchor member of the CTE is the first SELECT statement. By doing this, you select the root of the hierarchy; it’s the basis on which the recursive query will work its magic and find all other levels of the hierarchy. This statement selects all the columns from the table employee. It also adds the new column path, with the value in it being 'Owner'. The WHERE clause means this will be done only for the rows where the value in the column reports_to is NULL. Why is that? If there is a NULL value in the reports_to column, the employee does not report to anyone. It means this is the owner of the company.

The next step is to “connect” the anchor member with the recursive member of the CTE with UNION ALL. The important thing about using UNION ALL is that the SELECT statements you’re “connecting” have to have the same number of columns. Otherwise, UNION ALL won’t work.

The recursive member is the second SELECT statement. This statement again selects all the columns from the table employee. It also takes the value (which is 'Owner') from the column path from the employee_hierarchy CTE. It adds '->' to it, followed by the value from the column last_name from the table employee. (The || is a concatenation operator; it combines two or more values into one value.) This will be a path leading from the owner to every employee.

The table employee and the CTE employee_hierarchy are joined like any other two tables. This is done where the column reports_to equals the column employee_id. The recursive member serves as an extension of the anchor member. This means it extends a result that’s already been found (by the anchor member) with new results. Thus, the recursive member will perform everything described until it reaches the last employee.

Finally, the simple invocation part selects all the data from the CTE employee_hierarchy. And, voila! The result is:

employee_id	first_name	last_name	reports_to	path
6	Martina	Novak	NULL	Owner
1	Sharon	Simon	6	Owner->Simon
4	Kim	Magnus	6	Owner->Magnus
5	Vincent	Trafalgar	6	Owner->Trafalgar
2	Patricia	Rooney	1	Owner->Simon->Rooney
3	James	Palin	1	Owner->Simon->Palin
7	Victor	Fonseca	2	Owner->Simon->Rooney->Fonseca

If you look at, say, Victor Fonseca, you can see that the path from the owner to him leads through Sharon Simon and Patricia Rooney.

Let’s practice recursive queries on another example!

Querying the Folder Hierarchy

Companies usually have network drives where the employees save all their work. This usually leads to a very branched tree structure of folders. The data about the folders is stored in the table folder. Its columns are:

id – The folder’s ID and the table’s primary key (PK).
name – The name of the folder.
subfolder_of – The name of the folder one level up.

To find the path for all the folders, you’ll need the following query:

WITH RECURSIVE folder_hierarchy AS (
  SELECT	id,
   	 	name,
    		subfolder_of,
    		CAST (name AS text) AS path
  FROM folder
  WHERE subfolder_of IS NULL
	
  UNION ALL 
	
  SELECT	folder.id,
    		folder.name,
    		folder.subfolder_of,
    		folder_hierarchy.path || '\' || folder.name
  FROM folder, folder_hierarchy
  WHERE folder.subfolder_of = folder_hierarchy.id
)
SELECT *
FROM folder_hierarchy;

The principle is the same as in the previous example. Again, you start with WITH RECURSIVE and the name: folder_hierarchy. The first SELECT statement selects all three columns from the table folder. The fourth column is path, which contains data from the column name cast as text values. The data is cast to match the data type from the recursive member of the CTE. Finally, the WHERE clause limits the data only to those with the NULL values in the column subfolder_of. Where there’s NULL, there’s the root folder (i.e. the one having no folders above it).

The UNION ALL is again used to “connect” the anchor and recursive members of the CTE. One additional thing to remember: the data types in both SELECT statements have to be the same for UNION ALL to work. Otherwise, the query will return an error.

The second SELECT statement again selects all the columns from the table folder. The value from the columns path and name are put together, with '\' separating the data.

Finally, all the data from the CTE is selected, which returns a nice table:

id	name	subfolder_of	path
1	F:	NULL	F:
2	Reporting	1	F:\Reporting
3	Administration	1	F:\Administration
4	Budget	2	F:\Reporting\Budget
5	KPI	2	F:\Reporting\KPI
6	Financial Reports	2	F:\Reporting\Financial Reports
7	Working Hours	3	F:\Administration\Working Hours
8	Holidays	3	F:\Administration\Holidays
9	Company Car Reservation	3	F:\Administration\Company Car Reservation
10	Tasks	3	F:\Administration\Tasks

Now every folder has its path and can be easily accessed. No more browsing through multiple levels of folders to get what you need!

I believe these are convincing examples of the power of recursive queries. If not, there’s always a chance you need one more article to see recursive queries’ true potential. And if you’re an Oracle user, here’s how you can use hierarchical queries in that particular DBMS.

Querying Hierarchical Data is Easier Than Ever!

Now that you know what hierarchical data is, you’ll start recognizing it all over the place. I’ve given you several everyday examples, but I’m sure you’ll find even more. Try to implement the self-joins, recursive CTEs, and everything else you’ve learned here. That’s the only way to keep your knowledge fresh!

If you lack practice exercises, you can find plenty of them in our Recursive Queries Course. What can you learn in this course? How can recursive queries help you? No need to wonder; our Chief Content Officer explains everything in detail in this article.

Tags: