21st Sep 2017 Updated: 5th Sep 2018 10 minutes read

An Illustrated Guide to the SQL Self Join

JOIN

Table of Contents

When Should We Use Self Join in SQL?
Using SQL Self Join to Find Duplicate Values
Learn More

What is an SQL self join and how does it work? When should it be used? We’ll provide answers to those questions!

In SQL, we can combine data from multiple tables by using a JOIN operator. JOIN has several variants; we’ve already discussed CROSS JOIN, INNER JOIN, and OUTER JOIN. Most of the time, these operators join data from two or more different tables. You can practice all the different types of JOINs in our interactive SQL JOINs course. In this article, however, we will explain how to join records from the same table.

A SQL self join joins data from the same table. In other words, it joins a table with itself. Records taken from the table are matched to other records from the same table. Why would you do this? You may need to compare a value with another value from the same row. You can’t do that unless you join the table to itself and compare the values as if they were in two separate records.

There is no dedicated operator for this type of query. Instead, the SQL self join uses the INNER JOIN operator, one of the OUTER JOIN operators, or a CROSS JOIN. The difference is that a single table is listed as both the left and right table in the join.

Let’s look at a SQL self join example based on the employee table, shown below:

id	first_name	last_name	salary	manager_id
1	John	Watson	7550	NULL
2	Anne	Brown	3500	1
3	James	Black	3000	1
4	Scarlett	Miller	2500	3
5	Ethan	Davis	1200	3
6	Jacob	Smith	2000	3

The employee table stores employees’ ID numbers, first names, last names, salaries, and their manager’s ID number. With the exception of the boss (whose ID = 1), everyone has a supervisor. Only the boss can have a NULL value in the manager_id column.

Let’s try a query that returns the immediate superior of each employee:

SELECT
  e.id,
  e.first_name,
  e.last_name,
  e.salary,
  m.first_name AS fname_boss,
  m.last_name AS lname_boss
FROM employee e
JOIN employee m
  ON e.manager_id = m.id;

Note that we used the employee table as both the left and right table in the join. To join data from the same table, we had to assign two aliases to the table name.

You must use aliases when performing self joins. Because you’re joining columns from the same table, they will have the same names. The alias renames the columns so the database engine can execute your query. In the query above, we used the letter e as the alias for the left table. In this case, we are using the employee table for its employee records. We used the letter m as the alias for the right table, which used the employee table for its manager records. Even though we are using one table, SQL treats it as two different tables.

We selected employee data by selecting the id, first_name and last_name columns in the e “table”. The managers’ data was selected by referring to the m “table”. Notice that the employee records were prefixed with an “e” and the manager records were prefixed with an “m”. We could now join them because we used an appropriate joining condition: the manager_id column from the e table was compared to the id column from the m. Records from both tables were matched using a JOIN operator (or INNER JOIN; remember, the “INNER” is implied).

Did you notice that the boss is not shown in the results? They have no match in the right “manager” table, so no data about the boss’s superior could be found.

The image below shows how records are matched during this query execution.

Let’s analyze the data in the output table:

id	first_name	last_name	salary	fname_boss	lname_boss
2	Anne	Brown	3500	John	Watson
3	James	Black	3000	John	Watson
4	Scarlett	Miller	2500	James	Black
5	Ethan	Davis	1200	James	Black
6	Jacob	Smith	2000	James	Black

In our table, the JOIN did not return a record for John Watson. He does not have an identifier pointing to his superior’s id, so he is the boss. We would have to use a LEFT JOIN instead of an INNER JOIN if we wanted to return data for all employees, including the boss. In that case, records from the left table with no match in the right table would be shown as well:

SELECT
  e.id,
  e.first_name,
  e.last_name,
  e.salary,
  m.first_name AS fname_boss,
  m.last_name  AS lname_boss
FROM employee e
LEFT JOIN employee m
  ON e.manager_id = m.id;

Our interactive SQL JOINs course has the whole section dedicated to practicing self joins.

When Should We Use Self Join in SQL?

Self joins are commonly used in the following areas:

Hierarchical relationships
Sequential relationships
Graph data

We’ll look at each of these individually.

Hierarchical Data

Processing hierarchical data is one of the most frequent applications of the SQL self join. This occurs when there is an additional column pointing to an identifier in the same table, such as in our employee table. In our case, the manager_id column refers to (has the same value as) the id column.

The example given above (the employee-manager relationship) is not the only situation where hierarchical data is used. A bill of materials for a car shares a similar structure. Each car consists of multiple components, such as the engine, the brake system, and the electrical system. Each of these components consists of smaller parts. A car’s electrical system can be broken into its components, such as the battery and the alternator (which can be broken down into even more car parts). This means that car parts constitute a group of hierarchical data. The car table presented below shows some car parts data.

id	name	element_id
1	car	NULL
2	electrical system	1
3	engine	1
4	battery	2
5	alternator	2

Yet another example of hierarchical data is the parent-child relationship. By storing this relationship, we can use a single table to house an entire family tree. We can then use a SQL self join to easily retrieve data about the ancestors of a given person.

The table below can help us quickly identify the eldest ancestor(s) in a family. These people do not have ancestor data in their records, which means they form the root of the family tree.

The person table presented below illustrates this:

id	first_name	last_name	birth	mother_id	father_id
1	John	Watson	1945	NULL	NULL
2	Anne	Brown	1950	NULL	NULL
6	Scarlett	Miller	1985	2	1
7	Jacob	Miller	1982	NULL	NULL
8	David	Miller	2015	6	7

The query below retrieves the first and last name of each person along with the first and last name of their mother and father.

SELECT
  c.first_name,
  c.last_name,
  m.first_name AS fname_mother,
  m.last_name  AS lname_mother,
  f.first_name AS fname_father,
  f.last_name  AS lname_father
FROM person c
LEFT JOIN person m
  ON  c.mather_id = m.id
LEFT JOIN person f
  ON  c.father_id = f.id;

Note that we had to use a self join three times (i.e. on three “tables” or aliases) to get the data about both the father and the mother. John Watson, Anne Brown, and Jacob Miller do not have data pointing to their ancestors.

The examples of hierarchical structures discussed above store data using a superior-inferior record approach. This enables us to present data as a tree structure. You’ll learn more about processing tree structures in SQL as you continue to learn about relational databases.

Sequential Data

Sequential data can also benefit from using a SQL self join. For example, suppose you have records that describe the consecutive steps required to prepare a dish. All of the steps can be placed in a single table. Their order is determined based on the columns that point to the IDs of the previous and next records in the same table.

Let’s illustrate this using the instruction table:

id	content	previous_id	next_id
1	Preheat an oven to 220 degrees C.	NULL	2
2	Peel four potatoes.	1	4
3	Toss sliced potatoes with oil.	4	6
4	Cut potatoes into slices.	2	3
5	Season the hot slices with salt and pepper.	6	NULL
6	Bake in the preheated oven for 20 minutes.	3	5

As you see, in the current order these instructions do not make sense. But when we use a SQL self join to show the sequential relationship for each step in the recipe, we get:

Preheat an oven to 220 degrees C.
Peel four potatoes.
Cut potatoes into slices.
Toss sliced potatoes with oil.
Bake in the preheated oven for 20 minutes.
Season the hot slices with salt and pepper.

We also get some tasty oven fries!

Graphs

Self join SQL can also be used to show the relationships needed for graphs. A graph is a structure consisting of nodes connected to each other with edges (relations). One example of a graph is the road network between multiple cities.

Take a look at the drawing below.

This graph represents five cities that are connected with each other. Each arrow shows a road from one city to another. In this case, the cities are the nodes and the roads between them are the edges. We’re using two tables to store this data. The city table stores the ID number and the name of each city. The route table contains the route ID number, the starting city (the from_city_id column) and the target city (the to_city_id column).

Here is the “city” table:

id	name
1	Laredo
2	San Antonio
3	Austin
4	Waco
5	Houston

And this is the “route” table:

id	from_city_id	to_city_id
1	4	1
2	4	3
3	4	2
4	1	4
5	2	3
6	2	5
7	5	3

We can use a SQL self join on the city table, along with an INNER JOIN of the two tables, to find out what routes exist between cities.

Take a look at the query.

SELECT
  c1.name AS from_city,
  c2.name AS to_city
FROM city c1
JOIN route r
  ON c1.id = r.from_city_id
JOIN city c2
  ON c2.id = r.to_city_id;

The city and route tables were joined using the id column from city and the from_city_id column from route. At this point, we could only retrieve the name of the start city. In order to retrieve the name of the target city, we used a self-join on the city table. This time, we compared the id from the aliased city table with the to_city_id column in the route table.

Here is the result:

from_city	to_city
Waco	Laredo
Waco	Austin
Waco	San Antonio
Laredo	Waco
San Antonio	Austin
San Antonio	Houston
Houston	Austin

You can also use a SQL self join in recursive subqueries that store graphs. More information about this topic can be found in our new course, “Recursive Queries”.

Using SQL Self Join to Find Duplicate Values

Self-joins can also be used to identify duplicate values in a table. Let’s introduce an example table called color:

id	name
1	blue
2	green
3	yellow
4	blue
5	yellow

Each record in the table is different because of the id column, which must always be unique. But this doesn’t prevent two rows from storing the same color name. We want to identify such cases and find the IDs of the duplicate color names. Let’s try this:

SELECT
  c1.id AS id1,
  c1.name AS color1,
  c2.id AS id2,
  c2.name AS color2
FROM color c1
JOIN color c2
  ON c1.name = c2.name AND c1.id < c2.id;

We were able to find duplicate color names because we self-joined records based on the color name. The second condition is used to skip identical records from both tables as well as the same pairs of records in reverse order.

Take a look at the query result:

id1	color1	id2	color2
1	blue	4	blue
3	yellow	5	yellow

It is now easy to see that there are duplicate values for blue and yellow.

Learn More

If you would like to learn more about JOINs than we covered in this illustrated SQL self join guide, check out our SQL JOINs course. Remember, the best way to learn SQL JOINs – or anything else – is to learn by doing!

Tags:

JOIN