Back to articles list Articles Cookbook
5 minutes read

Converting Subqueries to Joins

Not all queries are alike, especially in terms of performance. In this article, we'll look at how you can convert SQL subqueries to joins for improved efficiency.

When should I use SQL subqueries?

Great question! Unfortunately, there's no concrete answer. SQL beginners tend to overuse subqueries. Typically, once they find that SQL construction works in one situation, they try to apply that same approach to other situations. It's only natural.

However, in some cases a subquery can be replaced with a more efficient JOIN. If you can avoid a subquery and replace it with a JOIN clause, you should do so without hesitation. But of course, in some cases, using a subquery is the only way to solve a data question. In this article, I'll show examples of both cases: when a subquery is a must and when a subquery should be avoided and replaced by a JOIN.

The data

Before getting to the examples, let’s briefly look at the sample database we will use. The database has two tables representing the production statistics of a fictional apple farm company named EverRed. The company has three farms.

The first table is current_year_production, which contains information about the number of apples produced in the current year by each farm, as well as the area and number of trees on each farm. The second table, production_history, stores past production information for each farm. Below are some sample data from these two tables.

current_year_production
farm_idarea_m2farm_namenumber_of_treesproduction_in_kg
10010000The Paradise2404400
10115000Evergreen3006200
10220000Red Delicious5809300
production_history
farm_idyearproduction_in_kgprice_ton
100201741001200
101201758001200
102201794001200
100201639001300
101201664001300
102201691001300

Example: replacing a subquery with a JOIN

Suppose you're an SQL data analyst working at EverRed. The owner of the company wants you to obtain the names of the farms where the company is producing more apples in the current year than in the previous year (2017).

Solution using a subquery:

SELECT farm_name, 
FROM current_year_production CYP
WHERE production_in_kg > (
  SELECT production_in_kg 
  FROM production_history PH  
  WHERE PH.farm_id = CYP.farm_id
  AND year = 2017
) 

Solution using a JOIN clause:

SELECT farm_name, 
FROM current_year_production CYP 
JOIN production_history PH
  ON PH.farm_id = CYP.farm_id 
WHERE PH.year = 2017
  AND CYP.production_in_kg > PH.production_in_kg 

The difference between these two approaches is in performance. While the JOIN clause in the second example needs to be executed only once, the subquery in the first example will be executed once per farm. In this case, we only have three farms, so the difference is negligible. But what if you worked for a larger company that has 10,000 global farms? The subquery would need to be executed 10,000 times. Clearly, a subquery is inefficient for our purposes here.

Moreover, in a test database with only a few farms, both queries execute with an acceptable response time; however, when we move to a productive database, (where the data volume is usually much higher), the response time of the subquery approach will increase significantly, while response time of JOIN approach will remain stable.

The result of both previous equivalent queries is:

farm_name
The Paradise
Evergreen

Example: when subqueries are the only way to go

Let’s now suppose the owner of the company, after reading the results you delivered in the previous query, asks you to obtain the names of the farms that are producing more apples per square meter this year than the historical average. This sounds complex, but it's easier than it seems.

Solution using a subquery:

SELECT
  farm_name, 
  production_in_kg / area AS "production_per_meter"
FROM Current_year_production 
WHERE production_in_kg / area > (
  SELECT AVG(PH.production_in_kg / CYP.area) 
  FROM production_history PH 
  JOIN Current_year_production CYP  
    ON PH.farm_id = CYP.farm_id 
) 

We can't replace this subquery with a JOIN because we don’t have a table with the average previously calculated. In other words, we need to first calculate the historical average. And to do that, we need a GROUP BY, which can break the one-to-one relation needed for a JOIN. Another point to note is that the metric "apples per square meter" is obtained with the following expression:

	production_in_kg / area

We used the "apples per square meter" metric because we need some way to compare the productivity of the different farms and rank them. The total "production_in_kg" of a farm is not a comparable value—because it's likely, for instance, that the biggest farm will have a better production_in_kg. So we divide "production_in_kg" by the area of each farm to standardize the values and create a comparable metric.

We find that the historical average production per square meter is 0.42. Then the result of the previous query is:

farm_nameproduction_per_meter
The Paradise0.44
Red Delicious0.47

Example: when JOINs and subqueries are equally efficient

As a last data question, let’s try to obtain the years when the company produced fewer apples than in the current year. We can write this query using two different approaches.

Solution using a subquery:

SELECT year, sum(production_in_kg) 
FROM production_history PH
GROUP BY year
HAVING sum(production_in_kg) < (
  SELECT sum(production_in_kg)
  FROM current_year_production
)

Solution using a JOIN clause:

SELECT year, sum(PH.production_in_kg) 
FROM production_history PH 
JOIN current_year_production CYP
  ON PH.farm_id = CYP.farm_id
GROUP BY year
HAVING sum(PH.production_in_kg) < sum(CYP.production_in_kg)

You can see that both of these queries are really similar; the main difference is in the JOIN clause and the subquery. In this case, both queries are equally efficient—the subquery is executed one time in the HAVING clause, so there is no performance issue.

Try it yourself!

To wrap things up, it's important to note that subqueries and JOINs are both really important resources for an SQL developer. We saw examples where we can replace a subquery with a JOIN and examples where we cannot do such a replacement. And sometimes, subqueries and JOINs are equally efficient.

But how do you know when to use a subquery and when to use a JOIN? In all honesty, the only way to develop your intuition is to regularly solve SQL exercises. If you're looking to sharpen your SQL skills, our SQL Practice Set offers 88 practice problems for a comprehensive review.

Stay tuned for more articles!