Back to articles list May 8, 2018 - 5 minutes read Converting Subqueries to Joins Ignacio L. Bisso Ignacio is a database consultant from Buenos Aires, Argentina. He’s worked for 15 years as a database consultant for IT companies like Informix and IBM. These days, he teaches databases at Sarmiento University and works as a PostgreSQL independent SQL consultant. A proud father of four kids with 54 years in his backpack, Ignacio plays soccer every Saturday afternoon, enjoying every match as if it’s his last one. Tags: data analysis how to in sql join SQL basics SQL subquery Not all queries are alike, especially in terms of performance. In this article, we'll look at how you can convert SQL subqueries to joins for improved efficiency. When should I use SQL subqueries? Great question! Unfortunately, there's no concrete answer. SQL beginners tend to overuse subqueries. Typically, once they find that SQL construction works in one situation, they try to apply that same approach to other situations. It's only natural. However, in some cases a subquery can be replaced with a more efficient JOIN. If you can avoid a subquery and replace it with a JOIN clause, you should do so without hesitation. But of course, in some cases, using a subquery is the only way to solve a data question. In this article, I'll show examples of both cases: when a subquery is a must and when a subquery should be avoided and replaced by a JOIN. The data Before getting to the examples, let’s briefly look at the sample database we will use. The database has two tables representing the production statistics of a fictional apple farm company named EverRed. The company has three farms. The first table is current_year_production, which contains information about the number of apples produced in the current year by each farm, as well as the area and number of trees on each farm. The second table, production_history, stores past production information for each farm. Below are some sample data from these two tables. current_year_production farm_idarea_m2farm_namenumber_of_treesproduction_in_kg 10010000The Paradise2404400 10115000Evergreen3006200 10220000Red Delicious5809300 production_history farm_idyearproduction_in_kgprice_ton 100201741001200 101201758001200 102201794001200 100201639001300 101201664001300 102201691001300 Example: replacing a subquery with a JOIN Suppose you're an SQL data analyst working at EverRed. The owner of the company wants you to obtain the names of the farms where the company is producing more apples in the current year than in the previous year (2017). Solution using a subquery: SELECT farm_name, FROM current_year_production CYP WHERE production_in_kg > ( SELECT production_in_kg FROM production_history PH WHERE PH.farm_id = CYP.farm_id AND year = 2017 ) Solution using a JOIN clause: SELECT farm_name, FROM current_year_production CYP JOIN production_history PH ON PH.farm_id = CYP.farm_id WHERE PH.year = 2017 AND CYP.production_in_kg > PH.production_in_kg The difference between these two approaches is in performance. While the JOIN clause in the second example needs to be executed only once, the subquery in the first example will be executed once per farm. In this case, we only have three farms, so the difference is negligible. But what if you worked for a larger company that has 10,000 global farms? The subquery would need to be executed 10,000 times. Clearly, a subquery is inefficient for our purposes here. Moreover, in a test database with only a few farms, both queries execute with an acceptable response time; however, when we move to a productive database, (where the data volume is usually much higher), the response time of the subquery approach will increase significantly, while response time of JOIN approach will remain stable. The result of both previous equivalent queries is: farm_name The Paradise Evergreen Example: when subqueries are the only way to go Let’s now suppose the owner of the company, after reading the results you delivered in the previous query, asks you to obtain the names of the farms that are producing more apples per square meter this year than the historical average. This sounds complex, but it's easier than it seems. Solution using a subquery: SELECT farm_name, production_in_kg / area AS "production_per_meter" FROM Current_year_production WHERE production_in_kg / area > ( SELECT AVG(PH.production_in_kg / CYP.area) FROM production_history PH JOIN Current_year_production CYP ON PH.farm_id = CYP.farm_id ) We can't replace this subquery with a JOIN because we don’t have a table with the average previously calculated. In other words, we need to first calculate the historical average. And to do that, we need a GROUP BY, which can break the one-to-one relation needed for a JOIN. Another point to note is that the metric "apples per square meter" is obtained with the following expression: production_in_kg / area We used the "apples per square meter" metric because we need some way to compare the productivity of the different farms and rank them. The total "production_in_kg" of a farm is not a comparable value—because it's likely, for instance, that the biggest farm will have a better production_in_kg. So we divide "production_in_kg" by the area of each farm to standardize the values and create a comparable metric. We find that the historical average production per square meter is 0.42. Then the result of the previous query is: farm_nameproduction_per_meter The Paradise0.44 Red Delicious0.47 Example: when JOINs and subqueries are equally efficient As a last data question, let’s try to obtain the years when the company produced fewer apples than in the current year. We can write this query using two different approaches. Solution using a subquery: SELECT year, sum(production_in_kg) FROM production_history PH GROUP BY year HAVING sum(production_in_kg) < ( SELECT sum(production_in_kg) FROM current_year_production ) Solution using a JOIN clause: SELECT year, sum(PH.production_in_kg) FROM production_history PH JOIN current_year_production CYP ON PH.farm_id = CYP.farm_id GROUP BY year HAVING sum(PH.production_in_kg) < sum(CYP.production_in_kg) You can see that both of these queries are really similar; the main difference is in the JOIN clause and the subquery. In this case, both queries are equally efficient—the subquery is executed one time in the HAVING clause, so there is no performance issue. Try it yourself! To wrap things up, it's important to note that subqueries and JOINs are both really important resources for an SQL developer. We saw examples where we can replace a subquery with a JOIN and examples where we cannot do such a replacement. And sometimes, subqueries and JOINs are equally efficient. But how do you know when to use a subquery and when to use a JOIN? In all honesty, the only way to develop your intuition is to regularly solve SQL exercises. If you're looking to sharpen your SQL skills, our SQL Practice Set offers 88 practice problems for a comprehensive review. Stay tuned for more articles! Tags: data analysis how to in sql join SQL basics SQL subquery You may also like How to Learn SQL JOINs Learn how to use SQL JOINs to effectively combine data across multiple tables and analyze sophisticated data sets. Read more Subquery vs. JOIN What are the differences between a subquery and a JOIN in SQL? How do you choose one over the other? See examples of when to use which. Read more How to Join 3 Tables (or More) in SQL Have you ever wondered how to join three tables in SQL? It's easy when you know the basics. Joining three tables can be as easy as joining two tables. Read more How to Practice SQL JOINs Do you want to master SQL JOINs? The best way is through practice. Here’s how to practice SQL JOINs along with some examples. Read more How to LEFT JOIN Multiple Tables in SQL You will learn how to left join 3 tables in SQL while avoiding common mistakes in joining multiple tables. Examples included! Read more SQL INNER JOIN Explained in Simple Words Learn how to join tables in SQL using the most famous among SQL JOIN types–the INNER JOIN. Read more An Illustrated Guide to Multiple Join Did you wonder what is the most efficient way of making multiple joins in SQL? We did as well - come by and check out our illustrated guide! Read more An Illustrated Guide to the SQL Non Equi Join SQL join doesn’t have to be based on identical matches! In this post, we look at the non-equi join, which uses ‘non-equal’ operators to match records. Read more An Illustrated Guide to the SQL Self Join What is a SQL self join and how does it work? When should you use a self join in SQL? In this article, you’ll find answers to these questions! Read more An Illustrated Guide to the SQL OUTER JOIN An SQL JOIN clause links the data from two or more database tables. Let’s find out what is the OUTER JOIN and how does it work! Read more An Illustrated Guide to the SQL INNER JOIN INNER JOIN combines data from multiple tables by joining them based on a matching record. It requires a JOIN condition, which we'll explain in this article. Read more An Illustrated Guide to the SQL CROSS JOIN CROSS JOIN returns a Cartesian product, or all records joined to all records in all tables. Take a look at the example queries. Read more An Introduction to Using SQL Aggregate Functions with JOINs Aggregate functions. Powerful SQL tools. Let's see how they cooperate paired with LEFT JOIN, SUM and GROUP BY perform computations on multiple tables. Read more Learning JOINs With Real World SQL Examples JOIN statement lets you work with data stored in multiple tables. This article is a practical introduction to the SQL JOIN. Check out examples in real life. Read more SQL JOINs A SQL JOIN is a method to retrieve data from two or more database tables. Here we present a basic overview of what data from a particular SQL join will look Read more SQL JOINs for Beginners JOIN data from two or more tables with no effort. Read more Subscribe to our newsletter Join our weekly newsletter to be notified about the latest posts.