28th Oct 2020 16 minutes read

What are SQL CTE Best Practices?

Table of Contents

What Are CTEs?
CTEs Make Your Code More Organized and Readable
Avoid Repeating Subqueries by Using CTEs
Replace a Subquery in the FROM Clause
Best Practices for Naming CTEs
Give Meaningful Names to Your CTEs
Be Consistent When Naming CTEs
Apply the Same Naming Convention to Both Tables and CTEs
You Can’t Avoid CTEs When Using Recursive Queries
The Benefits of Using CTEs

Are you wondering what the best practices for using common table expressions are? This article will help you learn when to use a CTE and how to write it.

If you have heard about SQL CTEs, you have probably noticed they’re often mentioned together with subqueries. Sometimes, people think there’s no difference compared to the subqueries, and there’s often debate about whether a CTE or subquery should be used to get a particular result.

The truth is, most of the queries written with CTEs can also be written with subqueries. The keyword here is most! There are, of course, some situations in which only CTEs can help you. There are also situations in which you can choose between a CTE and subquery, but using a CTE is often more comfortable.

I’ll show you what those situations are and the CTE best practices you should apply when writing a CTE query.

What Are CTEs?

CTEs, or common table expressions, are a temporary named data set returned by a query. The syntax of this SQL feature is:

WITH expression_name AS (CTE definition)

CTEs are also called WITH queries, and if you take a look at the syntax above, it’s obvious why. If you need to refresh your memory, you can read this article, which nicely shows how this basic syntax works in practice.

Speaking of practice, why not warm your fingers by writing some CTE code in the Recursive Queries course or the Advanced SQL track.

It’s now time I lead you through some benefits of using CTEs.

CTEs Make Your Code More Organized and Readable

Here’s the scenario: you’re working for a company that employs freelancers for various jobs. The data is stored in three tables:

freelancer
hours_worked
job_position

The table freelancer contains the following data:

id: the ID of the freelancer.
first_name: the first name of the freelancer.
last_name: the last name of the freelancer.
country: the country the freelancer works from.
pay_rate: the pay rate per hour.
job_type_id: the type of the job (the FK from the table job_position).

The second table, hours_worked, has columns:

id: the ID of the hours worked.
date: the date of the hours worked.
hours: the number of hours worked.
freelancer_id: the ID of the freelancer (the FK from the table freelancer).

The last table is job_position:

id: the ID of the job position type.
position_type: the name of the job position type.

In this first example, I will show you how using CTEs can make your code more organized and readable. Using the above tables, we need to list all freelancers, their pay rate, the job type they are hired for, and the average pay rate for the particular job position type.

If you want to use a subquery, here’s what the code will look like:

SELECT	f.id,
		f.first_name,
		f.last_name,
		f.pay_rate,
		j.position_type,
		sq.average_job_rate
FROM freelancer f JOIN job_position j 
ON f.job_type_id = j.id JOIN (SELECT j.id,
					AVG (f.pay_rate) AS average_job_rate
					FROM freelancer f 
JOIN job_position j 
ON f.job_type_id = j.id
					GROUP BY j.id) AS sq 
ON j.id = sq.id;

The code takes the following columns from the table freelancer: id, first_name, last_name, and pay_rate. Then, it takes the column position_type from the table job_position. Those two tables are given their aliases, and they are joined on the columns f.job_type_id and j.id.

That’s fine, but I’m still missing the average pay rate per job. I’ve joined the table, which is not a table, but a subquery in this case. In this subquery, I’ve again joined the tables freelancer and job_position on the columns f.job_type_id and j.id. I’ve done that so I can select the column id from the table job_position.

After that, it calculates the average pay rate using the AVG() function, with the result appearing in the new column average_job_rate. The result is then grouped by the column id since I want the result to be per job type.

Finally, the subquery is named sq. After I’ve written the subquery, I’ve joined it with the other tables on the columns j.id and sq.id. Now that I have the subquery written, I’ve put the column sq.average_job_rate in the main SELECT statement.

When you run the code, you’ll get the following result:

id	first_name	last_name	pay_rate	position_type	average_job_rate
1	Carole	Queen	60	SQL Developer	58.9
2	Arthur	Strauss	42	SQL Developer	58.9
3	Cindy	Jones	75	SQL Developer	58.9
4	Paul	Hopkins	25.5	Content Writer	24.618181
5	Katherine	McLean	32	Content Writer	24.618181
6	Don	Jones	18.5	Content Writer	24.618181
7	Linda	Stones	40.5	Data Analyst	39.45
8	Trent	Arthurs	12.5	Data Analyst	39.45
9	Miriam	Jones	52	Data Analyst	39.45
10	Guy	Gudbois	36	SQL Developer	58.9
11	Michelle	Lebouef	40	SQL Developer	58.9
12	Jean-Paul	Lamontagne	26	Content Writer	24.618181
13	Camille	Barbet	16	Content Writer	24.618181
14	Jules	Fournier	24.8	Content Writer	24.618181
15	Paulette	Dubois	32	Content Writer	24.618181
16	Mohamed	Lambert	38	Content Writer	24.618181
17	Simone	Pomeroy	36.7	Data Analyst	39.45
18	Gaspard	Chevalier	51	Data Analyst	39.45
19	Anouk	De Jong	80	SQL Developer	58.9
20	Ambroos	Jansen	75	SQL Developer	58.9
21	Hendrika	De Vries	41	SQL Developer	58.9
22	Caetano	Van de Berg	84.5	SQL Developer	58.9
23	Grietje	Van Dijk	55.5	SQL Developer	58.9
24	Dael	Janssen	18	Content Writer	24.618181
25	Uday	Visser	28	Content Writer	24.618181
26	Hadwin	Bakker	12	Content Writer	24.618181
27	Liv	Meijer	44	Data Analyst	39.45

Everything’s okay with the result, but I’m not satisfied with the code. Doesn’t it seem a little unreadable? I have difficulty analyzing it even though I just wrote it. Imagine how difficult it would be if I returned to it in a few days and wanted to change something. Or even worse, what if I had to analyze code written by someone else in such a way?

I don’t know about you, but I don’t like having the SELECT statements in JOIN. It makes my brain work in the opposite way of what is logical for me. I also want to have the simple name of the tables when I’m joining them, not disturbing this flow with a chunk of another SELECT statement, i.e., subquery.

Is it possible to make this code easier to read? Of course, that’s the first reason I’d use a CTE instead. Have a look at some code written with CTE:

WITH average_rate_per_job AS (
SELECT	j.id,
		AVG (f.pay_rate) AS average_job_rate
		FROM freelancer f 
JOIN job_position j 
ON f.job_type_id = j.id
		GROUP BY j.id	
)

SELECT	f.id,
		f.first_name,
		f.last_name,
		f.pay_rate,
		j.position_type,
		a.average_job_rate
FROM freelancer f 
JOIN job_position j ON f.job_type_id = j.id 
JOIN average_rate_per_job a ON f.job_type_id = a.id;

Doesn’t that look a little nicer and easier to analyze? Let’s do exactly that! I’ve created a CTE using the WITH statement. The name I’ve given to the CTE is average_rate_per_job. The name is followed by the keyword AS, and the CTE is defined in the parentheses.

If you care to check, it’s entirely the same as the subquery in the previous code. However, for me, it organizes the code better. Logically, I would first calculate the average rate per job and try to “merge” this result with the table employee.

After I’ve calculated the average rate per job, it’s time to list all the employees and job position types they’re hired for. Once again, the SELECT statement is the same as in the first code, save for different aliases and, fortunately, no subquery following the JOIN clause.

As you can see, the code takes the columns id, first_name, last_name, and pay_rate from the table freelancer, and the column position_type from the table job_position. Finally, it selects the column average_job_rate from the CTE. All those tables, including the CTE, are joined together.

I’ve simply treated the CTE as a table when joining it, using its name average_rate_per_job as I would when joining any other table. When you run the code, you will get the same result as running the first messy code.

Avoid Repeating Subqueries by Using CTEs

Let’s try something else. Use the data to get the freelancers and their average daily hours worked. Also, calculate the total average daily hours worked. Then, find the difference between those two values. Select only those freelancers who worked on average at least one hour above the total daily average.

How would you do this using subqueries? Here’s how:

SELECT	f.id,
		f.first_name,
		f.last_name,
		AVG (h.hours) AS daily_avg_freelancer,
		(SELECT AVG (hours)
			FROM hours_worked) AS daily_avg_total,
		AVG (h.hours) - (SELECT AVG (hours)
FROM hours_worked) AS diff
FROM hours_worked h 
JOIN freelancer f ON h.freelancer_id = f.id
GROUP BY f.id, f.first_name, f.last_name
HAVING (AVG (h.hours) - (SELECT AVG (hours)
FROM hours_worked) > 1);

This flamboyant code selects the columns id, first_name, and last_name from the table freelancer. Then it calculates the average daily hours worked by the freelancer, with the values shown in the column daily_avg_freelancer.

After that comes the subquery, which calculates the total average daily hours worked. It again uses the AVG() function, with values in the column daily_avg_total.

Now, I have to calculate the difference between those two values, which means I have to write the same subquery again: AVG (h.hours) - (SELECT AVG (hours) FROM hours_worked) AS diff. Since I need the data by the freelancer, the result has to be grouped by the id, first_name, and last_name from the table freelancer.

Finally, I have to set up a condition that will return only freelancers who worked at least one hour above the total average. Following the HAVING clause, there it is, the same old subquery showing up for the third time.

It would be nice if, instead of repeating the same subquery, I could write something like AVG (h.hours) - daily_avg_freelancer, i.e., using the subquery’s name every time I need it. Unfortunately, SQL doesn’t work that way. But is there a solution? Yes, you’ve guessed it: CTE.

Take a look at this! It’s the same code as above. Except now, it’s written using CTEs:

WITH total_average AS (
SELECT	AVG (hours) AS daily_avg_total
FROM hours_worked
),

avg_per_freelancer AS (
SELECT	f.id,
		f.first_name,
		f.last_name,
		AVG (h.hours) AS daily_avg_freelancer
FROM hours_worked h 
JOIN freelancer f ON h.freelancer_id = f.id
GROUP BY f.id, f.first_name, f.last_name
)

SELECT	*,
		(daily_avg_freelancer - daily_avg_total) AS diff
FROM avg_per_freelancer, total_average
WHERE (daily_avg_freelancer - daily_avg_total) > 1;

Using the WITH statement, I’ve created the total_average CTE. It’s a simple SELECT statement that calculates the total daily average hours worked.

Then comes the second CTE, and there’s no WITH statement. Remember, when writing two or more CTEs, you only write WITH before the first one. All the subsequent CTEs are separated only by a comma. This second CTE is named avg_per_freelancer. It does exactly what it says. It calculates the average daily hours worked per freelancer.

Not much is different compared to the original code written using subqueries. What is different is the CTE does calculations, and the code looks tidier since the calculation steps are organized in separate parts of the code.

Now we only need to calculate the difference. This means subtracting one CTE from the other. Here comes the magical part! There’s no need to write the CTEs all over again like you had to in the case of the subqueries. You simply refer to the CTEs by their name and select the columns you want.

In this case, it is (daily_avg_freelancer - daily_avg_total) AS diff. The SELECT statement also takes all the values from both CTEs. You treat them like any other table in the FROM clause. The condition is stated using the WHERE clause.

Again, you don’t have to write the same CTEs you’ve already written. Just use the CTEs one more time: WHERE (daily_avg_freelancer - daily_avg_total) > 1.

Running the query will return the table below:

id	first_name	last_name	daily_avg_freelancer	daily_avg_total	diff
2	Arthur	Strauss	10.333333	6.633333	3.7
6	Don	Jones	8	6.633333	1.366667
8	Trent	Arthurs	12	6.633333	5.366667
12	Jean-Paul	Lamontagne	8	6.633333	1.366667
16	Mohamed	Lambert	8.666666	6.633333	2.033333
18	Gaspard	Chevalier	10	6.633333	3.366667
19	Anouk	De Jong	12	6.633333	5.366667
22	Caetano	Van de Berg	7.666666	6.633333	1.033333
26	Hadwin	Bakker	7.833333	6.633333	1.2
27	Liv	Meijer	8.5	6.633333	1.866667

To learn even more about how to organize your queries, you can read this article. If you’re looking for more examples for practice, you can find several here.

Replace a Subquery in the FROM Clause

You’ve probably seen it, and maybe you’ve used it. Writing a subquery in the FROM clause is also called a derived table in SQL Server and Oracle. When using a subquery in the FROM clause, it has to have a name. For instance, the code could look something like this:

SELECT	hs.first_name,
		hs.last_name,
		hs.pay_rate * hs.hours_per_freelancer AS monthly_pay
FROM (
SELECT	f.id,
			f.first_name,
			f.last_name,
			f.pay_rate,
			SUM (h.hours) AS hours_per_freelancer
	FROM freelancer f 
JOIN hours_worked h ON f.id = h.freelancer_id
	GROUP BY f.id, f.first_name, f.last_name, f.pay_rate
) AS hs;

This query calculates the total monthly pay for every freelancer based on their hourly rate and hours charged. It seems a rather complicated query for such a simple calculation. Let’s analyze it, but this time, let’s start from the subquery in the FROM clause.

The subquery selects the columns id, first_name, last_name, and pay_rate from the table freelancer. It also sums the hours worked in the new column hours_per_freelancer, taking data from the table hours_worked.

The tables are, of course, joined to get the result. This result has to be by the freelancer. That’s why it is grouped by the id, first_name, and last_name columns. I’ve named the subquery hs.

The outer query uses the subquery as a table. It selects the needed columns and multiplies the pay rate with the hours worked by the freelancer, with the result shown in the column monthly_pay.

If you use a subquery in the FROM clause, it’s good to replace it with a CTE. Your code will be much shorter, less complicated, and much more transparent. In that case, the code will be:

WITH hours_charged AS (
SELECT	f.id,
		SUM (h.hours) AS hours_per_freelancer
FROM freelancer f 
JOIN hours_worked h ON f.id = h.freelancer_id
GROUP BY f.id
)

SELECT	f.first_name,
		f.last_name,
		f.pay_rate * hc.hours_per_freelancer AS monthly_pay
FROM freelancer f 
JOIN hours_charged hc ON f.id = hc.id;

The CTE is named hours_charged. It selects the column id from the table freelancer. It also calculates the total hours worked by using the SUM() function. The result will show up in the column hours_per_freelancer. Since I want the result by the freelancer, it is grouped by the column id from the table freelancer.

Once the CTE is defined, I can use it in the SELECT statement. This one simply selects the first and last name of the freelancer. It also returns the monthly pay as a result of this code line: f.pay_rate * hc.hours_per_freelancer AS monthly_pay.

This is the result of both queries:

first_name	last_name	monthly_pay
Carole	Queen	1020
Arthur	Strauss	1302
Cindy	Jones	1425
Paul	Hopkins	102
Katherine	McLean	640
Don	Jones	444
Linda	Stones	121.5
Trent	Arthurs	300
Miriam	Jones	572
Guy	Gudbois	2052
Michelle	Lebouef	480
Jean-Paul	Lamontagne	624
Camille	Barbet	96
Jules	Fournier	446.4
Paulette	Dubois	128
Mohamed	Lambert	988
Simone	Pomeroy	440.4
Gaspard	Chevalier	1530
Anouk	De Jong	10560
Ambroos	Jansen	1275
Hendrika	De Vries	451
Caetano	Van de Berg	1943.5
Grietje	Van Dijk	166.5
Dael	Janssen	252
Uday	Visser	168
Hadwin	Bakker	282
Liv	Meijer	1122

Now that you have started to see the benefits of CTEs, here’s an article that analyzes the differences between CTEs and subqueries.

Best Practices for Naming CTEs

Even though one of the benefits of using CTEs is making your code more organized and readable, it’s also up to you to keep your code neat. The best way to do that is by paying attention to how you name your CTEs.

Here are some pieces of advice that I apply when naming CTEs. I recommend that you do the same.

Give Meaningful Names to Your CTEs

This is important because you’ll reference CTEs by their name. It is a nice feature that makes CTEs very helpful in writing code.

Don’t mess it up by giving them names that are not related to what you’re doing. If you do, you’ll have no idea what a particular CTE does. You’ll have to analyze it every time you want to reference it somewhere. I’m guessing you have many more exciting things to do in your life.

Take, for instance, one of the codes above, the one containing two CTEs. Just by looking at the names of the CTEs, you’ll know or assume, at least, what each of those CTEs does, even if you’re seeing this code for the first time.

WITH total_average AS (
SELECT 	AVG (hours) AS daily_avg_total
FROM hours_worked
),

avg_per_freelancer AS (
SELECT	f.id,
		f.first_name,
		f.last_name,
		AVG (h.hours) AS daily_avg_freelancer
FROM hours_worked h 
JOIN freelancer f ON h.freelancer_id = f.id
GROUP BY f.id, f.first_name, f.last_name
)

SELECT	*,
		(daily_avg_freelancer - daily_avg_total) AS diff
FROM avg_per_freelancer, total_average
WHERE (daily_avg_freelancer - daily_avg_total) > 1;

Imagine I’ve decided to name those CTEs as a and b. Can you imagine the mess? If you can’t, here, let me show you:

WITH a AS (
SELECT 	AVG (hours) AS daily_avg_total
FROM hours_worked
),

b AS (
SELECT	f.id,
		f.first_name,
		f.last_name,
		AVG (h.hours) AS daily_avg_freelancer
FROM hours_worked h 
JOIN freelancer f ON h.freelancer_id = f.id
GROUP BY f.id, f.first_name, f.last_name
)

Do you know what each CTE does? Me neither. Look at the query using those CTEs:

SELECT	*,
		(daily_avg_freelancer - daily_avg_total) AS diff
FROM b, a
WHERE (daily_avg_freelancer - daily_avg_total) > 1;

Such a simple SELECT statement made incomprehensible by choosing the wrong names. What is a, and what is b in this statement? Good luck with analyzing it!

Be Consistent When Naming CTEs

You’ve probably noticed that all my CTEs are written in lower case. If the name consists of two or more words, I separate them by using the underscore. It doesn’t mean you have to do the same.

You can use only uppercase if you wish to. You can separate the words by starting every word with the capital letter. The only thing I want you to remember is to choose one way of naming CTEs and stick to it. Please, please, please don’t do something like this:

WITH Total_AveRage AS (...)
)

AVGPERfreelancer AS (...)

SELECT	*,
		(daily_avg_freelancer - daily_avg_total) AS diff
FROM TOTAL_AVERAGE, avgperfreelancer
WHERE (daily_avg_freelancer - daily_avg_total) > 1;

It would hurt my brain if it were medically possible. Yours, too, trust me.

Apply the Same Naming Convention to Both Tables and CTEs

Several times in this article, I’ve mentioned you should treat CTEs as any other table. This also applies to naming them. Once you’ve chosen a naming convention and decided to stick to it, apply it to both CTEs and tables.

Lines in the code are like sentences. You want to be able to read your thoughts and allow others to do the same. This also means keeping your lines of code unified by sticking to the chosen naming convention. Otherwise, IT woulD be much_harder FORYOU to read YOUR_OWNCode.

For example, you’ll see that I write all the CTE and table names in lowercase. Also, all the words in the name are separated by an underscore.

avg_per_freelancer AS (
SELECT	f.id,
		f.first_name,
		f.last_name,
		AVG (h.hours) AS daily_avg_freelancer
FROM hours_worked h JOIN freelancer f ON h.freelancer_id = f.id
GROUP BY f.id, f.first_name, f.last_name

If you’re not consistent in naming, your code could look like this:

avgperfreelancer AS (
SELECT	f.id,
		f.first_name,
		f.last_name,
		AVG (h.hours) AS daily_avg_freelancer
FROM hours_worked h JOIN Freelancer f ON h.freelancer_id = f.id
GROUP BY f.id, f.first_name, f.last_name

Even though your WITH query has a meaningful name, it’s tough to read it and realize it’s meaningful. It also doesn’t look nice if all the other tables are named using different conventions, e.g., hours_worked.

You Can’t Avoid CTEs When Using Recursive Queries

Those of you who already use the recursive queries know that. You can’t write a recursive query without a CTE. I don’t mean you can’t, as in you shouldn’t. No, you can’t do that without a CTE.

If you’re not that advanced yet, you’re maybe wondering what a recursive query is. Recursive queries reference themselves. By doing so, they return the sub-result and repeat the process until they return the final result.

I just wanted to mention this and won’t go through recursive queries with you. I’ll gladly leave you to my colleagues though. They can unlock the power of recursive queries for you. If you’re using PostgreSQL or Oracle, this article will lead you through recursive queries.

The Benefits of Using CTEs

With several examples, I’ve tried to show you the differences between a CTE and a subquery. By doing so, I hope I was able to show you when it’s good to use CTEs instead.

Whenever you decide to use CTEs, it’s essential to have in mind several naming conventions that could make writing them even easier. While not minding those conventions won’t affect the performance of the code itself, it could diminish one of the benefits of using CTEs, improving the code organization and readability.

Last but not least, if you’re interested in learning recursive queries, knowing CTEs is a must.

If you liked what you read here, feel free to learn even more in our Recursive Queries course. This article will give you all the info you need to find out what you can learn there.

Feel free to use the comment section to let me know how this article helped you or how I could’ve done better.

Tags: