27th May 2020 13 minutes read

5 Advanced SQL Constructions Every Data Analyst Should Learn

Table of Contents

Query 1: Ranking the Data
Query 2: Calculating Delta Values
Query 3: Calculating Running Totals
Query 4: Creating a Report Based on Multiple Conditions
Query 5: Adding Subtotals to a Report
Did You Find These Advanced SQL Topics Helpful?

Does your job involve data analysis? Here are five examples of how advanced SQL can help you in your daily work.

I’m a data analyst and I have to say – data analysts can be quite strange. Other people are usually afraid of immense amounts of data; we enjoy it. The deeper we get into data, the happier we are. An important tool for getting into data (and thus being a happier data analyst) is SQL.

Remember when you first discovered simple SQL functions? The joy you felt when you could quickly select the data, group it, and order it? If you now have a firm grasp of those basic functions, you’re probably wondering what advanced SQL is and how it can help you reach new levels of handling data and creating cool reports. After all, all work and no play makes data analysts dull boys and girls.

Well, I won’t dwell on the definition. The best way to explain advanced SQL is to show you some advanced queries. Then you’ll see how helpful (and fun) they can be.

Since I’ll be using a lot of SQL window functions, it’s best if you already know what they are and how they work. This article will bring you up to speed if you’re not familiar with them.

Query 1: Ranking the Data

Management likes to see rankings for absolutely anything: products sold, salaries, employees per department, money earned per any imaginable segment – they’ll always ask to see it ranked. To show you an example of how to rank things in SQL, I’ll use the sales table. It has the following columns:

product - The name of the product.
product_price - The price of the product.
items_sold - The number of items sold.

The idea is to calculate the revenue for every product and rank it using the RANK() function. The code below will solve this task:

SELECT	product,
		product_price,
		items_sold,
		product_price * items_sold AS revenue,
		RANK() OVER (ORDER BY product_price * items_sold DESC) AS revenue_rank
FROM sales;

The above code selects all columns from the table. To get the revenue per product, you need to multiply the price by the items sold. This is exactly what the query will do, and the result will be shown in the new column revenue. Finally, there’s the RANK() function. Here, this function will rank all rows by the new column revenue (defined by product_price * items_sold). The rank will be shown in the new column revenue_rank.

And here’s the result:

product	product_price	items_sold	revenue	revenue_rank
14HA34OrtegaYGasset	44.12	6547	288,853.64	1
TAL578UmbertoEcoFriendly	100	547	54,700.00	2
44HRZ890Sartrade	52.87	800	42,296.00	3
FX312AdornoForHome	12	3254	39,048.00	4
H618T4DeBeauvoirForAll	47.59	813	38,690.67	5
H16GRSocratesYoghurt	1.5	14587	21,880.50	6
67GHZ4Marximum	9.99	1789	17,872.11	7
H618T4HobbesSolutions	7.77	147	1,142.19	8
14HX13Arendt	12.47	47	586.09	9
MT657GombrowiczExtra	4	12	48.00	10

RANK() is only one way to rank data in SQL. If you want to do a deep dive into the subject, here’s our complete guide on how to rank rows in SQL.

Query 2: Calculating Delta Values

Along with ranking, calculating delta values is probably one of data analysts’ most common tasks. This is usually required when calculating day-to-day, month-to-month, quarter-to-quarter, or year-to-year changes. Be it revenue, costs, price changes, volume changes, or anything else imaginable, you’ll need to calculate the difference between the numbers. To do that, an advanced SQL query with the LAG() function is what you need. This function is used to retrieve the data from a previous row. Let me show you how it works using the revenue table. The table consists of two rows:

month - The month of the year.
revenue - The revenue for that month.

Your task is to calculate the difference between each month’s revenue and the previous month (i.e. the monthly revenue delta). How would you do that? If you know the LAG() function, it’s very easy. Here’s the code:

SELECT	month,
		revenue,
		revenue - LAG(revenue, 1) OVER (ORDER BY month) AS monthly_delta
FROM revenue;

The principle for calculating the delta is to deduct the previous month from the current month. The code above does exactly that. First, it selects the columns month and revenue. After that, it deducts the amount of the current month’s revenue from the previous month. This is defined by the LAG() function. The values we put in the function’s parenthesis (revenue, 1) indicate the value in the column revenue will be deducted from the previous value in that column. That’s why there’s the number 1; it defines how many rows the function goes back to perform the operation.

In theory, LAG(revenue) will do the same, since going one row back is the default for the LAG() function. However, I wanted to show you this explicitly. It’s easier to understand and you’ll know what to do when you need to go back more than one row.

Deltas need to be calculated sequentially, not by random months; that’s why there’s ORDER BY month. The delta will be shown in the new column monthly_delta. Run the code and you’ll get the resulting table:

month	revenue	monthly_delta
01/2019	12587.14	NULL
02/2019	478456.88	465869.74
03/2019	312588	-165868.88
04/2019	518387.66	205799.66
05/2019	222222.22	-296165.44
06/2019	588954.48	366732.26
07/2019	358981	-229973.48
08/2019	678841.54	319860.54
09/2019	1547895.82	869054.28
10/2019	1647895.82	100000
11/2019	912541.26	-735354.56
12/2019	984784.52	72243.26

Query 3: Calculating Running Totals

Running totals (also known as cumulative sums) are extensively used in data analysis. They are usually used with time series data to see how certain performance indicators are (or will be) developing over time. Like other advanced SQL concepts, running totals have a very broad practical use. They’re used to monitor sales, revenues, costs, profit, and budgets.

For now, I’ll show you how a cumulative sum works for budgets. Let’s use a table very imaginatively named budget. It consists of these columns:

month - The month of the cash flow.
client - The client name.
cash_flow - The budgeted cash flow.

There are three clients. The budget contains monthly projections of the yearly cash flow that your company will collect from them. You need to calculate the cumulative cash flow for each client. You might already intuitively know you have to use the SUM() function, but with some kind of twist. Here’s how to calculate running totals:

SELECT	month,
		client,
		cash_flow,
		SUM (cash_flow) OVER (PARTITION BY client ORDER BY month) AS running_total

FROM budget;

This is a simple little query that does wonders! It selects the columns month, client, and cash_flow. To calculate the running total, you have to summarize the cash flows. This is defined by SUM (cash_flow).

However, you’re not interested in having running totals on a table level. You need to somehow make it summarize cash flows month by month for the first client, then reset and start again for the second client. To do that, you need OVER (PARTITION BY client ORDER BY month). Here the partition is defined by the column client, which means that every data set is defined by the different clients. Also, the operation will be performed only within the partition, not on the whole table. That way, you get a running total for every client separately.

Of course, the cash flows have to be summarized sequentially; that’s why it’s ordered by the month column. The running total will appear in the new column running_total.

Here’s the resulting table:

month	client	cash_flow	running_total
01/2020	Claudio Gaudio	75564.38	75564.38
02/2020	Claudio Gaudio	12894.45	88458.83
03/2020	Claudio Gaudio	75564.38	164023.21
04/2020	Claudio Gaudio	12894.45	176917.66
05/2020	Claudio Gaudio	743541.12	920458.78
06/2020	Claudio Gaudio	325558.45	1246017.23
07/2020	Claudio Gaudio	390278.63	1636295.86
08/2020	Claudio Gaudio	22008.12	1658303.98
09/2020	Claudio Gaudio	85000	1743303.98
10/2020	Claudio Gaudio	42840.55	1786144.53
11/2020	Claudio Gaudio	85612.34	1871756.87
12/2020	Claudio Gaudio	412000	2283756.87
01/2020	Gabriele Pappardelle	49000	49000
02/2020	Gabriele Pappardelle	18480.26	67480.26
03/2020	Gabriele Pappardelle	127850.5	195330.76
04/2020	Gabriele Pappardelle	327000.5	522331.26
05/2020	Gabriele Pappardelle	500000	1022331.26
06/2020	Gabriele Pappardelle	0	1022331.26
07/2020	Gabriele Pappardelle	0	1022331.26
08/2020	Gabriele Pappardelle	1000000	2643324.72
08/2020	Gabriele Pappardelle	620993.46	2643324.72
09/2020	Gabriele Pappardelle	0	2643324.72
10/2020	Gabriele Pappardelle	500000	3143324.72
11/2020	Gabriele Pappardelle	500000	3643324.72
12/2020	Gabriele Pappardelle	500000	4143324.72
01/2020	Tony Pepperoni	10000	10000
02/2020	Tony Pepperoni	10000	20000
03/2020	Tony Pepperoni	10000	30000
04/2020	Tony Pepperoni	0	30000
05/2020	Tony Pepperoni	0	30000
06/2020	Tony Pepperoni	25787	55787
07/2020	Tony Pepperoni	32000	87787
08/2020	Tony Pepperoni	25787	113574
09/2020	Tony Pepperoni	0	113574
10/2020	Tony Pepperoni	18000	131574
11/2020	Tony Pepperoni	67450.5	199024.5
12/2020	Tony Pepperoni	1000	200024.5

I’ve used window functions in the last three examples. If you want to learn more about this topic, a good way is the Window Functions course, one of our advanced SQL courses.

Something that could also be very helpful, especially if you’re new to window functions or use them only occasionally, is this SQL Window Functions Cheat Sheet. I’ll be using it next time I write about window functions, for sure!

Query 4: Creating a Report Based on Multiple Conditions

One of data analysts’ main tasks is making data more friendly for other users. By giving them data in a form that they can easily use, we make their jobs easier. To create useful reports, a data analyst has to combine business input with their knowledge of the data. One of the tools that can help you in achieving that is a CASE statement, which is another advanced SQL concept.

To give you an example, let’s imagine the following scenario. You’re working for a bank and you’re asked by your colleagues to create a report. There’s a table called debt that shows the bank’s clients and details about their debt. The table consists of the following columns:

client - The name of the client.
date_due - The day the debt became due.
amount_due - The amount of the debt that is due.

What you need to do is create a report as of 30.4.2020. You somehow need to calculate the number of days due as of the reporting date. Also, you need to allocate the client to a certain time bucket, according to the number of the days their account is due.

The query is just below. Don’t be afraid – I’ll analyze it for you. It’s not as scary as it looks!

SELECT	client,
		date_due,
		amount_due,
		DATEDIFF ('2020-04-30', date_due) AS days_due,
		CASE
	WHEN  DATEDIFF ('2020-04-30', date_due) <= 30 THEN '0-30 days'
	WHEN  DATEDIFF ('2020-04-30', date_due) > 30 AND DATEDIFF ('2020-04-30', date_due) <=90 THEN '31-90 days'
	WHEN  DATEDIFF ('2020-04-30', date_due) > 90 AND DATEDIFF ('2020-04-30', date_due) <=180 THEN '91-180 days'
	WHEN  DATEDIFF ('2020-04-30', date_due) > 180 AND DATEDIFF ('2020-04-30', date_due) <=365 THEN '181-365 days'
	ELSE '> 365 days'
END AS time_bucket

FROM debt;

First, you need to specify the SELECT part of the query. I’ve selected the existing columns client, date_due, and amount_due.

Next, you have to calculate the days due. You do that by subtracting the due date from the reporting date. This is exactly what I did with DATEDIFF ('2020-04-30', date_due) AS days_due. I’ve used the DATEDIFF() function to calculate the required difference. When using this function, you first have to specify which dates you want to subtract. In our case, it’s the reporting date and the date due. Next, you have to specify how you want the result to be shown, i.e. in years, months, or days. You need days in this case, so you put day as the last value in DATEDIFF().

Now comes the exciting part – creating the conditions I’ve used in the CASE statement. This statement opens with CASE and finishes with END. In between, you need to define the conditions that will create the report your colleagues want. For this, you’ll use WHEN and THEN.

Let’s say that the first bucket of the days due is 0-30 days. The first condition in the CASE statement is WHEN DATEDIFF ('2020-04-30', date_due) <= 30 THEN '0-30 days'. Since you need to allocate clients to a time bucket according to the days due, this part of the code does exactly that. It reads like this: if the difference between the reporting date and the due date is less than or equal to 30 days, then this client will be allocated to the time bucket 0-30 days.

The next time bucket is 31-90 days, and this is the part of the code that defines it:

WHEN DATEDIFF ('2020-04-30', date_due) > 30 AND DATEDIFF ('2020-04-30', date_due) <=90 THEN '31-90 days'

It’s not that complicated, right?

The same principle works for the remaining two time buckets: 91-180 days and 181-365 days. Every debt that has been due for more than 365 days belongs to the 365 days time bucket. This is defined by ELSE '> 365 days'. This simply defines the criteria for reporting: if the value is this, do this; if it’s not, do this. Essentially, it’s a more complex version of the IF statement.

Please note there’s a more elegant way to write this code: I could’ve declared a variable containing the value ‘2020-04-30’ instead of writing ‘2020-04-30’ manually everywhere in the code. However, I didn’t want to confuse you if you’re not familiar with variables.

Also, regarding the DATEDIFF() function, note that I’ve used the MySQL function and syntax. Depending on the database engine you’re using, it’s possible that you’ll have to adapt the syntax accordingly.

All those time buckets will be shown in the new column time_bucket. Since you want your data to look nice, you’ll order your table by days_due ascending. Run the code and you’ll get a nice table. And probably a free coffee from your colleagues!

client	date_due	amount_due	days_due	time_bucket
GreatCompany	2019-12-31	10000	121	91-180 days
WeAreTheBest	2020-04-15	2000	15	0-30 days
AlmostBankrupt	2019-06-30	150000	305	181-365 days
WeWontPay	2019-01-15	870000	471	> 365 days
AllAboutMoney	2020-01-15	5000	106	91-180 days
YouTalkinToMe	2019-08-31	78000	243	181-365 days
BigLebowski	2020-01-31	42000	90	31-90 days
MilesSmiles	2019-11-30	78000	152	91-180 days
PanthelyaSolutions	2019-10-31	7000	182	181-365 days
PurplePrince	2019-12-31	500	121	91-180 days

Exciting, isn’t it? If you want more excitement like this, Creating Basic SQL Reports is for you! There you’ll learn more about CASE WHEN and the nuances of GROUP BY.

Query 5: Adding Subtotals to a Report

A very common request is to show subtotals and totals in the same report. The ROLLUP clause makes this much easier. It’s an extension of a GROUP BY clause. It allows you to add subtotals and grand totals to your data.

Here’s how to use ROLLUP. You have the table warehouse with the following columns:

warehouse - The name of the warehouse.
brand - The product’s brand.
product - The product’s name.
quantity - The quantity of this product in the warehouse.

There are two different brands with five products between them. And there are two warehouses. Your task is to calculate the total product quantity for both brands in both warehouses. You also need the grand total of all the products in both warehouses. And finally, you need to do everything in one table with one query. How would you manage it? The code is:

SELECT	warehouse,
		brand,
		SUM (quantity) AS sum_product

FROM warehouse
GROUP BY ROLLUP (warehouse, brand);

First, you select the columns warehouse and brand from the table. You also want the sum of the column quantity, which will be shown in the new table sum_product. What’s the next step? This is when ROLLUP comes in! It’s used to get totals for multiple data grouping levels. The GROUP BY ROLLUP (warehouse, brand) part will do exactly that. It will group the data by the warehouse and brand columns. After that, it will sum the data according to each grouping. The result is:

warehouse	brand	sum_product
Amsterdam	Brando	1105
Amsterdam	Ostap	62934
Amsterdam	NULL	64039
Berlin	Brando	67356
Berlin	Ostap	13451
Berlin	NULL	80807
NULL	NULL	144846

The table contains totals for the Brando and Ostap brands in the Amsterdam and Berlin warehouses and a grand total. The subtotal for both products in the Amsterdam warehouse is shown in the first row with the NULL brand value. It amounts to 64 039, the sum of the two previous rows.

Next, you can see the totals for both brands in the Berlin warehouse. After that, there’s another line with a NULL brand value; this is actually the Berlin subtotal amounting to 80 807. The last row shows the grand total of all products in all warehouses, which is 144 846.

Why are there NULL values in some rows? Because SQL doesn’t know how to name brands and warehouses when they’re grouped and a subtotal or grand total is shown. To find more fun details about other GROUP BY extensions, check out our GROUP BY Extensions course.

This is the last advanced SQL query for now.

Did You Find These Advanced SQL Topics Helpful?

Advanced SQL as used by data scientists, data analysts, and others is a really broad topic. There are vast possibilities in the use of SQL for data analysis. This Advanced SQL track will give you an idea of what you can do with SQL as a data analyst. Read our guide on how to practice advanced SQL with our platform to discover more SQL practice opportunities at advanced level.

In this article, I’ve tried to show you some of the most common uses of advanced SQL. The examples are practical and drawn from my experience, so I hope they were helpful. I’ve tried to make the queries as easy (and understandable) as possible. You can adapt these queries to your particular reports and data sets. Feel free to use them!

If you have any questions or comments, let me know in the comments section!

Tags: