27th May 2020 13 minutes read 5 Advanced SQL Constructions Every Data Analyst Should Learn Tihomir Babic learn sql Advanced SQL Data Analysis Table of Contents Query 1: Ranking the Data Query 2: Calculating Delta Values Query 3: Calculating Running Totals Query 4: Creating a Report Based on Multiple Conditions Query 5: Adding Subtotals to a Report Did You Find These Advanced SQL Topics Helpful? Does your job involve data analysis? Here are five examples of how advanced SQL can help you in your daily work. I’m a data analyst and I have to say – data analysts can be quite strange. Other people are usually afraid of immense amounts of data; we enjoy it. The deeper we get into data, the happier we are. An important tool for getting into data (and thus being a happier data analyst) is SQL. Remember when you first discovered simple SQL functions? The joy you felt when you could quickly select the data, group it, and order it? If you now have a firm grasp of those basic functions, you’re probably wondering what advanced SQL is and how it can help you reach new levels of handling data and creating cool reports. After all, all work and no play makes data analysts dull boys and girls. Well, I won’t dwell on the definition. The best way to explain advanced SQL is to show you some advanced queries. Then you’ll see how helpful (and fun) they can be. Since I’ll be using a lot of SQL window functions, it’s best if you already know what they are and how they work. This article will bring you up to speed if you’re not familiar with them. Query 1: Ranking the Data Management likes to see rankings for absolutely anything: products sold, salaries, employees per department, money earned per any imaginable segment – they’ll always ask to see it ranked. To show you an example of how to rank things in SQL, I’ll use the sales table. It has the following columns: product - The name of the product. product_price - The price of the product. items_sold - The number of items sold. The idea is to calculate the revenue for every product and rank it using the RANK() function. The code below will solve this task: SELECT product, product_price, items_sold, product_price * items_sold AS revenue, RANK() OVER (ORDER BY product_price * items_sold DESC) AS revenue_rank FROM sales; The above code selects all columns from the table. To get the revenue per product, you need to multiply the price by the items sold. This is exactly what the query will do, and the result will be shown in the new column revenue. Finally, there’s the RANK() function. Here, this function will rank all rows by the new column revenue (defined by product_price * items_sold). The rank will be shown in the new column revenue_rank. And here’s the result: productproduct_priceitems_soldrevenuerevenue_rank 14HA34OrtegaYGasset44.126547288,853.641 TAL578UmbertoEcoFriendly10054754,700.002 44HRZ890Sartrade52.8780042,296.003 FX312AdornoForHome12325439,048.004 H618T4DeBeauvoirForAll47.5981338,690.675 H16GRSocratesYoghurt1.51458721,880.506 67GHZ4Marximum9.99178917,872.117 H618T4HobbesSolutions7.771471,142.198 14HX13Arendt12.4747586.099 MT657GombrowiczExtra41248.0010 RANK() is only one way to rank data in SQL. If you want to do a deep dive into the subject, here’s our complete guide on how to rank rows in SQL. Query 2: Calculating Delta Values Along with ranking, calculating delta values is probably one of data analysts’ most common tasks. This is usually required when calculating day-to-day, month-to-month, quarter-to-quarter, or year-to-year changes. Be it revenue, costs, price changes, volume changes, or anything else imaginable, you’ll need to calculate the difference between the numbers. To do that, an advanced SQL query with the LAG() function is what you need. This function is used to retrieve the data from a previous row. Let me show you how it works using the revenue table. The table consists of two rows: month - The month of the year. revenue - The revenue for that month. Your task is to calculate the difference between each month’s revenue and the previous month (i.e. the monthly revenue delta). How would you do that? If you know the LAG() function, it’s very easy. Here’s the code: SELECT month, revenue, revenue - LAG(revenue, 1) OVER (ORDER BY month) AS monthly_delta FROM revenue; The principle for calculating the delta is to deduct the previous month from the current month. The code above does exactly that. First, it selects the columns month and revenue. After that, it deducts the amount of the current month’s revenue from the previous month. This is defined by the LAG() function. The values we put in the function’s parenthesis (revenue, 1) indicate the value in the column revenue will be deducted from the previous value in that column. That’s why there’s the number 1; it defines how many rows the function goes back to perform the operation. In theory, LAG(revenue) will do the same, since going one row back is the default for the LAG() function. However, I wanted to show you this explicitly. It’s easier to understand and you’ll know what to do when you need to go back more than one row. Deltas need to be calculated sequentially, not by random months; that’s why there’s ORDER BY month. The delta will be shown in the new column monthly_delta. Run the code and you’ll get the resulting table: monthrevenuemonthly_delta 01/201912587.14NULL 02/2019478456.88465869.74 03/2019312588-165868.88 04/2019518387.66205799.66 05/2019222222.22-296165.44 06/2019588954.48366732.26 07/2019358981-229973.48 08/2019678841.54319860.54 09/20191547895.82869054.28 10/20191647895.82100000 11/2019912541.26-735354.56 12/2019984784.5272243.26 Query 3: Calculating Running Totals Running totals (also known as cumulative sums) are extensively used in data analysis. They are usually used with time series data to see how certain performance indicators are (or will be) developing over time. Like other advanced SQL concepts, running totals have a very broad practical use. They’re used to monitor sales, revenues, costs, profit, and budgets. For now, I’ll show you how a cumulative sum works for budgets. Let’s use a table very imaginatively named budget. It consists of these columns: month - The month of the cash flow. client - The client name. cash_flow - The budgeted cash flow. There are three clients. The budget contains monthly projections of the yearly cash flow that your company will collect from them. You need to calculate the cumulative cash flow for each client. You might already intuitively know you have to use the SUM() function, but with some kind of twist. Here’s how to calculate running totals: SELECT month, client, cash_flow, SUM (cash_flow) OVER (PARTITION BY client ORDER BY month) AS running_total FROM budget; This is a simple little query that does wonders! It selects the columns month, client, and cash_flow. To calculate the running total, you have to summarize the cash flows. This is defined by SUM (cash_flow). However, you’re not interested in having running totals on a table level. You need to somehow make it summarize cash flows month by month for the first client, then reset and start again for the second client. To do that, you need OVER (PARTITION BY client ORDER BY month). Here the partition is defined by the column client, which means that every data set is defined by the different clients. Also, the operation will be performed only within the partition, not on the whole table. That way, you get a running total for every client separately. Of course, the cash flows have to be summarized sequentially; that’s why it’s ordered by the month column. The running total will appear in the new column running_total. Here’s the resulting table: monthclientcash_flowrunning_total 01/2020Claudio Gaudio75564.3875564.38 02/2020Claudio Gaudio12894.4588458.83 03/2020Claudio Gaudio75564.38164023.21 04/2020Claudio Gaudio12894.45176917.66 05/2020Claudio Gaudio743541.12920458.78 06/2020Claudio Gaudio325558.451246017.23 07/2020Claudio Gaudio390278.631636295.86 08/2020Claudio Gaudio22008.121658303.98 09/2020Claudio Gaudio850001743303.98 10/2020Claudio Gaudio42840.551786144.53 11/2020Claudio Gaudio85612.341871756.87 12/2020Claudio Gaudio4120002283756.87 01/2020Gabriele Pappardelle4900049000 02/2020Gabriele Pappardelle18480.2667480.26 03/2020Gabriele Pappardelle127850.5195330.76 04/2020Gabriele Pappardelle327000.5522331.26 05/2020Gabriele Pappardelle5000001022331.26 06/2020Gabriele Pappardelle01022331.26 07/2020Gabriele Pappardelle01022331.26 08/2020Gabriele Pappardelle10000002643324.72 08/2020Gabriele Pappardelle620993.462643324.72 09/2020Gabriele Pappardelle02643324.72 10/2020Gabriele Pappardelle5000003143324.72 11/2020Gabriele Pappardelle5000003643324.72 12/2020Gabriele Pappardelle5000004143324.72 01/2020Tony Pepperoni1000010000 02/2020Tony Pepperoni1000020000 03/2020Tony Pepperoni1000030000 04/2020Tony Pepperoni030000 05/2020Tony Pepperoni030000 06/2020Tony Pepperoni2578755787 07/2020Tony Pepperoni3200087787 08/2020Tony Pepperoni25787113574 09/2020Tony Pepperoni0113574 10/2020Tony Pepperoni18000131574 11/2020Tony Pepperoni67450.5199024.5 12/2020Tony Pepperoni1000200024.5 I’ve used window functions in the last three examples. If you want to learn more about this topic, a good way is the Window Functions course, one of our advanced SQL courses. Something that could also be very helpful, especially if you’re new to window functions or use them only occasionally, is this SQL Window Functions Cheat Sheet. I’ll be using it next time I write about window functions, for sure! Query 4: Creating a Report Based on Multiple Conditions One of data analysts’ main tasks is making data more friendly for other users. By giving them data in a form that they can easily use, we make their jobs easier. To create useful reports, a data analyst has to combine business input with their knowledge of the data. One of the tools that can help you in achieving that is a CASE statement, which is another advanced SQL concept. To give you an example, let’s imagine the following scenario. You’re working for a bank and you’re asked by your colleagues to create a report. There’s a table called debt that shows the bank’s clients and details about their debt. The table consists of the following columns: client - The name of the client. date_due - The day the debt became due. amount_due - The amount of the debt that is due. What you need to do is create a report as of 30.4.2020. You somehow need to calculate the number of days due as of the reporting date. Also, you need to allocate the client to a certain time bucket, according to the number of the days their account is due. The query is just below. Don’t be afraid – I’ll analyze it for you. It’s not as scary as it looks! SELECT client, date_due, amount_due, DATEDIFF ('2020-04-30', date_due) AS days_due, CASE WHEN DATEDIFF ('2020-04-30', date_due) <= 30 THEN '0-30 days' WHEN DATEDIFF ('2020-04-30', date_due) > 30 AND DATEDIFF ('2020-04-30', date_due) <=90 THEN '31-90 days' WHEN DATEDIFF ('2020-04-30', date_due) > 90 AND DATEDIFF ('2020-04-30', date_due) <=180 THEN '91-180 days' WHEN DATEDIFF ('2020-04-30', date_due) > 180 AND DATEDIFF ('2020-04-30', date_due) <=365 THEN '181-365 days' ELSE '> 365 days' END AS time_bucket FROM debt; First, you need to specify the SELECT part of the query. I’ve selected the existing columns client, date_due, and amount_due. Next, you have to calculate the days due. You do that by subtracting the due date from the reporting date. This is exactly what I did with DATEDIFF ('2020-04-30', date_due) AS days_due. I’ve used the DATEDIFF() function to calculate the required difference. When using this function, you first have to specify which dates you want to subtract. In our case, it’s the reporting date and the date due. Next, you have to specify how you want the result to be shown, i.e. in years, months, or days. You need days in this case, so you put day as the last value in DATEDIFF(). Now comes the exciting part – creating the conditions I’ve used in the CASE statement. This statement opens with CASE and finishes with END. In between, you need to define the conditions that will create the report your colleagues want. For this, you’ll use WHEN and THEN. Let’s say that the first bucket of the days due is 0-30 days. The first condition in the CASE statement is WHEN DATEDIFF ('2020-04-30', date_due) <= 30 THEN '0-30 days'. Since you need to allocate clients to a time bucket according to the days due, this part of the code does exactly that. It reads like this: if the difference between the reporting date and the due date is less than or equal to 30 days, then this client will be allocated to the time bucket 0-30 days. The next time bucket is 31-90 days, and this is the part of the code that defines it: WHEN DATEDIFF ('2020-04-30', date_due) > 30 AND DATEDIFF ('2020-04-30', date_due) <=90 THEN '31-90 days' It’s not that complicated, right? The same principle works for the remaining two time buckets: 91-180 days and 181-365 days. Every debt that has been due for more than 365 days belongs to the 365 days time bucket. This is defined by ELSE '> 365 days'. This simply defines the criteria for reporting: if the value is this, do this; if it’s not, do this. Essentially, it’s a more complex version of the IF statement. Please note there’s a more elegant way to write this code: I could’ve declared a variable containing the value ‘2020-04-30’ instead of writing ‘2020-04-30’ manually everywhere in the code. However, I didn’t want to confuse you if you’re not familiar with variables. Also, regarding the DATEDIFF() function, note that I’ve used the MySQL function and syntax. Depending on the database engine you’re using, it’s possible that you’ll have to adapt the syntax accordingly. All those time buckets will be shown in the new column time_bucket. Since you want your data to look nice, you’ll order your table by days_due ascending. Run the code and you’ll get a nice table. And probably a free coffee from your colleagues! clientdate_dueamount_duedays_duetime_bucket GreatCompany2019-12-311000012191-180 days WeAreTheBest2020-04-152000150-30 days AlmostBankrupt2019-06-30150000305181-365 days WeWontPay2019-01-15870000471> 365 days AllAboutMoney2020-01-15500010691-180 days YouTalkinToMe2019-08-3178000243181-365 days BigLebowski2020-01-31420009031-90 days MilesSmiles2019-11-307800015291-180 days PanthelyaSolutions2019-10-317000182181-365 days PurplePrince2019-12-3150012191-180 days Exciting, isn’t it? If you want more excitement like this, Creating Basic SQL Reports is for you! There you’ll learn more about CASE WHEN and the nuances of GROUP BY. Query 5: Adding Subtotals to a Report A very common request is to show subtotals and totals in the same report. The ROLLUP clause makes this much easier. It’s an extension of a GROUP BY clause. It allows you to add subtotals and grand totals to your data. Here’s how to use ROLLUP. You have the table warehouse with the following columns: warehouse - The name of the warehouse. brand - The product’s brand. product - The product’s name. quantity - The quantity of this product in the warehouse. There are two different brands with five products between them. And there are two warehouses. Your task is to calculate the total product quantity for both brands in both warehouses. You also need the grand total of all the products in both warehouses. And finally, you need to do everything in one table with one query. How would you manage it? The code is: SELECT warehouse, brand, SUM (quantity) AS sum_product FROM warehouse GROUP BY ROLLUP (warehouse, brand); First, you select the columns warehouse and brand from the table. You also want the sum of the column quantity, which will be shown in the new table sum_product. What’s the next step? This is when ROLLUP comes in! It’s used to get totals for multiple data grouping levels. The GROUP BY ROLLUP (warehouse, brand) part will do exactly that. It will group the data by the warehouse and brand columns. After that, it will sum the data according to each grouping. The result is: warehousebrandsum_product AmsterdamBrando1105 AmsterdamOstap62934 AmsterdamNULL64039 BerlinBrando67356 BerlinOstap13451 BerlinNULL80807 NULLNULL144846 The table contains totals for the Brando and Ostap brands in the Amsterdam and Berlin warehouses and a grand total. The subtotal for both products in the Amsterdam warehouse is shown in the first row with the NULL brand value. It amounts to 64 039, the sum of the two previous rows. Next, you can see the totals for both brands in the Berlin warehouse. After that, there’s another line with a NULL brand value; this is actually the Berlin subtotal amounting to 80 807. The last row shows the grand total of all products in all warehouses, which is 144 846. Why are there NULL values in some rows? Because SQL doesn’t know how to name brands and warehouses when they’re grouped and a subtotal or grand total is shown. To find more fun details about other GROUP BY extensions, check out our GROUP BY Extensions course. This is the last advanced SQL query for now. Did You Find These Advanced SQL Topics Helpful? Advanced SQL as used by data scientists, data analysts, and others is a really broad topic. There are vast possibilities in the use of SQL for data analysis. This Advanced SQL track will give you an idea of what you can do with SQL as a data analyst. Read our guide on how to practice advanced SQL with our platform to discover more SQL practice opportunities at advanced level. In this article, I’ve tried to show you some of the most common uses of advanced SQL. The examples are practical and drawn from my experience, so I hope they were helpful. I’ve tried to make the queries as easy (and understandable) as possible. You can adapt these queries to your particular reports and data sets. Feel free to use them! If you have any questions or comments, let me know in the comments section! Tags: learn sql Advanced SQL Data Analysis