2nd May 2023 12 minutes read

Top 7 Advanced SQL Queries for Data Analysis

data analysis

Table of Contents

7 Advanced SQL Queries Data Analysts Should Know
Power Up Your Data Analysis with Advanced SQL Queries

Explore essential advanced SQL queries for data analysis.

Structured Query Language, or SQL, is an indispensable tool for data analysts. Most people pick up the language relatively quickly and can begin in-depth data analysis after just a few lessons. Because of this, many data analysts tend to stay in the beginner/intermediate level of using SQL. This level of understanding allows you to “get the job done”, but it might not be the most efficient way to write a query.

Mastering advanced SQL techniques can greatly enhance your data analysis capabilities, allowing you to delve deeper into your datasets and derive valuable insights. In this article, we will explore seven advanced SQL queries that can be useful for data analysis. For each example, we will present the query and its results, discuss the syntax used, and provide a brief explanation of how the database computes the result.

If you’d like to improve your SQL skills, be sure to check out our SQL Reporting track. It is a great way to learn and practice more advanced SQL functions.

7 Advanced SQL Queries Data Analysts Should Know

1. Grouping Data by Time Period

In this first example, let’s examine data in the table sales. This dataset includes an ID for each product, the date of the transaction, and the total amount of the transaction.

product_id	sale_date	amount
1	2021-01-01	100
2	2021-01-15	200
1	2021-02-01	300
2	2021-02-15	400
1	2022-01-10	200
1	2022-02-05	100
2	2022-01-27	200
2	2022-02-12	400

One very common data analysis scenario is to take raw data like this and aggregate it by specific date periods, such as month or year. Let’s go ahead and do just that by running the following query:

SELECT
  EXTRACT(YEAR FROM sale_date) AS year, 
  EXTRACT(MONTH FROM sale_date) AS month,
 SUM(amount) AS total_sales
FROM sales
GROUP BY year, month
ORDER BY year, month;

This query will return these results:

year	month	amount
2021	01	300
2021	02	700
2022	01	400
2022	02	500

In this query, we use the EXTRACT() function to pull the year and month from the sale_date field. We then group the data by the new year and month fields and calculate the total sales for each group using the SUM() function.

Note that we need to group by both year and month to get accurate results. If we were to only pull in the month, the results would combine the values for specific months across the years (so all Januaries from all years would be combined into one row, all Februaries would be combined together, etc). When we group the results by both year and month, the months from different years are put into separate rows.

We then order the rows by year and month to have ordered results.

Different database engines often have different and sometimes better functions to accomplish the same result, such as DATE_TRUNC() in PostgreSQL or TRUNC() in Oracle.

2. Create Multiple Grouping Levels Using ROLLUP

Sometimes you may want to group data by multiple levels in a single query. Let’s say you want to add the total sales (across all years) as well as the total sales in each year to the previous example. You can add additional rows for the grand total and totals for each year by using the ROLLUP keyword.

Using the same dataset, we’ll run this query:

SELECT
 EXTRACT(YEAR FROM sale_date) AS year,
 EXTRACT(MONTH FROM sale_date) AS month,
 SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP(year,month)
ORDER BY year, month;

Result:

year	month	total_sales
2021	01	300
2021	02	700
2021	NULL	1000
2022	01	400
2022	02	500
2022	NULL	900
NULL	NULL	1900

In our example, ROLLUP adds three extra rows: one for the grand total of all sales and two for the total sales of each year. In our result set, the third row is the yearly total for 2021: the value in the year column is 2021 and the value in the month column is NULL. The sixth row is the total for 2022: the value in the year column is 2022 and the value in the month column is NULL. The final row is the total for both years: it has NULL in both the year and month columns. These rows were added by ROLLUP. The NULL values in the year and month columns indicate the aggregated rows.

3. Ranking Data Using Window Functions

Ranking data is a common requirement in advanced data analysis. A few business cases would be ranking your products by the highest sales to understand which products bring in the most revenue or ranking stores by the lowest sales to understand which stores are your lowest performers.

You can use window functions like RANK() or DENSE_RANK() to assign ranks to rows based on a specific column's value. First, let’s take a look at the RANK() function:

SELECT
  product_id,
  SUM(amount) AS total_sales,
  RANK() OVER(ORDER BY SUM(amount) DESC) AS rank
FROM sales
GROUP BY id;

Result:

product_id	total_sales	rank
2	1200	1
1	700	2

This query first calculates the total sales for each product_id and then ranks them in descending order. The RANK() function is a window function used to compute the ranking. The OVER() clause is the syntax used with window functions. The RANK() function assigns a unique rank to each row within a result set, based on specific ordering. The ORDER BY clause in the OVER() clause specifies the order in which the rank will be assigned, based on one or more columns. In our example we rank the rows by the total sales amount.

The same syntax can be used for the DENSE_RANK() function. The difference between the two is in the scenario where two values are equal. If two values tie for first place, RANK() will skip to 3 for the third row, while DENSE_RANK() would assign the third row as 2.

For more information about ranking functions, read our in-depth article Overview of Ranking Functions in SQL.

You can also filter the top X rows by using a common table expression (CTE) and a WHERE clause. For example, we can run the following to only pull our lowest-ranked row:

WITH sales_cte AS (
  SELECT
    product_id,
    SUM(amount) AS total_sales,
    RANK() OVER (ORDER BY SUM(amount) DESC) AS rank
  FROM sales
  GROUP BY id 
) 
SELECT * 
FROM sales_cte
WHERE rank = 1;

In this query, we’re defining a CTE called sales_cte, which gives the rank of each product_id based on its total sales. We then query this CTE, only pulling the rows where rank = 1. This allows us to only return the top row.

4. Computing the Difference (Delta) Between Rows

Sometimes you need to compare the values of consecutive rows. One example would be pulling a previous month’s sales to compare to the current month. Having both the current month’s sales and the previous month’s sales allows you to compute the difference in the two values.

You can use window functions like LEAD() and LAG() to achieve this. We’ll use a different table this time: sales_agg. This table has the aggregate sales for each year and month.

year	month	total_sales
2021	1	300
2021	2	700
2022	1	400
2022	2	500

Let's compute the month-over-month sales difference using LAG():

SELECT
  year, 
  month,
  total_sales,
  LAG(total_sales) OVER(ORDER BY year, month) AS previous_month,
  total_sales – LAG(total_sales) OVER(ORDER BY year, month) AS            sales_difference
FROM sales_agg
ORDER BY year, month;

This query uses the LAG() function to calculate the difference between the current month's sales and the previous month's sales:

year	month	total_sales	previous_month	sales_difference
2021	01	300	NULL	NULL
2021	02	700	300	400
2022	01	400	700	-300
2022	02	500	400	100

The LAG() function is a window function. It pulls the total_sales value from the previous row, ordered by year and month. We’re aliasing that column as previous_month.

Then the sales_difference column is calculated by taking the total_sales for each row and then subtracting the previous_month value. The first row has NULL values, since there is no previous row.

If you needed to pull the next month’s sales, you would use the window function LEAD() in place of LAG(). LAG() compares the current row to previous values; LEAD() compares the current row to subsequent values.

How to Calculate the Difference Between Two Rows in SQL gives multiple real-life examples of using this syntax to compute the difference between two rows.

5. Computing a Running Total

A running total calculates the sum of a sequence of numbers. It’s also known as a cumulative total or cumulative sum; it adds each new value to the previous total.

Running totals are useful in calculating the cumulative effect of data points over time. For example, you might want to calculate the cumulative number of users that have visited your website up until a certain time to understand your website’s user growth.

Let’s revisit the SUM() function and see how we could use it to calculate a running sum of sales. As with the previous example, we’ll need to use SUM() as a window function to achieve the desired results.

SELECT
  year,
  month,
  total_sales,
  SUM(total_sales) OVER(ORDER BY year, month) AS running_total
FROM sales_agg
ORDER BY year, month;

This running total is calculated using the SUM() function with the OVER() clause. This adds the total_sales from the current row to all previous rows in the specified order.

In the second row, the running total is calculated by aggregating the total_sales value from both the first and second row. In our example, the second row is for February of 2021. The running_total value is the sum of the first row (for January 2021) and the second row (for February 2021).

In the third row, the running total is calculated by aggregating the values from the first through third rows. This same pattern continues through each row.

year	month	total_sales	running_total
2021	01	300	NULL
2021	02	700	1000
2022	01	400	1400
2022	02	500	1900

For more information on computing running totals in SQL, check out our article What Is a SQL Running Total and How Do You Compute It?.

6. Computing a Moving Average

When looking at sales trends in data analysis, it is often helpful to use a moving average rather than each individual data point. A moving average (also known as a rolling average) calculates the average of the current value and a specified number of immediately preceding values.

This technique helps smooth out data and identify trends, especially when your data has high volatility. The main idea is to examine how these averages behave over time instead of examining the behavior of the original data points.

For example, you might need to analyze daily sales of a restaurant that has high sales on the weekend but low sales Monday through Wednesday. If you were to plot each individual data point, you would see very high values and very low values next to each other, making long-term trends more difficult to see. By using a 3-day moving average, you would take the average of the past 3 days, which evens out the highs and lows.

For this example, we’ll modify our sales_agg table.

year	month	total_sales
2021	01	300
2021	02	700
2021	03	500
2021	04	1000
2021	05	800
2021	06	600

Now let’s calculate a 3-month moving average of sales:

SELECT
  year, 
  month, 
  total_sales, 
  AVG(total_sales) OVER(ORDER BY year, month ROWS BETWEEN 2 PRECEDING and CURRENT ROW) AS moving_average
FROM aales_agg
ORDER BY year, month;

This query calculates the moving average using the AVG() function with the OVER() clause. The ROWS BETWEEN clause specifies the range of rows to include in the moving average calculation – in this case, the current row and the two preceding rows.

The result:

year	month	total_sales	moving_average
2021	01	300	300
2021	02	700	500
2021	03	500	500
2021	04	1000	733
2021	05	800	767
2021	06	600	800

Since there are no previous rows, the first row’s moving average is just the total sales value. The second row’s moving average is the average of 300 and 700. For the third row, we now have the 2 preceding rows, as defined in our query; the moving average is calculated by averaging 300, 700, and 500. This pattern continues for the rest of the rows.

What a Moving Average Is and How to Compute It in SQL is a great resource for more information on this topic.

7. Counting Elements in Custom Categories Using SUM() and CASE WHEN

You can count elements in custom categories by combining SUM() with CASE WHEN. You’d use this when you need to create business logic that doesn’t exist in your data. For example, you might want to group specific locations by custom regions and then compute metrics based on those regions.

CASE WHEN statements allow you to perform conditional logic in queries. The syntax is structured like this:

CASE 
  WHEN condition_1 THEN result_1
  WHEN condition_2 THEN result_2
END

The conditions are evaluated from top to bottom. The first condition that evaluates as true determines which result is returned.

In our example, let's create a sales category breakdown based on sales amounts:

SELECT
  SUM(CASE WHEN total_sales < 400 THEN 1 ELSE 0 END) AS low_sales,  
  SUM(CASE WHEN total_sales >= 400 AND total_sales < 700 THEN 1 ELSE 0 END) AS medium_sales,
  SUM(CASE WHEN total_sales >= 700 THEN 1 ELSE 0 END) AS high_sales
FROM sales_agg

In the above query, we use the CASE WHEN expression to categorize each month into low_sales, medium_sales, or high_sales categories. When the condition in the CASE WHEN expression is met, it returns a value of 1; otherwise, it returns 0.

The SUM() function is then used to add up the 1s and 0s for each category, effectively counting the number of months that fall into each category. The result is a single row with the counts of low_sales, medium_sales, and high_sales months.

low_sales	medium_sales	high_sales
1	2	3

This approach allows you to create custom categories and count the elements in each category with a single query. The combination of SUM() and CASE WHEN is versatile and can be adapted to different use cases.

If you want to use this construction in your query, How to Use CASE WHEN with SUM() in SQL will give you more details.

Power Up Your Data Analysis with Advanced SQL Queries

Mastering advanced SQL queries is essential for efficient and accurate data analysis. In this article, we've covered seven powerful SQL techniques that can help you gain deeper insights into your data.

By learning how to group data by time periods, use ROLLUP for multiple grouping levels, rank data with window functions, compute differences between rows, calculate running totals and moving averages, and count elements in custom categories, you'll be well-equipped to tackle complex data analysis tasks.

As you continue to hone your SQL skills, you'll find even more ways to leverage this powerful tool to unlock the full potential of your datasets and drive better decision-making in your organization.

For more practice and to gain an even deeper level of understanding on these topics, sign up for our SQL Reporting track. It is a great way to master these advanced SQL functions. Read our guide on how to practice advanced SQL online to discover more advanced SQL practice opportunities on our platform.

Tags:

data analysis