Back to articles list Articles Cookbook
12 minutes read

SQL Window Functions for Managers: Who Gets a Raise?

Window functions are one of the most powerful features in modern SQL. In this article, you will learn why window functions are so great. I will list you the most useful window functions and briefly explain when and how you can use them.

Simple SELECT statements, in combination with WHERE, GROUP BY, and HAVING, are sufficient for many analyses at work. I used standard SQL for many years before I realized there is something more.

A few years ago, I started learning about a great “modern” feature: window functions. And I have been using them ever since. Once you master the syntax and realize how great window functions are, you will use them often.

In this article, I'm going to show you a list of window functions, why they’re so great, and how you can use them. Afterward, I will dive deep into examples.

What Are Window Functions?

What Are Window Functions?

Window functions are a special feature in SQL that allows you to do calculations across a set of rows. They are also called OVER functions or analytics functions. You might be thinking, “Hey, this is similar to a GROUP BY statement.” Well, there is a huge difference between them.

When using a basic select-group, we do calculations and display them on a group level. The rows are collapsed and presented in the result set.

When using a window function, there is no collapsing. Each record gets its own calculation. For each row in a table, we define its window frame and do a specific calculation.

Let’s explain this with an example. Below is the kpi_new_clients table that contains information about salesmen’s KPI (key performance indicator), a measure for success, each quarter:

salesmanperiodnew_clients
Olivia SmithQ19
Olivia SmithQ27
Olivia SmithQ312
Olivia SmithQ410
Lily JonesQ17
Lily JonesQ25
Lily JonesQ38
Lily JonesQ46
Alfred BrownQ14
Alfred BrownQ26
Alfred BrownQ311
Alfred BrownQ49
Sonny LeeQ110
Sonny LeeQ210
Sonny LeeQ311
Sonny LeeQ411

The table contains three columns:

  • salesman — first and last name of the salesmen
  • period — Q1, Q2, Q3, and Q4 represent each quarter in the year
  • new_clients — number of new clients that each salesman acquired in one quarter

Now, imagine you are the boss and you want to see an additional column, benchmark, that shows the best result achieved by the salesmen in each quarter. Once you have a benchmark assigned to each record, you can easily see how close each salesman was. This calculation can be easily done using window functions:

SELECT *,
      MAX(new_clients) OVER (PARTITION BY period) AS benchmark 
FROM kpi_new_clients;

Once you run this query, the SQL engine will display an entire dataset with the additional column, benchmark. This column represents the maximum value of new clients for a specific quarter:

Table

Using a window function for the benchmark calculation.

Let’s briefly go through the syntax:

  • OVER denotes that this is a window function. Due to this keyword, sometimes window functions are also called OVER functions.
  • PARTITION BY tells us how the rows are grouped into logical chunks/groups. In our example, the rows are grouped on a period level, which means that for each quarter, we will separately calculate a maximum value.

Now that you are familiar with the syntax, I will briefly describe why window functions are so useful and go over the most commonly used ones.

Why Are Window Functions Useful?

By using window functions, you can perform certain complex business calculations with only a few lines of code. This is especially helpful with time series data analysis and calculations like moving averages, running totals, rankings, etc. This is nicely described in our articles When to use SQL window functions and SQL Course of the Month – Window Functions.

Also, by using window functions, you will have much cleaner and more readable code that is easier to maintain. I will show you what I mean with several examples later on.

Now, let's dive deep into the window functions list.

Window Functions List

Before I introduce the window functions list, I want to explain the main types of window functions:

  • Aggregate functions — These are regular aggregate functions that you have probably used with GROUP BY. However, they can also be used with OVER(). Unlike regular aggregations used in combination with GROUP BY, when they are used with OVER(), rows are not collapsed. Each record gets its own calculated values. This group of functions represents sum, avg, min, max, and count.
  • Ranking window functions — These are used to assign a rank or row number to each record inside a partition. The most famous functions in this group are rank(), dense_rank(), and row_number().
  • Positional window functions — Functions like first_value, last_value, lead, and lag return a single value from a particular row in each window frame (there are no aggregations). This “value” can be the value of the first/last record in each window frame, or it can return a value from the previous row or from the next row (lead/lag).
  • Distribution functions — In this group, there are two famous functions: cume_dist and percent_rank. Both calculate where each row value stands in a group of other values inside the same group/partition/window frame.

Okay, now that you are aware of the function types, it is time to introduce the most famous window functions. Below is the window function list:

Okay, now that you are aware of the function types, it is time to introduce the most famous window functions. Below is the window function list:

FunctionCategoryDescription
SUM()AGGREGATIONreturns total sum of all input values for each partition/window frame
AVG()AGGREGATIONreturns average value of all input values for each partition/window frame
MIN(), MAX()AGGREGATIONreturns min or max value among all input values for each partition/window frame
COUNT()AGGREGATIONreturns total number of input values for each partition/window frame
CUME_DIST()DISTRIBUTIONreturns cumulative distribution of input values for each partition/window frame (it is calculated by the formula: total number of rows that are less or equal to current value divided by the total number of rows in the partition/window frame)
PERCENT_RANK()DISTRIBUTIONreturns relative percent rank of a given row (similar to cume_dist, based on the number of rows in the group that have a lower value than the current row)
ROW_NUMBER()RANKINGassigns a sequential integer to each row within the partition of a result set
RANK()RANKINGassigns a rank to each value within the group/partition (values need to be ordered so that ORDER BY can define an order; if two values are the same, they receive the same rank)
DENSE_RANK()RANKINGAssigns the same rank to rows with equal values for the ranking criteria (very similar to rank function)
NTILE()RANKINGAssigns to each value in a list a number that represents a bucket or group (you specify a number of buckets, and ntile() determines what belongs to which bucket)
FIRST_VALUE()POSITIONALreturns a value of the expression for the first row according to the given order (it's different from max)
LAST_VALUE()POSITIONALreturns the value of the expression for the last row according to the given order
LEAD()POSITIONALassigns a value to each row in a table that is stored in a row after the current one (next row value)
LAG()POSITIONALassigns a value to each row in a table that is stored in a row before the current one (previous row value)

Now, it is time to see window functions in action. Let's go through several examples.

The Total Number and Mean of New Clients for Each Salesman

We could display the total number of new clients and mean value for each salesman in one year (Q1, Q2, Q3, and Q4) by using GROUP BY syntax:

SELECT salesman,
      avg(new_clients) AS average,
      sum(new_clients) AS total
FROM kpi_new_clients group by salesman;

You are probably now asking yourself, “Why should I use window functions if I can do the calculation by using GROUP BY? Why not stick with the regular syntax?” Well, in this case, window functions are great for adding averages and total sums to each input row as an additional attribute without collapsing.

With GROUP BY, you will get collapsed results:

salesmanaveragetotal
Alfred Brown7.530
Lily Jones6.526
Olivia Smith9.538
Sonny Lee10.542

You got four rows. If you would like to see the averages and totals as additional values, use window functions. See the query below:

SELECT *,
  new_clients,
  avg(new_clients) OVER (PARTITION BY salesman) AS average,
  sum(new_clients) OVER (PARTITION BY salesman) AS total
FROM kpi_new_clients;

Once you run this query, you will see the whole table with two additional columns:

Table

With a result presented like this, you can easily check in which quarter each salesman was below or above his/her average and answer similar questions. Our data set is now enriched with additional values that give us better insights.

Salesman Ranking per Each Quarter

We could analyze our dataset further by ranking our salesmen. Who was the best in each quarter? Who has acquired the most and who has not been successful in acquiring new clients?

To answer these questions, you can use window ranking functions—row_number, rank, and dense_rank. These analytics functions work similarly— rank is assigned to each record in a dataset by ordering records according to some value.

In this case, we are interested in who acquired the most clients. So, it is important that the column new_clients is ordered in descending order. Here is the SELECT statement:

SELECT *,
      ROW_NUMBER() OVER (PARTITION BY period ORDER BY new_clients desc) AS rownumber_salesman,
      RANK() OVER (PARTITION BY period ORDER BY new_clients desc) AS rank_salesman,
      dense_rank() OVER (PARTITION BY period ORDER BY new_clients desc) AS dense_rank_salesman
FROM kpi_new_clients;

Once you run this statement, the following output will be displayed on the screen:

salesmanperiodnew_clientsrownumber_salesmanrank_salesmandense_rank_salesman
Sonny LeeQ110111
Olivia SmithQ19222
Lily JonesQ17333
Alfred BrownQ14444
Sonny LeeQ210111
Olivia SmithQ27222
Alfred BrownQ26333
Lily JonesQ25444
Olivia SmithQ312111
Alfred BrownQ311222
Sonny LeeQ311322
Lily JonesQ38443
Sonny LeeQ411111
Olivia SmithQ410222
Alfred BrownQ49333
Lily JonesQ46444

Here is a short code explanation:

  • In this example, we created three additional columns for the results of the window functions: rownumber_salesman, rank_salesman, and dense_rank_salesman.
  • The rows are grouped into partitions by quarters so that each ranking will be done on each partition separately.
  • row_number, rank, and dense_rank assigned 1 for the salesman with the highest number of new clients in a given period, two for the second-highest, and so on.
  • We used OVER in a combination with ORDER BY. ORDER BY is used for partition ordering (we sorted the records inside each partition in a specified order). In our example, each partition was sorted by the column new_clients in descending order. After, row ordering ranking was applied.

Notice that all three analytics functions worked in almost the same way. The only difference lies in the records that have the same value in new_clients.

Let's take a look at the numbers from the third quarter:

Let's take a look at the numbers from the third quarter:

salesmanperiodnew_clientsrownumber_salesmanrank_salesmandense_rank_salesman
Olivia SmithQ312111
Alfred BrownQ311222
Sonny LeeQ311322
Lily JonesQ38443

In the third quarter, Alfred and Sonny acquired the same number of clients?: 11. Olivia did better, and Lily acquired the least. How do we want to rank if more records have the same value in new_clients?

If we want Alfred and Sonny to get the same rank, we should use rank or dense_rank. These two functions assign the same rank to equal values. If we want each salesman to get a different rank, we should use row_number.

What about Lily if Alfred and Sonny share second place? Do we want to give her third or fourth place? If third, we should use dense_rank. If fourth, we should use rank.

This is the main difference between rank and dense_rank. dense_rank does not skip any ranks if there is a tie in the preceding records.

More details about ranking functions can be found in our article: Overview of Ranking Functions in SQL. Take a look if you are keen to learn more about ranking.

Grouping Salesmen by Their Performance

Another useful function is NTILE. This window function breaks ordered records into a specified number of equal groups.

So, we could form groups or buckets of salesmen by their performance. We could do this as follows:

WITH total_by_quarter AS (
SELECT 
   salesman,
   sum(new_clients) AS new_clients
FROM kpi_new_clients 
GROUP BY salesman
)
SELECT 
  *,
  NTILE(2) OVER (ORDER BY new_clients desc) AS bucket_performance 
FROM total_by_quarter;

Here is the code explanation:

  • We used CTEs (common table expressions). total_by_quarter is the temporary result set that stores the total number of acquired clients in the year (Q1, Q2, Q3, and Q4 merged) for each salesman.
  • Each salesman belongs to one bucket/group. This is stored in the column bucket_performance which is created by using NTILE.
  • NTILE takes one argument?: ?the number of buckets. In our case, this argument is set to 2, which means that two salesmen groups are created.

The result of our query looks like this:

Table

Sonny and Olivia acquired more new clients, so they are in bucket 1. Alfred and Lily are in bucket 2. The groups are equal in size (they both have two salesmen), and they are formed according to the ordered new_clients variable.

Number of Acquired Clients From Previous Quarter

In the last example, I’m going to introduce a positional function. We are going to use lag to return the value of new_clients from the previous quarter for each salesman separately.

Here is the code:

SELECT 
  *,
  LAG(new_clients) OVER (PARTITION BY salesman ORDER BY period) 
     AS previous_period_new_clients 
FROM kpi_new_clients;

Keep this in mind:

  • Inside the OVER clause, we used PARTITION BY (we want to group the records for each salesman separately). For each salesman, the rows are ordered by period column.
  • To each current row, we assigned the value of the previous one (the lag function does this).
  • Some rows are populated with null. These are rows for which the previous row value is unknown.
Table

Similarly, the lead function can be used as well. lead takes values from the following row.

Summary

In this article, you learned about window functions — why they are useful and when you could use them. I gave you a list of window functions, and through several examples, I explained the syntax and how you can use them at work.

If you want to learn more, I suggest taking the Window Functions course available on our LearnSQL.com platform. This course is interactive, which means it contains a lot of examples for you to practice with.

If you are working with MS SQL or PostgreSQL, I suggest Window Functions in MS SQL Server or Window Functions in PostgreSQL. All three courses mentioned above contain the same content, except that they are for different SQL engines.

There is also a cool cheat sheet that you can use while practicing: SQL Window Functions Cheat Sheet. It is a good resource for learning window functions.