The SQL OVER() Clause Explained
Here is a detailed explanation of the OVER() clause in SQL window functions.
This is a detailed guide on using the OVER()
clause in SQL window functions. I will focus specifically on this clause, which requires that you have at least a general idea of how window functions work. To make the best out of this article, take a look at the SQL Window Functions Cheat Sheet. Feel free to have it by your side as we go.
What Is the SQL OVER() Clause?
Let’s start by answering the first question that springs to mind: what is an OVER()
clause? It is what makes functions become window functions.
The OVER()
clause lets you define the windows for window functions. It also lets you define the order in which a given window function is executed.
The SQL OVER() Clause Syntax
The SQL syntax of this clause is as follows:
SELECT < column name >, <window function > OVER([PARTITION BY < column names>] [ ORDER BY < column names>] [ <ROW or RANGE clause> ]) FROM table ; |
The three distinct parts of the OVER()
clause syntax are:
PARTITION BY
ORDER BY
- The window frame (
ROW
orRANGE
clause)
I’ll walk through each of these.
PARTITION BY()
This SQL clause lets you define the window. It partitions a dataset into smaller segments called windows.
When PARTITION BY
is omitted, the window function is applied to the whole dataset. Let’s use a small example dataset to see how it works:
id | date | sales | product |
---|---|---|---|
1 | 2021-12-06 | 1,357.22 | Ax |
2 | 2021-12-06 | 2,154.88 | Hammer |
3 | 2021-12-07 | 1,984.88 | Ax |
4 | 2021-12-07 | 3,147.11 | Hammer |
If I want to see total sales by product, using GROUP BY
comes to mind. I can write a query like this:
SELECT product, SUM (sales) AS sales_per_product FROM product_sales GROUP BY product; |
The result is:
product | sales_per_product |
---|---|
Ax | 3,342.10 |
Hammer | 5,301.99 |
GROUP BY
works fine if I only want to see aggregate values for each group. However, if I want to see aggregate values for each group while preserving row-level data, I need to use PARTITION BY
. For example:
SELECT product, date , sales, SUM (sales) OVER (PARTITION BY product) AS sales_per_product FROM product_sales; |
This query is just to show you what PARTITION BY
does. Don’t worry; later on, I’ll show you example queries and explain them in detail. For now, take a look at the result of the query above:
product | date | sales | sales_per_product |
---|---|---|---|
Ax | 2021-12-06 | 1,357.22 | 3,342.10 |
Ax | 2021-12-07 | 1,984.88 | 3,342.10 |
Hammer | 2021-12-06 | 2,154.88 | 5,301.99 |
Hammer | 2021-12-07 | 3,147.11 | 5,301.99 |
Here, I have sales per product as in the previous example, but I also have sales data for every product and date. The partition in this example is the product. This means the ax is one window partition, while the hammer is another window partition. The window function sums all sales values for the ax; it then goes to the next window partition, resets, and sums all sales values for the hammer.
This illustrates the main difference between using GROUP BY
and PARTITION BY:
GROUP BY
collapses rows and returns aggregate values in one row only, while PARTITION BY
preserves all rows and returns aggregate values in as many rows as there are in the partition.
You may want to help yourself in clarifying this topic further by reading an article dedicated to the difference between GROUP BY and PARTITION BY.
ORDER BY
You probably can guess what this SQL clause does. Yes, it orders something. But what?
Unlike the regular ORDER BY
, the ORDER BY
in an OVER()
clause does not sort the query result. When used in window functions as part of an OVER()
clause, ORDER BY
defines how window functions are executed.
Both ASC
and DESC
may be used with it. ASC
means the window function is executed in ascending order, e.g., A-Z, from the lowest to the highest number, from the oldest to the newest date, etc. If you use DESC
, the window function is executed in reverse order compared to ASC
. If ORDER BY
is omitted, the window function is executed in arbitrary order.
Whichever way you use it, keep in mind ORDER BY
in an OVER()
clause does not have to do with how the query output is sorted. You may still sort the result even when using window functions; you just have to write another ORDER BY
clause at the end of your query like you usually do.
The Window Frame
A window frame is a set of rows that are related to the current row. The current row is where the window function is being executed.
Window frames are defined using the RANGE
or ROWS
keyword. These keywords define the upper and lower bounds of a window frame, thus defining the frame itself. The difference between these two keywords is explained in an article about the RANGE clause.
The Default Window Frame
The default window frame depends on whether ORDER BY
is omitted or not. There are two possible scenarios when ORDER BY
is omitted:
- Using
OVER ()
, omitting bothPARTITION BY
andORDER BY
. - Using
OVER (PARTITION BY …)
, omitting onlyORDER BY
.
In both cases, the default window frame is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
. This means the window frame is all rows (all rows before the current row, the current row, and all rows after the current row).
In addition, there are two possible scenarios when ORDER BY
is not omitted:
OVER (ORDER BY …)
:PARTITION BY
is omitted.OVER (PARTITION BY … ORDER BY …)
: neitherPARTITION BY
norORDER BY
is omitted.
In these two cases, the default window frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. This means all values (not rows!) before the current row and the current row itself.
Meaning of Window Frame Abbreviations
While I’m at it, here’s a list of all possible upper and lower bounds for the window frame and what they mean:
Abbreviation | Meaning |
---|---|
UNBOUNDED PRECEDING | BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
n PRECEDING | BETWEEN n PRECEDING AND CURRENT ROW |
CURRENT ROW | BETWEEN CURRENT ROW AND CURRENT ROW |
n FOLLOWING | BETWEEN AND CURRENT ROW AND n FOLLOWING |
UNBOUNDED FOLLOWING | BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING |
If you want to see this in an example, here’s an article about defining a window frame in SQL window functions.
Now that we have gone through the technical stuff, it’s time to explain the OVER()
clause with examples.
The SQL OVER() Clause Examples
To understand the examples, you have to become familiar with the data first.
The Dataset Used
All four examples use the same dataset, which consists of three tables:
stores
products
inventory
The table stores
is a list of the stores selling tea. Here’s what it looks like:
id | store_name |
---|---|
1 | Zaandam |
2 | Amsterdam |
3 | Rotterdam |
4 | Utrecht |
The next table shows the list of tea producers and the product they make; it’s the products
table:
id | product_brand | product_name |
---|---|---|
1 | Mr. Wellington | Earl Grey |
2 | Karnataka Tea Company | Kashmir Tchai |
3 | Fukuyama & Co. | Matcha |
4 | Fukuyama & Co. | Sencha |
The table inventory
shows the quantities of tea available for every store. It has much more data than the first two tables, so here are just the first several rows:
id | inventory_date | opening_quantity | closing_quantity | product_id | store_id |
---|---|---|---|---|---|
1 | 2021-12-06 | 13 | 8 | 1 | 1 |
2 | 2021-12-07 | 8 | 1 | 1 | 1 |
3 | 2021-12-08 | 52 | 50 | 1 | 1 |
4 | 2021-12-09 | 50 | 44 | 1 | 1 |
5 | 2021-12-10 | 44 | 32 | 1 | 1 |
6 | 2021-12-06 | NULL | NULL | 2 | 1 |
7 | 2021-12-07 | NULL | NULL | 2 | 1 |
8 | 2021-12-08 | 28 | 20 | 2 | 1 |
9 | 2021-12-09 | 20 | 17 | 2 | 1 |
10 | 2021-12-10 | 17 | 17 | 2 | 1 |
As you can see, there are NULL
values. They mean the tea was not available when the store opened or closed. In the above, product_id = 2
in store_id = 1
wasn’t available for two days: 2021-12-06
and 2021-12-07
.
Generally, the closing_quantity
of a given day is the opening_quantity
of the following day. If it’s not, then it means a new order was delivered and accounted for before the store opened. This is the case, for example, in the highlighted rows.
Ready for some OVER()
clause examples? I bet you are!
Example 1: With PARTITION BY but Without ORDER BY
In this example, the task is to show the inventory date, the quantity at store open, and the quantity at store close. We also want the highest quantity at store close for each day regardless of the store or the product.
To do that, I’ve written this query:
SELECT inventory_date, opening_quantity, closing_quantity, MAX (closing_quantity) OVER (PARTITION BY inventory_date) AS daily_max FROM inventory; |
This query uses only PARTITION BY
in the OVER()
clause. What does this query do?
First, it selects three columns from the table inventory
. Then I use the MAX()
aggregate function to find the highest product quantity at store close. I use the OVER()
clause to make this aggregate function a window function. In it, I partition the window by the column inventory_date
. This means I get the highest product quantity at store close by inventory date across all stores and across all products.
The table below shows all the data for the first date (December 6) and the first row of the second date (December 7) for simplicity. Of course, the query returns much more data.
inventory_date | opening_quantity | closing_quantity | daily_max |
---|---|---|---|
2021-12-06 | 128 | 51 | 397 |
2021-12-06 | NULL | NULL | 397 |
2021-12-06 | 87 | 87 | 397 |
2021-12-06 | 22 | 21 | 397 |
2021-12-06 | 13 | 8 | 397 |
2021-12-06 | 247 | 200 | 397 |
2021-12-06 | 42 | 39 | 397 |
2021-12-06 | 33 | 33 | 397 |
2021-12-06 | 44 | 41 | 397 |
2021-12-06 | 73 | 71 | 397 |
2021-12-06 | 97 | 94 | 397 |
2021-12-06 | 412 | 397 | 397 |
2021-12-06 | 39 | 33 | 397 |
2021-12-06 | 62 | 58 | 397 |
2021-12-06 | 49 | 44 | 397 |
2021-12-06 | 79 | NULL | 397 |
2021-12-07 | 58 | 54 | 327 |
The table lists the quantities at store open and store close for every store-product combination. The highest quantity at store close is in the column daily_max
. As you can see, it is 397
for the date 2021-12-06
. You can verify this is the highest value; it appears in the highlighted row. The NULL
values are shown, but they don’t impact the result.
The data is presented the same way for every date in the dataset. You can see the next date is 2021-12-07
, and the highest product quantity at store close for this date is 327
. The same principle applies to the rest of the table.
Example 2: With Both PARTITION BY and ORDER BY
In this example, I’ll show you how to use both PARTITION BY
and ORDER BY
in an OVER()
clause. To do that, I write a query that displays the product brand, the product name, and the inventory date. I also want to show the daily quantity sold. In addition, I want to see the daily cumulative quantity sold per brand across all stores.
You need to know how to calculate a running total for this example. Before showing you the query, let’s make sure you understand what a running total is. Here it is with the example data:
id | date | sales | running_total | product |
---|---|---|---|---|
1 | 2021-12-06 | 1,357.22 | 1,357.22 | Ax |
2 | 2021-12-06 | 2,154.88 | 3,512.10 | Hammer |
3 | 2021-12-07 | 1,984.88 | 5,496.98 | Ax |
4 | 2021-12-07 | 3,147.11 | 8,644.09 | Hammer |
A running total is the sum of the current row plus all the preceding rows. In the first row, it’s 1,357.22 + 0 = 1,357.22 because there are no preceding rows. For the next row, it is 2,154.88 + 1,357.22 = 3,512.10. The same logic applies to all remaining rows: 1,984.88 + 3,512.10 = 5,496.98 and 3,147.11 + 5,496.98 = 8,644.09.
Now let’s calculate the running total in SQL:
SELECT product_brand, product_name, inventory_date, opening_quantity - closing_quantity AS daily_sales, SUM (opening_quantity - closing_quantity) OVER (PARTITION BY product_brand ORDER BY inventory_date) AS cumulative_sales_per_brand FROM products p JOIN inventory i ON p.id = i.product_id; |
The query selects certain columns from the tables products
and inventory
. Then, I calculate the difference between the quantities at store open and store close into the column daily_sales
. After that, this difference is used in the SUM()
window function to get me the sum of daily sales.
The window is partitioned by the column product_brand
because I want to see the sum of daily sales at the brand level. The ordinary SUM()
aggregate function now becomes a window function for calculating the cumulative sum.
Finally, the window function is executed by inventory date in ascending order. Had I not used ORDER BY
, the SUM()
function would have been executed in arbitrary order, which is not much of a cumulative sum/running total.
Note: If no ASC
or DESC
is specified, ASC
is the default value.
The query returns the result below. Again, I’m only showing a handful of rows:
product_brand | product_name | inventory_date | daily_sales | sales_per_brand |
---|---|---|---|---|
Fukuyama & Co. | Matcha | 2021-12-06 | 5 | 27 |
Fukuyama & Co. | Sencha | 2021-12-06 | 1 | 27 |
Fukuyama & Co. | Matcha | 2021-12-06 | 0 | 27 |
Fukuyama & Co. | Sencha | 2021-12-06 | 0 | 27 |
Fukuyama & Co. | Sencha | 2021-12-06 | NULL | 27 |
Fukuyama & Co. | Matcha | 2021-12-06 | 3 | 27 |
Fukuyama & Co. | Sencha | 2021-12-06 | 15 | 27 |
Fukuyama & Co. | Matcha | 2021-12-06 | 3 | 27 |
Fukuyama & Co. | Matcha | 2021-12-07 | 4 | 224 |
Fukuyama & Co. | Sencha | 2021-12-07 | 98 | 224 |
Fukuyama & Co. | Matcha | 2021-12-07 | 3 | 224 |
Fukuyama & Co. | Matcha | 2021-12-07 | 11 | 224 |
Fukuyama & Co. | Sencha | 2021-12-07 | 1 | 224 |
Fukuyama & Co. | Sencha | 2021-12-07 | 70 | 224 |
Fukuyama & Co. | Matcha | 2021-12-07 | 5 | 224 |
Fukuyama & Co. | Sencha | 2021-12-07 | 5 | 224 |
According to the result, the quantity sold for Fukuyama & Co. products is 27
for 2021-12-06
. The total quantity sold on 2021-12-07
is 197
. However, the quantity in the column sales_per_brand
is 224
. Why is that? This is because it’s a cumulative sum that shows the total sold on that day plus the total of all prior days. In other words, you get 224
if you sum all the values in the highlighted cells.
This is done the same way for all other dates and brands.
Example 3: With Both PARTITION BY and ORDER BY
Now, say you want to show the store name, the inventory date, the brand, and the products. Again, we will show the daily quantity sold, but we will also show the average daily quantity sold by each store. This is a cumulative average that changes with every passing day and from the lowest to the highest daily quantity sold.
The general principles of calculating a cumulative average are shown in the table below:
id | date | sales | cumulative_average | product |
---|---|---|---|---|
1 | 2021-12-06 | 1,357.22 | 1,357.22 | Ax |
2 | 2021-12-06 | 2,154.88 | 1,756.05 | Hammer |
3 | 2021-12-07 | 1,984.88 | 1,832.33 | Ax |
4 | 2021-12-07 | 3,147.11 | 2,161.02 | Hammer |
The moving average is calculated as an average of a subset. Generally, moving averages take into calculation the current row, a certain number of rows or values preceding the current row, and/or a certain number of rows or values following the current row. In the above example, I calculate an average that includes the current row and all rows that precede the respective current row. This results in a type of moving average, but I’ll refer to it as a cumulative average to make a distinction.
The cumulative average for the first row is 1,357.22. This is the same as the sales value of the row, because there are no rows before the first one. The calculation for the second row looks like this: (1,357.22 + 2,154.88)/2 = 1,756.05; for the next one: (1,357.22 + 2,154.88 + 1,984.88)/3 = 1,832.33. The last cumulative average is calculated by following the same principle: (1,357.22 + 2,154.88 + 1,984.88 + 3,147.11)/4 = 2,161.02.
Now, the cumulative average in SQL:
SELECT store_name, inventory_date, product_brand, product_name, opening_quantity - closing_quantity AS daily_sales, AVG (opening_quantity - closing_quantity) OVER ( PARTITION BY store_name ORDER BY inventory_date, opening_quantity - closing_quantity ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_avg_sales FROM stores s JOIN inventory i ON s.id = i.store_id JOIN products p ON p.id = i.product_id; |
I select several columns from all three tables. Then, I calculate the daily sales the same way as in Example 2. I use the AVG()
aggregate function to calculate the average daily sales. I need the average by store, so I partition the window by the column store_name
.
Then, I order the data in the partition by inventory date and daily sales using an ORDER BY
clause. It means the window function calculates the average from the oldest to the newest date and from the lowest to the highest daily sales. Why? I want to see my cumulative averages as they change.
Finally, the window frame is defined as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. The average includes the current row and all the preceding rows – within the defined window partition, of course.
Take a look at the resulting table:
store_name | inventory_date | product_brand | product_name | daily_sales | cumulative_avg_sales |
---|---|---|---|---|---|
Zaandam | 2021-12-06 | Fukuyama & Co. | Matcha | 0 | 0.00 |
Zaandam | 2021-12-06 | Fukuyama & Co. | Sencha | 1 | 0.50 |
Zaandam | 2021-12-06 | Mr. Wellington | Earl Grey | 5 | 2.00 |
Zaandam | 2021-12-06 | Karnataka Tea Company | Kashmir Tchai | NULL | 2.00 |
Zaandam | 2021-12-07 | Fukuyama & Co. | Sencha | 1 | 1.75 |
Zaandam | 2021-12-07 | Mr. Wellington | Earl Grey | 7 | 2.80 |
Zaandam | 2021-12-07 | Fukuyama & Co. | Matcha | 11 | 4.17 |
Zaandam | 2021-12-07 | Karnataka Tea Company | Kashmir Tchai | NULL | 4.17 |
Zaandam | 2021-12-08 | Fukuyama & Co. | Matcha | 0 | 3.57 |
Zaandam | 2021-12-08 | Fukuyama & Co. | Sencha | 1 | 3.25 |
Zaandam | 2021-12-08 | Mr. Wellington | Earl Grey | 2 | 3.11 |
Zaandam | 2021-12-08 | Karnataka Tea Company | Kashmir Tchai | 8 | 3.60 |
Zaandam | 2021-12-09 | Fukuyama & Co. | Sencha | 2 | 3.45 |
Zaandam | 2021-12-09 | Karnataka Tea Company | Kashmir Tchai | 3 | 3.42 |
Zaandam | 2021-12-09 | Mr. Wellington | Earl Grey | 6 | 3.62 |
Zaandam | 2021-12-09 | Fukuyama & Co. | Matcha | 24 | 5.07 |
Zaandam | 2021-12-10 | Karnataka Tea Company | Kashmir Tchai | 0 | 4.73 |
Zaandam | 2021-12-10 | Fukuyama & Co. | Matcha | 0 | 4.44 |
Zaandam | 2021-12-10 | Fukuyama & Co. | Sencha | 1 | 4.24 |
Zaandam | 2021-12-10 | Mr. Wellington | Earl Grey | 12 | 4.67 |
Let me prove the calculation is correct by using the same principle you already know. The cumulative average for the first row is 0.00 because the value of this row is 0 and there are no preceding rows.
How about the second row? (0 + 1)/2 = 0.50. So far, so good! The third row: (0 + 1 + 5)/3 = 2.00.
That’s easy. The same with the next row: (0 + 1 + 5 + NULL)/4 = 1.5. But the above table says the cumulative average is 2.00, not 1.5.
Ha, SQL has made an error, right? No, you have made an error if you have treated the NULL
values as 0. When calculating averages, SQL simply ignores NULL
values. So, the calculation for this row is again (0 + 1 + 5)/3 = 2.00. This is exactly what is shown in the table.
How about the next row? You skip the row with the NULL
value and calculate the cumulative average this way: (0 + 1 + 5 + 1)/4 = 1.75.
Example 4: Without PARTITION BY but With ORDER BY DESC
This time, I’ll show you how to use window functions without the PARTITION BY
clause but with the data in the window partition in descending order.
I want to rank the data by quantity at store open in descending order. In addition to the rank, I want the result to show the following columns: inventory_date
, product_brand
, product_name
, and opening_quantity
.
Take a look:
SELECT inventory_date, product_brand, product_name, opening_quantity, DENSE_RANK() OVER ( ORDER BY opening_quantity DESC ) FROM inventory i JOIN products p ON i.product_id = p.id; |
First, there are several columns from the tables inventory
and products
. I use the DENSE_RANK()
function to rank the values. You may also use the RANK()
or the ROW_NUMBER()
function, but the result may be slightly different.
In the OVER()
clause, there’s an ORDER BY
used with a DESC
keyword, so the ranking goes from the highest to the lowest quantity at store open. No partition is defined, so the ranking is applied to the whole dataset.
Here are several rows from the data the query returns:
inventory_date | product_brand | product_name | opening_quantity | rank |
---|---|---|---|---|
2021-12-06 | Karnataka Tea Company | Kashmir Tchai | NULL | 1 |
2021-12-07 | Karnataka Tea Company | Kashmir Tchai | NULL | 1 |
2021-12-09 | Fukuyama & Co. | Sencha | NULL | 1 |
2021-12-08 | Karnataka Tea Company | Kashmir Tchai | 820 | 2 |
2021-12-09 | Karnataka Tea Company | Kashmir Tchai | 809 | 3 |
2021-12-09 | Karnataka Tea Company | Kashmir Tchai | 704 | 4 |
2021-12-06 | Fukuyama & Co. | Sencha | 412 | 5 |
2021-12-07 | Fukuyama & Co. | Sencha | 397 | 6 |
2021-12-08 | Fukuyama & Co. | Sencha | 327 | 7 |
As you can see, all NULL
values are ranked 1. This is because I’m using PostgreSQL, which puts NULL
values at the top when data is in descending order. Since the first three rows have the same value, they have the same rank.
The quantity 820 is ranked 2, while the quantity 809 is ranked 3. This goes on in the same way throughout the dataset.
To show the same rank is allocated not only when the values are NULL
, here are two rows you don’t see above:
inventory_date | product_brand | product_name | opening_quantity | rank |
---|---|---|---|---|
2021-12-06 | Fukuyama & Co. | Matcha | 87 | 19 |
2021-12-07 | Fukuyama & Co. | Matcha | 87 | 19 |
As you see, the rows with duplicate values are ranked the same.
Ranking functions can be very useful. I recommend you learn how to use all the ranking functions with PARTITION BY, too.
Learning the SQL OVER() Clause Is Not Yet Over
My explanation of the OVER()
clause is done, but that doesn’t mean your learning and practice are over. Now that you understand PARTITION BY
, ORDER BY
, and the window frame, continue to build on that knowledge.
If you like reading about window functions, here’s a list of the best articles about window functions. If practice is your thing, here’s an article showing you several window function examples.
The best combination of learning through reading and practicing is in our Window Functions Course. It’s a part of the Advanced SQL Track, which, aside from window functions, covers topics such as GROUP BY
extensions and recursive queries. Check it out!