2nd Oct 2020 17 minutes read A Comprehensive Introduction to Window Functions in MySQL Martyna Sławińska window functions Table of Contents Overview of MySQL Window Functions MySQL Window Function Syntax Window Functions vs. GROUP BY Learn MySQL Window Functions by Example Example 1: The Ranking Window Functions Example 2: The Value Window Functions Example 4: NTILE() and CUME_DIST() Window Functions Going Deeper into MySQL Window Functions Do a deep dive into the working principles, syntax, and applications of various MySQL window functions. There are many open source databases you could choose; MySQL is consistently one of the most popular. But until 2018, there was no provision for including window functions in MySQL. Fortunately, all that has changed – starting from MySQL 8.0, SQL window functions were now available to MySQL users. Please note that window functions are available only in MySQL 8.0. As of the time of this writing, there is no plan to add SQL window functions to MySQL’s earlier releases. Why is the advent of window functions in MySQL so important? They are a powerful data analysis and reporting tool. It’s essential to learn window functions, especially if you work with earlier releases of MySQL and are planning to upgrade to MySQL 8.0. In this article, we’ll go through MySQL’s available window functions and explain their working principles. Next, we’ll compare window functions and aggregate functions. Finally, we’ll examine various sample queries (including their input and output tables) so you can see for yourself how window functions work. Overview of MySQL Window Functions The table presents the available window functions in MySQL along with a brief explanation of each: Window FunctionDescription RANK()Returns the rank of the current row within a defined partition. If one or more rows share the same ranking value, some rank numbers will be omitted from the sequence (e.g. if there are two rows tied for second rank, the rank sequence will be 1, 2, 2, 4…). DENSE_RANK()Returns the rank of the current row within a defined partition. The difference between DENSE_RANK() and RANK() is that DENSE_RANK() does not skip any sequence numbers. (e.g. if there are two rows tied for second rank, the rank sequence will be 1, 2, 2, 3, 4…). ROW_NUMBER()Returns the number of the current row within a defined partition. PERCENT_RANK()Returns a number between 0 and 1 (both inclusive) that indicates the percentage of the current partition values that are less than the current row value of that particular partition. LEAD()Looks up a subsequent row value for the specified column within the current partition. LAG()Looks up a previous row value for a specified column within the current partition. FIRST_VALUE()Returns the value of a specified column for the first row within the current partition. LAST_VALUE()Returns the value of a specified column for the last row within the current partition. NTH_VALUE()Returns the value of a specified column for the nth row within the current partition, where n is defined by the user. NTILE()Divides rows within the current partition into buckets. The number of buckets is specified by the user. The NTILE() function then assigns the number of the bucket to each row. CUME_DIST()Returns the cumulative distribution of a value within a defined partition. For a quick overview of SQL window functions, please see our Window Functions Cheat Sheet. MySQL Window Function Syntax Window function syntax is as follows: The window function (red box) is declared first, followed by the OVER clause (blue box). The OVER clause determines how the rows are arranged and then processed by the window function. Inside the OVER clause are the PARTITION BY and ORDER BY clauses (green boxes). The optional PARTITION BY clause divides window columns into groups (partitions), as you’ll see in subsequent examples. The ORDER BY clause – also optional – orders the resulting table by the values in a user-selected column. Depending on the particular window function you choose, you’ll use either PARTITION BY, ORDER BY, neither, or both. Finally, there’s the closing OVER bracket (blue box). Here’s an example of the above syntax at work: Next, we’ll compare the window functions with the GROUP BY clause. This will help you get a better general understanding of the concept behind window functions. Window Functions vs. GROUP BY To truly understand window functions, let’s briefly compare window functions and aggregate functions. In aggregate functions, input table rows are summarized, so several (or many) rows are collapsed into one summary row. Let’s see an example. To do this, we'll need the table CARS_SOLD: SaleIdCarCountrySold 1AudiGermany120 2AudiUSA110 3AudiJapan100 4BMWGermany250 5BMWUSA200 6BMWJapan200 7FordGermany260 8FordUSA300 9FordJapan200 When running this query, which uses GROUP BY with the aggregate function SUM(), the number of rows in the result table is less than in the original input table. This is because the rows are treated as the operands in the addition operation. It returns one summary row for each car type. SELECT Car, SUM(Sold) FROM CARS_SOLD GROUP BY Car; And the result table: CarSUM(Sold) Audi330 BMW650 Ford760 On the other hand, the number of rows in the input and output tables is the same when we use a window function: SELECT Car, Country, Sold, SUM(Sold) OVER (PARTITION BY Car) as SoldBrand FROM CARS_SOLD; The result is: CarCountrySoldSoldBrand AudiGermany120330 AudiUSA110330 AudiJapan100330 BMWGermany250650 BMWUSA200650 BMWJapan200650 FordGermany260760 FordUSA300760 FordJapan200760 The summary is returned in the SoldBrand column, where the total value of sold cars per brand is presented. Thus, the returned number of records equals the total number of rows in the input table. You may notice that the PARTITION BY clause in this example plays the same role as the GROUP BY clause in the previous query: it groups rows that have the same values in the Car column. For more information on the value of window functions, see our article Why Should I Learn SQL Window Functions? or check out our course on Window Functions. Be sure to read our article SQL Course of the Month - Window Functions first to find out what to expect from the course. Now that we’ve seen the difference between window and aggregate functions, let’s start exploring MySQL window functions. Learn MySQL Window Functions by Example This section will introduce you to the power of MySQL window functions. We’ll examine each function on its own and explain what it does. The tables used throughout the examples are presented below. Table CONTINENTS ContIdContinent 1America 2Europe 3Asia Table COUNTRIES CountryIdCountryNameContinentTotalCountrySales 1USA190 2Germany295 3France280 4Japan360 5Italy275 6Sweden285 Table CAR_MAKERS IdMakerProductionYearCountryPrice 2Volkswagen2015270000 3BMW20152100000 5Ford Motor Company2015180000 7Citroen2015350000 8Nissan Motors2017450000 9Fiat2017540000 11Honda2017440000 13Daimler Benz20172110000 14Opel2017265000 15Peugeot2019355000 16Renault2019360000 19Toyota2019445000 21Volvo2019675000 Example 1: The Ranking Window Functions Let’s look at some examples presenting the ranking window functions. RANK() Window Function The RANK() window function assigns a ranking value to each row within the defined partition. It reinitializes the rank to start from 1 when the partition is switched. RANK() skips sequence numbers if the row value is repeated, i.e. the same rank is given to rows with the same values. It’s easier to understand the working principle of RANK() when we look at the example below. SELECT Maker, Price, RANK() OVER (ORDER BY Price) as RankValue FROM CAR_MAKERS; MakerPriceRankValue Fiat400001 Honda400001 Toyota450003 Citroen500004 Nissan Motors500004 Peugeot550006 Renault600007 Opel650008 Volkswagen700009 Volvo7500010 Ford Motor Company8000011 BMW10000012 Daimler Benz11000013 In this example, we are ordering the result using the Price column and treating all table rows as one partition. The first row is assigned to the rank value 1. In the second row, the rank value is 1 again because the Price column value is the same for the first and second row. In the third row, the rank value is 3 (because of the skipped number due to the repeated Price column values); the value is 3 because this is the third row. In the fourth row, the rank value is 4 as expected. Once again, two rows share the same Price value, and thus the rank 4 is repeated. This principle is applied throughout all the rows in the example. Next, let’s look at an example of the RANK() window function that partitions the result by the ProductionYear column and then orders each partition according to the Country column: SELECT Maker, ProductionYear, Price, RANK() OVER (PARTITION BY ProductionYear ORDER BY Price) as RankValue FROM CAR_MAKERS; MakerProductionYearPriceRankValue Citroen2015500001 Volkswagen2015700002 Ford Motor Company2015800003 BMW20151000004 Fiat2017400001 Honda2017400001 Nissan Motors2017500003 Opel2017650004 Daimler Benz20171100005 Toyota2019450001 Peugeot2019550002 Renault2019600003 Volvo2019750004 We placed the PARTITION BY clause inside the OVER() clause. This divides the result into 3 groups: Group 1 with a ProductionYear of 2015. Group 2 with a ProductionYear of 2017. Group 3 with a ProductionYear of 2019. Similar partitioning by the ProductionYear column takes place in the upcoming examples of this article. In this example, each group (partition) has its rank values started from 1. The rank values are assigned using the same working principle as described in the previous example. DENSE_RANK() Window Function The DENSE_RANK() window function is very similar to the RANK() function. The only difference is that it does not skip any numbers in the rank sequence. Let’s examine the example below to see the difference: SELECT Maker, ProductionYear, Price, DENSE_RANK() OVER (PARTITION BY ProductionYear ORDER BY Price) as DenseRankValue FROM CAR_MAKERS; MakerProductionYearPriceDenseRankValue Citroen2015500001 Volkswagen2015700002 Ford Motor Company2015800003 BMW20151000004 Fiat2017400001 Honda2017400001 Nissan Motors2017500002 Opel2017650003 Daimler Benz20171100004 Toyota2019450001 Peugeot2019550002 Renault2019600003 Volvo2019750004 In the seventh row of the above output table, the DenseRankValue column value is 2 – although it is in the third row of its partition. The DENSE_RANK() function does not skip the rank value 2 , which was not used in the second row due to row value repetitions. On the other hand, RANK() would use the rank value 3 in the third row of the red partition. ROW_NUMBER() Window Function The ROW_NUMBER() window function does exactly what its name says. It assigns a row number to each record within the partition; it reinitializes row numbers to start from 1 when the partition is switched. First, let’s look at an example that does not use the PARTITION BY clause within the OVER() clause. Please note that although the OVER() clause is empty, you still have to define it with the window function: SELECT CountryName, Continent, ROW_NUMBER() OVER () as RowNumberValue FROM COUNTRIES; CountryNameContinentRowNumberValue USA11 Germany22 France23 Japan34 Italy25 Sweden26 As we do not partition the result set, the output is very straightforward. The rows are numbered from 1 to 6. Once we partition the output table, there will be more than one set of row numbers. Please note that the ORDER BY clause is optional in both of these examples. SELECT Maker, ProductionYear, Price, ROW_NUMBER() OVER (PARTITION BY ProductionYear ORDER BY Price) as RowNumberValue FROM CAR_MAKERS; MakerProductionYearPriceRowNumberValue Citroen2015500001 Volkswagen2015700002 Ford Motor Company2015800003 BMW20151000004 Fiat2017400001 Honda2017400002 Nissan Motors2017500003 Opel2017650004 Daimler Benz20171100005 Toyota2019450001 Peugeot2019550002 Renault2019600003 Volvo2019750004 Notice that rows are numbered from 1 for each of the partitioned groups. PERCENT_RANK() Window Function The PERCENT_RANK() window function returns the value from 0 to 1 (both inclusive), which indicates the percentage of current partition rows with a value less than the current row value. This will become clearer with an example. Have a look: SELECT Maker, ProductionYear, Price, PERCENT_RANK() OVER (PARTITION BY ProductionYear ORDER BY Price) as PercentValue FROM CAR_MAKERS; MakerProductionYearPricePercentValue Citroen2015500000 Volkswagen2015700000.3333333333333333 Ford Motor Company2015800000.6666666666666666 BMW20151000001 Fiat2017400000 Honda2017400000 Nissan Motors2017500000.5 Opel2017650000.75 Daimler Benz20171100001 Toyota2019450000 Peugeot2019550000.3333333333333333 Renault2019600000.6666666666666666 Volvo2019750001 Let’s go through the output table starting from the first row. In the first partition, there is no Price column value that is smaller than 50000. Hence, the PercentValue column value is 0, i.e. 0% of the current partition’s Price values are smaller than the current row’s Price value. In the case of the second row, 1/3 of the Price values in the current partition are smaller than 70000. So, the PercentValue value of this row is 0.333(or 33%). In the last row of this partition, we see that all the Price values are smaller than 100000. Thus, the PercentValue value is 1, i.e. 100% of the current partition’s Price values are smaller than the current Price value. You can apply the same working principle to go through the remaining two partitions of the above example. Want to know more about ranking window functions? The article Overview of Ranking Functions in SQL will help you out. Example 2: The Value Window Functions Below are some usage examples of MySQL’s value window functions, which are very useful for data analysis. LEAD() Window Function The LEAD() window function allows us to look up the values of subsequent rows in the current partition. It is commonly used to calculate the difference between current and following row values. First, let’s look at the example that simply uses the LEAD() function to output the value of the row after the current one: SELECT Maker, ProductionYear, Price, TotalCountrySales, LEAD(TotalCountrySales, 1, 0) OVER (PARTITION BY ProductionYear ORDER BY Price) as LeadValue FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearPriceTotalCountrySalesLeadValue Citroen2015500008095 Volkswagen2015700009590 Ford Motor Company2015800009095 BMW2015100000950 Honda2017400006075 Fiat2017400007560 Nissan Motors2017500006095 Opel2017650009595 Daimler Benz2017110000950 Toyota2019450006080 Peugeot2019550008080 Renault2019600008085 Volvo201975000850 The LEAD() window function takes three arguments: The column value to be returned – in this case, it is the TotalCountrySales column. The row number (relative to the current row) to be looked up – here, LEAD() looks up the value of the row immediately after the current row. A default value that’s returned if there is no subsequent row – in the current example, the LeadValue in the fourth row is 0 because there are no more rows in this partition. In the output, the first row has a LeadValue equal to 95 – the TotalCountrySales of the second row. The second row has a LeadValue of 90, the TotalCountrySales of the third row, and so on. As I already mentioned, the LeadValue of the fourth row is 0 (the default value we defined); although there are subsequent rows in the table, the fourth row is the last row in the current partition. This process is repeated for each partition of the resulted table. Now let’s try to calculate the TotalCountrySales difference between the current and subsequent rows. SELECT Maker, ProductionYear, Price, TotalCountrySales, LEAD(TotalCountrySales , 1, 0) OVER (PARTITION BY ProductionYear ORDER BY Price) as LeadValue, TotalCountrySales - (LEAD(TotalCountrySales , 1, 0) OVER (PARTITION BY ProductionYear ORDER BY Price)) as Difference FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearCountryTotalCountrySalesLeadValueDifference Citroen2015500008095-15 Volkswagen20157000095905 Ford Motor Company2015800009095-5 BMW201510000095095 Fiat201740000756015 Honda20174000060600 Nissan Motors2017500006095-35 Opel20176500095950 Daimler Benz201711000095095 Toyota2019450006080-20 Peugeot20195500080800 Renault2019600008085-5 Volvo20197500085085 Except for the Difference column, this example is similar to the previous one. This column stores the difference between the TotalCountrySales value of the current row (column TotalCountrySales) and the TotalCountrySales value of the next row (column LeadValue). Please note that the LEAD() window function is not limited to getting the value from the immediate next row. Let’s see an example where LEAD() is used to get a value from the third row after the current row. In this example, we do not partition the result table. SELECT Maker, ProductionYear, Price, CountrySalesIndicator, LEAD(TotalCountrySales, 3) OVER (ORDER BY Price) as LeadValue FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearPriceTotalCountrySalesLeadValue Fiat2017400007580 Honda2017400006060 Toyota2019450006080 Citroen2015500008080 Nissan Motors2017500006095 Peugeot2019550008095 Renault2019600008085 Opel2017650009590 Volkswagen2015700009595 Volvo2019750008595 Ford Motor Company2015800009080 BMW20151000009585 Daimler Benz2017110000950 Starting from the first row of the result table, each row gets a value of the 3rd row following it. Please note that the last 3 rows values are null because there is no value for the 3rd row following them. LAG() Window Function The LAG() window function is the opposite of the LEAD() function. It returns the value from a previous row. Let’s see it in action. The following example returns a LagValue and calculates the difference between the current row and the previous row: SELECT Maker, ProductionYear, Price, TotalCountrySales, LAG(TotalCountrySales, 1, 0) OVER (PARTITION BY ProductionYear ORDER BY Price) as LagValue, TotalCountrySales - (LAG(TotalCountrySales, 1, 0) OVER (PARTITION BY ProductionYear ORDER BY Price)) as Difference FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearPriceTotalCountrySalesLagValueDifference Citroen20155000080080 Volkswagen201570000958015 Ford Motor Company2015800009095-5 BMW201510000095905 Fiat20174000075075 Honda2017400006075-15 Nissan Motors20175000060600 Opel201765000956035 Daimler Benz201711000095950 Toyota20194500060060 Peugeot201955000806020 Renault20196000080800 Volvo20197500085805 The example should speak for itself, as it is analogous to the example used for LEAD(). Here the values are looked up from the previous row; this is opposite of the LEAD() function, where the values are looked up from the subsequent row. Like LEAD(), the LAG() window function can fetch the value of any row preceding it – not just the immediately preceding row. FIRST_VALUE() Window Functions This window function outputs the first value of the current partition. This will be clear after looking at the example below. SELECT Maker, ProductionYear, Price, TotalCountrySales, FIRST_VALUE(TotalCountrySales) OVER (PARTITION BY ProductionYear ORDER BY Price) as FirstValue FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearPriceTotalCountrySalesFirstValue Citroen2015500008080 Volkswagen2015700009580 Ford Motor Company2015800009080 BMW20151000009580 Fiat2017400007575 Honda2017400006075 Nissan Motors2017500006075 Opel2017650009575 Daimler Benz20171100009575 Toyota2019450006060 Peugeot2019550008060 Renault2019600008060 Volvo2019750008560 Let’s analyze the first partition in this output table. The FirstValue column contains an 80 for all rows; this is the TotalCountrySales column value of the first row in this partition. Example 3: Using Aggregate Functions with OVER() Our next example demonstrates the usage of the aggregate functions SUM(), COUNT(), and AVG() with the OVER() clause. Using these functions with OVER() turns them into window functions. Please note that the OVER() clause is mandatory for all window functions, regardless of whether it is empty or not. SELECT Maker, ProductionYear, Country, TotalCountrySales, SUM(TotalCountrySales) OVER (PARTITION BY ProductionYear) as SumValue, COUNT(TotalCountrySales) OVER (PARTITION BY ProductionYear) as CountValue, AVG(TotalCountrySales) OVER (PARTITION BY ProductionYear) as AvgValue FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearCountryTotalCountrySalesSumValueCountValueAvgValue Volkswagen2015295360490.0000 BMW2015295360490.0000 Ford Motor Company2015190360490.0000 Citroen2015380360490.0000 Nissan Motors2017460385577.0000 Fiat2017575385577.0000 Honda2017460385577.0000 Daimler Benz2017295385577.0000 Opel2017295385577.0000 Peugeot2019380305476.2500 Renault2019380305476.2500 Toyota2019460305476.2500 Volvo2019685305476.2500 In the above example, the sum, count, and average values are calculated using the values in the current partition. It is also worth mentioning that they are calculated for each partition separately. Let’s analyze the first partition to see what’s going on: The SumValue column stores the sum of all TotalCountrySales values in the current partition (i.e. 95+95+90+80=360). The CountValue column stores the number of records in the current partition (i.e. 4). The AvgValue column stores the average of TotalCountrySales column values in the current partition (i.e. (95+95+90+80)/4=90). A good example of ‘aggregate function + OVER() clause’ usage is the SQL running total. If you want to learn more, please see the article What Is a SQL Running Total and How Do You Compute It?. Example 4: NTILE() and CUME_DIST() Window Functions The NTILE() Window Function The NTILE() window function divides partition rows into buckets. It takes the parameter n that the user places in the function brackets and creates that number of buckets. Thus, n groups/buckets are created for each partition and the appropriate bucket number is assigned to each row. Like a few of the other concepts we’ve talked about, this will be clearer after you see an example. Suppose you want to divide car makers into two groups for each production year. This is the query you’d use: SELECT Maker, ProductionYear, NTILE(2) OVER (PARTITION BY ProductionYear ORDER BY Maker) as NtileValue FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearNtileValue BMW20151 Citroen20151 Ford Motor Company20152 Volkswagen20152 Daimler Benz20171 Fiat20171 Honda20171 Nissan Motors20172 Opel20172 Peugeot20191 Renault20191 Toyota20192 Volvo20192 The column NtileValue contains the bucket number – which is either 1 or 2 – for each row in each partition. The NTILE window function is used to divide the result table rows into groups. In the example above, we partition the result table into 3 partitions and each partition’s rows are divided into 2 groups because the query uses NTILE(2). The CUME_DIST() Window Function The CUME_DIST() window function – as its name indicates – calculates the cumulative distribution value for each row in a partition. Cumulative distribution is the number of rows that have a value that’s less than or equal to the current row value divided by the number of all rows within the partition. The example below will help clear up this concept: SELECT Maker, ProductionYear, Country, TotalCountrySales, CUME_DIST() OVER (PARTITION BY ProductionYear ORDER BY TotalCountrySales) as CumeDistValue FROM CAR_MAKERS cm JOIN COUNTRIES c ON cm.Country = c.CountryID; MakerProductionYearCountryTotalCountrySalesCumeDistValue Citroen20153800.25 Ford Motor Company20151900.5 Volkswagen20152951 BMW20152951 Nissan Motors20174600.4 Honda20174600.4 Fiat20175750.6 Daimler Benz20172951 Opel20172951 Toyota20194600.25 Peugeot20193800.75 Renault20193800.75 Volvo20196851 Let’s analyze the first partition as an example. It includes TotalCountrySales column values of 80, 90, 95, and 95. The CumeDistValue column values of the first four rows can be explained as follows: The first row has a TotalCountrySales column value of 80. This is the smallest TotalCountrySales value in the current partition; hence, 1/4=0.25. The second row has a TotalCountrySales value of 90. There are two values less than or equal to 90; hence, 2/4=0.5. The third and fourth rows have a TotalCountrySales value of 95. This is the highest in this partition, so 4/4=1. The CUME_DIST window function returns the cumulative distribution value. It is used mostly for data analysis. Going Deeper into MySQL Window Functions SQL window functions are similar to aggregate functions, but with a major difference. Aggregate functions create one summary row for a group of rows, while window functions assign summary values to each row. MySQL window functions are very useful for thorough data analysis: they allow the user to analyze each data group without losing the details in each row. This article was meant to be an introduction to MySQL window functions. To really learn them, you should check out the LearnSQL.com course on Window Functions. It’ll give you a detailed explanation of each function and lots of opportunities to practice all of them. Tags: window functions