5th Oct 2023 14 minutes read BigQuery Window Functions Explained Jill Thornhill window functions Google BigQuery Table of Contents What’s BigQuery? What Are SQL Window Functions? Why Are They Called Window Functions? Syntax of BigQuery Window Functions The OVER() Clause The PARTITION BY Clause The ORDER BY Clause Using ORDER BY with the PARTITION BY Clause SQL Window Functions Available in BigQuery RANK() DENSE_RANK() ROW_NUMBER() LAG() LEAD() Practical Uses of BigQuery Window Functions Next Steps with BigQuery Window Functions Keeping up with data analysis trends gives your organization – and your CV – the cutting edge. In this article, we'll look at BigQuery window functions and how you can use them to gain deeper insights into your data. Google's BigQuery, launched in 2010, is gaining traction as a popular choice with organizations needing to analyze large quantities of information quickly and to compare their own data against statistical data in the public domain. Since Google aligned BigQuery's data retrieval language to conform with standard SQL— and included advanced features such as SQL window functions—its popularity has increased. Many organizations now include BigQuery skills as a must-have, and this means that SQL skills are more in demand than ever. This article explains why knowledge of SQL is essential to working with BigQuery SQL continues to hold its place as the leading skill for anyone who needs to work with data. If you're not yet an SQL guru, you may be interested in our SQL from A to Z learning track. It includes 7 courses that take you from absolute beginner to SQL expert. The course has hundreds of real-world coding challenges and takes about 84 hours to complete. Since you can access the sample databases through your browser, you don't need to install any software to get started. What’s BigQuery? Google BigQuery is a high-speed data warehouse located in the Cloud. Designed specially to hold data used for analysis, it can process petabytes of data in minutes. If you haven't yet got your head around petabytes, a petabyte is one quadrillion bytes (or a million gigabytes). You pay for what you use in BigQuery, so the cost of storing and analyzing huge quantities of data usually works out much lower than investing in lots of hard disks. And since all kinds of useful statistical data from governments and world organizations are stored publicly in BigQuery, you can access them to gain insights into how your organization's performance could be enhanced. What Are SQL Window Functions? Window functions are also known as analytical functions or OVER functions. They were added to the SQL standard in 2003, and most major database vendors began implementing them from 2010 onwards. They are thus a fairly new addition to SQL. Before window functions were included in SQL, you could either list individual rows or calculate aggregates such as totals and averages. You couldn't easily do both in the same query unless you wrote complex—and probably slow and inefficient—subqueries. This means you could either have a list like this … Student IDSubjectScore 1Math63 1Science50 2Math59 … or you could show aggregates like this: SubjectClass Average Math52 Science61 English55 Overall Average56 In this example, if you wanted to know how an individual student's score compared to the class average, you'd have to look at the two reports side by side. It would be much nicer if you could see the class average on the same row as a student's score, like this: Student IDSubjectScoreClass Average 1Math6362 1Science5061 2Math5952 As I mentioned, you could achieve this using subqueries. But subqueries are notoriously slow to run and can make your query very complicated. This is the kind of thing window functions allow you to do easily and efficiently: include aggregates alongside details on the same row. Why Are They Called Window Functions? They're called window functions because, while looking at an individual row, you can also "look through the window" and extract information from the entire dataset or rows related to the current row. As an example, let's look at a table of students' grades: Student IDTeacher IDSubjectGrade 11Math63 21Math80 32Math60 42Math45 51Math52 61Math70 72Math65 12Science70 22Science62 32Science90 42Science30 52Science53 15English59 35English70 55English45 65English62 112History55 312History67 412History58 As per the previous example, we want to show the class average next to each student, so that a single row looks like this: Student IDSubjectGradeClass Average 4Math4562 To do this, we need to look at the average of all the other students when we're extracting this row and show the result alongside the other data. In SQL, window functions use a sliding window of rows to extract additional information from either the entire dataset, or a subset related to the current row. In the above diagram, the current row is a math result, and the window includes all the math grades. When we're extracting this row … Student IDSubjectGradeClass Average 1Science7062 … we'll need the window to "slide" so we're viewing all the science results in order to calculate the average for science. Window functions have many of the same capabilities as the GROUP BY clause, but the difference is that they allow us to view aggregates and details side by side. Syntax of BigQuery Window Functions The OVER() Clause The OVER() clause indicates that you're using a window function. You'd write your query as usual, and include the aggregates you want alongside the other column names. Each aggregate is identified with the OVER() clause. When you use this clause on its own, the 'window' is the entire dataset. I'll cover sliding windows a bit later in the article. For example, if you wanted to extract all the math results and show the class average and the highest and lowest grade against each student’s grade, your query would look like this: SELECT student_id, grade, AVG(grade) OVER() AS average, MIN(grade) OVER() AS lowest, MAX(grade) OVER() AS highest FROM exam_results WHERE subject = 'Math'; Your results would then look like this: student_idgradeaveragelowesthighest 163624580 280624580 360624580 445624580 552624580 670624580 765624580 The PARTITION BY Clause This clause uses a sliding window. Instead of a window containing the entire dataset, it includes only a partition (or part) of the set. In the previous example, I included only the math results by excluding all others using the WHERE clause. If you wanted a report that showed results for all subjects but calculated the average using only the rows where the subject matched the current row, you would use the PARTITION BY clause: SELECT student_id, subject, grade, AVG(grade) OVER(PARTITION BY subject) AS average FROM exam_results; Have a look at this color-coded copy of the students’ grades table to see how the partitions will work: Student IDTeacher IDSubjectGrade 11Math63 21Math80 32Math60 42Math45 51Math52 61Math70 72Math65 12Science70 22Science62 32Science90 42Science30 52Science53 15English59 35English70 55English45 65English62 112History55 312History67 412History58 When processing each row, the rows included in the window change based on the value in the subject column. This means the average is calculated only for the partition of the dataset where the subject matches the current row. You can visualize it like this: The results would look like this: Student IDSubjectGradeClass Average 1Math6362 2Math8062 3Math6062 4Math4562 5Math5262 6Math7062 7Math6562 1Science7061 2Science6261 3Science9061 4Science3061 5Science5361 1English5959 3English7059 5English4559 6English6259 1History5560 3History6760 4History5860 The ORDER BY Clause The ORDER BY clause within the OVER() function uses a different type of sliding window. When you use OVER(ORDER BY column_name), the window includes only those rows where the value of the specified column is less than or equal to that column’s value in the current row. The ORDER BY clause is useful for calculating running totals and moving averages. As an example, I'll use a table called monthly_transactions that contains bank account transactions: account_idtran_datetransactionvalue 12023-09-01Opening Balance500.00 12023-09-03Deposit137.45 12023-09-12Withdrawal-200.00 12023-09-18Withdrawal-250.00 22023-09-01Opening Balance1200.00 22023-09-14Deposit900.00 22023-09-20Purchase-318.90 The query below will list the transactions for Account ID 1, showing a running balance. SELECT account_id, tran_date, transaction, value, SUM(value) OVER(ORDER BY tran_date) AS balance FROM monthly_transactions WHERE account_id = 1; Including the ORDER BY clause within the OVER clause controls a sliding window. If you wanted to, you could also use the usual ORDER BY clause at the end of the query to control the final order of the rows on the report. They don’t have to be shown in the original order. By default, using the ORDER BY inside the OVER clause causes the window to slide so that it only views rows where the date is less than or equal to the date of the current row. There are other keywords that can change this default, but they’re a bit beyond the scope of this article. You can visualize it like this: The results would look like this: account_idtran_datetransactionvaluebalance 12023-09-01Opening Balance500.00500.00 12023-09-03Deposit137.45637.45 12023-09-12Withdrawal-200.00437.45 12023-09-18Withdrawal-250.00187.45 22023-09-01Opening Balance1200.001200.00 22023-09-14Deposit900.002100.00 22023-09-20Purchase-318.901781.10 Using ORDER BY with the PARTITION BY Clause If you want to show all accounts with their running balances, you can use PARTITION BY and ORDER BY together: SELECT account_id, tran_date, transaction, value, SUM(value) OVER(PARTITION BY account_id ORDER BY tran_date) AS balance FROM monthly_transactions ORDER BY account_id, tran_date The PARTITION BY clause will cause the sliding window to only include rows where the account_id matches the current row. The ORDER BY clause will cause it to only include rows within that partition where the date is less than or equal to the date of the current row. The results would be: account_idtran_datetransactionvaluebalance 12023-09-01Opening Balance500.00500.00 12023-09-03Deposit137.45637.45 12023-09-12Withdrawal-200.00437.45 12023-09-18Withdrawal-250.00187.45 22023-09-01Opening Balance1200.001200.00 22023-09-14Deposit900.002100.00 22023-09-20Purchase-318.901781.10 Each account has its own separate running balance. SQL Window Functions Available in BigQuery You've seen how common SQL aggregate functions such as SUM(), AVG(), MIN(), and MAX() can be used in conjunction with the OVER clause to extract aggregates from a window of data. Google BigQuery, in common with many other SQL dialects, has additional functions that can give deeper insights into the data. Here are some examples. RANK() This function ranks the dataset from highest to lowest on a specified column. It can answer questions like: Where did each student place in the class, based on exam results? Which products were most profitable? Which customers spent the most money? Which warehouse received the most complaints? Using the sample table we looked at earlier, let’s rank the students by exam results using this query: SELECT student_id, subject, grade, RANK() OVER(PARTITION BY subject ORDER BY grade desc) AS place FROM exam_results ORDER BY subject, grade DESC; The result would be: student_idsubjectgradeplace 3English701 6English622 1English593 5English454 3History671 4History582 1History553 2Math801 6Math702 7Math653 1Math634 3Math605 5Math526 4Math457 3Science901 1Science702 2Science623 5Science534 4Science305 When calculating these results, the PARTITION BY subject clause causes SQL to look at only the results for the same subject as the current row. The ORDER BY grade DESC clause sorts them in descending order of grade. The RANK() function then ranks the students in this order. Since student 3 has the highest grade in English, his rank is 1; student 6, the next highest, is ranked 2 in this subject. Student 2 has the highest grade in Math and is ranked as 1. DENSE_RANK() DENSE_RANK() is used for the same purpose as RANK. The difference between them can best be explained by looking at these query results, which represent scores in a shooting competition. Using RANK, the query would be: SELECT competitor_no, score, RANK() OVER(ORDER BY score desc) AS rank FROM match_results ORDER BY score DESC; The results are: Competitor NoScoreRank 4851 5832 10832 9814 2765 6765 7727 3708 8689 16210 Competitors 5 and 10 tied for second place, and both received a ranking of 2. Competitor 9 is the next highest, and ranks as 4. Third place is left out. Using DENSE_RANK, the query is: SELECT competitor_no, score, DENSE RANK() OVER(ORDER BY score desc) AS rank FROM match_results ORDER BY score DESC; The results are: Competitor NoScoreRank 4851 5832 10832 9813 2764 6764 7725 3706 8687 1628 Competitors 5 and 10 are still both ranked as 2, but third place is not left out: Competitor 9 now has a ranking of 3. Both functions have the same syntax. If we wanted to recode the previous students’ grades query using the DENSE_RANK() function, it would look like this: SELECT student_id, subject, grade, DENSE_RANK() OVER(PARTITION BY subject ORDER BY grade desc) AS class_place FROM exam_results ORDER BY subject, grade DESC But because there are no tie values, the result would be the same. ROW_NUMBER() The ROW_NUMBER() function works in a similar way to the previous two functions, but the rows are simply numbered in order. If rows have the same value, they will be numbered consecutively, depending on which was encountered first. Here’s the results of the shooting competition query using ROW_NUMBER() instead of RANK() or DENSE_RANK(): Competitor NoScoreRank 4851 5832 10833 9814 2765 6766 7727 3708 8689 16210 LAG() This function allows you to compare data from the previous row in the result set to data in the current row. It's ideal for year-on-year comparisons, allowing you to discover trends and identify business performance issues. LAG() and the related function LEAD() can only be used in conjunction with the OVER(ORDER BY) clause. As an example, take the following table, which holds sales data for a small company: yearsales_valuesales_quantityprofit 2019540009008000 202075000120011000 2021300004501000 202260000100007000 The sample query to compare year-on-year figures is: SELECT year, sales_value, sales_quantity, profit, LAG(sales_value) OVER(ORDER BY year) as ly_value, LAG(sales_quantity) OVER(ORDER BY year) as ly_qty, LAG(profit) OVER(ORDER BY year) as ly_profit, profit - LAG(profit) OVER(ORDER BY year) as inc_dec FROM annual_sales ORDER BY year; The results of this query are: yearsales_valuesales_quantityprofitly_valuely_qtyly_profitinc_dec 2019540009008000NULLNULLNULLNULL 2020750001200110005400090080003000 202130000450100075000120011000-10000 2022600001000070003000045010006000 Let's look at this line of the query and see what it actually did: LAG(sales_value) OVER(ORDER BY year) as ly_value On this row of the result … yearsales_valuesales_quantityprofitly_valuely_qtyly_profitinc_dec 2020750001200110005400090080003000 … our current row is for the year 2020. The LAG() function in conjunction with ORDER BY year causes SQL to look at the row for the previous year (2019) and extract the sales value from it under the heading ly_value. You will notice that in the first row, the columns calculated by the LAG() function contain a null value, since there is no previous record. LEAD() The LEAD() function is the inverse of LAG(): it obtains data from the row after the current row, instead of the one before it. To compare the profit between the current year, the previous year, and the subsequent year using the same sample table, the query would be: SELECT year, profit, LAG(profit) OVER(ORDER BY year) as ly_profit, profit - LAG(profit) OVER(ORDER BY year) as ly_inc_dec, LEAD(profit) OVER(ORDER BY year) as ny_profit, LEAD(profit) OVER(ORDER BY year) - profit as ny_inc_dec FROM annual_sales ORDER BY year; The results of this query would be: yearprofitly_profitly_inc_decny_profitny_inc_dec 20198000NULLNULL110003000 202011000800030001000-10000 2021100011000-1000070006000 2022700010006000NULLNULL The columns ny_profit and ny_inc_dec use the LEAD() function, which causes SQL to look at the next row in sequence to extract these fields. When the current row is for 2019, these two fields will be taken from the row for 2020. This should have given you an idea of some of the useful SQL window functions available in BigQuery. You'll find a full list of functions in the BigQuery SQL documentation. For more examples of window functions, refer to the article SQL Window Function Examples. You can also find a quick reference guide for the syntax of SQL window functions in our Window Functions Cheat Sheet. Practical Uses of BigQuery Window Functions In the real world, there are many ways that BigQuery window functions can give you insights to help your organization perform better. They are very powerful and allow you to produce complex reports very quickly. Here are a few ideas of how BigQuery window functions may be used: Ranking your employees by performance to give motivational rewards. Check out this article for more information on ranking rows. Finding how each product line performed compared to other similar items. Learning how individual products affect the moving average profit. You can read more about computing moving averages in SQL here. Comparing year-on-year data to discover trends. Find out more about preparing year-on-year comparisons in SQL in this article. Using running totals to be able to see exactly how many sales had been made at a given point in time. Learn about computing running totals in SQL here. Next Steps with BigQuery Window Functions Now that you've seen what SQL window functions can do in BigQuery and other database management systems, it's time to think about upping your skills in this important area. A good place to start is LearnSQL.com's SQL Window Functions course. You'll learn step by step how to make use of the techniques you've seen in this article, with a full explanation of each topic. You'll also get plenty of practice, with over 200 interactive exercises to make sure you know how to extract complex information in real-world situations. Help is available when you get stuck, and you will be able to access sample databases through your browser. The course takes about 20 hours to complete. If you really want to become an expert, you can get more practice in solving complex problems by working through our Window Functions Practice Set. You will tackle 100 examples of complex reports using three different databases. You can also read this article on how to practice window functions, and you can find out more about BigQuery syntax here. If you're job-hunting for a top-level job in data analysis or data science, it's very likely that you'll be asked to demonstrate your knowledge of SQL window functions at the interview. To get an idea of what kind of questions you might be asked (and how to answer them), here's an article that discusses top SQL window functions interview questions. Take the first steps today to move your data analysis skills to the next level! Tags: window functions Google BigQuery