23rd Jul 2020 13 minutes read How to Analyze a Time Series in SQL Marija Ilic sql learn sql window functions Table of Contents Examples of Time Series Data Running Totals with SQL Window Functions Percent Change in Daily Website Visits Using the LAG() function to retrieve a previous row’s value 1-day increase/decrease in total number of visits 7-day increase/decrease in total number of visits Simple Moving Averages: 7 Days Using RANK() to Find the Highest Number of Visits Learn More About Analyzing Time Series with SQL Values ordered by time are called a time series. In this article, I’ll show you the most frequent patterns and teach you how to write queries for time series in SQL with the help of window functions. Maybe you’ve had the opportunity to analyze some variables where each value was associated with a time value. Such data – where values are ordered by time – is called time series data. Because it’s so frequently used in business, it’s important to learn the common patterns used to analyze this type of data. Examples of Time Series Data Time series data are variables with an additional component – time. This means that each value of an attribute is associated with a date or time value. Here are some examples of time series data: The daily stock price for a particular corporation last year. (Each stock price is associated with a specific day.) The daily average of the Dow Jones index for the last 10 years. (Each index value is associated with a specific day.) Unique visits to a website over a month. (Each month, the number of visits differ.) A website’s registered users for each day. Weekly sales figures. Yearly company revenue and expenses over a decade. (The year is the time value.) Daily app logins over two months. (The day is the time value.) In this article, we will analyze the popularity of two fictional websites through a measure called the “total daily number of visits”. The SQL queries we’ll discuss can be utilized for other time series analysis; that is, they are applicable to other time series data sets. We will observe the period from 01 July 2019 until 31 December 2019. Here is the data: datevisitsweekendwebsite 2019-07-012805Nwww.sqlanalysts.com 2019-07-024398Nwww.sqlanalysts.com 2019-07-036744Nwww.sqlanalysts.com 2019-07-046925Nwww.sqlanalysts.com ............ ............ 2019-12-253591Nwww.sqlanalysts.com 2019-12-264988Nwww.sqlanalysts.com 2019-12-277061Nwww.sqlanalysts.com 2019-12-282286Ywww.sqlanalysts.com 2019-12-292462Ywww.sqlanalysts.com 2019-12-303216Nwww.sqlanalysts.com 2019-12-314752Nwww.sqlanalysts.com 2019-07-013087Nwww.sqldevelopers.com 2019-07-025157Nwww.sqldevelopers.com 2019-07-038207Nwww.sqldevelopers.com ............ ............ 2019-12-265924Nwww.sqldevelopers.com 2019-12-278619Nwww.sqldevelopers.com 2019-12-281730Ywww.sqldevelopers.com 2019-12-291913Ywww.sqldevelopers.com 2019-12-303621Nwww.sqldevelopers.com 2019-12-315618Nwww.sqldevelopers.com daily_visits table This table is named daily_visits and contains following attributes: date – Any day between 01 July and 31 December 2019. visits – The total number of website visits on a specific date. weekend – This value is ‘N’ if the date is a weekday and ‘Y’ if it is a Saturday or Sunday. website – The website’s domain name (‘www.sqlanalysts.com’ or ‘www.sqldevelopers.com’). As you may notice, each row in our daily_visits table (i.e. each value of the visits attribute) is associated with one day (the date attribute). This is an example of time series data. Running Totals with SQL Window Functions We will start our analysis using a pattern called the running total. A running total is the cumulative sum of all the previous numbers in a column. Below is a running total for the number of visits from 01 July to a specific day. Note that this is calculated for each site: In the image above, you can see how on 01 July the running total is 2,805. (It is equal to the number of visits on that day.) This is because there is no data for the dates before 01 July; we are starting the calculation from this date. On the next day (02 July) the value from the previous row is added to the current number of visits. The running total for this day is 7,203 – the number of visits on 01 July plus the number of visits on 02 July. On the next day, 03 July, we add that number (6,744) to the previous total (7,203) and get 13,947. And on it goes. This is just one business example where the running total pattern is used; some other examples include the: Total number of items sold from the first day of a month until the observed day in the same month. Running total (cumulative sum) of a bank account’s debit/credit transactions in the last quarter or year. Total revenue-to-date from January until the observed month in the same year Now that we know what a running total is, let's write an SQL query that calculates one. A running total (or cumulative sum) can be calculated in SQL by using the appropriate window functions. Window functions are special SQL functions that work across a set of rows. They are similar to the GROUP BY clause but the result is displayed differently at the end. In a simple GROUP BY, rows are collapsed (each group is shown as one row). With window functions, rows are not collapsed; each row is returned and the calculation value is assigned to each row in the table. This is exactly what we need when displaying running total – an additional value in an existing table. Below is an SQL query that calculates the running total on the daily_visit data : SELECT *, SUM(visits) OVER (PARTITION BY website ORDER BY date) AS running_total FROM daily_visits; When you run this query, all rows are displayed and an additional column, running_total, is created. Here’s a short explanation of what’s going on: The OVER clause, with PARTITION BY, tells SQL this will be a window function. PARTITION BY divides rows into logical groups. In our example, rows are grouped on the website level. ORDER BY sets the order of the rows. In our example, all rows are sorted by the date column. ROW/RANGE is not explicitly defined in this example. These optional keywords signify that each window frame inside one partition comprises all rows from the start of the partition to the current row. In other words, for each row, the total sum is calculated as the sum of the values from the first row in the partition until the current row. With this SQL statement (OVER in combination with PARTITION BY and ORDER BY), we set up a running total of visits on the website level. I'm not going to dive deep into this explanation and syntax here. If you want to learn window functions in a more detailed way, I recommend the LearnSQL.com course on window functions. Check out the companion article, SQL Course of the Month - Window Functions; it answers questions like why you should learn window functions, what they are, and why this course is a great choice. Window functions are used in other pattern calculations. In the next section, I’ll peek into the window functions cheat sheet and show you how to write SQL queries for percent change and moving averages. Percent Change in Daily Website Visits You will quite often need to describe how your business changes through time. There’s another very common pattern used in time series analysis called ‘percent of change’ (or or percent/percentage change). It answers questions like: How does this month’s revenue compare to last month’s? Has it increased or decreased? Has our website’s number of registered users increased or decreased this quarter? Did I sell more items today than I did one week ago? Are our sales increasing or decreasing compared to last year? Next, I’ll show you how to use SQL to calculate percentage changes. In the following example, we’ll find the percent of change in the total number of website visits (today compared to yesterday and today compared to one week ago). Once you learn how to do percentage changes with this example, you can apply it to any other business case. The query structure is the same; only the table and column names – and maybe a LAG() argument, which I’ll explain in a second – will differ. Using the LAG() function to retrieve a previous row’s value The first step in calculating the percent of change is to retrieve a value from a previous row. Why do we need this? Because percent change is calculated by the formula: (current_value - previous_value)/previous value * 100. This formula means that to calculate the increase or decrease in the number of visits, you need to have both values presented in the same row. So, our first task in calculating this pattern is to retrieve a value from a previous row. This task can be done with the help of the LAG() window function. It provides access to a value in a given row that comes before the current row. Below is a SELECT statement that assigns the previous row’s daily_visits value to a new column (previous_day_visits) in the current row: SELECT *, LAG(visits) OVER(PARTITION BY website ORDER BY date) AS previous_day_visits FROM daily_visits; This is a typical window function: inside the OVER clause, you define the desired partition and ordering. The LAG() function takes one argument (the name of the column holding the desired values) and assigns the previous row’s value to each row: LAG() can also be used to assign values from n rows back, not just the previous row. For example, suppose you want to calculate the percentage change for the same day last week. In that case, you’ll need to assign a value from seven days ago to each row. To do this, we use LAG’s optional offset parameter. Check out the following query: SELECT *, LAG(visits,7) OVER(PARTITION BY website ORDER BY date) AS previous_day_visits FROM daily_visits; LAG’s default offset value is 1 (the previous row), but you can change it to any other value. In our example, we used a value of 7, which assigns each row the number of visits from 7 days back: 1-day increase/decrease in total number of visits Now we can easily calculate a 1-day increase/decrease with this SQL code: WITH daily_visits_lag AS ( SELECT *, LAG(visits) OVER(PARTITION BY website ORDER BY date) AS previous_day_visits FROM daily_visits ) SELECT *, COALESCE(round((visits — previous_day_visits)/previous_day_visits *100),0) AS percent_change FROM daily_visits_lag; This statement seems complicated, but it really isn’t. Here’s what’s going on: The SELECT that assigns each row a value from the previous row is written inside a WITH statement. This means that we are using a common table expression or CTE (i.e. a named temporary result set). We then use this temporary result in the subsequent SELECT. The temporary result set is named daily_visits_lag. It contains values needed for the percentage change calculation (i.e. the number of visits for the current row and the number of visits from the previous day). daily_visits_lag is used in the main query. An additional column, percent_change, is calculated in COALESCE() with the formula (visits—previous_day_visits)/previous_day_visits * 100. After this statement is executed, the SQL engine drops the temporary result set; it cannot be used further in the code. (This is how a CTE works.) There’s not enough space to go into CTEs here, but our Recursive Queries course is a good resource for learning more about CTE usage and syntax. After running this query, the following values are displayed: 7-day increase/decrease in total number of visits Now that you know how to calculate a 1-day percentage change, a very similar statement can be used to calculate a 7-day decrease/increase in the total number of visits: WITH daily_visits_lag AS ( SELECT *, LAG(visits,7) OVER(PARTITION BY website ORDER BY date) AS previous_7day_visits FROM daily_visits ) SELECT *, COALESCE(round((visits — previous_7day_visits)/previous_7day_visits *100),0) AS percent_change FROM daily_visits_lag; The only difference here is that we used LAG() with an offset parameter of 7 – we are retrieving visit values from 7 days earlier (one week ago) and calculating the percent of change with those values (current day vs. one week earlier). In other words, we are calculating the 7-day increase or decrease in the number of visits: Keep in mind that the query we used here can be used on other business examples. Just adjust the table and column names; the rest can stay the same. Simple Moving Averages: 7 Days Another very frequently used pattern in time series analysis is called a simple moving average (SMA). An SMA is the unweighted mean of the previous n row values; it is calculated for each value in a given column. SMAs are often used when determining trends in stock price or cryptocurrency analysis, including crypto buying. This information helps us understand the behavior of our variable: instead of just one value, we get better estimations by using the average values of a specific measure. We are smoothing out the fluctuations to get an overall view. In our website example, each day will have two figures of interest: The number of visits that happened on that specific day The average number of visits for the last 7 days. Once again, we’ll use an SQL window function to calculate our SMA. Unlike the previous example, where we used LAG() to retrieve previous row values, here we’ll use the ROW/RANGE parameter inside the OVER clause: SELECT *, AVG(visits) OVER(PARTITION BY website ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as SMA7 FROM daily_visits ; Earlier in the article, we said that ROW inside an OVER clause defines a window inside each partition. When we calculated the running total, we used default values for the row/range parameters. This means the starting point for each window inside the partition was the first row in that partition and the ending point was the current row. Now, each window is defined as 7 rows (the 6 prior row values + the current row value). We did that with ROWS BETWEEN 6 PRECEDING AND CURRENT ROW. In other words, the starting point is not the first row in the partition. Here, the average value (avg(visits)) is going to be calculated on the last 7 rows, including the current one. That’s the definition of an SMA. Once you run this statement, you’ll get an average value associated with each row. This will allow you to observe trends in visits in a more detailed way. Here’s the result: SQL window functions are a really powerful feature for trend analysis, and the SMA is just one of indicators that can be obtained with window functions. If you would like to see more examples related to trend analysis, try our Revenue Trend Analysis in SQL course. It discusses how you can use SQL to analyze any time series trend. Using RANK() to Find the Highest Number of Visits Our last analysis pattern is ranking. As you might guess, this orders results based on a given variable. Suppose we want to see which dates had the highest number of visits for each of our websites. To do this, we need to rank our daily visits for each website separately. We can do that using the RANK() window function: SELECT *, RANK() OVER (PARTITION by website ORDER BY visits DESC) AS rank FROM daily_visits; Here is what is happening in this code: The rank is determined by the column visits. This is defined in ORDER BY, which is in the OVER clause). The row with the highest visits value gets the highest rank. Rows (visits) are grouped separately for each website so the rank for each website is determined individually. This is defined in PARTITION BY in the OVER clause. Rows with equal ranking criteria (i.e. have the same visits value) receive the same rank. Once you run this SELECT, the SQL engine returns a result set with an additional column named rank. Now we can easily see which days had the highest number of visits. The great thing is that the ranking is defined for each site, so we’re not comparing the two sites together. datevisitsweekendwebsiterank 2019-08-2311993Nwww.sqldevelopers.com1 2019-08-2811334Nwww.sqldevelopers.com2 2019-10-0410998Nwww.sqldevelopers.com3 2019-09-2010812Nwww.sqldevelopers.com4 2019-10-2310737Nwww.sqldevelopers.com5 The highest rank for sqldevelopers.com datevisitsweekendwebsiterank 2019-10-1210895Nwww.sqlanalysts.com1 2019-07-0610595Nwww.sqlanalysts.com2 2019-07-1310558Nwww.sqlanalysts.com3 2019-12-2210327Nwww.sqlanalysts.com4 2019-10-2010290Nwww.sqlanalysts.com5 The highest rank for sqlanalysts.com Learn More About Analyzing Time Series with SQL So now you can use SQL window functions to do some basic time series analysis. You know what time series are and how you can use SQL window functions to get some pretty great insights. You’ve even been introduced to CTEs. What’s next? I definitely recommend taking an organized approach to learning more SQL time series analysis. The Window Functions and Revenue Trend Analysis in SQL courses I’ve already mentioned are good. Remember, it’s important to put your skills into practice, so make sure you’re getting some real-world exercises to solve! Tags: sql learn sql window functions