25th Aug 2020 8 minutes read

How to Calculate the Length of a Series with SQL

window functions

Table of Contents

What Is a Series?
Why Calculate a Series Length?
How to Calculate the Length of a Series in SQL
- Creating the CTE
- Selecting Data from the CTE
That’s One Way to Calculate a Series Length ...

What is a time series and why bother to calculate its length using SQL?

How do we calculate the length of a series in SQL? I can answer that in two words: window functions! Yes, you’ll need some knowledge of SQL window functions to calculate series length. But what is a time series, and why would you want to find out how long it is?

What Is a Series?

Although there is some variation in the definition of “time series”, it is basically a sequence of data listed in time order.

In a database, this is usually represented by events separated by an equal time. For example, user website logins on consecutive days is considered a series. The table below shows such login dates:

id	date	consecutive logins
1	2020-06-01	3 days
2	2020-06-02
3	2020-06-03
4	2020-06-06	6 days
5	2020-06-07
6	2020-06-08
7	2020-06-09
8	2020-06-10
9	2020-06-11
10	2020-06-13	4 days
11	2020-06-14
12	2020-06-15
13	2020-06-16
14	2020-06-22	2 days
15	2020-06-23	2 days

If I wanted to analyze this series’ length, I would be counting the number of consecutive logins. The first series’ length is three days, since the user logged in on 2020-06-01, 2020-06-02, and 2020-06-03. The second series’ length is six days (the user logged in every day between 2020-06-06 and 2020-06-11). Following the same logic, the next two series’ lengths are four and two days, respectively.

Why Calculate a Series Length?

Time series are extensively used, and there are many situations where you’d need to calculate series’ length. Some examples of calculating series’ lengths include:

Measuring a login streak on Stack Overflow.
Seeing your activity streak on Duolingo.
Tracking how many days you’ve used a fitness app.
Analyzing a sales streak in an e-commerce site.
Finding the consecutive increase or decrease of a currency value.

Basically, anywhere you have a time series, you’ll probably need to calculate its length.

The main question remains: How do you calculate a time series’ length, and how do you do it in SQL? As the above table shows, time series is not a difficult concept to grasp. The moment you saw this example, you intuitively knew how to get the series’ length.

However, it’s one thing to find a series length manually across several lines of data. But you can’t do that on a database with hundreds, thousands, or millions of rows. Luckily, SQL window functions are here to save the day!

If you need to refresh your understanding of window functions, try this window functions cheat sheet. It may be especially useful when we arrive at the calculation part.

If you don’t even know what window functions are, I recommend that you read Why Should I Learn SQL Window Functions? before you go any further. The best way to really get a comprehensive knowledge of window functions is the LearnSQL.com Window Functions course.

How to Calculate the Length of a Series in SQL

For this example, let’s imagine you’re learning High Valyrian on Duolingo. The Duolingo platform has a thing called a streak. As the site explains, a streak is “the number of days in a row you have completed a lesson. Once you complete a lesson in the app or website, your streak will increase by one day. You will receive your daily reward when you meet your XP goal.” (XP is ‘experience points’, by the way.) Monitoring your streak allows you to receive specific platform incentives. So how does Duolingo know how long your streak is?

Your learning streak for July 2020 can be presented by the table lesson_completed. It contains the following columns:

id: The ID of the lesson.
date_completed: The date you completed the lesson.

Now let’s write a query to find your streak. I’ll use a Common Table Expression (CTE) to help organize this query. For those of you who are not familiar, our article on CTEs is a good starting point. Running this query will calculate the length of the series:

WITH groups AS (
	SELECT	RANK() OVER (ORDER BY date_completed) AS row_number,
			date_completed,
			DATEADD (day, -RANK() OVER (ORDER BY date_completed), 
date_completed) AS date_group
FROM lesson_completed)

SELECT	COUNT(*) AS days_streak,
		MIN (date_completed) AS min_date,
		MAX (date_completed) AS max_date
FROM groups
GROUP BY date_group;

Let me explain what I did. The query can be divided into two parts:

Creating the CTE.
Selecting data from the CTE.

Creating the CTE

The part of the query that creates the CTE is given again below:

WITH groups AS (
	SELECT	RANK() OVER (ORDER BY date_completed) AS row_number,
			date_completed,
			DATEADD (day, -RANK() OVER (ORDER BY date_completed), 
date_completed) AS date_group
FROM lesson_completed)

The CTE is defined by the WITH clause. I’ve decided the name of this CTE should be groups. Everything that is written in the parentheses after the AS keyword is just a fairly regular SELECT statement.

Now, what does this do? First, I’ve added the number of rows to the table. To do that, I’ve used the RANK() function. This is a window function, which is why it’s defined by the OVER() clause. I want the ranks to be added sequentially according to the dates, so this function’s result is ordered by the column date_completed.

The CTE then selects the column date_completed. I’ve used the DATEADD function (SQL Server) to deduct the row number from the date_completed. In this statement ...

DATEADD (day, -RANK() OVER (ORDER BY date_completed), date_completed) AS date_group

... I had to define the interval that will be added (or deducted), which is the day. How many days do I want to deduct? The days that are equal to the number of rows – that’s why I’ve only copied the RANK() window function I already had defined and added the negative sign in front of it. From what do I want this to be deducted? From the date_completed, of course!

This part of the query will give the following result:

row_number	date_completed	date_group
1	2020-07-01	2020-06-30
2	2020-07-02	2020-06-30
3	2020-07-03	2020-06-30
4	2020-07-04	2020-06-30
5	2020-07-05	2020-06-30
6	2020-07-08	2020-07-02
7	2020-07-09	2020-07-02
8	2020-07-10	2020-07-02
9	2020-07-18	2020-07-09
10	2020-07-19	2020-07-09
11	2020-07-20	2020-07-09
12	2020-07-21	2020-07-09
13	2020-07-22	2020-07-09
14	2020-07-23	2020-07-09
15	2020-07-24	2020-07-09
16	2020-07-25	2020-07-09
17	2020-07-26	2020-07-09
18	2020-07-28	2020-07-10
19	2020-07-29	2020-07-10
20	2020-07-30	2020-07-10
21	2020-07-31	2020-07-10

Why do I need this? It will help me calculate the series length. Look at the table. Did you notice that the consecutive days belong to the same date group? Why is that? If the dates are consecutive and you deduct the row number from it, you’ll always get the same date. Look at this:

row_number	date_completed	date_group
1	2020-07-01	2020-06-30

Deduct the row number (1) from the date (2020-07-01) and you’ll get 2020-06-30. That’s precisely what I have in the table.

row_number	date_completed	date_group
1	2020-07-01	2020-06-30
2	2020-07-02	2020-06-30

And then the next row. If you deduct the row number (2) from the date (2020-07-02), the result is 2020-06-30 again! Now let’s see what happens when the dates are not consecutive.

row_number	date_completed	date_group
5	2020-07-05	2020-06-30
6	2020-07-08	2020-07-02

For row 5, the result is still the same, 2020-06-30. That’s fine, but what about the next row? Deduct row number 6 from the date (2020-07-08), and what do you get? 2020-07-02. This is now a new date group, which is not surprising since 2020-07-08 isn’t consecutive to 2020-07-05 in any calendar!

A nice little trick, right? Regarding dates in the date_group column, it really doesn’t matter what dates you get. They’ll only serve as values, which will be counted in the second part of the query. Count how many times every group occurs, and you’ll have your series length!

Selecting Data from the CTE

The second part of the query selects the data from the CTE I’ve defined above:

SELECT	COUNT(*) AS days_streak,
		MIN (date_completed) AS min_date,
		MAX (date_completed) AS max_date
FROM groups
GROUP BY date_group;

This simple SELECT statement counts the number of rows and shows the results in the new column days_streak. Then it selects the minimum and maximum date in the column date_completed with the results shown in the columns min_date and max_date, respectively. All this data will be selected from the CTE named groups.

Finally, the data has to be grouped by the date_group. Why? Because I don’t need the total number of rows; I want the number of rows for each date group.

days_streak	min_date	max_date
5	2020-07-01	2020-07-05
3	2020-07-08	2020-07-10
9	2020-07-18	2020-07-26
4	2020-07-28	2020-07-31

That’s One Way to Calculate a Series Length ...

This little example should give you an idea of how to calculate the length of a series with SQL. However, there’s no single way to approach it. And, as you’ve seen, there’s no simple SQL function that will give you what you want. Your data and what you need to calculate will determine your SQL code. It usually requires a little trick; your SQL experience and knowledge will help you decide which one!

What you’ll almost certainly need to know when calculating the length of the series are SQL window functions. If you’re still not convinced you should learn window functions, read this interview with the course creator – she explains all the benefits of window functions.

This article showed you only a glimpse of what time series are. If you’re interested in learning how to analyze this specific type of data, LearnSQL has some very detailed guidance.

If this article helped you solve some time series problems, let me know in the comments section.

Tags:

window functions