16th Apr 2020 8 minutes read

Differences Between GROUP BY and PARTITION BY

Table of Contents

PARTITION BY vs. GROUP BY
GROUP BY
PARTITION BY
Window Functions
PARTITION BY and GROUP BY: Similarities and Differences

Window functions are a great addition to SQL, and they can make your life much easier if you know how to use them properly. Today, we will address the differences between a GROUP BY and a PARTITION BY. We’ll start with the very basics and slowly get you to a point where you can keep researching on your own.

PARTITION BY vs. GROUP BY

The PARTITION BY and the GROUP BY clauses are used frequently in SQL when you need to create a complex report. While returning the data itself is useful (and even needed) in many cases, more complex calculations are often required. This is where GROUP BY and PARTITION BY come in. Although they are very similar in that they both do grouping, there are key differences. We will analyze these differences in this article.

GROUP BY

The GROUP BY clause is used in SQL queries to define groups based on some given criteria. These criteria are what we usually find as categories in reports. Examples of criteria for grouping are:

group all employees by their annual salary level
group all trains by their first station
group incomes and expenses by month
group students according to the class in which they are enrolled

Using the GROUP BY clause transforms data into a new result set in which the original records are placed in different groups using the criteria we provide. You can check out more details on the GROUP BY clause in this article.

We can perform some additional actions or calculations on these groups, most of which are closely related to aggregate functions. As a quick review, aggregate functions are used to aggregate our data, and therefore in the process, we lose the original details in the query result. There are many aggregate functions, but the ones most commonly used are COUNT, SUM, AVG, MIN, and MAX.

If you want to practice using the GROUP BY clause, we recommend our interactive course Creating Reports in SQL. Aggregate functions and the GROUP BY clause are essential to writing reports in SQL.

Let’s consider the following example. Here we have the train table with the information about the trains, the journey table with the information about the journeys taken by the trains, and the route table with the information about the routes for the journeys. See below—take a look at the data and how the tables are related:

Let’s run the following query which returns the information about trains and related journeys using the train and the journey tables.

SELECT
        train.id,
        train.model,
        journey.*
FROM train
INNER JOIN journey ON journey.train_id = train.id
ORDER BY
        train.id ASC;

Here is the result:

id	model	id	train_id	route_id	date
1	InterCity 100	1	1	1	1/3/2016
1	InterCity 100	25	1	5	1/3/2016
1	InterCity 100	2	1	2	1/4/2016
1	InterCity 100	3	1	3	1/5/2016
1	InterCity 100	4	1	4	1/6/2016
2	InterCity 100	6	2	3	1/4/2016
2	InterCity 100	7	2	4	1/5/2016
2	InterCity 100	8	2	5	1/6/2016
2	InterCity 100	5	2	2	1/3/2016
3	InterCity 125	10	3	5	1/4/2016
3	InterCity 125	11	3	5	1/5/2016
3	InterCity 125	29	3	4	1/3/2016
3	InterCity 125	27	3	3	1/5/2016
3	InterCity 125	12	3	6	1/6/2016
3	InterCity 125	9	3	3	1/3/2016
4	Pendolino 390	16	4	7	1/6/2016
4	Pendolino 390	13	4	4	1/4/2016
4	Pendolino 390	14	4	5	1/4/2016
4	Pendolino 390	15	4	6	1/5/2016
4	Pendolino 390	28	4	6	1/6/2016

You can see that the train with id = 1 has 5 different rows, the train with id = 2 has 4 different rows, etc.

Now, let’s run a query with the same two tables using a GROUP BY.

SELECT
  	train.id,
	train.model,
	COUNT(*) AS routes
FROM train
INNER JOIN journey ON journey.train_id = train.id
GROUP BY
  	train.id,
	train.model
ORDER BY
  	train.id ASC;

And the result is the following:

id	model	routes
1	InterCity 100	5
2	InterCity 100	4
3	InterCity 125	6
4	Pendolino 390	5

From the query result, you can see that we have aggregated information, telling us the number of routes for each train. In the process, we lost the row-level details from the journey table.

You can compare this result set to the prior one and check that the number of rows returned from the first query (number of routes) matches the sum of the numbers in the aggregated column (routes) of the second query result.

Although you can use aggregate functions in a query without a GROUP BY clause, it is necessary in most cases. Aggregate functions work like this:

You generate groups using a GROUP BY statement by specifying one or more columns that have the same value within each group.
The aggregate function calculates the result.
The original rows are “collapsed.” You can access the columns in the GROUP BY statement and the values produced by the aggregate functions, but the original row-level details are no longer there.

“Collapsing” the rows is fine in most cases. Sometimes, however, you need to combine the original row-level details with the values returned by the aggregate functions. This can be done with subqueries by linking the rows in the original table with the resulting set from the query using aggregate functions. Or, you could try a different approach—we will see this next.

PARTITION BY

Depending on what you need to do, you can use a PARTITION BY in our queries to calculate aggregated values on the defined groups. The PARTITION BY is combined with OVER() and windows functions to calculate aggregated values. This is very similar to GROUP BY and aggregate functions, but with one important difference: when you use a PARTITION BY, the row-level details are preserved and not collapsed. That is, you still have the original row-level details as well as the aggregated values at your disposal. All aggregate functions can be used as window functions.

Let’s look at the following query. In addition to train and journey, we now incorporate the route table as well.

SELECT
  	train.id,
	train.model,
	route.name,
	route.from_city,
	route.to_city,
	COUNT(*) OVER (PARTITION BY train.id ORDER BY train.id) AS routes,
	COUNT(*) OVER () AS routes_total
FROM train
INNER JOIN journey ON journey.train_id = train.id
INNER JOIN route ON journey.route_id = route.id;

Here is the result of the query:

id	model	name	from_city	to_city	routes	routes_total
1	InterCity 100	Manchester Express	Sheffield	Manchester	5	30
1	InterCity 100	BeatlesRoute	Liverpool	York	5	30
1	InterCity 100	GoToLeads	Manchester	Leeds	5	30
1	InterCity 100	StudentRoute	London	Oxford	5	30
1	InterCity 100	MiddleEnglandWay	London	Leicester	5	30
2	InterCity 100	StudentRoute	London	Oxford	4	30
2	InterCity 100	MiddleEnglandWay	London	Leicester	4	30
2	InterCity 100	BeatlesRoute	Liverpool	York	4	30
2	InterCity 100	GoToLeads	Manchester	Leeds	4	30
3	InterCity 125	BeatlesRoute	Liverpool	York	6	30
3	InterCity 125	BeatlesRoute	Liverpool	York	6	30
3	InterCity 125	MiddleEnglandWay	London	Leicester	6	30
3	InterCity 125	StudentRoute	London	Oxford	6	30
3	InterCity 125	NewcastleDaily	York	Newcastle	6	30
3	InterCity 125	StudentRoute	London	Oxford	6	30
4	Pendolino 390	ScotlandSpeed	Newcastle	Edinburgh	5	30
4	Pendolino 390	MiddleEnglandWay	London	Leicester	5	30
4	Pendolino 390	BeatlesRoute	Liverpool	York	5	30
4	Pendolino 390	NewcastleDaily	York	Newcastle	5	30
4	Pendolino 390	NewcastleDaily	York	Newcastle	5	30
5	Pendolino ETR310	StudentRoute	London	Oxford	5	30

From the result set, we note several important points:

We did not use a GROUP BY but still obtained aggregated values (routes and routes_total).
We have the same columns (id and model) from the GROUP BY in the previous query, but the original row-level details were preserved. The aggregated values are repeated in all rows with the same values of id and model. This is expected; as an example, we have 5 journey records for id = 1, all of which have identical values for these columns.
We also have values in the columns name, from_city, and to_city that are different within a given value of id. Had we used a GROUP BY on the columns id and model, these row-level details would be lost.
COUNT(*) OVER () AS routes_total produced the same aggregate count, 30, as COUNT and GROUP BY would do. In this result set, however, this value is included in each row.
The part COUNT(*) OVER (PARTITION BY train.id ORDER BY train.id) AS routes is very interesting. We have defined the group over which this window function should be used with the PARTITION BY clause. Therefore, in the routes column, we have a count of rows for only that group. Window functions are applied after the rows are filtered, thereby keeping row-level details while still defining the groups through PARTITION BY.

Using standard aggregate functions as window functions with the OVER() keyword allows us to combine aggregated values and keep the values from the original rows. We can accomplish the same using aggregate functions, but that requires subqueries for each group or partition.

It is important to note that all standard aggregate functions can be used as window functions like this.

Window Functions

Besides aggregate functions, there are some other important window functions, such as:

ROW_NUMBER(). Returns the sequence number of the row in the result set.
RANK(). Similar to ROW_NUMBER(), but can take a column as an argument. The rank order is determined over the value of this column. If two or more rows have the same value in this column, these rows all get the same rank. The next rank will continue from the equivalent number of rows up; for example, if two rows share a rank of 10, the next rank will be 12.
DENSE_RANK(). Very similar to RANK(), except it doesn’t have “gaps.” In the previous example, if two rows share a rank of 10, the next rank will be 11.
NTILE. Used to calculate quartiles, deciles, or any other percentiles.
LAG & LEAD. Used to pull values from the previous (LAG) or the following (LEAD) row.

There is no general rule about when you should use window functions, but you can develop a feel for them. I definitely recommend going through the Window Functions course; there, you will find all the details you will want to know!

PARTITION BY and GROUP BY: Similarities and Differences

Although we use a GROUP BY most of the time, there are numerous cases when a PARTITION BY would be a better choice. In some cases, you could use a GROUP BY using subqueries to simulate a PARTITION BY, but these can end up with very complex queries.

Let’s wrap everything up with the most important similarities and differences:

Similarity: Both are used to return aggregated values.
Difference: Using a GROUP BY clause collapses original rows; for that reason, you cannot access the original values later in the query. On the other hand, using a PARTITION BY clause keeps original values while also allowing us to produce aggregated values.
Difference: The PARTITION BY is combined with OVER() and windows functions to add a lot more functionalities.

Tags:

PARTITION BY vs. GROUP BY

GROUP BY

PARTITION BY

Window Functions

PARTITION BY and GROUP BY: Similarities and Differences

You may also like