Back to articles list Articles Cookbook
8 minutes read

Differences Between GROUP BY and PARTITION BY

Window functions are a great addition to SQL, and they can make your life much easier if you know how to use them properly. Today, we will address the differences between a GROUP BY and a PARTITION BY. We’ll start with the very basics and slowly get you to a point where you can keep researching on your own.

PARTITION BY vs. GROUP BY

The PARTITION BY and the GROUP BY clauses are used frequently in SQL when you need to create a complex report. While returning the data itself is useful (and even needed) in many cases, more complex calculations are often required. This is where GROUP BY and PARTITION BY come in. Although they are very similar in that they both do grouping, there are key differences. We will analyze these differences in this article.

GROUP BY

The GROUP BY clause is used in SQL queries to define groups based on some given criteria. These criteria are what we usually find as categories in reports. Examples of criteria for grouping are:

  • group all employees by their annual salary level
  • group all trains by their first station
  • group incomes and expenses by month
  • group students according to the class in which they are enrolled

Using the GROUP BY clause transforms data into a new result set in which the original records are placed in different groups using the criteria we provide. You can check out more details on the GROUP BY clause in this article.

We can perform some additional actions or calculations on these groups, most of which are closely related to aggregate functions. As a quick review, aggregate functions are used to aggregate our data, and therefore in the process, we lose the original details in the query result. There are many aggregate functions, but the ones most commonly used are COUNT, SUM, AVG, MIN, and MAX.

If you want to practice using the GROUP BY clause, we recommend our interactive course Creating Reports in SQL. Aggregate functions and the GROUP BY clause are essential to writing reports in SQL.

Let’s consider the following example. Here we have the train table with the information about the trains, the journey table with the information about the journeys taken by the trains, and the route table with the information about the routes for the journeys. See below—take a look at the data and how the tables are related:

table_train table_journey table_route

Let’s run the following query which returns the information about trains and related journeys using the train and the journey tables.

SELECT
        train.id,
        train.model,
        journey.*
FROM train
INNER JOIN journey ON journey.train_id = train.id
ORDER BY
        train.id ASC;

Here is the result:

idmodelidtrain_idroute_iddate
1InterCity 1001111/3/2016
1InterCity 10025151/3/2016
1InterCity 1002121/4/2016
1InterCity 1003131/5/2016
1InterCity 1004141/6/2016
2InterCity 1006231/4/2016
2InterCity 1007241/5/2016
2InterCity 1008251/6/2016
2InterCity 1005221/3/2016
3InterCity 12510351/4/2016
3InterCity 12511351/5/2016
3InterCity 12529341/3/2016
3InterCity 12527331/5/2016
3InterCity 12512361/6/2016
3InterCity 1259331/3/2016
4Pendolino 39016471/6/2016
4Pendolino 39013441/4/2016
4Pendolino 39014451/4/2016
4Pendolino 39015461/5/2016
4Pendolino 39028461/6/2016

You can see that the train with id = 1 has 5 different rows, the train with id = 2 has 4 different rows, etc.

Now, let’s run a query with the same two tables using a GROUP BY.

SELECT
  	train.id,
	train.model,
	COUNT(*) AS routes
FROM train
INNER JOIN journey ON journey.train_id = train.id
GROUP BY
  	train.id,
	train.model
ORDER BY
  	train.id ASC;

And the result is the following:

idmodelroutes
1InterCity 1005
2InterCity 1004
3InterCity 1256
4Pendolino 3905

From the query result, you can see that we have aggregated information, telling us the number of routes for each train. In the process, we lost the row-level details from the journey table.

You can compare this result set to the prior one and check that the number of rows returned from the first query (number of routes) matches the sum of the numbers in the aggregated column (routes) of the second query result.

Although you can use aggregate functions in a query without a GROUP BY clause, it is necessary in most cases. Aggregate functions work like this:

  1. You generate groups using a GROUP BY statement by specifying one or more columns that have the same value within each group.
  2. The aggregate function calculates the result.
  3. The original rows are “collapsed.” You can access the columns in the GROUP BY statement and the values produced by the aggregate functions, but the original row-level details are no longer there.

“Collapsing” the rows is fine in most cases. Sometimes, however, you need to combine the original row-level details with the values returned by the aggregate functions. This can be done with subqueries by linking the rows in the original table with the resulting set from the query using aggregate functions. Or, you could try a different approach—we will see this next.

PARTITION BY

Depending on what you need to do, you can use a PARTITION BY in our queries to calculate aggregated values on the defined groups. The PARTITION BY is combined with OVER() and windows functions to calculate aggregated values. This is very similar to GROUP BY and aggregate functions, but with one important difference: when you use a PARTITION BY, the row-level details are preserved and not collapsed. That is, you still have the original row-level details as well as the aggregated values at your disposal. All aggregate functions can be used as window functions.

Let’s look at the following query. In addition to train and journey, we now incorporate the route table as well.

SELECT
  	train.id,
	train.model,
	route.name,
	route.from_city,
	route.to_city,
	COUNT(*) OVER (PARTITION BY train.id ORDER BY train.id) AS routes,
	COUNT(*) OVER () AS routes_total
FROM train
INNER JOIN journey ON journey.train_id = train.id
INNER JOIN route ON journey.route_id = route.id;

Here is the result of the query:

idmodelnamefrom_cityto_cityroutesroutes_total
1InterCity 100Manchester ExpressSheffieldManchester530
1InterCity 100BeatlesRouteLiverpoolYork530
1InterCity 100GoToLeadsManchesterLeeds530
1InterCity 100StudentRouteLondonOxford530
1InterCity 100MiddleEnglandWayLondonLeicester530
2InterCity 100StudentRouteLondonOxford430
2InterCity 100MiddleEnglandWayLondonLeicester430
2InterCity 100BeatlesRouteLiverpoolYork430
2InterCity 100GoToLeadsManchesterLeeds430
3InterCity 125BeatlesRouteLiverpoolYork630
3InterCity 125BeatlesRouteLiverpoolYork630
3InterCity 125MiddleEnglandWayLondonLeicester630
3InterCity 125StudentRouteLondonOxford630
3InterCity 125NewcastleDailyYorkNewcastle630
3InterCity 125StudentRouteLondonOxford630
4Pendolino 390ScotlandSpeedNewcastleEdinburgh530
4Pendolino 390MiddleEnglandWayLondonLeicester530
4Pendolino 390BeatlesRouteLiverpoolYork530
4Pendolino 390NewcastleDailyYorkNewcastle530
4Pendolino 390NewcastleDailyYorkNewcastle530
5Pendolino ETR310StudentRouteLondonOxford530

From the result set, we note several important points:

  • We did not use a GROUP BY but still obtained aggregated values (routes and routes_total).
  • We have the same columns (id and model) from the GROUP BY in the previous query, but the original row-level details were preserved. The aggregated values are repeated in all rows with the same values of id and model. This is expected; as an example, we have 5 journey records for id = 1, all of which have identical values for these columns.
  • We also have values in the columns name, from_city, and to_city that are different within a given value of id. Had we used a GROUP BY on the columns id and model, these row-level details would be lost.
  • COUNT(*) OVER () AS routes_total produced the same aggregate count, 30, as COUNT and GROUP BY would do. In this result set, however, this value is included in each row.
  • The part COUNT(*) OVER (PARTITION BY train.id ORDER BY train.id) AS routes is very interesting. We have defined the group over which this window function should be used with the PARTITION BY clause. Therefore, in the routes column, we have a count of rows for only that group. Window functions are applied after the rows are filtered, thereby keeping row-level details while still defining the groups through PARTITION BY.

Using standard aggregate functions as window functions with the OVER() keyword allows us to combine aggregated values and keep the values from the original rows. We can accomplish the same using aggregate functions, but that requires subqueries for each group or partition.

It is important to note that all standard aggregate functions can be used as window functions like this.

Window Functions

Besides aggregate functions, there are some other important window functions, such as:

  • ROW_NUMBER(). Returns the sequence number of the row in the result set.
  • RANK(). Similar to ROW_NUMBER(), but can take a column as an argument. The rank order is determined over the value of this column. If two or more rows have the same value in this column, these rows all get the same rank. The next rank will continue from the equivalent number of rows up; for example, if two rows share a rank of 10, the next rank will be 12.
  • DENSE_RANK(). Very similar to RANK(), except it doesn’t have “gaps.” In the previous example, if two rows share a rank of 10, the next rank will be 11.
  • NTILE. Used to calculate quartiles, deciles, or any other percentiles.
  • LAG & LEAD. Used to pull values from the previous (LAG) or the following (LEAD) row.

There is no general rule about when you should use window functions, but you can develop a feel for them. I definitely recommend going through the Window Functions course; there, you will find all the details you will want to know!

PARTITION BY and GROUP BY: Similarities and Differences

Although we use a GROUP BY most of the time, there are numerous cases when a PARTITION BY would be a better choice. In some cases, you could use a GROUP BY using subqueries to simulate a PARTITION BY, but these can end up with very complex queries.

Let’s wrap everything up with the most important similarities and differences:

  • Similarity: Both are used to return aggregated values.
  • Difference: Using a GROUP BY clause collapses original rows; for that reason, you cannot access the original values later in the query. On the other hand, using a PARTITION BY clause keeps original values while also allowing us to produce aggregated values.
  • Difference: The PARTITION BY is combined with OVER() and windows functions to add a lot more functionalities.