18th Apr 2024 Updated: 29th Oct 2024 17 minutes read

GROUP BY and Aggregate Functions: A Complete Overview

Table of Contents

Syntax of GROUP BY with Aggregate Functions
SQL Aggregate Functions
Using GROUP BY with SQL Aggregate Functions
Advanced GROUP BY Techniques
Common Mistakes When Using GROUP BY
Further GROUP BY Resources

SQL’s GROUP BY and aggregate functions are essential in data aggregation – and for analyzing data and creating reports. Let’s explore them together!

In SQL, GROUP BY and aggregate functions are one of the language’s most popular features. Data aggregation is critical for data analysis and reporting; to make sense of all the data in a dataset, it often needs to be aggregated.

But what is data aggregation? Put simply, it’s when you group data based on common values and perform a calculation for each data group. For example, you might group a list of dogs by their breed and then use a calculation to figure out the average height for each breed. This task is made possible by the GROUP BY clause and the aggregate function AVG().

If you need to refresh your knowledge about GROUP BY, our SQL Basics course is ideal. It has 129 interactive exercises that you can finish in about 10 hours. It covers topics such as filtering data, using logical operators, joining, and using set operators.

Syntax of GROUP BY with Aggregate Functions

GROUP BY is a clause in SQL that arranges data with the same values into groups. Grouping is done by column(s), where all the rows with the same value in that column belong to one group. You can then perform summary computations – such as counting, summing, or averaging values – for each group. For example, you can use GROUP BY to compute the number of employees in a department, the total salary or average salary per department, etc.

The syntax of GROUP BY with an aggregate function is:

SELECT 
  column1, 
  column2, 
  ..., 
  aggregate_function(column_name)
FROM table_name
WHERE filtering_condition
GROUP BY column1, column2, …
HAVING filtering_condition;

The aggregate function is written in the SELECT statement, and the function’s result will be shown as an additional column.

The GROUP BY clause contains the columns by which you want your output to be grouped. This clause is often used with the WHERE and HAVING clauses for filtering. In the syntax, WHERE comes before GROUP BY, while HAVING comes after it. (I’ll explain more about these two clauses later.) Also, your data will sometimes contain NULLs, so you’ll be interested to learn how GROUP BY treats NULLs.

An example of GROUP BY in action is shown in the following GIF.

This basic example groups data by the film director and shows the number of films each has made.

SQL Aggregate Functions

SQL’s aggregate functions are used when you want to do computations on data and return a single value.

The most common aggregate functions in SQL are:

COUNT() – Counts the rows in the stated column.
SUM() – Returns the sum of all values.
AVG() – Computes the average of a group of values.
MIN() – Returns the minimum (the lowest) of the given values.
MAX() – Returns the maximum (the highest) of the given values

See our guides on how to use these functions to compute aggregates for the whole table:

It’s important to note that the COUNT() function has several different use variations:

COUNT(*) counts all rows in a result set, including NULL values and duplicates.
COUNT(expression) counts non-NULL values in a result set.
COUNT(DISTINCT) counts distinct values and ignores duplicates.

There are more details in our guide What is the Difference Between COUNT(*), COUNT(1), COUNT(column), and COUNT(DISTINCT)?

Using GROUP BY with SQL Aggregate Functions

I’ll use the table albums to show how aggregate functions and GROUP BY work together. The script for creating the table is here.

This table contains data about musical albums. Here’s a snapshot:

id	artist	album	release_year	length	label
1	Grant Green	The Latin Bit	1963	0:38:56	Blue Note Records
2	AC/DC	High Voltage	1976	0:44:23	Atlantic Records
3	Brother Jack McDuff	To Seek a New Home	1970	0:33:12	Blue Note Records
4	Grant Green	Green Is Beautiful	1970	0:37:33	Blue Note Records
5	Wayne Shorter	Moto Grosso Feio	1974	0:42:22	Blue Note Records

Example 1: COUNT() with GROUP BY

Here’s an example code that counts the number of albums by label:

SELECT 
  label,
  COUNT(*) AS number_of_albums
FROM albums
GROUP BY label;

I select the label and use COUNT(*) to find the number of albums. The asterisk (*) is a shorthand for counting all rows. When I group the output by label, COUNT() will show the number of albums per label.

Here’s the output:

label	number_of_albums
Atlantic Records	13
Blue Note Records	12
Stax Records	14

Further reading:

Example 2: SUM() with GROUP BY

The following code sums the album lengths to show the total music available per artist:

SELECT 
  artist,
  SUM(length) AS total_music_available
FROM albums
GROUP BY artist;

The SUM() function is applied to the length column, and the output is then grouped by the artist.

This is the code result:

artist	total_music_available
Isaac Hayes	6:30:02
Otis Redding	1:34:09
Brother Jack McDuff	1:58:11
Aretha Franklin	1:47:07
Grant Green	3:10:11
John Prine	1:21:57
Led Zeppelin	3:32:07
Wayne Shorter	2:38:02
Albert King	3:08:28
AC/DC	2:05:23

Further reading:

Example 3: AVG() with GROUP BY

This code calculates each artist’s average album length:

SELECT 
  artist,
  AVG(length) AS average_album_length
FROM albums
GROUP BY artist;

To get the desired result, you need to use AVG() on the column length and group the output by artist.

Here’s what the output looks like:

artist	average_album_length
Isaac Hayes	1:05:00
Otis Redding	0:31:23
Brother Jack McDuff	0:39:24
Aretha Franklin	0:35:42
Grant Green	0:38:02
John Prine	0:40:59
Led Zeppelin	0:42:25
Wayne Shorter	0:39:31
Albert King	0:37:42
AC/DC	0:41:48

Further reading:

Example 4: MIN() and MAX() with GROUP BY

Let’s find the shortest and the longest album length for each year:

SELECT 
  release_year,
  MIN(length) AS minimum_album_length,
  MAX(length) AS maximum_album_length
FROM albums
GROUP BY release_year
ORDER BY release_year;

As in earlier examples, the aggregate functions are used on the length column; MIN() for the shortest and MAX() for the longest album.

The output is grouped by release year.

I also added the ORDER BY clause to sort the output from the earliest to the latest year so it would be more readable.

Here’s the code output:

release_year	minimum_album_length	maximum_album_length
1962	0:34:49	0:34:49
1963	0:38:56	0:38:56
1964	0:30:17	0:30:17
1965	0:32:22	0:42:45
1967	0:32:51	0:41:08
1968	0:29:30	0:29:30
1969	0:31:30	0:46:00
1970	0:33:12	0:43:04
1971	0:42:37	1:33:38
1972	0:37:50	0:44:46
1973	0:40:57	1:43:10
1974	0:42:22	0:46:00
1976	0:39:59	0:44:23
1977	0:41:01	0:41:01

Further reading:

Advanced GROUP BY Techniques

The previous section dealt with the basics of using SQL aggregate functions and GROUP BY. With these foundations set, we can now explore some advanced GROUP BY techniques.

HAVING

To filter aggregated results, you need to use the HAVING clause. Its syntax is shown here:

SELECT 
  column1, 
  column2, 
  ..., 
  aggregate_function(column_name)
FROM table_name
GROUP BY column1, column2, …
HAVING filtering_condition;

HAVING compares the aggregated values to the filtering condition and returns only the values that satisfy that condition. You can learn more in our article about the HAVING clause.

Let me show you an example. The following code calculates the average album length per artist, but it shows only those whose average album length is under 40 minutes.

SELECT 
  artist,
  AVG(length) AS average_album_length
FROM albums
GROUP BY artist
HAVING AVG(length) < '00:40:00';

I used the AVG() function just the way I did earlier. The HAVING clause is new. To filter data in the required way, I wrote in HAVING the same average calculation as the one appearing in SELECT. Then, I use the comparison operator ‘less than’ (<) to compare each grouped result with a value (in this case, 40 minutes). This value is enclosed in single quotes (''). The value in the quotes has to be written in the format HH:MM:SS (hours:minutes:seconds) because the values in the column length are of the INTERVAL data type.

Here’s the output:

artist	average_album_length
Otis Redding	0:31:23
Brother Jack McDuff	0:39:24
Aretha Franklin	0:35:42
Grant Green	0:38:02
Wayne Shorter	0:39:31
Albert King	0:37:42

The principle is the same no matter what aggregate function you use, but here are the cookbooks to help you with some other functions:

Grouping by Multiple Columns

So far, I have grouped data by one column. However, it’s also possible to group by two or more columns.

It’s not complicated: You just have to list all the grouping columns in GROUP BY and separate them with commas.

By doing that, you group by the column values but also by the combination of values across all the grouping columns. If you have value A in column_1 and value B in column_2, this is one group. If there are, for example, values A in column_1 with value C in column_2, this will form another group – despite the values in column_1 being the same as in the earlier example.

The code below shows the number of albums by the artist and the year of release:

SELECT 
  artist,
  release_year,
  COUNT(*) AS number_of_albums
FROM albums
GROUP BY artist, release_year
ORDER BY artist, release_year;

I’m using COUNT(*) to find the number of albums. The output is grouped by artist and release year. As you can see, those are exactly the columns in GROUP BY and they are separated by a comma.

To make the output more readable, I sorted data by artist alphabetically and from the earliest to the latest release year.

This is the output snapshot:

artist	release_year	number_of_albums
AC/DC	1976	2
AC/DC	1977	1
Albert King	1962	1
Albert King	1967	1
Albert King	1969	1
Albert King	1972	1
Albert King	1974	1
Aretha Franklin	1967	1
Aretha Franklin	1968	1
Aretha Franklin	1972	1
Brother Jack McDuff	1969	2
Brother Jack McDuff	1970	1

Two Aggregate Functions in One Query

It’s also possible to write a query that has two aggregate functions. No, it’s not what you think: you don’t write one aggregate function within another. That’s not possible in SQL.

Two aggregate functions in one query can mean two things:

Simple approach: Using two aggregate functions in two columns of the same SELECT.
Advanced approach: Using the result of one aggregate function in another aggregate function.

One Query, Two Aggregate Functions: The Simple Approach

So, the simple approach is when you have two aggregate functions in two different aggregate columns in one SELECT. They can be the same function applied on different columns, different functions used on the same column, or different functions used on different columns. It really doesn’t matter; the point is you have two aggregations in one query. These aggregations don’t ‘communicate’, i.e. they don’t depend on one another in any way.

In the example, I’ll use the table package_deliveries with the script linked here.

The table shows information about the number of packages delivered daily to several cities.

Here are the first six rows in the table:

id	date_of_delivery	number_of_packages	city_of_delivery	country_of_delivery
1	2024-03-01	49	Krakow	Poland
2	2024-03-01	528	Warsaw	Poland
3	2024-03-01	158	Gdansk	Poland
4	2024-03-01	28	Rijeka	Croatia
5	2024-03-01	97	Split	Croatia
6	2024-03-01	64	Zagreb	Croatia

The code below calculates the total and the average number of packages by city.

SELECT 
  city_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages,
  AVG(number_of_packages) AS average_number_of_packages
FROM package_deliveries
GROUP BY city_of_delivery;

You can see the query has two aggregates: one with SUM() and one with AVG(). It really doesn’t matter how many aggregate functions you write in the query. The only important thing is that each aggregation is treated as a separate code line or output column.

Here’s the result:

city_of_delivery	total_number_of_packages	average_number_of_packages
Split	531	177.00
Izmir	5,936	1,978.67
Ankara	9,139	3,046.33
Gdansk	712	237.33
Rijeka	627	209.00
Zagreb	930	310.00
Istanbul	13,857	4,619.00
Krakow	673	224.33
Warsaw	2,358	786.00

One Query, Two Aggregate Functions: The Advanced Approach

The advanced version is when the aggregations ‘communicate’, meaning you first need to find one aggregated value and then use it to calculate another aggregated value. Typically, this is achieved either through subqueries or Common Table Expressions (CTEs).

I’ll show you subquery usage in the example below. Our guide on combining two aggregate functions in GROUP BY covers both approaches.

The following code returns the average daily number of packages delivered in each country:

SELECT 
  country_of_delivery,
  AVG(pd.daily_number_of_packages) AS average_daily_number_of_packages
FROM 
  (SELECT 
     date_of_delivery, 
     country_of_delivery,
     SUM(number_of_packages) AS daily_number_of_packages
  FROM package_deliveries
  GROUP BY date_of_delivery, country_of_delivery) AS pd
GROUP BY country_of_delivery;

I first need to find the number of packages delivered daily in each country, which is the first aggregation. The second aggregation uses this result and calculates the daily average by country.

The first aggregation is written in the subquery. It uses the SUM() function and lists the columns date_of_delivery and country_of_delivery in SELECT. The output is grouped by the same columns.

Now, I can write the main query, referencing in AVG() the column daily_number_of_packages from the subquery. I group the output by country and get the average daily number of packages delivered for each country.

country_of_delivery	average_daily_number_of_packages
Turkey	9,644.00
Croatia	696.00
Poland	1,247.67

GROUP BY with Conditional Aggregates

The CASE or CASE WHEN statement labels data based on its values. This statement can also be used with GROUP BY.

Grouping data by conditional aggregates means using CASE WHEN with GROUP BY. However – and this is important – the CASE WHEN you write in SELECT also has to appear in GROUP BY.

Using the table from the previous example, I can write a code that categorizes countries into EU and non-EU countries and shows the total number of packages delivered for each category:

SELECT
  CASE 
    WHEN country_of_delivery = 'Turkey' THEN 'Non-EU'
    ELSE 'EU'
  END,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY CASE 
   WHEN country_of_delivery = 'Turkey' THEN 'Non-EU'
   ELSE 'EU'
  END;

The CASE statement says that Turkey should be categorized as a non-EU country. This condition is stated in WHEN. The category for all other countries is EU, which is written in ELSE.

The CASE statement starts with CASE (hence the name) and ends with the keyword END.

I use the SUM() function to calculate the total number of packages. To show this value by category, I just copied the whole CASE WHEN statement in GROUP BY.

Here’s the output:

case	total_number_of_packages
Non-EU	28,932
EU	5,831

If you need more examples to understand this fully, read one of our guides:

GROUP BY Extensions: ROLLUP, CUBE, GROUPING SETS

The three SQL GROUP BY extensions are:

ROLLUP – Creates a summary row for the groups listed in GROUP BY.
CUBE – Creates subtotals for all the combinations of the groups in GROUP BY.
GROUPING SETS – Aggregates output on multiple levels.

In the example, I’ll show you how GROUPING SETS works. You can learn how the other two extensions work in our GROUP BY extensions guide.

The following code returns the number of packages delivered per country and per date:

SELECT 
  country_of_delivery,
  date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY GROUPING SETS(country_of_delivery, date_of_delivery)
ORDER BY country_of_delivery, date_of_delivery;

I start by using SUM() to calculate the number of packages.

In GROUP BY, I write GROUPING SETS with all the aggregation levels inside the parentheses.

The output shows the number of delivered packages by country and by delivery date:

country_of_delivery	date_of_delivery	total_number_of_packages
Croatia	NULL	2,088
Poland	NULL	3,743
Turkey	NULL	28,932
NULL	2024-03-01	2,730
NULL	2024-03-02	11,208
NULL	2024-03-03	20,825

It can help to think of grouping sets as two separate queries. The first query groups by country …

SELECT 
  country_of_delivery,
  NULL AS date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery
ORDER BY country_of_delivery;

… and returns the first three rows of the previous output:

country_of_delivery	date_of_delivery	total_number_of_packages
Croatia	NULL	2,088
Poland	NULL	3,743
Turkey	NULL	28,932

The second query groups by delivery date …

SELECT 
  NULL AS country_of_delivery,
  date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY date_of_delivery
ORDER BY date_of_delivery;

… and outputs the other three rows of the original result:

country_of_delivery	date_of_delivery	total_number_of_packages
NULL	2024-03-01	2,730
NULL	2024-03-02	11,208
NULL	2024-03-03	20,825

In other words, if you merge these two separate grouping queries using UNION …

SELECT 
  country_of_delivery,
  NULL AS date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery

UNION 

SELECT 
  NULL AS country_of_delivery,
  date_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY date_of_delivery
ORDER BY country_of_delivery, date_of_delivery;

… you get exactly the same result as with GROUPING SETS.

country_of_delivery	date_of_delivery	total_number_of_packages
Croatia	NULL	2,088
Poland	NULL	3,743
Turkey	NULL	28,932
NULL	2024-03-01	2,730
NULL	2024-03-02	11,208
NULL	2024-03-03	20,825

Common Mistakes When Using GROUP BY

Confusing ORDER BY and GROUP BY

This is often a matter of not knowing what GROUP BY does.

Let me demonstrate with an example.

Say you want to find the total number of packages by country of delivery.

If you’re not familiar with GROUP BY, you might write a naive solution using ORDER BY.

SELECT 
  date_of_delivery,
  city_of_delivery,
  country_of_delivery
FROM package_deliveries
ORDER BY country_of_delivery;

Now that you have the data sorted by country, as shown in the snapshot below, you decide to manually sum the values to get the result by country.

date_of_delivery	city_of_delivery	country_of_delivery
2024-03-01	Rijeka	Croatia
2024-03-01	Split	Croatia
2024-03-01	Zagreb	Croatia
2024-03-02	Rijeka	Croatia
2024-03-02	Split	Croatia
2024-03-02	Zagreb	Croatia
2024-03-03	Zagreb	Croatia
2024-03-03	Split	Croatia
2024-03-03	Rijeka	Croatia
2024-03-01	Krakow	Poland
2024-03-01	Warsaw	Poland

But why would you manually do something that GROUP BY does automatically? Instead of ordering data by country, you should group by it and use SUM() to sum the data:

SELECT 
  country_of_delivery,
  SUM(number_of_packages) AS total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery
ORDER BY country_of_delivery;

You immediately get the output you need:

country_of_delivery	total_number_of_packages
Croatia	2,088
Poland	3,743
Turkey	28,932

Confusing WHERE and HAVING

Another common mistake is trying to filter aggregated values using WHERE. That is not possible — WHERE is used to filter individual rows before aggregation. On the other hand, HAVING is for filtering groups of rows after aggregation.

If you want to get the average number of packages delivered daily by city and show only those with more than 500 packages, you might decide to write this query:

SELECT 
  city_of_delivery,
  AVG(number_of_packages) AS average_number_of_packages
FROM package_deliveries
WHERE AVG(number_of_packages) > 500
GROUP BY city_of_delivery;

It will return an error because WHERE doesn’t accept an aggregate function as an argument. (This doesn’t mean you can't ever use WHERE with GROUP BY; you can, but not to filter groups.)

In this case, you should use HAVING:

SELECT 
  city_of_delivery,
  AVG(number_of_packages) AS average_number_of_packages
FROM package_deliveries
GROUP BY city_of_delivery
HAVING AVG(number_of_packages) > 500;

It’s the same query with AVG() and GROUP BY. The only difference is HAVING, where you compare the AVG() calculation with 500.

The output shows only cities whose daily average is above 500.

city_of_delivery	average_number_of_packages
Izmir	1,978.67
Ankara	3,046.33
Istanbul	4,619.00
Warsaw	786.00

Learn more about this in the article about HAVING vs. WHERE.

Listing Unaggregated Columns in SELECT but Not in WHERE

The simple rule you should remember is whatever unaggregated column you write in SELECT should also be included in GROUP BY. If you don’t do that, the database won’t know which value to show if several different values are within the same group.

For example, you write the query that is supposed to calculate the total number of packages by country and each city.

SELECT 
  country_of_delivery,
  city_of_delivery,
  SUM(number_of_packages) total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery;

Sure, you have country and city in SELECT, but you forgot to put the city in GROUP BY. This query won’t run. It will return an error. While the messages are worded differently in PostgreSQL, Oracle, and other databases, their message is the same: the column city_of_delivery must appear in GROUP BY.

Even if it ran, the database would be confused. It would think something like this: “You want to show cities but don’t want to group by city? How am I supposed to know which city to show in the output when every country has three of them? Please put cities in GROUP BY so I can show each city as a separate group.”

So, you help it by including the city of delivery in GROUP BY:

SELECT 
  country_of_delivery,
  city_of_delivery,
  SUM(number_of_packages) total_number_of_packages
FROM package_deliveries
GROUP BY country_of_delivery, city_of_delivery;

Yes, it’s grouping by two columns and you’ve already learned how to do that. Here’s the query’s output:

country_of_delivery	city_of_delivery	total_number_of_packages
Croatia	Zagreb	930
Croatia	Rijeka	627
Croatia	Split	531
Turkey	Istanbul	13,857
Poland	Warsaw	2,358
Poland	Krakow	673
Turkey	Ankara	9,139
Poland	Gdansk	712
Turkey	Izmir	5,936

Other Errors

There are also some other errors, such as forgetting GROUP BY with aggregate functions or not grouping by a unique key. Those and several more are covered in this article about common GROUP BY errors.

Further GROUP BY Resources

After reading this overview of the SQL GROUP BY clause, you should have an idea of its uses and challenges. But if you want to master GROUP BY, you’ll need to go into more detail.

The best way to do that is by checking out some of our courses and cookbooks. Here are some of my suggestions for further learning:

Creating Basic SQL Reports – This intermediate-level course focuses on the nuances of using GROUP BY – grouping by multiple columns, using HAVING, combining GROUP BY with CASE WHEN, the difference between COUNT(*) and COUNT(id), etc.
GROUP BY Extensions in SQL – This course covers CUBE, ROLLUP, and GROUPING SETS.
SQL Practice – The courses in this track have sections dedicated to different SQL topics, including quite a lot of GROUP BY practice.

And don’t forget about our articles featuring GROUP BY examples. Two of my favorites are SQL Practice: 10 GROUP BY Practice Exercises with Detailed Solutions and Top 9 SQL GROUP BY Interview Questions.

In addition to that, there’s an SQL for Data Analysis Cheat Sheet, which has a section dedicated to GROUP BY.

Happy learning!

Tags:

Syntax of GROUP BY with Aggregate Functions

SQL Aggregate Functions

Using GROUP BY with SQL Aggregate Functions

Example 1: COUNT() with GROUP BY

Example 2: SUM() with GROUP BY

Example 3: AVG() with GROUP BY

Example 4: MIN() and MAX() with GROUP BY

Advanced GROUP BY Techniques

HAVING

Grouping by Multiple Columns

Two Aggregate Functions in One Query

One Query, Two Aggregate Functions: The Simple Approach

One Query, Two Aggregate Functions: The Advanced Approach

GROUP BY with Conditional Aggregates

GROUP BY Extensions: ROLLUP, CUBE, GROUPING SETS

Common Mistakes When Using GROUP BY

Confusing ORDER BY and GROUP BY

Confusing WHERE and HAVING

Listing Unaggregated Columns in SELECT but Not in WHERE

Other Errors

Further GROUP BY Resources

You may also like