18th Apr 2024 17 minutes read GROUP BY and Aggregate Functions: A Complete Overview Tihomir Babic aggregate functions GROUP BY Table of Contents The Basic Syntax of GROUP BY with Aggregates SQL Aggregate Functions Using GROUP BY with SQL Aggregate Functions Example 1: COUNT() with GROUP BY Example 2: SUM() with GROUP BY Example 3: AVG() with GROUP BY Example 4: MIN() and MAX() with GROUP BY Advanced GROUP BY Techniques Filtering Aggregated Results with HAVING Grouping by Multiple Columns Example Two Aggregate Functions in One Query One Query, Two Aggregate Functions: The Simple Approach Example One Query, Two Aggregate Functions: The Advanced Approach Example GROUP BY with Conditional Aggregates Example GROUP BY Extensions: ROLLUP, CUBE, GROUPING SETS Example Common Mistakes When Using GROUP BY Confusing ORDER BY and GROUP BY Confusing WHERE and HAVING Example Listing Unaggregated Columns in SELECT but Not in WHERE Example Other Errors Further SQL GROUP BY Resources SQL’s GROUP BY and aggregate functions are essential in data aggregation – and for analyzing data and creating reports. Let’s explore them together! In SQL, GROUP BY and aggregate functions are one of the language’s most popular features. Data aggregation is critical for data analysis and reporting; to make sense of all the data in a dataset, it often needs to be aggregated. But what is data aggregation? Put simply, it’s when you group data based on common values and perform a calculation for each data group. For example, you might group a list of dogs by their breed and then use a calculation to figure out the average height for each breed. This task is made possible by the GROUP BY clause and the aggregate function AVG(). If you need to refresh your knowledge about GROUP BY, our SQL Basics course is ideal. It has 129 interactive exercises that you can finish in about 10 hours. It covers topics such as filtering data, using logical operators, joining, and using set operators. The Basic Syntax of GROUP BY with Aggregates GROUP BY is a clause in SQL that arranges data with the same values into groups. Grouping is done by column(s), where all the rows with the same value in that column belong to one group. You can then perform summary computations – such as counting, summing, or averaging values – for each group. For example, you can use GROUP BY to compute the number of employees in a department, the total salary or average salary per department, etc. The syntax of GROUP BY with an aggregate function is: SELECT column1, column2, ..., aggregate_function(column_name) FROM table_name WHERE filtering_condition GROUP BY column1, column2, … HAVING filtering_condition; The aggregate function is written in the SELECT statement, and the function’s result will be shown as an additional column. The GROUP BY clause contains the columns by which you want your output to be grouped. This clause is often used with the WHERE and HAVING clauses for filtering. In the syntax, WHERE comes before GROUP BY, while HAVING comes after it. (I’ll explain more about these two clauses later.) Also, your data will sometimes contain NULLs, so you’ll be interested to learn how GROUP BY treats NULLs. An example of GROUP BY in action is shown in the following GIF. This basic example groups data by the film director and shows the number of films each has made. SQL Aggregate Functions SQL’s aggregate functions are used when you want to do computations on data and return a single value. The most common aggregate functions in SQL are: COUNT() – Counts the rows in the stated column. SUM() – Returns the sum of all values. AVG() – Computes the average of a group of values. MIN() – Returns the minimum (the lowest) of the given values. MAX() – Returns the maximum (the highest) of the given values See our guides on how to use these functions to compute aggregates for the whole table: How to Count the Number of Rows in a Table in SQL How to Count Distinct Values in SQL How to Sum Values of a Column in SQL? How to Find the Average of a Numeric Column in SQL How to Find the Maximum Value of a Column in SQL How to Find the Minimum Value of a Column in SQL It’s important to note that the COUNT() function has several different use variations: COUNT(*) counts all rows in a result set, including NULL values and duplicates. COUNT(expression) counts non-NULL values in a result set. COUNT(DISTINCT) counts distinct values and ignores duplicates. There are more details in our guide What is the Difference Between COUNT(*), COUNT(1), COUNT(column), and COUNT(DISTINCT)? Using GROUP BY with SQL Aggregate Functions I’ll use the table albums to show how aggregate functions and GROUP BY work together. The script for creating the table is here. This table contains data about musical albums. Here’s a snapshot: idartistalbumrelease_yearlengthlabel 1Grant GreenThe Latin Bit19630:38:56Blue Note Records 2AC/DCHigh Voltage19760:44:23Atlantic Records 3Brother Jack McDuffTo Seek a New Home19700:33:12Blue Note Records 4Grant GreenGreen Is Beautiful19700:37:33Blue Note Records 5Wayne ShorterMoto Grosso Feio19740:42:22Blue Note Records Example 1: COUNT() with GROUP BY Here’s an example code that counts the number of albums by label: SELECT label, COUNT(*) AS number_of_albums FROM albums GROUP BY label; I select the label and use COUNT(*) to find the number of albums. The asterisk (*) is a shorthand for counting all rows. When I group the output by label, COUNT() will show the number of albums per label. Here’s the output: labelnumber_of_albums Atlantic Records13 Blue Note Records12 Stax Records14 Further reading: A Detailed Guide to the SQL COUNT() Function The SQL Count Function Explained With 7 Examples How to Count the Number of Rows in a Table in SQL. What is the Difference Between COUNT(*), COUNT(1), COUNT(column), and COUNT(DISTINCT)? How to Count Distinct Values in SQL. Example 2: SUM() with GROUP BY The following code sums the album lengths to show the total music available per artist: SELECT artist, SUM(length) AS total_music_available FROM albums GROUP BY artist; The SUM() function is applied to the length column, and the output is then grouped by the artist. This is the code result: artisttotal_music_available Isaac Hayes6:30:02 Otis Redding1:34:09 Brother Jack McDuff1:58:11 Aretha Franklin1:47:07 Grant Green3:10:11 John Prine1:21:57 Led Zeppelin3:32:07 Wayne Shorter2:38:02 Albert King3:08:28 AC/DC2:05:23 Further reading: SQL SUM() Function Explained with 5 Practical Examples How to Sum Values of a Column in SQL Example 3: AVG() with GROUP BY This code calculates each artist’s average album length: SELECT artist, AVG(length) AS average_album_length FROM albums GROUP BY artist; To get the desired result, you need to use AVG() on the column length and group the output by artist. Here’s what the output looks like: artistaverage_album_length Isaac Hayes1:05:00 Otis Redding0:31:23 Brother Jack McDuff0:39:24 Aretha Franklin0:35:42 Grant Green0:38:02 John Prine0:40:59 Led Zeppelin0:42:25 Wayne Shorter0:39:31 Albert King0:37:42 AC/DC0:41:48 Further reading: The SQL AVG() Function Explained with Examples How to Find the Average of a Numeric Column in SQL Example 4: MIN() and MAX() with GROUP BY Let’s find the shortest and the longest album length for each year: SELECT release_year, MIN(length) AS minimum_album_length, MAX(length) AS maximum_album_length FROM albums GROUP BY release_year ORDER BY release_year; As in earlier examples, the aggregate functions are used on the length column; MIN() for the shortest and MAX() for the longest album. The output is grouped by release year. I also added the ORDER BY clause to sort the output from the earliest to the latest year so it would be more readable. Here’s the code output: release_yearminimum_album_lengthmaximum_album_length 19620:34:490:34:49 19630:38:560:38:56 19640:30:170:30:17 19650:32:220:42:45 19670:32:510:41:08 19680:29:300:29:30 19690:31:300:46:00 19700:33:120:43:04 19710:42:371:33:38 19720:37:500:44:46 19730:40:571:43:10 19740:42:220:46:00 19760:39:590:44:23 19770:41:010:41:01 Further reading: SQL MIN and MAX Functions Explained in 6 Examples How to Find the Maximum Value of a Numeric Column in SQL How to Find the Minimum Value of a Column in SQL Advanced GROUP BY Techniques The previous section dealt with the basics of using SQL aggregate functions and GROUP BY. With these foundations set, we can now explore some advanced GROUP BY techniques. Filtering Aggregated Results with HAVING To filter aggregated results, you need to use the HAVING clause. Its syntax is shown here: SELECT column1, column2, ..., aggregate_function(column_name) FROM table_name GROUP BY column1, column2, … HAVING filtering_condition; HAVING compares the aggregated values to the filtering condition and returns only the values that satisfy that condition. You can learn more in our article about the HAVING clause. Let me show you an example. The following code calculates the average album length per artist, but it shows only those whose average album length is under 40 minutes. SELECT artist, AVG(length) AS average_album_length FROM albums GROUP BY artist HAVING AVG(length) < '00:40:00'; I used the AVG() function just the way I did earlier. The HAVING clause is new. To filter data in the required way, I wrote in HAVING the same average calculation as the one appearing in SELECT. Then, I use the comparison operator ‘less than’ (<) to compare each grouped result with a value (in this case, 40 minutes). This value is enclosed in single quotes (''). The value in the quotes has to be written in the format HH:MM:SS (hours:minutes:seconds) because the values in the column length are of the INTERVAL data type. Here’s the output: artistaverage_album_length Otis Redding0:31:23 Brother Jack McDuff0:39:24 Aretha Franklin0:35:42 Grant Green0:38:02 Wayne Shorter0:39:31 Albert King0:37:42 The principle is the same no matter what aggregate function you use, but here are the cookbooks to help you with some other functions: How to Filter Records with the Aggregate Function COUNT How to Filter Records with the Aggregate Function SUM How to Filter Records with the Aggregate Function AVG Grouping by Multiple Columns So far, I have grouped data by one column. However, it’s also possible to group by two or more columns. It’s not complicated: You just have to list all the grouping columns in GROUP BY and separate them with commas. By doing that, you group by the column values but also by the combination of values across all the grouping columns. If you have value A in column_1 and value B in column_2, this is one group. If there are, for example, values A in column_1 with value C in column_2, this will form another group – despite the values in column_1 being the same as in the earlier example. Example The code below shows the number of albums by the artist and the year of release: SELECT artist, release_year, COUNT(*) AS number_of_albums FROM albums GROUP BY artist, release_year ORDER BY artist, release_year; I’m using COUNT(*) to find the number of albums. The output is grouped by artist and release year. As you can see, those are exactly the columns in GROUP BY and they are separated by a comma. To make the output more readable, I sorted data by artist alphabetically and from the earliest to the latest release year. This is the output snapshot: artistrelease_yearnumber_of_albums AC/DC19762 AC/DC19771 Albert King19621 Albert King19671 Albert King19691 Albert King19721 Albert King19741 Aretha Franklin19671 Aretha Franklin19681 Aretha Franklin19721 Brother Jack McDuff19692 Brother Jack McDuff19701 Two Aggregate Functions in One Query It’s also possible to write a query that has two aggregate functions. No, it’s not what you think: you don’t write one aggregate function within another. That’s not possible in SQL. Two aggregate functions in one query can mean two things: Simple approach: Using two aggregate functions in two columns of the same SELECT. Advanced approach: Using the result of one aggregate function in another aggregate function. One Query, Two Aggregate Functions: The Simple Approach So, the simple approach is when you have two aggregate functions in two different aggregate columns in one SELECT. They can be the same function applied on different columns, different functions used on the same column, or different functions used on different columns. It really doesn’t matter; the point is you have two aggregations in one query. These aggregations don’t ‘communicate’, i.e. they don’t depend on one another in any way. Example In the example, I’ll use the table package_deliveries with the script linked here. The table shows information about the number of packages delivered daily to several cities. Here are the first six rows in the table: iddate_of_deliverynumber_of_packagescity_of_deliverycountry_of_delivery 12024-03-0149KrakowPoland 22024-03-01528WarsawPoland 32024-03-01158GdanskPoland 42024-03-0128RijekaCroatia 52024-03-0197SplitCroatia 62024-03-0164ZagrebCroatia The code below calculates the total and the average number of packages by city. SELECT city_of_delivery, SUM(number_of_packages) AS total_number_of_packages, AVG(number_of_packages) AS average_number_of_packages FROM package_deliveries GROUP BY city_of_delivery; You can see the query has two aggregates: one with SUM() and one with AVG(). It really doesn’t matter how many aggregate functions you write in the query. The only important thing is that each aggregation is treated as a separate code line or output column. Here’s the result: city_of_deliverytotal_number_of_packagesaverage_number_of_packages Split531177.00 Izmir5,9361,978.67 Ankara9,1393,046.33 Gdansk712237.33 Rijeka627209.00 Zagreb930310.00 Istanbul13,8574,619.00 Krakow673224.33 Warsaw2,358786.00 One Query, Two Aggregate Functions: The Advanced Approach The advanced version is when the aggregations ‘communicate’, meaning you first need to find one aggregated value and then use it to calculate another aggregated value. Typically, this is achieved either through subqueries or Common Table Expressions (CTEs). I’ll show you subquery usage in the example below. Our guide on combining two aggregate functions in GROUP BY covers both approaches. Example The following code returns the average daily number of packages delivered in each country: SELECT country_of_delivery, AVG(pd.daily_number_of_packages) AS average_daily_number_of_packages FROM (SELECT date_of_delivery, country_of_delivery, SUM(number_of_packages) AS daily_number_of_packages FROM package_deliveries GROUP BY date_of_delivery, country_of_delivery) AS pd GROUP BY country_of_delivery; I first need to find the number of packages delivered daily in each country, which is the first aggregation. The second aggregation uses this result and calculates the daily average by country. The first aggregation is written in the subquery. It uses the SUM() function and lists the columns date_of_delivery and country_of_delivery in SELECT. The output is grouped by the same columns. Now, I can write the main query, referencing in AVG() the column daily_number_of_packages from the subquery. I group the output by country and get the average daily number of packages delivered for each country. country_of_deliveryaverage_daily_number_of_packages Turkey9,644.00 Croatia696.00 Poland1,247.67 GROUP BY with Conditional Aggregates The CASE or CASE WHEN statement labels data based on its values. This statement can also be used with GROUP BY. Grouping data by conditional aggregates means using CASE WHEN with GROUP BY. However – and this is important – the CASE WHEN you write in SELECT also has to appear in GROUP BY. Example Using the table from the previous example, I can write a code that categorizes countries into EU and non-EU countries and shows the total number of packages delivered for each category: SELECT CASE WHEN country_of_delivery = 'Turkey' THEN 'Non-EU' ELSE 'EU' END, SUM(number_of_packages) AS total_number_of_packages FROM package_deliveries GROUP BY CASE WHEN country_of_delivery = 'Turkey' THEN 'Non-EU' ELSE 'EU' END; The CASE statement says that Turkey should be categorized as a non-EU country. This condition is stated in WHEN. The category for all other countries is EU, which is written in ELSE. The CASE statement starts with CASE (hence the name) and ends with the keyword END. I use the SUM() function to calculate the total number of packages. To show this value by category, I just copied the whole CASE WHEN statement in GROUP BY. Here’s the output: casetotal_number_of_packages Non-EU28,932 EU5,831 If you need more examples to understand this fully, read one of our guides: How to Use CASE WHEN With SUM() in SQL How to Use CASE WHEN in GROUP BY GROUP BY Extensions: ROLLUP, CUBE, GROUPING SETS The three SQL GROUP BY extensions are: ROLLUP – Creates a summary row for the groups listed in GROUP BY. CUBE – Creates subtotals for all the combinations of the groups in GROUP BY. GROUPING SETS – Aggregates output on multiple levels. In the example, I’ll show you how GROUPING SETS works. You can learn how the other two extensions work in our GROUP BY extensions guide. Example The following code returns the number of packages delivered per country and per date: SELECT country_of_delivery, date_of_delivery, SUM(number_of_packages) AS total_number_of_packages FROM package_deliveries GROUP BY GROUPING SETS(country_of_delivery, date_of_delivery) ORDER BY country_of_delivery, date_of_delivery; I start by using SUM() to calculate the number of packages. In GROUP BY, I write GROUPING SETS with all the aggregation levels inside the parentheses. The output shows the number of delivered packages by country and by delivery date: country_of_deliverydate_of_deliverytotal_number_of_packages CroatiaNULL2,088 PolandNULL3,743 TurkeyNULL28,932 NULL2024-03-012,730 NULL2024-03-0211,208 NULL2024-03-0320,825 It can help to think of grouping sets as two separate queries. The first query groups by country … SELECT country_of_delivery, NULL AS date_of_delivery, SUM(number_of_packages) AS total_number_of_packages FROM package_deliveries GROUP BY country_of_delivery ORDER BY country_of_delivery; … and returns the first three rows of the previous output: country_of_deliverydate_of_deliverytotal_number_of_packages CroatiaNULL2,088 PolandNULL3,743 TurkeyNULL28,932 The second query groups by delivery date … SELECT NULL AS country_of_delivery, date_of_delivery, SUM(number_of_packages) AS total_number_of_packages FROM package_deliveries GROUP BY date_of_delivery ORDER BY date_of_delivery; … and outputs the other three rows of the original result: country_of_deliverydate_of_deliverytotal_number_of_packages NULL2024-03-012,730 NULL2024-03-0211,208 NULL2024-03-0320,825 In other words, if you merge these two separate grouping queries using UNION … SELECT country_of_delivery, NULL AS date_of_delivery, SUM(number_of_packages) AS total_number_of_packages FROM package_deliveries GROUP BY country_of_delivery UNION SELECT NULL AS country_of_delivery, date_of_delivery, SUM(number_of_packages) AS total_number_of_packages FROM package_deliveries GROUP BY date_of_delivery ORDER BY country_of_delivery, date_of_delivery; … you get exactly the same result as with GROUPING SETS. country_of_deliverydate_of_deliverytotal_number_of_packages CroatiaNULL2,088 PolandNULL3,743 TurkeyNULL28,932 NULL2024-03-012,730 NULL2024-03-0211,208 NULL2024-03-0320,825 Common Mistakes When Using GROUP BY Confusing ORDER BY and GROUP BY This is often a matter of not knowing what GROUP BY does. Let me demonstrate with an example. Say you want to find the total number of packages by country of delivery. If you’re not familiar with GROUP BY, you might write a naive solution using ORDER BY. SELECT date_of_delivery, city_of_delivery, country_of_delivery FROM package_deliveries ORDER BY country_of_delivery; Now that you have the data sorted by country, as shown in the snapshot below, you decide to manually sum the values to get the result by country. date_of_deliverycity_of_deliverycountry_of_delivery 2024-03-01RijekaCroatia 2024-03-01SplitCroatia 2024-03-01ZagrebCroatia 2024-03-02RijekaCroatia 2024-03-02SplitCroatia 2024-03-02ZagrebCroatia 2024-03-03ZagrebCroatia 2024-03-03SplitCroatia 2024-03-03RijekaCroatia 2024-03-01KrakowPoland 2024-03-01WarsawPoland But why would you manually do something that GROUP BY does automatically? Instead of ordering data by country, you should group by it and use SUM() to sum the data: SELECT country_of_delivery, SUM(number_of_packages) AS total_number_of_packages FROM package_deliveries GROUP BY country_of_delivery ORDER BY country_of_delivery; You immediately get the output you need: country_of_deliverytotal_number_of_packages Croatia2,088 Poland3,743 Turkey28,932 Read more about the differences between GROUP BY and ORDER BY and how to use GROUP BY and ORDER BY together. Confusing WHERE and HAVING Another common mistake is trying to filter aggregated values using WHERE. That is not possible — WHERE is used to filter individual rows before aggregation. On the other hand, HAVING is for filtering groups of rows after aggregation. Example If you want to get the average number of packages delivered daily by city and show only those with more than 500 packages, you might decide to write this query: SELECT city_of_delivery, AVG(number_of_packages) AS average_number_of_packages FROM package_deliveries WHERE AVG(number_of_packages) > 500 GROUP BY city_of_delivery; It will return an error because WHERE doesn’t accept an aggregate function as an argument. (This doesn’t mean you can't ever use WHERE with GROUP BY; you can, but not to filter groups.) In this case, you should use HAVING: SELECT city_of_delivery, AVG(number_of_packages) AS average_number_of_packages FROM package_deliveries GROUP BY city_of_delivery HAVING AVG(number_of_packages) > 500; It’s the same query with AVG() and GROUP BY. The only difference is HAVING, where you compare the AVG() calculation with 500. The output shows only cities whose daily average is above 500. city_of_deliveryaverage_number_of_packages Izmir1,978.67 Ankara3,046.33 Istanbul4,619.00 Warsaw786.00 Learn more about this in the article about HAVING vs. WHERE. Listing Unaggregated Columns in SELECT but Not in WHERE The simple rule you should remember is whatever unaggregated column you write in SELECT should also be included in GROUP BY. If you don’t do that, the database won’t know which value to show if several different values are within the same group. Example For example, you write the query that is supposed to calculate the total number of packages by country and each city. SELECT country_of_delivery, city_of_delivery, SUM(number_of_packages) total_number_of_packages FROM package_deliveries GROUP BY country_of_delivery; Sure, you have country and city in SELECT, but you forgot to put the city in GROUP BY. This query won’t run. It will return an error. While the messages are worded differently in PostgreSQL, Oracle, and other databases, their message is the same: the column city_of_delivery must appear in GROUP BY. Even if it ran, the database would be confused. It would think something like this: “You want to show cities but don’t want to group by city? How am I supposed to know which city to show in the output when every country has three of them? Please put cities in GROUP BY so I can show each city as a separate group.” So, you help it by including the city of delivery in GROUP BY: SELECT country_of_delivery, city_of_delivery, SUM(number_of_packages) total_number_of_packages FROM package_deliveries GROUP BY country_of_delivery, city_of_delivery; Yes, it’s grouping by two columns and you’ve already learned how to do that. Here’s the query’s output: country_of_deliverycity_of_deliverytotal_number_of_packages CroatiaZagreb930 CroatiaRijeka627 CroatiaSplit531 TurkeyIstanbul13,857 PolandWarsaw2,358 PolandKrakow673 TurkeyAnkara9,139 PolandGdansk712 TurkeyIzmir5,936 Other Errors There are also some other errors, such as forgetting GROUP BY with aggregate functions or not grouping by a unique key. Those and several more are covered in this article about common GROUP BY errors. Further SQL GROUP BY Resources After reading this overview of the SQL GROUP BY clause, you should have an idea of its uses and challenges. But if you want to master GROUP BY, you’ll need to go into more detail. The best way to do that is by checking out some of our courses and cookbooks. Here are some of my suggestions for further learning: Creating Basic SQL Reports – This intermediate-level course focuses on the nuances of using GROUP BY – grouping by multiple columns, using HAVING, combining GROUP BY with CASE WHEN, the difference between COUNT(*) and COUNT(id), etc. GROUP BY Extensions in SQL – This course covers CUBE, ROLLUP, and GROUPING SETS. SQL Practice – The courses in this track have sections dedicated to different SQL topics, including quite a lot of GROUP BY practice. And don’t forget about our articles featuring GROUP BY examples. Two of my favorites are SQL Practice: 10 GROUP BY Practice Exercises with Detailed Solutions and Top 9 SQL GROUP BY Interview Questions. In addition to that, there’s an SQL for Data Analysis Cheat Sheet, which has a section dedicated to GROUP BY. Happy learning! Tags: aggregate functions GROUP BY