Back to articles list Articles Cookbook
10 minutes read

How Does SQL GROUP BY Work?

Grouping results is a powerful SQL feature that allows you to compute key statistics for a group of records.

GROUP BY is one of SQL’s most powerful clauses. It allows you to see data in a new way and find key metrics (like the average, maximal, and minimal values in a group of records).

Without GROUP BY, all the results we obtain are oriented to records. With GROUP BY, we can create groups of records and calculate metrics on each group. In this article, you’ll learn how GROUP BY makes your SQL queries much more powerful and diverse.

GROUP BY, Part 1: Grouping Data

Let’s suppose we have a small hotel in Patagonia. We also have a database containing guests’ names, cities of origin, ages, check-in dates, check-out dates, and more. This data is in two tables called room_guest and guest. Have a look:

room_guest

guest_nameorigin_cityroom_numberday_inday_outageroom_levelamount_invoiced
Juan B.San Pedro10012012-12-282013-01-0732standard$9500
Mary J.San Francisco10022013-01-022013-01-1223standard$6700
Peter S.Dubai20022013-01-022013-01-2965premium$34000
Clair BGenova20012014-07-022014-08-0221standard$16000
Meiling Y.San Francisco20022014-11-022014-11-1252standard$9500
Olek V.Dubai20032015-01-022015-01-3137premium$28400
Benjamin L.San Pedro20022016-01-022016-01-1561premium$15400
Arnaldo V.Genova10012017-01-012017-01-0443standard$2500
Mary J.San Francisco10022017-01-022017-01-0723standard$4800
Wei W.Los Angeles20022018-01-022018-01-2231standard$12000
Meiling Y.San Francisco20012018-01-022018-01-2252premium$17500
Peter S.Dubai20022019-01-022019-02-2565premium$32000
Arnaldo V.Genova20032019-08-052019-08-1743standard$11200
Mary J.San Francisco10012019-01-022019-01-1223standard$8900

guest

guest_namepreferred_activitycity_namestatecountrycontinent
activityCity_nameStateCountryContinent32
Juan B.trekkingSan PedroAndaluciaSpainEurope
Mary J.trekkingSan FranciscoCaliforniaUnited StatesAmerica
Peter S.trekkingDubaiDubaiArabiaAsia
Chiara BskiingGenovaLiguriaItalyEurope
Meiling Y.trekkingSan FranciscoCaliforniaUnited StatesAmerica
Olek V.relaxingDubaiDubaiArabiaAsia
Benjamin L.skiingSan PedroBuenos AiresArgentinaAmerica
Wei W.trekkingLos AngelesCaliforniaUnited StatesAmerica
Arnaldo V.skiingGenovaLiguriaItalyEurope

We want to calculate some statistics so we can book more guests. The SQL GROUP BY clause lets us group records based on data in a given column (or columns). We can group records in the table room_guest based on the value of the column origin_city. Then all the records of guests from ‘Genova’ will belong to one group; all the records of guests from ‘Dubai’ will belong to another group, and so on. The following table shows each group of records in a different color.

guest_nameorigin_cityroom_numberday_inday_outageroom_levelamount_invoiced
Peter S.Dubai20022013-01-022013-01-2965premium$34000
Olek V.Dubai20032015-01-022015-01-3137premium$28400
Peter S.Dubai20022019-01-022019-02-2565premium$32000
Clair BGenova20012014-07-022014-08-0221standard$16000
Arnaldo V.Genova10012017-01-012017-01-0443standard$2500
Arnaldo V.Genova20032019-08-052019-08-1743standard$11200
Wei W.Los Angeles20022018-01-022018-01-2231standard$12000
Mary J.San Francisco10022013-01-022013-01-1223standard$6700
Mary J.San Francisco10022017-01-022017-01-0723standard$4800
Meiling Y.San Francisco20022014-11-022014-11-1252standard$9500
Meiling Y.San Francisco20012018-01-022018-01-2256premium$17500
Mary J.San Francisco10012019-01-022019-01-1223standard$8900
Benjamin L.San Pedro20022016-01-022016-01-1561premium$15400
Juan B.San Pedro10012012-12-282013-01-0732standard$9500

Now, suppose the hotel’s owner wants to know how many guests come from each city. To find out, we need to count the number of records in each group. In other words, we need the aggregate function COUNT(*), which returns the number of records in a group. COUNT() is a very common function; we’ll return to it later in this article.

So, we need a query to create groups of records with the same value in origin_city and then count the number of records in each group. The query would look like this:

SELECT 	origin_city,
COUNT(*) AS quantity_of_guests
FROM   	room_guest 
GROUP BY	origin_city

You can compare the number of guests from each city in the result table below against the colored table shown previously:

origin_cityquantity_of_guests
Dubai3
Genova3
Los Angeles1
San Francisco5
San Pedro2

Notice that the number of rows in the query results are the same as the quantity of groups created by the GROUP BY clause. One group for each city, one row for each city.

GROUP BY, Part 2: Aggregation Functions

While grouping by a value is handy, the real power of GROUP BY is when it’s used with aggregate functions. I’d go so far as to say that every SQL query using a GROUP BY clause should have at least one aggregate function. (But it’s not mandatory.)

In the previous section, we mentioned that GROUP BY is used to create groups and calculate metrics. Metrics are calculated by aggregation functions like COUNT(), SUM(), AVG(), MIN(), and MAX(). The values calculated by each of these functions are self-explanatory. However, all of them have something in common: all aggregate functions return a value based on all the records in the group.

Let’s consider an example. The hotel owner wants to know the maximum value invoiced for each room. Along with this, he wants to see the minimum and average invoiced value for each room. Here’s the query, followed by the results:

SELECT 	room_number,
MAX(amount_invoiced) AS max_amount_invoiced,
MIN(amount_invoiced) AS min_amount_invoiced,
AVG(amount_invoiced) AS average_amount_invoiced
FROM   	room_guest 
GROUP BY	room_number
room_numbermax_amount_invoicedmin_amount_invoicedaverage_amount_invoiced
10019500.002500.006966.66
10026700.004800.005750.00
200117500.0016000.0016750.00
200234000.009500.0020580.00
200328400.0011200.0019800.00

Grouping Records by Multiple Columns

In some cases, we may need to group by two or more columns. Can we do that with GROUP BY? We sure can!

In the previous query, we created a report analyzing how much money each room is generating. However, some rooms can be configured at a premium or standard level (see room number 2002) during different seasons; thus, to do a correct analysis, we need to group records using two columns: room_number and room_level.

Before going to the query, let's use colors to see how the records are grouped by the GROUP BY room_number, room_level clause. Remember that the records in each group must have exactly the same values in both room_number and room_level. For example, the first group is for room_number = 1001 and room_level = ‘standard’.

guest_nameorigin_cityroom_numberday_inday_outageroom_levelamount_invoiced
Mary J.San Francisco10012019-01-022019-01-1223standard$8900
Arnaldo V.Genova10012017-01-012017-01-0443standard$2500
Juan B.San Pedro10012012-12-282013-01-0732standard$9500
Mary J.San Francisco10022013-01-022013-01-1223standard$6700
Mary J.San Francisco10022017-01-022017-01-0723standard$4800
Meiling Y.San Francisco20012018-01-022018-01-2252premium$17500
Clair BGenova20012014-07-022014-08-0221standard$16000
Benjamin L.San Pedro20022016-01-022016-01-1561premium$15400
Peter S.Dubai20022013-01-022013-01-2965premium$34000
Peter S.Dubai20022019-01-022019-02-2565premium$32000
Meiling Y.San Francisco20022014-11-022014-11-1252standard$9500
Wei W.Los Angeles20022018-01-022018-01-2231standard$12000
Olek V.Dubai20032015-01-022015-01-3137premium$28400
Arnaldo V.Genova20032019-08-052019-08-1743standard$11200

The query is:

SELECT 	room_number,
		room_level,
MAX(amount_invoiced) AS max_amount_invoiced,
MIN(amount_invoiced) AS min_amount_invoiced,
AVG(amount_invoiced) AS average_amount_invoiced
FROM   	room_guest 
GROUP BY	room_number, room_level

The following table shows the results of this query. You can compare this table with the previous table to verify the results.

room_numberroom_levelmax_amount_invoicedmin_amount invoicedaverage_amoun_invoiced
1001standard9500.002500.006966.66
1002standard6700.004800.005750.00
2001premium17500.0017500.0017500.00
2001standard16000.0016000.0016000.00
2002premium34000.0015400.0027133.33
2002standard12000.009500.0010750.00
2003premium28400.0028400.0028400.00
2003standard11200.0011200.0011200.00

Grouping NULL Values

Like any other value, NULL values have their own group; if we have a NULL in any of the columns in GROUP BY, an extra group of records is created for those records. To demonstrate this, we need to insert a pair of records with NULL values in the column origin_city:

INSERT INTO into room_guest VALUES ('Kevin C.', NULL, 2001, '2019-07-25', '2019-08-07', NULL, 'standard', 10500);
INSERT INTO into room_guest VALUES  ('Karl J.', NULL, 1002, '2019-11-12', '2019-11-22', NULL, 'premium', 13900);

Then, this query ...

SELECT 	origin_city,
COUNT(*) AS quantity_of_guests
FROM   	room_guest 
GROUP BY	origin_city

… will show the following result. Notice the new group for NULL origin_city values in the first row:

origin_cityquantity_of_guests
NULL2
Dubai3
Genova3
Los Angeles1
San Francisco5
San Pedro2

Using WHERE with GROUP BY

The WHERE clause is frequently used in SQL queries, so it’s important to understand how it works when combined with GROUP BY.

The WHERE clause is applied before the GROUP BY. This means that all the records are filtered first by WHERE; then the records that match the WHERE condition are grouped using the GROUP BY criteria.

As an example, let's use the previous query, but this time we’ll filter for guests coming from the cities of San Francisco and Los Angeles. The query is:

SELECT 	room_number,
		room_level,
MAX(amount_invoiced) AS max_amount_invoiced,
MIN(amount_invoiced) AS min_amount_invoiced,
AVG(amount_invoiced) AS average_amount_invoiced
FROM   	room_guest
WHERE		origin_city IN (‘San Francisco’,’Los Angeles’ )
GROUP BY	room_number, room_level

As expected, this result set is shorter than the previous ones; the WHERE clause filtered out many guests, and only the records for rooms in San Francisco and Los Angeles were processed by the GROUP BY clause.

room_numberroom_levelmax_amount_invoicedmin_amount_ invoicedaverage_amount_invoiced
1001standard8900.008900.008900.00
1002standard6700.004800.005750.00
2001premium17500.0017500.0017500.00
2002standard12000.009500.0010750.00

Avoiding Problems with GROUP BY

When you’re getting started with GROUP BY, it’s common to run into the following problems. Here’s how to avoid them.

Counting Problems

Let’s look at a similar case where we need to add more than one extra column into the GROUP BY clause. In the first query, we grouped by origin_city. However, some cities share the same name (because they are in different states or countries). In our data set, we have two different cities named San Pedro, one in Argentina and the other in Spain. We don’t want to count them together, as they are two different places.

To count these cities separately, we need to group records using the columns city_origin, state, and country. Then we will repeat the first query but add the columns state and country to the GROUP BY clause. However, if we add columns to the GROUP BY, we should also add them to the SELECT.

Because the columns state and country are in the guest table, we have to JOIN the tables room_guest and guest. Here’s the query we have:

SELECT 	origin_city, state, country
COUNT(*) AS number_of_guests
FROM   	room_guest 
JOIN		guest ON guest.guest_name = room_guest.guest_name
GROUP BY	origin_city, state, country

The results show two different “San Pedro” cities because we’ve used state and country as additional columns in the GROUP BY clause.

origin_citystatecountrynumber_of_guests
DubaiDubaiUAE3
GenovaLiguriaItaly3
Los AngelesCaliforniaUnited States1
San FranciscoCaliforniaUnited States5
San PedroBuenos AiresArgentina1
San PedroAndaluciaSpain1

There is still an issue to fix in this query: if the same person visited the hotel two times, we are counting this person twice. This is not necessarily wrong, but what if we want to know the number of unique visitors to the hotel? We’d need to use COUNT(distinct guest_name). The grouping function COUNT(distinct column) returns the quantity of unique values for a given column in a group of records.

In the query below, we add the COUNT(distinct) function. We also maintain the original COUNT(*) so that the reader can compare both results:

SELECT 	origin_city, state, country
COUNT(distinct guest_name) AS number_of_unique_guests,
COUNT(*) AS number_of_guests
FROM   	room_guest 
JOIN		guest ON guest.guest_name = room_guest.guest_name
GROUP BY	origin_city, state, country

Now we can see the hotel received a total of three visits from a Dubai resident, but that these three visits were made by two distinct people (Peter S. and Olek V) .

origin_citystatecountrynumber_of_unique_guestsnumber_of_guests
DubaiDubaiUAE23
GenovaLiguriaItaly23
Los AngelesCaliforniaUnited States11
San FranciscoCaliforniaUnited States25
San PedroBuenos AiresArgentina11
San PedroAndaluciaSpain11

Before closing this section, I suggest you watch this 5-minute video on GROUP BY for beginners. It’s a super dynamic way to learn SQL.

Omitting Non-Aggregated Columns from GROUP BY

Another very common GROUP BY error is to add a non-aggregated column (i.e. a column that isn’t used in an aggregate function) in the SELECT that you don’t have in GROUP BY. To avoid this error, follow a very simple rule: All columns in SELECT should appear in the GROUP BY clause or be used in an aggregate function.

Let’s try an invalid query to see the error:

SELECT    room_number,
	    room_level,
	    origin_city, --This column is invalid, is not in the GROUP BY
    COUNT(*) AS quantity_of_visitors,
FROM      room_guest 
GROUP BY  room_number, room_level

If we execute this query, we’ll get the following error:

ERROR:  The column «room_guest.origin_city» must be in the GROUP BY clause
LINE 3:   guest_age,

We can fix the error by adding the column origin_city to the GROUP BY clause:

	SELECT 	room_number,
			room_level,
			Origin_city,
			COUNT(*) AS quantity_of_visitors
	FROM		room_gest
	GROUP BY	room_number, room_level, origin_city -- origin_city added

If you’re trying to figure out the difference between GROUP BY and ORDER BY, read the Difference Between GROUP BY and ORDER BY in Simple Words. It will help you sort it out.

There’s More to Do with GROUP BY

So, we’ve learned how to use GROUP BY to group records by common values. We know the aggregate functions MIN(), MAX(), AVG(), and SUM() compute various statistics. And the COUNT() function does a lot of things:

  • COUNT(*) counts all rows.
  • COUNT(guest_name) counts all non-NULL values in the guest_name column.
  • COUNT(distinct guest_name) counts all different non-NULL values in the guest_name column.

When grouping, NULL gets its own group. And all non-aggregated columns in SELECT must be present in GROUP BY.

Due to the length of the article, I didn’t cover the HAVING clause, which is a kind of WHERE clause used to filter groups instead of records. For those readers who want to go a step further, I’ll leave you a link to our SQL Basics course, which covers many interesting topics. It’s a great way to build your SQL skills!