Back to articles list January 26, 2021 - 13 minutes read 7 Common GROUP BY Errors Marija Ilic Marija works as a data scientist in the banking industry. She specializes in big data platforms (Cloudera and Hadoop) with software and technologies such as Hive/Impala, Python and PySpark, Kafka, and R. Marija has an extensive background in DWH/ETL development in the banking industry. Her main interests are predictive modeling, real-time decision-making, and social network analysis. Outside of work, Marija enjoys listening to her favorite LPs on her old gramophone—and never grows tired of its soothing crackle. Tags: sql learn sql GROUP BY Are you making these errors with GROUP BY in SQL? Find out what they are, how to avoid them, and how to fix them. SQL’s GROUP BY statement can quickly uncover powerful data insights. At first, using GROUP BY can seem easy – e.g. when creating basic SQL reports that you’ll present to business decision-makers. But while learning this powerful feature, you could become trapped in weird errors or get incorrect results caused by improperly written GROUP BY statements. If you’re feeling like things aren’t quite adding up with your use of GROUP BY, keep reading. In this article, I will explain the most common GROUP BY mistakes and how you can avoid them. Stop Making These 7 Common GROUP BY Mistakes 1. Forgetting GROUP BY with Aggregate Functions You use SELECT statements with the GROUP BY clause when you want to group and organize rows into specific groups and then perform a specific calculation of each group. The most common GROUP BY error is forgetting to write GROUP BY inside the SELECT statement. Here is one example. Imagine that you have the table recipes, which contains 100 records and six columns. This table stores the number of views (no_of_views) per each recipe published on a famous culinary website: meal_categorynameauthorno_of_viewsno_of_views_lst_mthauthor_id Cold appetizerMarinated CheeseMarta107104906211 SoupsPumpkin soupJohn68856693772 dessertsBanana CheesecakeAlly131944NULL3 drinksPaloma PicanteLuke72027713124 Bread and pastrySour Cream DoughnutsJohn50935527912 dessertsReal Strawberry CupcakesLisa17626811693911 Soupspotato soupMary64796643886 .................. .................. .................. Bread and pastryCider DoughnutsTim53896511608 recipe table Here is a short description of the table’s columns: meal_category – The recipe category (soup, drinks, desserts, etc.). name – The recipe’s name. author – The author’s name. no_of_views – The number of views (total pages/recipes viewed) in the current month. no_of_views_lst_mth – The number of views (total pages/recipes viewed) in the previous month. author_id – The author’s unique ID number. Let's say that you want to count the number of recipes in each meal category. If you write the statement like this (without GROUP BY at the end) ... SELECT meal_category, count(*) AS total_recipes FROM recipes ; ... most SQL engines will give you an error. Some engines, however, will output weird, unwanted results. I’m using MySQL and when I run this statement, I get this: meal_categorytotal_recipes Cold appetizer100 Result without GROUP BY 100 is the total count of all the recipes in the whole data set and the meal category ‘Cold appetizer’ is just one category out of ten. To correct this type of error, you need to add a GROUP BY meal_category at the end of the statement. (Otherwise, your result in MySQL just doesn't make sense.) The correct SELECT looks like this: SELECT meal_category, count(*) AS total_recipes FROM recipes GROUP BY meal_category ; Here is a short explanation of what’s going on: Records are merged based on meal category. For example, desserts are one group, soups another, main dishes yet another, etc. The column meal_category is specified after GROUP BY; it is also listed in SELECT. For each group, we are using COUNT(*) to count the total number of recipes in that group. I'm not going to dive deeply into the syntax here, but I would definitely suggest you read GROUP BY in SQL Explained or Using GROUP BY in SQL for more details. As you can see, the result is like we expected: meal_categorytotal_recipes Bread and pastry7 Cold appetizer6 desserts20 drinks7 Main dishes20 Salads8 Side dishes12 Soups17 Warm appetizer3 Valid GROUP BY result 2. Confusing WHERE and HAVING Maybe you’d like to see only those meal categories that have more than 10 recipes. A lot of beginners would write this query: SELECT meal_category, count(*) AS total_recipes FROM recipes WHERE count(*) > 10 GROUP BY meal_category ; This statement will return an error because you cannot use aggregate functions in a WHERE clause. WHERE is used with GROUP BY when you want to filter rows before grouping them. In our example, we want to filter rows after grouping; in cases like this, we need to use the HAVING clause: SELECT meal_category, count(*) AS total_recipes FROM recipes GROUP BY meal_category HAVING count(*) > 10 ; This misunderstanding about the difference between HAVING and WHERE is the second most common error with GROUP BY. Let's clarify this difference with two more examples. Example 1 - How to Display Meal Categories with 1M+ Views A statement that displays only categories with more than 1 million total page views can be written like this: SELECT meal_category, sum(no_of_views) AS total FROM recipes GROUP BY meal_category HAVING sum(no_of_views) >1000000; Here we are using HAVING because we want to filter records after they have been grouped. The result is presented below: meal_categorytotal desserts2969324 Main dishes1323981 Side dishes1662910 Soups1100911 Example with HAVING Example 2 – John’s Performance in Each Meal Category This query extracts only John's recipes and calculates his performance: SELECT meal_category, sum(no_of_views) AS total FROM recipes WHERE author = ‘John’ GROUP BY meal_category; We’re using WHERE because we need to filter records (so we only get John’s data) before we put the records into groups by meal category. Here is how the result looks: meal_categorytotal Bread and pastry50935 desserts301869 drinks147745 Main dishes279934 Salads88097 Side dishes415864 Soups393253 Warm appetizer85570 John’s KPIs HAVING and WHERE are nicely described in our articles What Is the Difference Between WHERE and HAVING Clauses in SQL? and 5 Examples of GROUP BY. If you would like to see more examples on this topic, I suggest starting there. 3. Listing a Column Inside SELECT but Not in GROUP BY Now suppose you want to see the total number of views per meal_category and author. We can do that? – ?we only need to add the author column to our previous query: SELECT meal_category, author, sum(no_of_views) AS total FROM recipes GROUP BY meal_category; Does this look okay to you? Of course not; it will throw an error in most SQL engines. For example, Oracle will tell you “error: Not a GROUP BY expression”. Why this confusing error? What is missing here? Well, the SQL engine doesn't know how to calculate the total for each author because we didn't include it in the GROUP BY clause; the attribute author is not listed inside the GROUP BY clause. This is another common error with GROUP BY. Let's fix this query and run it one more time: SELECT meal_category, author, sum(no_of_views) AS total FROM recipes GROUP BY meal_category, author; The result is: meal_categoryauthortotal Bread and pastryDino53789 Bread and pastryJohn50935 Bread and pastryMarta52998 Bread and pastryMary52904 Bread and pastryPatricia51451 Bread and pastryTim106226 ......... ......... ......... SoupsMary125731 SoupsMonte128356 SoupsPatricia255574 SoupsTim132532 Warm appetizerJohn85570 Warm appetizerLisa82960 Warm appetizerMary87560 Now this looks okay. Remember, unaggregated columns that are listed in SELECT must also be listed in GROUP BY. In our case, the unaggregated columns are meal_category and author, which are now in SELECT and GROUP BY. You don't list columns that are inside aggregate functions in GROUP BY. In our example, the column no_of_views is used in the aggregate function SUM() and thus is not listed in the GROUP BY clause. If you would like to learn more about this topic, check out our article GROUP BY Clause: How Well Do You Know It?. It explains why SELECTed columns need to appear in the GROUP BY clause. Also, How to Fix a ‘Not a GROUP BY Expression’ Error gives more examples related to this type of error. 4. Not Grouping by a Unique Key Now let’s try something else. Suppose we want to get the average number of page views for each recipe author. The following query calculates the average total number of page views for each author using the author name: SELECT author, avg(no_of_views) FROM recipes GROUP BY author; When you look at the result, you will notice that Lisa averages 116101.5 page views: authoravg(NO_OF_VIEWS) Ally106545 Dino94667.9091 John88163.35 Lisa116101.5 Luke104591 Marta119789.1667 Mary101040.0588 Monte84794 Patricia81911.1333 Tim76185.375 GROUP BY author – but names are not unique However, we actually have two authors named Lisa in our table. When we group the results by the author column, both Lisas are averaged together. Why? Because we are using a non-unique column in the GROUP BY. This means that not all the grouping values have to be unique. If we want to see each Lisa’s average separately, we should add author_id (a unique column) to the GROUP BY list: SELECT author, author_id avg(no_of_views) FROM recipes GROUP BY author, author_id; Now we see how recipes from Lisa (id=11) are much more viewed than recipes by Lisa (id=5): authorauthor_idavg(no_of_views) Ally3106545 Dino794667.9091 John288163.35 Lisa585798 Lisa11146405 Luke4104591 Marta1119789.1667 Mary6101040.0588 Monte984794 Patricia1081911.1333 Tim876185.375 GROUP BY with author and author_id It is important to always think about grouping keys. Grouping values should be unique and must represent each group in the desired way. Otherwise, you’ll get inaccurate, confusing results and possibly a GROUP BY error. 5. Confusing COUNT(distinct) and COUNT(*) If you’re curious to see the total number of authors for each meal category, you can write a GROUP BY statement to calculate that. Let's use COUNT(*) and retrieve the number of authors in each category: SELECT meal_category, count(*) FROM recipes GROUP BY meal_category; Here’s the result – but it’s not what you expected, is it? meal_categorycount(*) Bread and pastry7 Cold appetizer6 desserts20 drinks7 Main dishes20 Salads8 Side dishes12 Soups17 Warm appetizer3 This is the total number of recipes in each category, not the total number of authors. Why is that? Well, COUNT(*) counts all the rows in each group. The table recipe contains information on a recipe level - each record is one recipe. This query counts the recipes (rows) in each category, not the recipe authors. One author can have many recipes in each category, so to get the information you want, you must count distinct authors (using COUNT(distinct author) instead of COUNT(*)) inside each group. This is a very common GROUP BY error. So, when should you use COUNT(*), COUNT(expression) and COUNT(distinct expression)? Let's take a look at an example: SELECT meal_category, count(distinct author), count(author), count(*) FROM recipes GROUP BY meal_category; meal_categorycount(distinct author)count(author)count(*) Bread and pastry677 Cold appetizer266 desserts82020 drinks577 Main dishes92020 Salads688 Side dishes81212 Soups61717 Warm appetizer333 The difference between COUNT(*) and COUNT(expression) is visible if we are doing calculations on a column that has some missing values. When missing values are present, COUNT(*) will count all the records in a group and COUNT(expression) will count only non-null values. In the above example, COUNT(*) and COUNT(author) give the exact same result because the author column doesn’t have any NULL values. COUNT(distinct author) gives us the number of distinct authors for each category, which is not the same as COUNT(*). For example, the cold appetizer meal category contains six recipes from two distinct authors. COUNT(*) counts the number of recipes (records) in each category, while COUNT(distinct author) counts the number of distinct authors. So, if you would like to display the total number of distinct authors per each meal category, use COUNT(distinct author). Here is the correct query: SELECT meal_category, count(distinct author) FROM recipes GROUP BY meal_category; GROUP BY meal_category; For a more detailed explanation, see What is the Difference Between COUNT(*), COUNT(1), COUNT(column name), and COUNT(DISTINCT column name)? 6. Problems Using Aggregate Functions With NULLs This is another ‘missing value’ problem. Let's say that you want to calculate the average total number of views from the previous month for each category. Your colleague calculated those figures, but they’d like you to double check the result. Here is your query: SELECT meal_category, avg(no_of_views_lst_mth) as average, FROM recipes GROUP BY meal_category; And what you get is ... meal_categoryaverage Bread and pastry52274.8571 Cold appetizer95584.2 desserts144349.7222 drinks72551.7143 Main dishes61350.8889 Salads90798.875 Side dishes139765.25 Soups64978.8824 Warm appetizer78390.6667 The result looks okay and you are confident when it comes to the correctness of your query. However, your colleague got slightly different figures: meal_categoryaverage Bread and pastry52274.8571 Cold appetizer79653.5 desserts129914.75 drinks72551.7143 Main dishes55215.8 Salads90798.875 Side dishes139765.25 Soups64978.8824 Warm appetizer78390.6667 What just happened? Why the different results? In a nutshell, the differing results arise from different interpretations of missing values. The column no_of_views_lst_mth represents the number of total page views in the previous month. If a recipe was created in the current month, this column will be NULL for that row. For example, Ally’s Banana Cheesecake recipe was written in the current month, so there are no statistics for the previous month: meal_categorynameauthorno_of_viewsno_of_views_lst_mthauthor_id dessertsBanana CheesecakeAlly131944NULL3 Banana cheesecake was published in current month Now, let's get back to those averages and their different results. Averages are calculated as the total sum of no_of_views_lst_mth divided by the total number of records. If you use the AVG() function and NULLs are present, the engine just ignores the NULLs and does calculations without them. This is what happened when you ran your query – the NULLs were omitted. In some cases, you’ll want to replace NULLs with 0 (because business logic dictates); this is what your colleague did, which produced slightly different figures. Here is your colleague’s query: SELECT meal_category, avg(CASE WHEN no_of_views_lst_mth is null THEN 0 ELSE no_of_views_lst_mth END) AS average FROM recipes GROUP BY meal_category; Notice how the averages from these two queries treat NULLs differently. For example, the ‘desserts’ category contains NULLs. Thus, the first query omits these rows and does not count them towards the total number of rows; this gives the value 144349.72. The second query replaces all NULLs with zero and counts these rows in the average, giving a smaller value of 129914.75. I would say that both queries could be valid, depending on how you want to calculate averages. 7. Using COUNT(*) with GROUP BY and a LEFT JOIN Using GROUP BY with a LEFT JOIN statement can be quite confusing – especially with COUNT(). Let's see how COUNT(*) and COUNT(expression) function in a LEFT JOIN. Let’s suppose that someone in marketing has the following table, recipes_campaigns. It contains information about the number of campaigns run on each meal category in the current month: meal_categorycampaigns Bread and pastry2 Cold appetizer1 desserts3 drinks0 Main dishes3 Salads1 Side dishes2 Soups3 Warm appetizer0 brunch1 sandwiches0 recipes_campaign In addition to the data in recipes_campaigns, the marketer also wants to see the number of recipes for each meal category. For that, we’ll need information from the recipes table. So let’s left join these two tables and calculate the number of recipes using COUNT(*), like so: SELECT a.meal_category, count(*), FROM recipes_campaigns a LEFT JOIN recipes b ON a.meal_category=b.meal_category GROUP BY a.meal_category; Here’s the result: meal_categorycount(*) Bread and pastry7 brunch1 Cold appetizer6 desserts20 drinks7 Main dishes20 Salads8 sandwiches1 Side dishes12 Soups17 Warm appetizer3 This isn’t what we expected. The table recipe does not contain any recipes in the ‘brunch’ category, so why then did we get that 1 in the result? This happens because COUNT() is applied to the LEFT JOIN result! When you LEFT JOIN two tables, the ‘brunch’ category will be present in the output – even if there are no matching recipes or categories in the recipe table. How can we fix this? If we use COUNT(expression) instead of COUNT(*), we’ll get the result we want: SELECT a.meal_category, count(author_id), FROM recipes_campaigns a LEFT JOIN recipes b ON a.meal_category=b.meal_category GROUP BY a.meal_category; Once you run this, you get: meal_categorycount(author_id) Bread and pastry7 brunch0 Cold appetizer6 desserts20 drinks7 Main dishes20 Salads8 sandwiches0 Side dishes12 Soups17 Warm appetizer3 Here, COUNT(author_id) counts only the non-NULL values in author_id after the LEFT JOIN is performed. There is no author_id value for the ‘brunch’ category; in other words, it’s NULL and the result for that category is 0. You Can Solve GROUP BY Errors! Through several examples, we’ve explored GROUP BY and the most common errors that beginners often make. I hope that now you have a better sense of how GROUP BY works and what’s causing those weird errors or confusing results. GROUP BY is really very important in report creation. If you want to learn how to construct good reports, I recommend our Creating Basic SQL Reports course. It has lots of interactive exercises that let you gain experience writing GROUP BY queries for reports. And more experience certainly reduces the possibility of mistakes! Tags: sql learn sql GROUP BY You may also like Using GROUP BY in SQL Master the powerful SQL GROUP BY command. Learn to compute statistical summaries with aggregate functions to solve real-world Business Intelligence challenges. Read more GROUP BY Clause: How Well Do You Know It? The GROUP BY clause is the most basic way to compute statistics in SQL. It can be quite tough for beginners but it is really powerful. Read more GROUP BY in SQL Explained Need to refresh your knowledge of SQL GROUP BY? Learn how GROUP BY works and when it can be useful. Examples provided. Read more SQL Window Functions vs. GROUP BY: What’s the Difference? Window functions and GROUP BY may seem similar at first, but they’re quite different. Learn how window functions differ from GROUP BY and aggregate functions. Read more NULL Values and the GROUP BY Clause We've already covered how to use the GROUP BY clause but how does SQL's GROUP BY clause work when NULL values are involved? Find out! Read more Getting the Hang of the GROUP BY Clause GROUP BY is an important part of the SQL SELECT statement. But new SQL coders can run into some problems when this clause is used incorrectly. Here’s how to avoid those issues. Read more Subscribe to our newsletter Join our weekly newsletter to be notified about the latest posts.