Back to articles list September 2, 2020 - 5 minutes read How to Find Duplicate Values in SQL Zahin Rahman Nuclear Engineer by day and Data Scientist by night, Zahin is passionate about driving business with data. He has 5+ years of industry experience in energy and power generation and has a master’s degree in Engineering with a focus in Data Science. He loves learning, whether from an industry veteran or a high school science enthusiast. Tags: sql learn sql GROUP BY ORDER BY Duplicate records waste time, space, and money. Learn how to find and fix duplicate values using SQL’s GROUP BY and HAVING clauses. Database best practices usually dictate having unique constraints (such as the primary key) on a table to prevent the duplication of rows when data is extracted and consolidated. However, you may find yourself working on a dataset with duplicate rows. This could be because of human error, an application bug, or uncleaned data that’s been extracted and merged from external sources, among other things. Why fix duplicate values? They can mess up calculations. They can even cost a company money; for example, an e-commerce business might process duplicated customer orders multiple times, which can have a direct impact on the business’s bottom line. In this article, we will discuss how you can find those duplicates in SQL by using the GROUP BY and HAVING clauses. How to Find Duplicate Values in SQL First, you will need to define the criteria for detecting duplicate rows. Is it a combination of two or more columns where you want to detect duplicate values, or are you simply searching for duplicates within a single column? In the examples below, we will be exploring both these scenarios using a simple customer order database. In terms of the general approach for either scenario, finding duplicates values in SQL comprises two key steps: Using the GROUP BY clause to group all rows by the target column(s) – i.e. the column(s) you want to check for duplicate values on. Using the COUNT function in the HAVING clause to check if any of the groups have more than 1 entry; those would be the duplicate values. For a quick visual refresher on GROUP BY, check out our We Learn SQL Series’ SQL GROUP BY video. Our SQL Basics course also covers these concepts in great detail. Duplicate Values in One Column Here, we will be demonstrating how you can find duplicate values in a single column. For this example, we will be using the Orders table, a modified version of the table we used in my previous article on using GROUP BY in SQL. A sample of the table is shown below. OrderIDCustomerIDEmployeeIDOrderDateShipperID 102489051996-07-043 102498161996-07-051 102503441996-07-082 102518431996-07-081 102518431996-07-081 102527641996-07-092 …………… 104436681997-02-121 In this example, there are a few duplicates in the OrderID column. Ideally, each row should have a unique value for OrderID, since each individual order is assigned its own value. For some reason, that wasn’t implemented here. To find the duplicates, we can use the following query: SELECT OrderID, COUNT(OrderID) FROM Orders GROUP BY OrderID HAVING COUNT(OrderID)>1 RESULT Number of Records: 2 OrderIDCOUNT(OrderID) 102512 102762 As we can see, OrderID 10251 (which we saw in the table sample above) and OrderID 10276 have duplicates. Using the GROUP BY and HAVING clauses can neatly show the duplicates in your data. Once you have validated that the rows are the same, you may choose to remove the duplicate(s) using the DELETE statement. Duplicate Values in Multiple Columns Often, you’re interested in finding rows where a combination of a few columns match. For this example, we will be using the OrderDetails table, a sample of which is shown below. OrderDetailIDOrderIDProductIDQuantity 1102481112 2102484210 310248725 410249149 510249142 6102495140 ………… 520104432812 We want to find entries where the OrderID and ProductID columns are identical. This type of duplicate likely means there is a bug in the ordering system, since each order will process each product in that order only once in the cart. If multiple quantities of that product are ordered, the Quantity value would simply be increased; separate (duplicate) rows should not be created. A glitch of this type may impact business operations negatively if the orders are being fulfilled, packaged, and shipped automatically. To find duplicates in multiple column values, we can use the following query. It’s very similar to the one for a single column: SELECT OrderID, ProductID, COUNT(*) FROM OrderDetails GROUP BY OrderID, ProductID HAVING COUNT(*)>1 RESULT Number of Records: 2 Above, we can confirm that the ordering system does indeed have a bug. Like the first example using a single column, this second example similarly allows us to find errors in the ordering system. In this case, products are being registered as a new order even though they were added to the same cart by the same customer. Now you, as the business owner, can take proper corrective actions to rectify this bug in your order management system. Note that above, we used COUNT(*) and not a column-specific counter such as COUNT(OrderID). COUNT(*) counts all rows, whereas COUNT (Column) only counts non-null values in the specified column. However, in this example, it will not have made a difference either way – there were no null values in either of the two columns being grouped. Master Dealing with Duplicate Values Finding duplicates in SQL is mainly about quality/rationality checks and data validation. These checks are often applied to the day-to-day operations of a lot of small and medium businesses. Also, this is a very common interview question for data science/analyst roles! So, it’s great that you now know the basics of how you can approach this question. Still, it goes without saying that you will definitely need more practice to clearly see the nuances brought forward by the uniqueness of each dataset and what criteria you should be applying for those rationality and quality checks. To get a better handle on dealing with duplicate records, I would definitely recommend LearnSQL’s SQL Basics course, which covers these concepts holistically, with a comprehensive set of practical exercises. Tags: sql learn sql GROUP BY ORDER BY You may also like Difference between GROUP BY and ORDER BY in Simple Words For someone who's learning SQL, one of the most common concepts that they get stuck with is the difference between GROUP BY and ORDER BY. Read more Getting the Hang of the GROUP BY Clause GROUP BY is an important part of the SQL SELECT statement. But new SQL coders can run into some problems when this clause is used incorrectly. Here’s how to avoid those issues. Read more GROUP BY Clause: How Well Do You Know It? The GROUP BY clause is the most basic way to compute statistics in SQL. It can be quite tough for beginners but it is really powerful. Read more NULL Values and the GROUP BY Clause We've already covered how to use the GROUP BY clause but how does SQL's GROUP BY clause work when NULL values are involved? Find out! Read more How ORDER BY and NULL Work Together in SQL Learn how NULLs are sorted by the ORDER BY clause in different databases and how to change the default behavior. Read more What Is the Difference Between a GROUP BY and a PARTITION BY? What is the difference between a GROUP BY and a PARTITION BY in SQL queries? When should you use which? You can find the answers in today's article. Read more SQL Window Functions vs. GROUP BY: What’s the Difference? Window functions and GROUP BY may seem similar at first, but they’re quite different. Learn how window functions differ from GROUP BY and aggregate functions. Read more Subscribe to our newsletter Join our weekly newsletter to be notified about the latest posts.