2nd Sep 2020 Updated: 28th Nov 2023 6 minutes read

How to Find Duplicate Values in SQL — The Ultimate Guide

GROUP BY

Table of Contents

Identify Duplicate Values in SQL
Duplicate Values in One Column
Duplicate Values in Multiple Columns
Why Duplicates in SQL Are Bad
Master Dealing with Duplicate Values

Find duplicate values in SQL efficiently and avoid wasting resources. This article demonstrates how to locate and address duplicate records using SQL's GROUP BY and HAVING clauses.

Database best practices usually dictate having unique constraints (such as the primary key) on a table to prevent the duplication of rows when data is extracted and consolidated. However, you may find yourself working on a dataset with duplicate rows. This could be because of human error, an application bug, or uncleaned data that’s been extracted and merged from external sources, among other things.

Why fix duplicate values? They can mess up calculations. They can even cost a company money; for example, an e-commerce business might process duplicated customer orders multiple times, which can have a direct impact on the business’s bottom line.

In this article, we will discuss how you can find those duplicates in SQL by using the GROUP BY and HAVING clauses.

Identify Duplicate Values in SQL

First, you will need to define the criteria for detecting duplicate rows. Is it a combination of two or more columns where you want to detect duplicate values, or are you simply searching for duplicates within a single column?

In the examples below, we will be exploring both these scenarios using a simple customer order database.

In terms of the general approach for either scenario, finding duplicates values in SQL comprises two key steps:

Using the GROUP BY clause to group all rows by the target column(s) – i.e. the column(s) you want to check for duplicate values on.
Using the COUNT function in the HAVING clause to check if any of the groups have more than 1 entry; those would be the duplicate values.

For a quick visual refresher on GROUP BY, check out our We Learn SQL Series’ SQL GROUP BY video.

Duplicate Values in One Column

Here, we will be demonstrating how you can find duplicate values in a single column. For this example, we will be using the Orders table, a modified version of the table we used in my previous article on using GROUP BY in SQL. A sample of the table is shown below.

OrderID	CustomerID	EmployeeID	OrderDate	ShipperID
10248	90	5	1996-07-04	3
10249	81	6	1996-07-05	1
10250	34	4	1996-07-08	2
10251	84	3	1996-07-08	1
10251	84	3	1996-07-08	1
10252	76	4	1996-07-09	2
…	…	…	…	…
10443	66	8	1997-02-12	1

In this example, there are a few duplicates in the OrderID column. Ideally, each row should have a unique value for OrderID, since each individual order is assigned its own value. For some reason, that wasn’t implemented here. To find the duplicates, we can use the following query:

SELECT OrderID, COUNT(OrderID)
FROM Orders
GROUP BY OrderID
HAVING COUNT(OrderID) > 1

RESULT

Number of Records: 2

OrderID	COUNT(OrderID)
10251	2
10276	2

As we can see, OrderID 10251 (which we saw in the table sample above) and OrderID 10276 have duplicates.

Using the GROUP BY and HAVING clauses can neatly show the duplicates in your data. Once you have validated that the rows are the same, you may choose to remove the duplicate(s) using the DELETE statement.

You can find more examples in our SQL Cookbook:

Duplicate Values in Multiple Columns

Often, you’re interested in finding rows where a combination of a few columns match. For this example, we will be using the OrderDetails table, a sample of which is shown below.

OrderDetailID	OrderID	ProductID	Quantity
1	10248	11	12
2	10248	42	10
3	10248	72	5
4	10249	14	9
5	10249	14	2
6	10249	51	40
…	…	…	…
520	10443	28	12

We want to find entries where the OrderID and ProductID columns are identical. This type of duplicate likely means there is a bug in the ordering system, since each order will process each product in that order only once in the cart. If multiple quantities of that product are ordered, the Quantity value would simply be increased; separate (duplicate) rows should not be created. A glitch of this type may impact business operations negatively if the orders are being fulfilled, packaged, and shipped automatically.

To find duplicates in multiple column values, we can use the following query. It’s very similar to the one for a single column:

SELECT OrderID, ProductID, COUNT(*)
FROM OrderDetails
GROUP BY OrderID, ProductID
HAVING COUNT(*) > 1

RESULT

Number of Records: 2

OrderID	ProductID	COUNT(OrderID)
10239	14	2
10356	32	2

Above, we can confirm that the ordering system does indeed have a bug. Like the first example using a single column, this second example similarly allows us to find errors in the ordering system. In this case, products are being registered as a new order even though they were added to the same cart by the same customer. Now you, as the business owner, can take proper corrective actions to rectify this bug in your order management system.

Note that above, we used COUNT(*) and not a column-specific counter such as COUNT(OrderID). COUNT(*) counts all rows, whereas COUNT (Column) only counts non-null values in the specified column. However, in this example, it will not have made a difference either way – there were no null values in either of the two columns being grouped.

Why Duplicates in SQL Are Bad

Duplicates in SQL databases are a crucial concern, particularly for quality control, rationality checks, and data validation. These checks are vital for the smooth operation of many small and medium-sized businesses.

When data is duplicated, it can lead to inaccurate results in analyses, skewed reports and, ultimately, misinformed business decisions. This can be especially critical in areas like inventory management, where duplicate entries might lead to overstocking or understocking.

Moreover, the ability to identify and handle duplicates in SQL is a common interview topic for data science and analyst roles. Here are some resources to help you prepare for your next interview:

Understanding the basics of handling duplicates is just the starting point. As you delve deeper into SQL, you'll discover that each dataset is unique, requiring specific criteria and approaches for quality and rationality checks. The uniqueness of each dataset means there's no one-size-fits-all solution; you'll need to practice and develop a keen eye for the details and nuances of different datasets.

For beginners, it's essential to recognize that duplicate data can compromise the integrity of your database. It can affect everything from basic queries to complex data analytics. Identifying and resolving duplicates is not just about cleaning up data; it's about ensuring the reliability and accuracy of the information that businesses and professionals rely on daily.

By mastering this skill, you ensure data quality and contribute to the overall efficiency and accuracy of business operations and decision-making processes.

Master Dealing with Duplicate Values

Once you know the basics, regular practice is crucial for mastering duplicate handling in SQL. Tackling various datasets will sharpen your skills, enhancing your grasp of database nuances. This consistent effort is key for all proficiency levels, ensuring effective management of duplicates.

If you are just starting out, in order to get a better handle on dealing with duplicate records, I would definitely recommend LearnSQL’s SQL Basics course, which covers these concepts holistically, with a comprehensive set of practical exercises.

Tags:

GROUP BY

Identify Duplicate Values in SQL

Duplicate Values in One Column

RESULT

Duplicate Values in Multiple Columns

RESULT

Why Duplicates in SQL Are Bad

Master Dealing with Duplicate Values

You may also like