14th Feb 2023 9 minutes read

Data Cleaning in SQL

data analysis

Table of Contents

What Is Data Cleaning?
Data Cleaning Techniques
- How and When to Delete Data
- How to Update Data
  - Example 1: Put a Meaningful Label for NULL Values
  - Example 2: Fix the Capitalization of Values
Always Remember to Clean Your Data

Data cleaning is an important part of any data analysis. Here we’ll discuss techniques you can use to do data cleaning in SQL.

I find it nearly impossible to focus on work when my desk is a mess. If it’s cluttered with paper, coffee mugs, or random toys my daughter has somehow snuck into my office, there is no chance I will be able to get anything done until my desk is back in order. For some reason, it’s like the clutter on my desk has somehow made its way into my mind.

This same idea is pertinent to data cleaning. Many times, I’ve spent hours on an analysis and drawn my conclusions only to learn of an inconsistency in the data that negates my whole report. As analysts, we typically jump straight into analyzing data without first taking the time to ensure our data is clean. This can lead to many hours of wasted time – or even worse, inaccurate reports.

What Is Data Cleaning?

The process of data cleaning (also called data cleansing) involves identifying any inaccuracies in a dataset and then fixing them. It’s the first step in any analysis and it includes deleting data, updating data, and finding inconsistencies or things that just don’t make sense.

You can learn all SQL features needed to clean data in SQL in our SQL from A to Z track. The track contains 7 interactive SQL courses that will teach you complete SQL, from the very basics through intermediate topics and up to advanced SQL concepts like window functions and recursive queries. It is the most complete set of SQL courses available on the Internet.

Data Cleaning Techniques

Now that you’ve got the idea, let’s go ahead and take a look at the SQL techniques you can use to cleanse data. For each example, we’ll be using the companies table shown below. It shows information about various companies:

id	name	industry	year_founded	employees	state	city
1	Over-Hex	Software	2006	25	TX	Franklin
2	Unimattax	IT Services	2009	36	TX	Newtown Square
3	Lexila	Real Estate	2032	38	IL	Tinley Park
4	Greenfax	Retail	2012	320	sc	Greenville
5	Saoace	Energy	2009	24	WI	New Holstein
6	Donplus	Advertising & Marketing	2009	26	ca	Los Angeles
7	Blacklane	IT Services	2011	9	CA	Orange
8	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham
9	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham
10	Quotelane	Advertising & Marketing	NULL	4	SC	Greenville
11	Ganzzap	Advertising & Marketing	2011	133	CA	San Francisco
12	Yearflex	NULL	2013	45	WI	Madison

How and When to Delete Data

Sometimes you will encounter scenarios in which you need to remove data from your dataset. This could be because the data is not pertinent to what you are analyzing or it is duplicated or inaccurate. In the next few examples, we’ll explore these different scenarios and how to approach them.

Example 1: Deleting Duplicate Data

The first thing we’ll do is look for any data that needs to be deleted. This could be because of duplicates or because the data is not relevant. In this table, we can quickly see that the row for the Toughtam company is duplicated. This won’t be so easy to identify in a large dataset. Before we just go and delete that row, let’s discuss how we would find it.

In this dataset, each company should have only one row, so let’s use the GROUP BY and HAVING clauses to identify duplicate names. This query is going to count the number of times each name exists in the database using GROUP BY. Then it uses the HAVING clause to filter the results for only those names that exist more than once.

SELECT name, 
	 COUNT(name) as count
FROM companies
GROUP BY name
HAVING(count > 1)

This query will return the following result:

name	count
Toughtam	2

Great! Now we know that the company named Toughtam is duplicated, but how do we delete one of the rows? We’ll use a combination of ROW_NUMBER() and DELETE. First, let’s add a row number for each row based on the name column:

SELECT name, 
	 ROW_NUMBER() OVER(PARTITION BY name) AS rn
FROM companies

id	name	industry	year_founded	employees	state	city	rn
1	Over-Hex	Software	2006	25	TX	Franklin	1
2	Unimattax	IT Services	2009	36	TX	Newtown Square	1
3	Lexila	Real Estate	2032	38	IL	Tinley Park	1
4	Greenfax	Retail	2012	320	sc	Greenville	1
5	Saoace	Energy	2009	24	WI	New Holstein	1
6	Donplus	Advertising & Marketing	2009	26	ca	Los Angeles	1
7	Blacklane	IT Services	2011	9	CA	Orange	1
8	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham	1
9	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham	2
10	Quotelane	Advertising & Marketing	NULL	4	SC	Greenville	1
11	Ganzzap	Advertising & Marketing	2011	133	CA	San Francisco	1
12	Yearflex	NULL	2013	45	WI	Madison	1

What we’ve done is add a new column that shows the row number for each name. As you can see, there is now a 1 and 2 for the rows for Toughtam. Now we’ll run a DELETE statement to remove any rows where the rn column is greater than 1.

DELETE 
FROM (
SELECT name, 
	 	ROW_NUMBER() OVER(PARTITION BY name) AS rn
FROM companies
) 
WHERE rn > 1

Now our dataset looks like this:

id	name	industry	year_founded	employees	state	City
1	Over-Hex	Software	2006	25	TX	Franklin
2	Unimattax	IT Services	2009	36	TX	Newtown Square
3	Lexila	Real Estate	2032	38	IL	Tinley Park
4	Greenfax	Retail	2012	320	sc	Greenville
5	Saoace	Energy	2009	24	WI	New Holstein
6	Donplus	Advertising & Marketing	2009	26	ca	Los Angeles
7	Blacklane	IT Services	2011	9	CA	Orange
8	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham
10	Quotelane	Advertising & Marketing	NULL	4	SC	Greenville
11	Ganzzap	Advertising & Marketing	2011	133	CA	San Francisco
12	Yearflex	NULL	2013	45	WI	Madison

Example 2: Ordering Data Before Deletion

In this example, the rows for Toughtam are the same (with the exception of id), so we are not ordering the rows by anything other than how they appear in the database. Many times, you’ll find that rows are duplicated, but maybe the fields are not identical. If this is the case, you can add an ORDER BY clause after the PARTITION BY.

For example, suppose the data looked like this:

id	name	industry	year_founded	employees	state	created
8	Toughtam	Logistics & Transportation	2011	20	AL	1/3/2023
9	Toughtam	Logistics & Transportation	2011	30	AL	1/10/2023

It looks like this record was updated for this company on 1/20/2023 and the employee count increased. If we wanted to keep the most recent record, we would run:

DELETE 
FROM (
SELECT name, 
	 	ROW_NUMBER() OVER(PARTITION BY name ORDER BY created DESC) AS rn
FROM companies
) 
WHERE rn > 1

As you can see, we’re now ordering the ROW_NUMBER() statement by the created field in descending order, pulling the most recently-created record first. We then delete the records that come after the first record.

For more information on finding duplicate values in SQL, check out our article How to Find Duplicate Values in SQL.

Example 3: Removing NULL Values

Now let’s take a look at the NULLs. NULL indicates a missing value; you can read more about them here. Depending on the implications of NULL values in your data, you can either remove those rows or you can update them. In our example, we see two NULL values. One row has a NULL value for industry and the other for the year_founded. We are going to handle each in a different way.

A company must have a year that they were founded. We’re going to use DELETE to remove that row where it is missing, since it seems to be bad data.

SELECT
FROM companies 
WHERE year_founded IS NULL

id	name	industry	year_founded	employees	state	city
10	Quotelane	Advertising & Marketing	NULL	4	SC	Greenville

In the above query, we are using the IS NULL clause. This clause is looking at the year_founded column and returning any rows where it IS NULL. Once we verify this is the row we want to remove, we can delete it by running:

DELETE
FROM companies 
WHERE year_founded IS NULL

At this point, we’re done deleting bad data and we’re ready to move on to the UPDATE statement. Let’s use this to fix our other NULL value.

How to Update Data

The UPDATE statement is used to modify existing data. You would use this data cleaning technique when correcting inaccurate data or to format your data (making it more readable). Over the next few examples, we’ll walk through these types of scenarios to understand how to manipulate data using UPDATE.

Example 1: Put a Meaningful Label for NULL Values

As we’ve already seen, there is one company that has a NULL value for industry. We’re OK with this because, in our hypothetical situation, we know not all industries are available in our database. What we need to do in this situation is update the NULL to be “Other.” Another option would be to replace NULLs with “NA” or “Not Applicable.”

First, let’s use SELECT to pull the row with the NULL industry:

SELECT *
FROM companies 
WHERE industry IS NULL

id	name	industry	year_founded	employees	state	city
12	Yearflex	NULL	2013	45	WI	Madison

Now that we know we’ve pulled the correct row, we can UPDATE the industry column. We’ll do this by running:

UPDATE companies 
SET industry = ‘Other’
WHERE industry IS NULL

When using UPDATE, the first thing we need to do is identify the table we want to modify. In our example, this table is companies. Next, we need to say which column we’re updating and what we’re updating it to. We identify the column by using SET [column name]. Then we define what we want to change the column to using = [ value ]. The WHERE clause is the same as if we were to write a SELECT statement. We only want to change the industry to ‘Other’ if the industry IS NULL.

Example 2: Fix the Capitalization of Values

At this point our data is looking better, but the state column could use a little cleaning up. Some of the values are in uppercase and some are in lowercase. Typically, the state abbreviation is capitalized, so let’s update all the lowercase values to uppercase.

id	name	industry	year_founded	employees	state	city
1	Over-Hex	Software	2006	25	TX	Franklin
2	Unimattax	IT Services	2009	36	TX	Newtown Square
3	Lexila	Real Estate	2032	38	IL	Tinley Park
4	Greenfax	Retail	2012	320	sc	Greenville
5	Saoace	Energy	2009	24	WI	New Holstein
6	Donplus	Advertising & Marketing	2009	26	ca	Los Angeles
7	Blacklane	IT Services	2011	9	CA	Orange
8	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham
11	Ganzzap	Advertising & Marketing	2011	133	CA	San Francisco
12	Yearflex	Other	2013	45	WI	Madison

Since we want to ensure that all state values are uppercase, we can run:

UPDATE companies 
SET state = UPPER(state)

When you wrap the UPPER() statement around a column name, you are changing all the letters to uppercase. (With the LOWER() statement, you do the opposite – change all letters to lowercase). Now our table looks like this:

id	name	industry	year_founded	employees	state	city
1	Over-Hex	Software	2006	25	TX	Franklin
2	Unimattax	IT Services	2009	36	TX	Newtown Square
3	Lexila	Real Estate	2032	38	IL	Tinley Park
4	Greenfax	Retail	2012	320	SC	Greenville
5	Saoace	Energy	2009	24	WI	New Holstein
6	Donplus	Advertising & Marketing	2009	26	CA	Los Angeles
7	Blacklane	IT Services	2011	9	CA	Orange
8	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham
11	Ganzzap	Advertising & Marketing	2011	133	CA	San Francisco
12	Yearflex	‘Other’	2013	45	WI	Madison

Example 3: Correct Logical Errors

One last thing you’ll want to look for in your dataset is logical errors. In our data, we see that one company has a year_founded value of 2032. Well, that just isn’t possible, since a company can’t be founded in the future. We can identify future dated records by running:

SELECT *
FROM companies 
WHERE year_founded > CURRENT_TIMESTAMP()

CURRENT_TIMESTAMP() returns the current date and time. In the above statement, we’re pulling all records where year_founded is after the time the query is run. You could also hard code a date, such as today’s date.

After a little digging, we see that this is a typo and that company was founded in 2012, so we’ll go ahead and modify that record:

UPDATE companies
SET year_founded = 2012
FROM companies 
WHERE id = 3

id	name	industry	year_founded	employees	state	city
1	Over-Hex	Software	2006	25	TX	Franklin
2	Unimattax	IT Services	2009	36	TX	Newtown Square
3	Lexila	Real Estate	2012	38	IL	Tinley Park
4	Greenfax	Retail	2012	320	SC	Greenville
5	Saoace	Energy	2009	24	WI	New Holstein
6	Donplus	Advertising & Marketing	2009	26	CA	Los Angeles
7	Blacklane	IT Services	2011	9	CA	Orange
8	Toughtam	Logistics & Transportation	2011	20	AL	Birmingham
11	Ganzzap	Advertising & Marketing	2011	133	CA	San Francisco
12	Yearflex	‘Other’	2013	45	WI	Madison

In this scenario, we specified a specific id in the WHERE clause. This is because you would most likely not want to update all year_founded values to be the same year for any company that had a year founded in the future. We know the year founded for this specific company, so we’ll only update that record.

Always Remember to Clean Your Data

Our data now looks much clearer than the original dataset. Data cleaning, while tedious, is an imperative part of the data analysis process. Never assume the data that you are working with is clean. Explore the data by looking for duplicates, NULLs, and any logical fallacies.

You now understand several SQL techniques that you can use to modify your data including DELETE and UPDATE. A great next step is to take the SQL from A to Z track. It contains 7 interactive SQL courses, including an entire course on the DELETE, UPDATE, and INSERT clauses. It is the most complete set of SQL courses available on the Internet. This track will help reinforce everything that you’ve learned in this article and give you opportunities for additional practice. Happy data cleaning!

Tags:

data analysis