14th Feb 2023 9 minutes read Data Cleaning in SQL Nicole Darnley data analysis Table of Contents What Is Data Cleaning? Data Cleaning Techniques How and When to Delete Data Example 1: Deleting Duplicate Data Example 2: Ordering Data Before Deletion Example 3: Removing NULL Values How to Update Data Example 1: Put a Meaningful Label for NULL Values Example 2: Fix the Capitalization of Values Always Remember to Clean Your Data Data cleaning is an important part of any data analysis. Here we’ll discuss techniques you can use to do data cleaning in SQL. I find it nearly impossible to focus on work when my desk is a mess. If it’s cluttered with paper, coffee mugs, or random toys my daughter has somehow snuck into my office, there is no chance I will be able to get anything done until my desk is back in order. For some reason, it’s like the clutter on my desk has somehow made its way into my mind. This same idea is pertinent to data cleaning. Many times, I’ve spent hours on an analysis and drawn my conclusions only to learn of an inconsistency in the data that negates my whole report. As analysts, we typically jump straight into analyzing data without first taking the time to ensure our data is clean. This can lead to many hours of wasted time – or even worse, inaccurate reports. What Is Data Cleaning? The process of data cleaning (also called data cleansing) involves identifying any inaccuracies in a dataset and then fixing them. It’s the first step in any analysis and it includes deleting data, updating data, and finding inconsistencies or things that just don’t make sense. You can learn all SQL features needed to clean data in SQL in our SQL from A to Z track. The track contains 7 interactive SQL courses that will teach you complete SQL, from the very basics through intermediate topics and up to advanced SQL concepts like window functions and recursive queries. It is the most complete set of SQL courses available on the Internet. Data Cleaning Techniques Now that you’ve got the idea, let’s go ahead and take a look at the SQL techniques you can use to cleanse data. For each example, we’ll be using the companies table shown below. It shows information about various companies: idnameindustryyear_foundedemployeesstatecity 1Over-HexSoftware200625TXFranklin 2UnimattaxIT Services200936TXNewtown Square 3LexilaReal Estate203238ILTinley Park 4GreenfaxRetail2012320scGreenville 5SaoaceEnergy200924WINew Holstein 6DonplusAdvertising & Marketing200926caLos Angeles 7BlacklaneIT Services20119CAOrange 8ToughtamLogistics & Transportation201120ALBirmingham 9ToughtamLogistics & Transportation201120ALBirmingham 10QuotelaneAdvertising & MarketingNULL4SCGreenville 11GanzzapAdvertising & Marketing2011133CASan Francisco 12YearflexNULL201345WIMadison How and When to Delete Data Sometimes you will encounter scenarios in which you need to remove data from your dataset. This could be because the data is not pertinent to what you are analyzing or it is duplicated or inaccurate. In the next few examples, we’ll explore these different scenarios and how to approach them. Example 1: Deleting Duplicate Data The first thing we’ll do is look for any data that needs to be deleted. This could be because of duplicates or because the data is not relevant. In this table, we can quickly see that the row for the Toughtam company is duplicated. This won’t be so easy to identify in a large dataset. Before we just go and delete that row, let’s discuss how we would find it. In this dataset, each company should have only one row, so let’s use the GROUP BY and HAVING clauses to identify duplicate names. This query is going to count the number of times each name exists in the database using GROUP BY. Then it uses the HAVING clause to filter the results for only those names that exist more than once. SELECT name, COUNT(name) as count FROM companies GROUP BY name HAVING(count > 1) This query will return the following result: namecount Toughtam2 Great! Now we know that the company named Toughtam is duplicated, but how do we delete one of the rows? We’ll use a combination of ROW_NUMBER() and DELETE. First, let’s add a row number for each row based on the name column: SELECT name, ROW_NUMBER() OVER(PARTITION BY name) AS rn FROM companies idnameindustryyear_foundedemployeesstatecityrn 1Over-HexSoftware200625TXFranklin1 2UnimattaxIT Services200936TXNewtown Square1 3LexilaReal Estate203238ILTinley Park1 4GreenfaxRetail2012320scGreenville1 5SaoaceEnergy200924WINew Holstein1 6DonplusAdvertising & Marketing200926caLos Angeles1 7BlacklaneIT Services20119CAOrange1 8ToughtamLogistics & Transportation201120ALBirmingham1 9ToughtamLogistics & Transportation201120ALBirmingham2 10QuotelaneAdvertising & MarketingNULL4SCGreenville1 11GanzzapAdvertising & Marketing2011133CASan Francisco1 12YearflexNULL201345WIMadison1 What we’ve done is add a new column that shows the row number for each name. As you can see, there is now a 1 and 2 for the rows for Toughtam. Now we’ll run a DELETE statement to remove any rows where the rn column is greater than 1. DELETE FROM ( SELECT name, ROW_NUMBER() OVER(PARTITION BY name) AS rn FROM companies ) WHERE rn > 1 Now our dataset looks like this: idnameindustryyear_foundedemployeesstateCity 1Over-HexSoftware200625TXFranklin 2UnimattaxIT Services200936TXNewtown Square 3LexilaReal Estate203238ILTinley Park 4GreenfaxRetail2012320scGreenville 5SaoaceEnergy200924WINew Holstein 6DonplusAdvertising & Marketing200926caLos Angeles 7BlacklaneIT Services20119CAOrange 8ToughtamLogistics & Transportation201120ALBirmingham 10QuotelaneAdvertising & MarketingNULL4SCGreenville 11GanzzapAdvertising & Marketing2011133CASan Francisco 12YearflexNULL201345WIMadison Example 2: Ordering Data Before Deletion In this example, the rows for Toughtam are the same (with the exception of id), so we are not ordering the rows by anything other than how they appear in the database. Many times, you’ll find that rows are duplicated, but maybe the fields are not identical. If this is the case, you can add an ORDER BY clause after the PARTITION BY. For example, suppose the data looked like this: idnameindustryyear_foundedemployeesstatecreated 8ToughtamLogistics & Transportation201120AL1/3/2023 9ToughtamLogistics & Transportation201130AL1/10/2023 It looks like this record was updated for this company on 1/20/2023 and the employee count increased. If we wanted to keep the most recent record, we would run: DELETE FROM ( SELECT name, ROW_NUMBER() OVER(PARTITION BY name ORDER BY created DESC) AS rn FROM companies ) WHERE rn > 1 As you can see, we’re now ordering the ROW_NUMBER() statement by the created field in descending order, pulling the most recently-created record first. We then delete the records that come after the first record. For more information on finding duplicate values in SQL, check out our article How to Find Duplicate Values in SQL. Example 3: Removing NULL Values Now let’s take a look at the NULLs. NULL indicates a missing value; you can read more about them here. Depending on the implications of NULL values in your data, you can either remove those rows or you can update them. In our example, we see two NULL values. One row has a NULL value for industry and the other for the year_founded. We are going to handle each in a different way. A company must have a year that they were founded. We’re going to use DELETE to remove that row where it is missing, since it seems to be bad data. SELECT FROM companies WHERE year_founded IS NULL idnameindustryyear_foundedemployeesstatecity 10QuotelaneAdvertising & MarketingNULL4SCGreenville In the above query, we are using the IS NULL clause. This clause is looking at the year_founded column and returning any rows where it IS NULL. Once we verify this is the row we want to remove, we can delete it by running: DELETE FROM companies WHERE year_founded IS NULL At this point, we’re done deleting bad data and we’re ready to move on to the UPDATE statement. Let’s use this to fix our other NULL value. How to Update Data The UPDATE statement is used to modify existing data. You would use this data cleaning technique when correcting inaccurate data or to format your data (making it more readable). Over the next few examples, we’ll walk through these types of scenarios to understand how to manipulate data using UPDATE. Example 1: Put a Meaningful Label for NULL Values As we’ve already seen, there is one company that has a NULL value for industry. We’re OK with this because, in our hypothetical situation, we know not all industries are available in our database. What we need to do in this situation is update the NULL to be “Other.” Another option would be to replace NULLs with “NA” or “Not Applicable.” First, let’s use SELECT to pull the row with the NULL industry: SELECT * FROM companies WHERE industry IS NULL idnameindustryyear_foundedemployeesstatecity 12YearflexNULL201345WIMadison Now that we know we’ve pulled the correct row, we can UPDATE the industry column. We’ll do this by running: UPDATE companies SET industry = ‘Other’ WHERE industry IS NULL When using UPDATE, the first thing we need to do is identify the table we want to modify. In our example, this table is companies. Next, we need to say which column we’re updating and what we’re updating it to. We identify the column by using SET [column name]. Then we define what we want to change the column to using = [ value ]. The WHERE clause is the same as if we were to write a SELECT statement. We only want to change the industry to ‘Other’ if the industry IS NULL. Example 2: Fix the Capitalization of Values At this point our data is looking better, but the state column could use a little cleaning up. Some of the values are in uppercase and some are in lowercase. Typically, the state abbreviation is capitalized, so let’s update all the lowercase values to uppercase. idnameindustryyear_foundedemployeesstatecity 1Over-HexSoftware200625TXFranklin 2UnimattaxIT Services200936TXNewtown Square 3LexilaReal Estate203238ILTinley Park 4GreenfaxRetail2012320scGreenville 5SaoaceEnergy200924WINew Holstein 6DonplusAdvertising & Marketing200926caLos Angeles 7BlacklaneIT Services20119CAOrange 8ToughtamLogistics & Transportation201120ALBirmingham 11GanzzapAdvertising & Marketing2011133CASan Francisco 12YearflexOther201345WIMadison Since we want to ensure that all state values are uppercase, we can run: UPDATE companies SET state = UPPER(state) When you wrap the UPPER() statement around a column name, you are changing all the letters to uppercase. (With the LOWER() statement, you do the opposite – change all letters to lowercase). Now our table looks like this: idnameindustryyear_foundedemployeesstatecity 1Over-HexSoftware200625TXFranklin 2UnimattaxIT Services200936TXNewtown Square 3LexilaReal Estate203238ILTinley Park 4GreenfaxRetail2012320SCGreenville 5SaoaceEnergy200924WINew Holstein 6DonplusAdvertising & Marketing200926CALos Angeles 7BlacklaneIT Services20119CAOrange 8ToughtamLogistics & Transportation201120ALBirmingham 11GanzzapAdvertising & Marketing2011133CASan Francisco 12Yearflex‘Other’201345WIMadison Example 3: Correct Logical Errors One last thing you’ll want to look for in your dataset is logical errors. In our data, we see that one company has a year_founded value of 2032. Well, that just isn’t possible, since a company can’t be founded in the future. We can identify future dated records by running: SELECT * FROM companies WHERE year_founded > CURRENT_TIMESTAMP() CURRENT_TIMESTAMP() returns the current date and time. In the above statement, we’re pulling all records where year_founded is after the time the query is run. You could also hard code a date, such as today’s date. After a little digging, we see that this is a typo and that company was founded in 2012, so we’ll go ahead and modify that record: UPDATE companies SET year_founded = 2012 FROM companies WHERE id = 3 idnameindustryyear_foundedemployeesstatecity 1Over-HexSoftware200625TXFranklin 2UnimattaxIT Services200936TXNewtown Square 3LexilaReal Estate201238ILTinley Park 4GreenfaxRetail2012320SCGreenville 5SaoaceEnergy200924WINew Holstein 6DonplusAdvertising & Marketing200926CALos Angeles 7BlacklaneIT Services20119CAOrange 8ToughtamLogistics & Transportation201120ALBirmingham 11GanzzapAdvertising & Marketing2011133CASan Francisco 12Yearflex‘Other’201345WIMadison In this scenario, we specified a specific id in the WHERE clause. This is because you would most likely not want to update all year_founded values to be the same year for any company that had a year founded in the future. We know the year founded for this specific company, so we’ll only update that record. Always Remember to Clean Your Data Our data now looks much clearer than the original dataset. Data cleaning, while tedious, is an imperative part of the data analysis process. Never assume the data that you are working with is clean. Explore the data by looking for duplicates, NULLs, and any logical fallacies. You now understand several SQL techniques that you can use to modify your data including DELETE and UPDATE. A great next step is to take the SQL from A to Z track. It contains 7 interactive SQL courses, including an entire course on the DELETE, UPDATE, and INSERT clauses. It is the most complete set of SQL courses available on the Internet. This track will help reinforce everything that you’ve learned in this article and give you opportunities for additional practice. Happy data cleaning! Tags: data analysis