Back to articles list Articles Cookbook
5 minutes read

How to Remove Junk Characters in SQL

Unwanted characters in text data can be a bit of a pain, but there’s an easy way to fix them. Scroll down to learn how to remove junk characters in SQL in the easiest way!

Sometimes, we’ll find unwanted characters inside our string data because our SQL queries didn’t work as expected. Moreover, these extra characters may sometimes be invisible, which really complicates things. In this article, we’ll examine some string-related SQL functions that can handle unwanted characters—visible or not!

The Importance of Standardizing Strings in Your Database

When we work with data populated or loaded as free text, the data quality is often quite poor. Unwanted characters can seriously hurt the quality of your data and make it more difficult to analyze the information you’ve collected. Thus, it’s important to understand how you can use SQL string functions to fix these common problems so you can clean up your database.

How to remove junk characters in SQL? Let’s start by exploring the SQL trim and length functions.

Problem-Solving with SQL String Functions

To explain how to solve problems with unwanted characters, we’ll work with a simple health care database. Imagine we have a doctor who stores all his patient data in a single table. The table contains the patient’s full name, the date of the visit, the doctor’s diagnosis, the suggested treatment, and any drugs that were prescribed.

Every time a patient visits his office, the doctor creates a new record. He manually types his notes into the database, so the data quality is occasionally poor.

Full_Name date diagnostic treatment Drugs_in_receipt
John Smith ’03/04/2016′ Flu rest “aspirin, paracetamol”
Mary Deep ’10/11/2016′ Food Poisoning Rest & diet “penicillin,paracetamol”
Agnes Jason ’03/12/2016′ flu rest “aspirin, paracetamol”
Johnny SMITH ’15/03/2017′ Food Poisoned Rest & diet “penicillin,paracetamol”
AGNES Jason ’19/072017′ angina Rest and don’t speak “amoxicillin”
Peter Duckerz ’10/10/2017′ Flu Rest “paracetamol”

SQL Functions for Removing Invisible and Unwanted Characters

In some cases, a text string can have unwanted characters, such as blank spaces, quotes, commas, or even “|” separators. These can be on either or both sides of the string. We can remove those unwanted characters by using the SQL TRIM, SQL LTRIM, and SQL RTRIM functions. They are very similar and are explained in the following table:

Function Parameters Behavior
LTRIM A string and a specific character Removes the specified character from the left side only
RTRIM A string and a specific character Removes the specified character from the right side only
TRIM A string and a specific character Removes the specified character from both sides

Let’s try these functions, starting with LENGTH. How to remove junk characters in SQL using them?

If you examine the original table, you’ll notice that some entries under the diagnostic column have some unwanted characters such as unnecessary spaces at the beginning. But there are also some hidden spaces after each entry. To fix this, we’ll start by counting the number of characters in the diagnostic strings using the LENGTH function. As blank spaces are not visible characters, we use angle brackets to show us where the extra spaces (if any) are.

Let’s see how the SQL LENGTH function works in this query:

SELECT 	diagnostic as real_diagnostic,
	 	length(diagnostic) as field_length,
		'<' || diagnostic || '>' as delimited_diagnostic
FROM patient_data 
real_diagnostic field_length delimited_diagnostic
Flu 3  <Flu>
Food Poisoning 14  <Food Poisoning>
flu 4 < flu>
Food Poisoned 15 < Food Poisoned>
angina 6  <angina>
Flu 4 < Flu>

We can see that several records have unwanted characters, i.e. unnecessary spaces. For “ flu”, the length is 4 instead of 3, and the delimited field shows the blank at the beginning. A proper query will fix this issue and remove the unnecessary space. Let’s go ahead and do that now!

SELECT	diagnostic as real_diagnostic, 	
		trim(diagnostic) as trimmed_diagnostic
FROM patient_data 
real_diagnostic trimmed_diagnostic
Flu Flu
Food Poisoning Food Poisoning
flu flu
Food Poisoned Food Poisoned
angina angina
Flu Flu

Power Strings: Nested SQL String Functions

Let’s suppose our doctor wants to know how many patients were diagnosed with each of the illnesses in the diagnostic column. We’d use the following query to get this information:

SELECT diagnostic, count(*)
FROM  patient_data
GROUP BY diagnostic
diagnostic count(*)
Flu 1
Food Poisoning 1
flu 1
food poisoning 1
angina 1
Flu 1

As we can see, the result is not what we expected. The same illness is showing up several times because the doctor was not consistent with his typing. A diagnosis of flu shows up as “Flu”, “flu”, and “ flu”. We know they are the same, but the database engine sees them as three different things.

We have the skills to fix this query and get the result we want. We can use the same nested expression to get rid of the unwanted characters (extra spaces) and eliminate the capitalization mistakes. Here’s how it looks:

SELECT lower(trim(diagnostic)), count(*)
FROM  patient_data
GROUP BY lower(trim(diagnostic))
diagnostic count(*)
flu 3
food poisoning 2
angina 1

In programming, it is really common to nest functions, or call a function from inside another function for use as a parameter. This is what we did in the previous example.

Learn More About SQL String Functions

In this article, we covered the important SQL string functions TRIM and LENGTH to learn how to remove junk characters in SQL. If you want to detect hidden or unwanted characters as part of an initial diagnosis, use LENGTH. Then, use TRIM to get rid of unwanted characters. We also discussed how you can nest SQL functions, which is a powerful technique in programming.

If you’re eager to learn more about SQL string functions, check out our Standard SQL Functions course. Try it for free today!