Back to articles list February 9, 2018 - 5 minutes read How to Remove Junk Characters in SQL Ignacio L. Bisso Ignacio is a database consultant from Buenos Aires, Argentina. He’s worked for 15 years as a database consultant for IT companies like Informix and IBM. These days, he teaches databases at Sarmiento University and works as a PostgreSQL independent SQL consultant. A proud father of four kids with 54 years in his backpack, Ignacio plays soccer every Saturday afternoon, enjoying every match as if it’s his last one. Tags: common sql mistakes data quality how to in sql SQL basics Unwanted characters in text data can be a bit of a pain, but there’s an easy way to fix them. Scroll down to learn how to remove junk characters in SQL in the easiest way! Sometimes, we’ll find unwanted characters inside our string data because our SQL queries didn’t work as expected. Moreover, these extra characters may sometimes be invisible, which really complicates things. In this article, we’ll examine some string-related SQL functions that can handle unwanted characters—visible or not! The Importance of Standardizing Strings in Your Database When we work with data populated or loaded as free text, the data quality is often quite poor. Unwanted characters can seriously hurt the quality of your data and make it more difficult to analyze the information you’ve collected. Thus, it’s important to understand how you can use SQL string functions to fix these common problems so you can clean up your database. How to remove junk characters in SQL? Let’s start by exploring the SQL trim and length functions. Problem-Solving with SQL String Functions To explain how to solve problems with unwanted characters, we’ll work with a simple health care database. Imagine we have a doctor who stores all his patient data in a single table. The table contains the patient’s full name, the date of the visit, the doctor’s diagnosis, the suggested treatment, and any drugs that were prescribed. Every time a patient visits his office, the doctor creates a new record. He manually types his notes into the database, so the data quality is occasionally poor. Full_Name date diagnostic treatment Drugs_in_receipt John Smith ’03/04/2016′ Flu rest “aspirin, paracetamol” Mary Deep ’10/11/2016′ Food Poisoning Rest & diet “penicillin,paracetamol” Agnes Jason ’03/12/2016′ flu rest “aspirin, paracetamol” Johnny SMITH ’15/03/2017′ Food Poisoned Rest & diet “penicillin,paracetamol” AGNES Jason ’19/072017′ angina Rest and don’t speak “amoxicillin” Peter Duckerz ’10/10/2017′ Flu Rest “paracetamol” SQL Functions for Removing Invisible and Unwanted Characters In some cases, a text string can have unwanted characters, such as blank spaces, quotes, commas, or even “|” separators. These can be on either or both sides of the string. We can remove those unwanted characters by using the SQL TRIM, SQL LTRIM, and SQL RTRIM functions. They are very similar and are explained in the following table: Function Parameters Behavior LTRIM A string and a specific character Removes the specified character from the left side only RTRIM A string and a specific character Removes the specified character from the right side only TRIM A string and a specific character Removes the specified character from both sides Let’s try these functions, starting with LENGTH. How to remove junk characters in SQL using them? If you examine the original table, you’ll notice that some entries under the diagnostic column have some unwanted characters such as unnecessary spaces at the beginning. But there are also some hidden spaces after each entry. To fix this, we’ll start by counting the number of characters in the diagnostic strings using the LENGTH function. As blank spaces are not visible characters, we use angle brackets to show us where the extra spaces (if any) are. Let’s see how the SQL LENGTH function works in this query: SELECT diagnostic as real_diagnostic, length(diagnostic) as field_length, '<' || diagnostic || '>' as delimited_diagnostic FROM patient_data real_diagnostic field_length delimited_diagnostic Flu 3 <Flu> Food Poisoning 14 <Food Poisoning> flu 4 < flu> Food Poisoned 15 < Food Poisoned> angina 6 <angina> Flu 4 < Flu> We can see that several records have unwanted characters, i.e. unnecessary spaces. For “ flu”, the length is 4 instead of 3, and the delimited field shows the blank at the beginning. A proper query will fix this issue and remove the unnecessary space. Let’s go ahead and do that now! SELECT diagnostic as real_diagnostic, trim(diagnostic) as trimmed_diagnostic FROM patient_data real_diagnostic trimmed_diagnostic Flu Flu Food Poisoning Food Poisoning flu flu Food Poisoned Food Poisoned angina angina Flu Flu Power Strings: Nested SQL String Functions Let’s suppose our doctor wants to know how many patients were diagnosed with each of the illnesses in the diagnostic column. We’d use the following query to get this information: SELECT diagnostic, count(*) FROM patient_data GROUP BY diagnostic diagnostic count(*) Flu 1 Food Poisoning 1 flu 1 food poisoning 1 angina 1 Flu 1 As we can see, the result is not what we expected. The same illness is showing up several times because the doctor was not consistent with his typing. A diagnosis of flu shows up as “Flu”, “flu”, and “ flu”. We know they are the same, but the database engine sees them as three different things. We have the skills to fix this query and get the result we want. We can use the same nested expression to get rid of the unwanted characters (extra spaces) and eliminate the capitalization mistakes. Here’s how it looks: SELECT lower(trim(diagnostic)), count(*) FROM patient_data GROUP BY lower(trim(diagnostic)) diagnostic count(*) flu 3 food poisoning 2 angina 1 In programming, it is really common to nest functions, or call a function from inside another function for use as a parameter. This is what we did in the previous example. Learn More About SQL String Functions In this article, we covered the important SQL string functions TRIM and LENGTH to learn how to remove junk characters in SQL. If you want to detect hidden or unwanted characters as part of an initial diagnosis, use LENGTH. Then, use TRIM to get rid of unwanted characters. We also discussed how you can nest SQL functions, which is a powerful technique in programming. If you’re eager to learn more about SQL string functions, check out our Standard SQL Functions course. Try it for free today! Tags: common sql mistakes data quality how to in sql SQL basics You may also like How to Solve Capitalization Data Quality Issues Misspelled names, typos, and text data quality issues in your database? No problem! We can fix it with SQL string functions. Read more 15 Best SQL Articles for Beginners Published in 2017 If you're looking for articles on SQL for beginners, take a look at my comprehensive list of best SQL articles from 2017! Read more The SQL Coalesce Function: Handling Null Values Though the SQL coalesce function may seem complex, it’s actually very straightforward. Let’s look at how it can be used to work with NULL values. Read more Subscribe to our newsletter Join our weekly newsletter to be notified about the latest posts.