16th Apr 2021 11 minutes read Analyze Small Datasets with SQL Himanshu Kathuria data analysis Table of Contents What Is a Small Dataset? How to Use SQL to Analyze a Small Dataset Customer Analysis using SQL Supplier Analysis with SQL Sample Use Cases for SQL Analysis in Business Why Use SQL For Such Analysis and Not a Spreadsheet? Ready to Learn SQL and Generate Valuable Insights? If you think SQL data analysis is useful only when a business is generating millions of rows of data – that SQL cannot help you if you are generating a limited dataset – think again and read on! There is so much talk about how SQL is useful in analyzing Big Data that sometimes the applications of SQL data analysis for small datasets can go unnoticed. Having worked with early-stage businesses – which have just started generating data through daily transactions – as well as big established clients generating terabytes of data, I can tell you for sure that SQL can be an extremely handy tool for just about any business. It can help you generate insights and grow your business through data-based decision-making instead of gut feel and guesses. In this article, I'll show how SQL can be used for small datasets. Do you want to make better decisions based on data analysis? Are you an entrepreneur who runs a small business and aspires to make it big? Or maybe you are a student looking to learn SQL and understand its applications. In any case, this article is for you. First, let’s explain what constitutes a small dataset and how it’s different from Big Data. What Is a Small Dataset? To put it simply, a small dataset constitutes a few hundred to a few thousand rows of limited features (i.e. data attributes). For instance, ‘customer address’ is a common feature, usually represented as a column in a table. Before we dive deeper into understanding small data sets, take a look at this example. Imagine that you just started a small business where your customers place orders through a phone call. You store the orders’ transactional data in a database table known as Orders. Here, line_number, quantity, order_status, etc. are features for the Order entity. Orders Order_IdLine_NumberCustomer_IDItem_IdQuantityAmountDiscount_PercentBrand_NameItem_TypeOrder_DateOrder_Status Assume that you have managed to get 1,000 customers who have placed around 4,000 orders. On average, a customer has 3 different items per order. This makes a total of 12,000 rows in your order details table. You also have separate tables named customer_details, item_details, and brand_details with a similar number of features linked to this table. This is a typical example of a small dataset. ‘Small’ data does not mean having only 2-3 rows and 3-4 features; it means you have just enough data to start generating insights. Notice that this data will have little to no noise (meaningless information) and is usually well structured. The storage required for this kind of dataset may range from tens or hundreds of kilobytes/megabytes to a few terabytes (1 terabyte = ~1,000 GB). A business usually operates at such volume when it’s just getting started or when its scale is limited. This also happens when data is collected from limited sources. On the other hand, Big Data can take up many terabytes, be quite unstructured, and come with a decent amount of noise. For example, imagine analyzing all the public Facebook comments related to US elections. They probably run into millions of words and will be quite unstructured (i.e. come in different forms, lengths, languages, etc.). Here is a quick summary of the differences between big and small data: Small DataBig Data Number of FeaturesLimited set of featuresLarge number of features Number of RowsUsually a few thousand rowsMillions of rows StructureMostly structured dataBoth structured and unstructured NoiseLessLots Speed of Data GenerationLowHigh Now that you understand what a small dataset looks like, let’s examine some real-time practical uses of SQL analysis or SQL reporting for such data. How to Use SQL to Analyze a Small Dataset In the first few days or months of starting a business, strategy is often shaped by understanding what kind of value proposition and product or service features are most preferred by different kinds of customers. Your goal is to meet these needs better than anyone else. To understand your customers better, it’s common to group customers into various cohorts or segments by their different shared characteristics (e.g. all married men who have 2-4 kids would be one group, married men with no children another group, single dads a third group, and so on). You then use this information to analyze each group’s interaction patterns with your business. Small data can serve as a great source of useful information here, and SQL can use this data to help you create valuable analyses. Customer Analysis using SQL Suppose it has been two months since you launched your e-commerce fashion jewelry business. Your products are mostly directed towards women. However, you aren’t certain which age demographic is most likely to buy your products. So, right now all your marketing efforts are directed towards women of all ages. As part of your customer registration process, you start collecting customers’ dates of birth, which is then updated as the Age column in the customer database. Even with this limited data, you can find out some great insights. Using SQL, a single query can help you identify who buys the most of one type of item – say, small earrings. Here’s a sample of your dataset: Customer_IdCustomer_NameAgeSexPhone_NumberOccupation 1234John35MaleXXXXXJob 1235Stacey29FemaleXXXXXHomemaker ….….….….….30 The Customers table Order_IdLine_NumberCustomer_IDItem_IdQuantityAmountDiscount_PercentBrand_NameItem_TypeOrder_DateOrder_Status The Orders table And here's the SQL query to help you find which age groups and occupations mostly purchase small earrings. (Don’t worry if you don’t understand everything in this query right now; this is just to show you what’s possible with a few lines of code.) SELECT Case When c.age>18 and c.age<=25 then ‘18-25’ When c.age>25 and c.age<=30 then ‘25-30’ When c.age>30 and c.age<=40 then ‘30-40’ When c.age>40 then ‘>40’ else ‘<18’ END "Age Group", c.occupation “Occupation”, SUM(o.amount) “Total Value Purchased”, SUM(o.quantity) “Total Quantity”, COUNT(distinct o.order_id) “Total Orders”, COUNT(distinct c.customer_id) “Total Customers” FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.item_type = ‘Earrings - small’ and c.sex = ‘Female’ GROUP BY 1 ORDER BY SUM(o.amount) DESC; The results: Age GroupOccupationTotal Value PurchasedTotal QuantityTotal OrdersTotal Customers 25-30Job$10,0001203020 25-30Homemaker$2,00040108 18-25Own Business$100222 >40Homemaker$90222 It is clear that small earrings are mostly purchased by working women between 25-30 years of age. You could slightly alter this query and get similar data for different groups, items, or brands. Once you have your answer, you can tailor your marketing campaigns to target that demographic. You can also add or modify products to cater to an underserved segment. After you make the necessary changes in your catalog or products, you can again use SQL to see what improvements followed. In this case, we drew a practical and actionable insight from a relatively small data set. As little as a few hundred rows can be a good starting point; you can further refine your insights as the data grows. Not only that, you can easily link the database to Python and use SQL with it to make your analysis even more powerful by automating communications to customers. This analysis can therefore be a starting point to further understand the needs of your customer base and serve them better. But wait, there’s more! Once you see that a particular way of reporting or data aggregation is useful, you can create automated reports using SQL and learn how your customer base evolves as you add more products and designs. In fact, customer behavior analysis using SQL is highly utilized in many industries. So, we’ve looked at an example for the sales and marketing domain. Let’s consider another scenario. Supplier Analysis with SQL Say that your online jewelry business follows a drop ship model (where your products are shipped from suppliers to the customer and you act as an interface between the two). If the ‘average time to delivery’ is an important metric for customer satisfaction, then any delay in shipping from the suppliers’ end could negatively affect the quality of service at your end. To address this, incorporating multiple destination route planning can help optimize shipping routes and minimize delays, thus improving overall delivery efficiency. SQL can help you assess this metric. Below, we’ll look at a sample query that uses the Shipments table, which has the following self-explanatory columns: Order_Id Item_Id Supplier_Id Supplier_Name Shipment_Id Shipment_Status Order_Delivery_Duration Here’s the query. Again, don’t worry if you don’t understand how it works; we’re just looking at possible use cases. SELECT Supplier_Name “Supplier Name”, Supplier_id “Supplier Id”, COUNT(DISTINCT Order_id) “Total Orders”, AVG(Order_Delivery_Duration) “Average Delivery Duration” FROM shipments s GROUP BY Supplier_Name, Supplier_Id ORDER BY AVG(Order_Delivery_Duration) DESC; And this is what the results look like: Supplier NameSupplier IdTotal OrdersAverage Delivery Duration Necklace Supplier XX21214615 Earrings Supplier YY3434573 Bracelet Supplier4551005 (This example is just for demonstration. You may want to look closely at the distribution of deliveries across orders for better insights.) This output clearly shows how the supplier named ‘Necklace Supplier XX’ lags in terms of average delivery duration. This insight can help you resolve issues with problematic suppliers while also setting the right customer expectations by being clear about the average days to delivery for various suppliers. Sample Use Cases for SQL Analysis in Business Here are a few other SQL use cases for small datasets: Analyze customer purchase patterns (i.e. which items sell the most). Understand customers’ affinities for various brands. Measure and report outcomes of marketing campaigns. Analyze the performance of sales reps. Improve sales reps’ productivity using data insights. Analyze customer journeys on mobile applications and websites. Measure supply chain performance and improve turnaround time. Understand the major cost heads and improve margins. Filter your client email database and organize a B2B SaaS cross-promotion. This is by no means an exhaustive list; the possibilities are virtually endless. This versatility clearly makes SQL an important skill, regardless of your department. For someone who works in sales, marketing, supply chain, HR, or finance, SQL can be used as a quick way to check on important metrics. These metrics can help determine the health of the business and also help in identifying insights that aid in decision-making. For someone in IT, SQL can be a useful tool to automate data analysis or reporting for specific departments. If you are a management or technical consultant, SQL can help you unlock great value for your customers. On the other hand, if you are presently a student who is planning on a corporate or even a freelance career, SQL is a must-have tool in your arsenal. And, in the data world, SQL is essential. In fact, SQL is listed as the top skill for data analysts. Source: towardsdatascience.com If you are planning on using SQL to write your own reports, I would recommend the SQL Reporting course by LearnSQL.com. It will help you write advanced queries that can really unlock a lot of value for your business. If you’re a complete newcomer to SQL, you should probably complete the SQL Fundamentals track first, which will give you a good grounding in the basics before you move on to the more advanced concepts. Why Use SQL For Such Analysis and Not a Spreadsheet? You can argue that MS Excel or Google Sheets can also enable this kind of analysis. While that is partially right, there are certain challenges that come with spreadsheets. Here are the steps to perform a similar task in Excel: Download the data into a spreadsheet. Organize the data and do all sorts of VLOOKUPs to link sheets. Create pivot charts based on age buckets. On the other hand, here is the process using SQL: Write a query :) Once you know how to write a query, SQL is easier! Also, Excel is much slower than SQL. While it takes seconds or minutes to locate and retrieve data with SQL, Excel can take hours to process the same. And there’s another advantage to using SQL. In SQL, data is separate from analysis. Hence, there’s less risk of you changing the data while doing the analysis. If you want to know more about the advantages of SQL over Excel, check out this article. It may seem that learning SQL is an additional headache. But this language is so simple and powerful that it is undoubtedly worth learning. And SQL is based on English sentence structure, so it’s probably not as hard as you think. Ready to Learn SQL and Generate Valuable Insights? If you were to ask me when you should start learning SQL, I would unequivocally say “Right away!”. The benefits of knowing SQL for data analysis, generating insights, and general reporting are so great that it is a must-have skill – for not just analysts or programmers, but any businessperson who uses data. There are great use cases for consultants and freelancers as well. LearnSQL.com has many great courses that can assist you in your SQL journey. If you are completely new to SQL, start with the SQL Basics course. You’ll learn to work with relational databases, write SQL queries, join tables, and analyze data. All the courses on the platform are interactive and contain a lot of hands-on exercises. You’ll learn specific topics and then have a chance to use that knowledge by writing SQL queries in our online console. You don't need to install or set anything up. Once you know SQL’s basics, you can start on more advanced courses to become a SQL expert. It really pays off. Some time ago, my colleague Jakub wrote about different database jobs and what they earn. Check it out – maybe you’ll find something that’s right for you. With a few weeks to months of effort, you will have equipped yourself with enough SQL to help you earn more money, either directly or indirectly: directly by winning you more clients or a higher-paying job, indirectly by business insights that can improve your revenue or margins. Once you’ve mastered SQL, analysis will take less time, which you can then use somewhere else. In fact, SQL has helped me quicken my analyses and improve not just my turnaround time for clients but also my efficacy. So what are you waiting for? Get started today! Tags: data analysis