13th Feb 2024 15 minutes read

Integrating SQL with Python for Data Analysis

Integrating SQL with Python isn’t difficult. The two tools work together to combine the information-processing power of relational databases with the flexibility of a programming language. In this article, I will discuss the benefits of data analysis using SQL and Python, with real-world coding examples.

Why do most of the best data analysis tools on the market – such as Tableau and Power BI – include both SQL and Python in their toolboxes? Because these two programming languages complement each other to give you the power you need for deep and meaningful data analysis.

Since they’re two very different skills, there are not many tutorials on Python and SQL. To build your knowledge, you’ll need to learn Python and SQL separately. This article aims to bridge the gap by showing you how to put the two together so you have the best of both worlds.

If you don’t already know SQL, this article may help you get started. My recommendation would be LearnSQL.com’s SQL for Data Analysis learning track, which guides you from beginner-level to advanced data analysis skills. It includes four interactive courses that you’ll need about 48 hours to complete. You’ll be working in your browser, so you don’t need to install any software before you start. By the time you’ve completed over 40 practical exercises solving real-world data analysis problems, you’ll be an SQL guru!

And if you don’t know Python, I’d suggest you check out our sister site, LearnPython.com.

In this article, I’ll give you a taste of what you can do with data analysis using Python and SQL.

What Is SQL?

Let’s start by looking at SQL: what it is and what it can and can’t do.

SQL has been the industry-standard way of interacting with databases for more than 50 years. Here is more about the history of SQL. And no, it’s not out of date – it’s constantly revised to keep up with changing technology.

SQL is an English-like language that lets you manipulate, extract, analyze, and aggregate data from a database. It’s easy to learn, and it’s the perfect tool for working with very large datasets. It’s a specialized language – purely designed for working with data. There are lots of things it can’t do, such as complex mathematical functions, working with graphics, and complicated decision-making.

What Is Python?

Python is an all-purpose programming language that’s designed to be simple to learn and easy to read. This makes it a great choice for beginners, but it’s also enormously versatile and powerful. The real power of Python comes from its libraries, or collections of reusable code. The libraries give you a range of tools to help you easily program almost any kind of computing task, including:

Mathematical and statistical functions.
Image-processing tools.
Creating graphs and charts.
Tools for working with email and the Internet.
Working with various file formats.
Python SQL database manipulation.

Data Analysis with Python and SQL

Why do you need both tools? Your computer consists of many parts, but the most important are:

The CPU is the ‘brain’ of the computer and controls everything else.
RAM, which is the computer’s fast onboard memory. Ram is limited in size, and these days it’s generally measured in gigabytes. This is quite small compared to the amount of data stored by most organizations, which is why we need databases.
Storage devices, including hard drives and USB sticks. Storage devices can hold vast amounts of information, but they need a way to manage it and access it quickly. Relational database technology is one way of achieving this.
Various devices for communicating with the user, including the keyboard, display screen, camera, and speakers.
Networking devices for communicating with the larger world.

Python is the perfect tool for managing most of these resources, but it falls short when it comes to efficiently processing the enormous quantity of data stored locally and in the Cloud. This is where SQL comes in. SQL is designed to efficiently access and manipulate very large quantities of data.

As a data analyst, you’d use SQL to access and aggregate your organization’s data. You can then use Python for performing complicated mathematical calculations, preparing the data for presentations, circulating it to other people, and converting it into formats used by other programs. To do this efficiently, you need to integrate SQL with Python.

SQL Python Integration Examples

Let’s look at a few examples that execute SQL from Python to give you an idea of what you can do by making the two tools work together.

How Do You Connect Python to SQL?

To connect SQL to any programming language, you’ll need something called an ODBC driver. ODBC stands for Open Database Connectivity. Vendors of RDBMS (relational database management system) software make drivers available for their specific databases, and these are revised from time to time.

There are many different types of databases, each with its own set of drivers. In the examples below, I’ll be using MS SQL Server 2022 Express and MySQL Version 8.0.

If you’re using a different database, or even a different version of MySQL or SQL Server, you may need some help from your systems administrator to figure out which ODBC driver you should use. For example, Microsoft’s SQL server originally used a driver named “SQL Server Native Client”, but they’ve recently changed this to “ODBC Driver 17 for SQL Server”. You’ll also need to know which version of the driver you have.

In Windows, you can see what drivers are installed on your machine by typing “ODBC” in the search box on your taskbar. You should see a menu including these two choices, since there are two types of drivers: 32-bit and 64-bit.

Integrating SQL with Python for Data Analysis

Either of these choices show you a screen that looks like this:

In the Drivers tab, you’ll see a list of drivers installed. You should be able to identify which one best fits the type of database you’re using. If not, speak to your systems administrator. You may need to check both the 32- and 64-bit options to find what you’re looking for.

You’ll also need to import a Python SQL library, which allows you to work with SQL. In my examples, I’m using the pyodbc library for MS SQL Server. This library will work with almost any database. For MySQL, I’m using the mysql.connector library. It is a simpler connection, but it only works with MySQL.

If these libraries haven’t already been installed in your system, you can install them using pip from the command line, – or, if you’re working in an IDE such as PyCharm, follow your IDE’s instructions for installing packages. The pip commands are:

pip install mysql_connector

pip install pyodbc

You’ll also need the server name for MS SQL, or the URL of the machine where your database is located for MySQL. If MySQL is running on your own machine, the URL will be localhost.

Now that you have all the pieces you need, let’s look at the Python code to connect to the database.

Python and MYSQL

For MySQL, your Python code should begin like this. Important: You need to replace the text enclosed in <> with your own settings.

import mysql.connector
connector = mysql.connector.connect(
  host="<URL of the machine where the database is installed>",
  user="<your user name>",
  password="<your password>",
  database="<your database name>"
)

Python and SQL Server

For SQL Server, your Python code should begin like this. Again, you need to replace the text enclosed in <> with your own settings. This code will work for most other types of databases if you use the correct driver name.

import pyodbc
connector = pyodbc.connect("Driver={<driver name>};"
                      "Server=<your server name>;"
                      "Database=<your database name>;"
                      "Trusted_Connection=yes;")

If you usually log into your database with a username and password, you may have to remove this entry …

"Trusted_Connection=yes;"

… and replace it with the following:

user="lt;your username>",
password="lt;your password>"

That’s the difficult bit! The rest is easy. You can now use the connector object to execute SQL code from Python. The results are stored in a cursor. A cursor is an object that allows you to access the results of your query sequentially.

If you wanted to list everything from the department table in the sample MySQL database employees, your program would look like this:

import mysql.connector

# Connect to the database
connector = mysql.connector.connect(
  host="localhost",
  user="<your user name>",
  password="<your password>",
  database="employees"
)

# Create a cursor for this connection
department_cursor = connector.cursor()

# Execute the query
department_cursor.execute("SELECT  * from departments")

# Display the results
for x in department_cursor:
  print(x)

Your results should look like this:

('d009', 'Customer Service')
('d005', 'Development')
('d002', 'Finance')
('d003', 'Human Resources')
('d001', 'Marketing')
('d004', 'Production')
('d006', 'Quality Management')
('d008', 'Research')
('d007', 'Sales')

If you wanted to print only one of the result columns, you could specify the column by its index. Indices are column numbers starting from 0. To see only the department name, which is the second column, the print command would look like this:

print(x[1])

The results would then look like this:

Customer Service
Development
Finance
Human Resources
Marketing
Production
Quality Management
Research
Sales

Now you know how to use SQL with Python, let’s look at a few useful examples.

Using the Python Pandas Library for Further Analysis

If you haven’t already used the pandas library in Python, it’s well worth learning. The pandas website has a really good tutorial to get you started. You can install pandas using the following pip command if you’re working from the command line; if you’re working in an IDE like PyCharm, install it in your IDE.

pip install pandas

pandas allows you to store data in a DataFrame, which consists of columns and rows. From the DataFrame, you can carry out all kinds of useful statistical functions.

Let’s look at an example script.

I want to use the following query from MS SQL’s sample AdventureWorksDW database to do some further analysis in pandas:

SELECT MONTH(OrderDate) AS Month,
                        SUM(SalesAmount) AS MonthSales
                        FROM FactInternetSales
                        WHERE YEAR(OrderDate) = 2013
                        GROUP BY  MONTH(OrderDate)
                        ORDER BY  MONTH(OrderDate)

I want to place the query result in a DataFrame, print it, and then print some summary statistics.

First I need to import the pyodbc and pandas libraries and create the connection to MS SQL:

import pyodbc
import pandas as pd
connector = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};"
                      "Server=JILL\\SQLExpress;"
                      "Database=AdventureWorksDW;"
                      "Trusted_Connection=yes;")

Next, since my SQL command is fairly long, I’ll store it in a variable:

sql_string = """SELECT  MONTH(OrderDate) AS Month,
                        CONVERT(int,SUM(SalesAmount/1000)) AS MonthSales
                        FROM FactInternetSales
                        WHERE YEAR(OrderDate) = 2013
                        GROUP BY  MONTH(OrderDate)
                        ORDER BY  MONTH(OrderDate)"""

Now I want to put the results of that query into a pandas DataFrame. The method read_sql_query defined in pandas will do this:

salesdata=pd.read_sql_query(sql_string,connector)

I now have a DataFrame named salesdata, which contains the results of my query. I can print that data using the following command:

print(salesdata)

This gives me the same results I would have seen if I had run the query directly against the database, with the row number on the left:

    Month  MonthSales
0       1         858
1       2         771
2       3        1050
3       4        1046
4       5        1284
5       6        1643
6       7        1371
7       8        1551
8       9        1447
9      10        1673
10     11        1780
11     12        1874

Now let’s do something useful with this information. pandas has a method named describe(), which shows summary statistics. If you wanted to see the summary for MonthSales underneath your query results, you could use this code:

print("Summary")
print(salesdata["MonthSales"].describe())

This shows the summary like this:

Summary
count      12.000000
mean     1362.333333
std       364.586524
min       771.000000
25%      1049.000000
50%      1409.000000
75%      1650.500000
max      1874.000000
Name: MonthSales, dtype: float64

As you can see, it shows the count, the mean, the standard deviation, the minimum, the quartiles, and the maximum.

The whole program looks like this:

import pyodbc
import pandas as pd

# Connect to the database
connector = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};"
                      "Server=JILL\\SQLExpress;"
                      "Database=AdventureWorksDW;"
                      "Trusted_Connection=yes;")
# Define the query
sql_string = """SELECT  MONTH(OrderDate) AS Month,
                        CONVERT(int,SUM(SalesAmount/1000)) AS MonthSales
                        FROM FactInternetSales
                        WHERE YEAR(OrderDate) = 2013
                        GROUP BY  MONTH(OrderDate)
                        ORDER BY  MONTH(OrderDate)"""

# Place the results of the query in a DataFrame
salesdata=pd.read_sql_query(sql_string,connector)

# Print the results
print(salesdata)

#Print the summary
print("Summary")
print(salesdata["MonthSales"].describe())

pandas has many useful functions: this is just a sample of how it can be used with SQL.

Creating Charts from Your SQL Database

Often, you’ll want to display the results of your analysis visually for presentations. There are many ways you can do this; this article on the top Python data visualization libraries will give you some ideas on how to prepare visual data in Python.

As a very simple demonstration, I’m going to create a plot using pandas and the data from the previous example.

pandas uses the Python library Matplotlib to create graphs and charts, so the first step is to install matplotlib if you don’t already have it.

pip install matplotlib

To change the previous program to create a simple plot, you would:

Import matplotlib.pyplot at the beginning of the program.
Place the query results in a DataFrame.
Call the plot() method of the DataFrame to create the plot in matplotlib.
Call the show() method of matplotlib.pyplot to view the graph.

The whole script would look like this:

import pyodbc
import pandas as pd
import matplotlib.pyplot as plt

# Connect to the database
connector = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};"
                      "Server=JILL\\SQLExpress;"
                      "Database=AdventureWorksDW;"
                      "Trusted_Connection=yes;")

# Define the query
sql_string = """SELECT  MONTH(OrderDate) AS Month,
                        CONVERT(int,SUM(SalesAmount/1000)) AS MonthSales
                        FROM FactInternetSales
                        WHERE YEAR(OrderDate) = 2013
                        GROUP BY  MONTH(OrderDate)
                        ORDER BY  MONTH(OrderDate)"""

# Put the results of the query into a DataFrame
salesdata=pd.read_sql_query(sql_string,connector)


# Create the plot
salesdata["MonthSales"].plot()

# Display the plot
plt.show()

This displays a graph:

To find out more about how to customize your plots, see this pandas article on creating plots, and the Matplotlib documentation.

Emailing Your Query Results Using Python

Once you’ve created the statistics you want, you’ll probably want to share them with other people, perhaps via email. You can learn about sending emails in Python in this tutorial, so I’m not going to include many explanations here.

The example program below does the following:

Selects a list of employees due for retirement from MySQL’s sample employees database.
Adds each employee’s name to a string.
Sets up an SMTP mail connection.
Adds the string containing the list of employees to the body of the email.
Sends the email.

You’ll need to find the SMTP settings for your email server. Your systems administrator should be able to help you.

Here’s the code:

import mysql.connector
import smtplib
from email.mime.text import MIMEText

# Set up the database connection
connector = mysql.connector.connect(
  host="localhost",
  user="<your user>",
  password="<your password>",
  database="employees"
)

# Define the SQL query
sql_string = """SELECT CONCAT(first_name, ' ',  last_name) FROM employees
                    WHERE YEAR(curdate()) - YEAR(birth_date) > 65"""

# Initialize the message line. It will have data added to it later
msg_line = "\n"

#Retrieve the data from the database
employee_cursor = connector.cursor()
employee_cursor.execute(sql_string)

# Add each row to the message line
for x in employee_cursor:
      msg_line=msg_line+x[0]+"\n"

# Set up the parameters for emailing
subject = "Employees due for retirement"
body = msg_line
sender = "<your email address>"
recipients = "<recipient address>"
password = "<your email password>"

# Set up the message
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = sender
msg['To'] = ', '.join(recipients)

# Forward the email to the mail server
with smtplib.SMTP_SSL('<your email server smtp url>', 
                       <your email server port>) as smtp_server:

       smtp_server.login(sender, password)
       smtp_server.sendmail(sender, recipients, msg.as_string())

#Done!
print("Message sent!")

Learning More About Python and SQL

These examples are just a taste of what you can do with an SQL query in Python. The next step is to grow your SQL and Python knowledge.

If you work in a data-driven career, SQL Python integration is a must-have skill. Database administrators can connect Python to SQL to automatically carry out data management and ETL (extract, transform, and load data) operations. Data analysts will find plenty of ways to combine these two tools to provide meaningful and well-presented information. For anyone who wants to automate SQL queries, Python is one of the best ways to achieve this.

It’s really worth investing time in learning how to integrate SQL with Python. LearnSQL.com and our sister site LearnPython.com have a range of training programs, articles, and resources to help you become an expert.

Here are some articles that may interest you, depending on what you would like to use SQL for:

And here are some articles that may help you in your Python learning journey:

SQL vs. Python for Data Analysis: This article explains the strengths and weaknesses of each, and how they complement each other.
7 Datasets to Practice Data Analysis in Python: This list tells you where to find some good practice datasets for analyzing data with Python.
Python Data Cleaning: A How-To Guide for Beginners This comprehensive discussion explains how you can use Python to manipulate and clean your data.

If you want to improve your skills in Python and SQL, you might want to consider one or more of our online training programs. You can browse our SQL courses here, and you’ll find Python courses here. Whatever your current level of knowledge and area of interest, you’ll find courses that suit you. Here are some recommendations:

SQL Basics: This is suitable for absolute beginners. It’s packed with interactive exercises that will soon build your confidence.
SQL from A to Z: This comprehensive learning track will develop your skills from beginner to advanced
SQL for Data Analysis: Starting at a beginner level, this learning track will teach you everything you need to know about SQL as a data analyst.
Creating Database Structure: This learning track is the perfect starting point for a career in database administration.
Learn Programming with Python: This learning track is great for approaching Python and computer programming as a beginner.

I hope this article has given you an idea of just how much you can do by integrating SQL with Python. Now it’s up to you. Learn all you can and practice, practice, practice!

Tags: