4th Aug 2021 7 minutes read

Use SQL on a Movie Database to Decide What to Watch

Table of Contents

Completing the SQL Movie Database Download
SQL Exercises on a Movie Database
Finding all the Movies for a Given Director
Using SQL on a Large Existing Movie Database

We’ll demonstrate how to use SQL to parse large datasets and gain valuable insights, in this case, to help you choose what movie to watch next using an IMDb dataset.

In this article, we’ll be downloading a dataset directory from IMDb. Not sure what to watch tonight? Are you browsing Netflix endlessly? Decide what to watch using the power of SQL! We’ll be loading an existing movie IMDb dataset into SQL. We’ll analyze the data in different ways like sorting movies by their rating, by what actors star in the movie, or by other similar criteria.

As mentioned in this blog post on how to practice SQL, the best way to practice SQL is by gaining hands-on experience in solving real-world problems, which is exactly what we’ll be doing.

If you have a basic knowledge of SQL, you should be able to follow this article easily. If you have no IT experience whatsoever, consider starting with this SQL A to Z Learning Track designed for people who have no experience in IT and want to start their adventure with SQL.

Let’s get started by learning how to get the movie data into our SQL database.

Completing the SQL Movie Database Download

Let’s walk through the process of downloading our data and loading it into a database management system (DBMS), step by step. Common DBMSs include MySQL, Oracle DB, PostgreSQL, and SQL Server.

Although this article focuses on movie data, you can choose an entirely different dataset. Check out this list of free online datasets you can use and find the one you are interested in. The import of these datasets will be similar regardless of what dataset you use.

Open whatever variety of SQL you are using. For this example, I’ll be using SQL Server Management Studio, but the steps should be similar for all of the other varieties of SQL out there. Let’s get started:

The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
Download all of the listed files:
Extract the downloaded zip files. The end result will be a TSV (tab-separated) file for each table.
Let’s clean up this data and convert it to CSV so it’s in a more workable state:
1. Open each file in a spreadsheet application like Google Sheets or Microsoft Excel.
2. Find and replace all occurrences of “\N” with an empty cell.
3. Save the file as a CSV file. This will make it easier to import into the DBMS of your choice.
Open your DBMS.
Create a new schema or table by right-clicking on the left pane and selecting “New Database.” I’ve named my new database “imdb.”
Right-click on the database → Tasks → Import Flat File and follow the Import Wizard to create a table for each file:
1. Set valid data types for each column you are importing. I recommend using nvarchar(MAX) for string columns, since you do not know how long the strings will be for each field. You can change the column datatype later if required.
2. Allow null values for all rows. This will prevent issues with the import.
Repeat this process for each of the files you have downloaded.

After completing these steps, your SQL movie database will be in place! You are now ready to start analyzing and querying the data.

SQL Exercises on a Movie Database

Thankfully, this dataset came with some descriptive documentation. To get an even better idea of the data, you can quickly select the top 1000 rows from each table.

Let’s start looking for our first movie. Imagine you want to watch a horror movie. How can we isolate only the horror movies? Fortunately, this task is frighteningly simple.

SELECT *
FROM title_basics
WHERE genres LIKE '%Horror%'

If this query causes any confusion, open this SQL cheat sheet to refresh your knowledge. Have this cheat sheet open for the rest of the tutorial to help you along!

What if we wanted to refine this horror movie list further? We could restrict the results to horror movies created after 1990, with an average rating above 9.0 and at least 10,000 votes.

This will involve getting data from multiple tables. Opening each table and taking a look at the column headers, we can see the following tables will be involved:

title_basics: handles the genre of movie and the release year (represented by the column startYear).
title_ratings: handles the rating (averageRating) and votes (numVotes).

The two tables can be joined on the shared column, tconst. As explained in the IMDb documentation here, tconst is an alphanumeric unique identifier of the title. Let’s write our query:

SELECT titleType, primaryTitle, startYear, genres, averageRating, numVotes
FROM title_basics
INNER JOIN title_ratings ON title_basics.tconst = title_ratings.tconst
WHERE genres LIKE '%Horror%' AND startYear > 1989 AND averageRating > 9.0 AND numVotes > 10000

titleType	primaryTitle	startYear	genres	averageRating	numVotes
videoGame	Resident Evil 4	2005	Action,Adventure,Horror	9.2	11406

Executing this query returns a single result, but not the result we want! On closer inspection, we can see that this title is a video game, not a movie. Let’s alter our query to include only movies, and expand the search by reducing the minimum number of votes required to 1,000 and the minimum rating required to 8.0.

SELECT *
FROM title_basics
INNER JOIN title_ratings ON title_basics.tconst = title_ratings.tconst
WHERE genres LIKE '%Horror%' AND startYear > 1989 AND averageRating > 8.0 AND numVotes > 1000 AND titleType = 'movie'

titleType	primaryTitle	startYear	genres	averageRating	numVotes
movie	Manichitrathazhu	1993	Comedy,Horror,Music	8.7	9468

Executing this query also yields a single result! Looks like we won’t have to decide what to watch anymore, since there’s only one option that fits our criteria!

Finding all the Movies for a Given Director

Let’s run through another scenario. What if we want to see all of the movies Steven Spielberg has directed? How would this work?

By looking through the tables, we can determine the following:

name_basics: It contains the names of all actors, writers, directors, and others involved in the creation of film and TV titles.
title_crew: It acts as a linking table for titles, directors, and writers. We’ll use this table to connect Steven Spielberg to the titles he’s involved with.
title_basics: We have already used this table. It contains title information like name, release date, rating, etc.

Let’s get to work! Let’s write a query for the name_basics table to try and find the famous director Steven Spielberg.

SELECT nconst, primaryName, birthYear, deathYear, primaryProfession, knownForTitles
FROM name_basics
WHERE primaryName LIKE 'steven spielberg'

Executing this query yields a single result:

nconst	primaryName	birthYear	deathYear	primaryProfession	knownForTitles
nm0000229	Steven Spielberg	1946	NULL	producer,writer,director	tt0082971,tt0083866,tt0120815,tt0108052

This gives us the important value of nconst. From the documentation, we know that nconst is the alphanumeric unique identifier of the name/person.

We can feed this value into the title_crew table, which contains the director and writer information for all the titles in IMDb, and match Steven Spielberg to all the titles he’s involved with.

    SELECT * from title_crew where directors LIKE 'nm0000229'

Executing this query results in a list of 45 titles. You can see from the value of the directors column that Steven Spielberg was the director of them all.

We need a way of using this list of titles alongside the title_basics table to get the name of the movies instead of just the tconst. Let’s use a subquery for this!

SELECT titleType, primaryTitle, startYear, genres
FROM title_basics
WHERE titleType LIKE 'movie'
AND tconst IN
(SELECT tconst FROM title_crew WHERE directors LIKE 'nm0000229')

Execute this query to see the result:

titleType	primaryTitle	startYear	genres
movie	Firelight	1964	Sci-Fi,Thriller
movie	The Sugarland Express	1974	Crime,Drama
movie	Jaws	1975	Adventure,Thriller
movie	Close Encounters of the Third Kind	1977	Drama,Sci-Fi
movie	1941	1979	Action,Comedy,War
movie	Indiana Jones and the Raiders of the Lost Ark	1981	Action,Adventure
movie	E.T. the Extra-Terrestrial	1982	Family,Sci-Fi
movie	Indiana Jones and the Temple of Doom	1984	Action,Adventure
movie	The Color Purple	1985	Drama
movie	Empire of the Sun	1987	Action,Drama,History
movie	Always	1989	Drama,Fantasy,Romance
movie	Indiana Jones and the Last Crusade	1989	Action,Adventure
movie	Hook	1991	Adventure,Comedy,Family
movie	Jurassic Park	1993	Action,Adventure,Sci-Fi
movie	Schindler's List	1993	Biography,Drama,History
movie	Amistad	1997	Biography,Drama,History
movie	The Lost World: Jurassic Park	1997	Action,Adventure,Sci-Fi
movie	Saving Private Ryan	1998	Drama,War
movie	Minority Report	2002	Action,Crime,Mystery
movie	A.I. Artificial Intelligence	2001	Drama,Sci-Fi
movie	Catch Me If You Can	2002	Biography,Crime,Drama
movie	The Terminal	2004	Comedy,Drama,Romance
movie	Indiana Jones and the Kingdom of the Crystal Skull	2008	Action,Adventure
movie	War of the Worlds	2005	Adventure,Sci-Fi,Thriller
movie	Munich	2005	Action,Drama,History
movie	Lincoln	2012	Biography,Drama,History
movie	The Adventures of Tintin	2011	Action,Adventure,Animation

There we have it, all of the Steven Spielberg movie titles from our database!

Don’t stop here! Write your own custom queries to extract more insights from this large dataset. There are many ways to practice SQL. If you feel like you’ve had enough of working with this dataset, check out this post on 12 Ways to Learn SQL Online for more excellent learning resources.

Using SQL on a Large Existing Movie Database

You have learned how to import and analyze large existing datasets into the DBMS of your choice and to use SQL to analyze a movie database. This is a powerful tool in your SQL arsenal. Not to mention, you’ll never have to worry about not being able to choose a movie to watch again! Completing SQL exercises on movie databases is a helpful way to learn, but if you would like more structure, check out this SQL Practice Set from LearnSQL.com.

Tags:

Completing the SQL Movie Database Download

SQL Exercises on a Movie Database

Finding all the Movies for a Given Director

Using SQL on a Large Existing Movie Database

You may also like