Back to articles list Articles Cookbook
7 minutes read

Use SQL on a Movie Database to Decide What to Watch

We’ll demonstrate how to use SQL to parse large datasets and gain valuable insights, in this case, to help you choose what movie to watch next using an IMDb dataset.

In this article, we’ll be downloading a dataset directory from IMDb. Not sure what to watch tonight? Are you browsing Netflix endlessly? Decide what to watch using the power of SQL! We’ll be loading an existing movie IMDb dataset into SQL. We’ll analyze the data in different ways like sorting movies by their rating, by what actors star in the movie, or by other similar criteria.

As mentioned in this blog post on how to practice SQL, the best way to practice SQL is by gaining hands-on experience in solving real-world problems, which is exactly what we’ll be doing.

If you have a basic knowledge of SQL, you should be able to follow this article easily. If you have no IT experience whatsoever, consider starting with this SQL A to Z Learning Track designed for people who have no experience in IT and want to start their adventure with SQL.

Let’s get started by learning how to get the movie data into our SQL database.

Completing the SQL Movie Database Download

Let’s walk through the process of downloading our data and loading it into a database management system (DBMS), step by step. Common DBMSs include MySQL, Oracle DB, PostgreSQL, and SQL Server.

Although this article focuses on movie data, you can choose an entirely different dataset. Check out this list of free online datasets you can use and find the one you are interested in. The import of these datasets will be similar regardless of what dataset you use.

Open whatever variety of SQL you are using. For this example, I’ll be using SQL Server Management Studio, but the steps should be similar for all of the other varieties of SQL out there. Let’s get started:

  1. The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
  2. Download all of the listed files:
    1. basics.tsv.gz
    2. akas.tsv.gz
    3. basics.tsv.gz
    4. crew.tsv.gz
    5. episode.tsv.gz
    6. principals.tsv.gz
    7. ratings.tsv.gz
  3. Extract the downloaded zip files. The end result will be a TSV (tab-separated) file for each table.
  4. Let’s clean up this data and convert it to CSV so it’s in a more workable state:
    1. Open each file in a spreadsheet application like Google Sheets or Microsoft Excel.
    2. Find and replace all occurrences of “\N” with an empty cell.
    3. Save the file as a CSV file. This will make it easier to import into the DBMS of your choice.
  5. Open your DBMS.
  6. Create a new schema or table by right-clicking on the left pane and selecting “New Database.” I’ve named my new database “imdb.”
  7. Right-click on the database → Tasks → Import Flat File and follow the Import Wizard to create a table for each file: SQL movie database
    1. Set valid data types for each column you are importing. I recommend using nvarchar(MAX) for string columns, since you do not know how long the strings will be for each field. You can change the column datatype later if required.
    2. Allow null values for all rows. This will prevent issues with the import. SQL movie database
  8. Repeat this process for each of the files you have downloaded.

After completing these steps, your SQL movie database will be in place! You are now ready to start analyzing and querying the data.

SQL Exercises on a Movie Database

Thankfully, this dataset came with some descriptive documentation. To get an even better idea of the data, you can quickly select the top 1000 rows from each table.

Let’s start looking for our first movie. Imagine you want to watch a horror movie. How can we isolate only the horror movies? Fortunately, this task is frighteningly simple.

SELECT *
FROM title_basics
WHERE genres LIKE '%Horror%'

If this query causes any confusion, open this SQL cheat sheet to refresh your knowledge. Have this cheat sheet open for the rest of the tutorial to help you along!

What if we wanted to refine this horror movie list further? We could restrict the results to horror movies created after 1990, with an average rating above 9.0 and at least 10,000 votes.

This will involve getting data from multiple tables. Opening each table and taking a look at the column headers, we can see the following tables will be involved:

  • title_basics: handles the genre of movie and the release year (represented by the column startYear).
  • title_ratings: handles the rating (averageRating) and votes (numVotes).

The two tables can be joined on the shared column, tconst. As explained in the IMDb documentation here, tconst is an alphanumeric unique identifier of the title. Let’s write our query:

SELECT titleType, primaryTitle, startYear, genres, averageRating, numVotes
FROM title_basics
INNER JOIN title_ratings ON title_basics.tconst = title_ratings.tconst
WHERE genres LIKE '%Horror%' AND startYear > 1989 AND averageRating > 9.0 AND numVotes > 10000
titleTypeprimaryTitlestartYeargenresaverageRatingnumVotes
videoGameResident Evil 42005Action,Adventure,Horror9.211406

Executing this query returns a single result, but not the result we want! On closer inspection, we can see that this title is a video game, not a movie. Let’s alter our query to include only movies, and expand the search by reducing the minimum number of votes required to 1,000 and the minimum rating required to 8.0.

SELECT *
FROM title_basics
INNER JOIN title_ratings ON title_basics.tconst = title_ratings.tconst
WHERE genres LIKE '%Horror%' AND startYear > 1989 AND averageRating > 8.0 AND numVotes > 1000 AND titleType = 'movie'
titleTypeprimaryTitlestartYeargenresaverageRatingnumVotes
movieManichitrathazhu1993Comedy,Horror,Music8.79468

Executing this query also yields a single result! Looks like we won’t have to decide what to watch anymore, since there’s only one option that fits our criteria!

Finding all the Movies for a Given Director

Let’s run through another scenario. What if we want to see all of the movies Steven Spielberg has directed? How would this work?

By looking through the tables, we can determine the following:

  • name_basics: It contains the names of all actors, writers, directors, and others involved in the creation of film and TV titles.
  • title_crew: It acts as a linking table for titles, directors, and writers. We’ll use this table to connect Steven Spielberg to the titles he’s involved with.
  • title_basics: We have already used this table. It contains title information like name, release date, rating, etc.

Let’s get to work! Let’s write a query for the name_basics table to try and find the famous director Steven Spielberg.

SELECT nconst, primaryName, birthYear, deathYear, primaryProfession, knownForTitles
FROM name_basics
WHERE primaryName LIKE 'steven spielberg'

Executing this query yields a single result:

nconstprimaryNamebirthYeardeathYearprimaryProfessionknownForTitles
nm0000229Steven Spielberg1946NULLproducer,writer,directortt0082971,tt0083866,tt0120815,tt0108052

This gives us the important value of nconst. From the documentation, we know that nconst is the alphanumeric unique identifier of the name/person.

We can feed this value into the title_crew table, which contains the director and writer information for all the titles in IMDb, and match Steven Spielberg to all the titles he’s involved with.

    SELECT * from title_crew where directors LIKE 'nm0000229'
		

Executing this query results in a list of 45 titles. You can see from the value of the directors column that Steven Spielberg was the director of them all.

We need a way of using this list of titles alongside the title_basics table to get the name of the movies instead of just the tconst. Let’s use a subquery for this!

SELECT titleType, primaryTitle, startYear, genres
FROM title_basics
WHERE titleType LIKE 'movie'
AND tconst IN
(SELECT tconst FROM title_crew WHERE directors LIKE 'nm0000229')

Execute this query to see the result:

titleTypeprimaryTitlestartYeargenres
movieFirelight1964Sci-Fi,Thriller
movieThe Sugarland Express1974Crime,Drama
movieJaws1975Adventure,Thriller
movieClose Encounters of the Third Kind1977Drama,Sci-Fi
movie19411979Action,Comedy,War
movieIndiana Jones and the Raiders of the Lost Ark1981Action,Adventure
movieE.T. the Extra-Terrestrial1982Family,Sci-Fi
movieIndiana Jones and the Temple of Doom1984Action,Adventure
movieThe Color Purple1985Drama
movieEmpire of the Sun1987Action,Drama,History
movieAlways1989Drama,Fantasy,Romance
movieIndiana Jones and the Last Crusade1989Action,Adventure
movieHook1991Adventure,Comedy,Family
movieJurassic Park1993Action,Adventure,Sci-Fi
movieSchindler's List1993Biography,Drama,History
movieAmistad1997Biography,Drama,History
movieThe Lost World: Jurassic Park1997Action,Adventure,Sci-Fi
movieSaving Private Ryan1998Drama,War
movieMinority Report2002Action,Crime,Mystery
movieA.I. Artificial Intelligence2001Drama,Sci-Fi
movieCatch Me If You Can2002Biography,Crime,Drama
movieThe Terminal2004Comedy,Drama,Romance
movieIndiana Jones and the Kingdom of the Crystal Skull2008Action,Adventure
movieWar of the Worlds2005Adventure,Sci-Fi,Thriller
movieMunich2005Action,Drama,History
movieLincoln2012Biography,Drama,History
movieThe Adventures of Tintin2011Action,Adventure,Animation

There we have it, all of the Steven Spielberg movie titles from our database!

Don’t stop here! Write your own custom queries to extract more insights from this large dataset. There are many ways to practice SQL. If you feel like you’ve had enough of working with this dataset, check out this post on 12 Ways to Learn SQL Online for more excellent learning resources.

Using SQL on a Large Existing Movie Database

You have learned how to import and analyze large existing datasets into the DBMS of your choice and to use SQL to analyze a movie database. This is a powerful tool in your SQL arsenal. Not to mention, you’ll never have to worry about not being able to choose a movie to watch again! Completing SQL exercises on movie databases is a helpful way to learn, but if you would like more structure, check out this SQL Practice Set from LearnSQL.com.