Back to articles list Articles Cookbook
9 minutes read

SQL Project: Personal Reddit Recap

Does your Reddit Recap seem too short and meaningless? With the power of SQL, you can extract much deeper insight from your usage data! Find out how to build this interesting SQL project below.

Making a personal SQL project is a great way to practice your database building and querying skills, and it can also be a nice addition to your portfolio.

In this article, we will cover the ways to extract the most from your Reddit data and uncover insights about your Reddit activity. We will be relying heavily on advanced SQL data analysis topics in this SQL project. If you need a refresher on how to utilize SQL for a project like this, check out our SQL for Data Analysis learning track.

Want to create a similar recap for other platforms? Check out our articles on making your own Netflix Wrapped and Spotify Wrapped with SQL.

Ready? Let’s take a look at how you can acquire your Reddit data and get started with the project.

Setting Up Your Personal Reddit SQL Project

Step 1: Get the Data

Reddit allows you to request a copy of your data, including some interesting usage statistics. You can do so on their Data Request page. You need to be logged in to view it.

Your data request will take some time to process; you will receive a message with the download link in your Reddit inbox once it is complete.

Download and unpack the archive. You will see 35 CSV files containing your personal information, including your activity information like posts, comments, subscribed subreddits, votes, and other miscellaneous data.

Most of those files do not contain information that would be useful in an SQL project, so let’s focus on those that do:

  • posts.csv contains information about posts that you have created.
  • comments.csv contains information about your comments. Note: This file is more extensive than comment_headers.csv, which only contains the basic information about the comment.
  • subscribed_subreddits.csv contains the list of subreddits that you have joined. There is only one column of data, but it is useful when you need to filter by the subreddit name.
  • votes.csv contains your vote history as links to the post and the direction of the vote (either ‘up’ or ‘down’).

Step 2: Load Data into a Database

CSV files contain a lot of information, but it would be hard to extract any insight from them using just spreadsheets. Here’s an SQL database that can simplify the process:

Personal Reddit Recap

This database schema copies the structure of the CSV files from the archive. With a database, we will be able to retrieve and operate on data much faster! So, let’s build one using PostgreSQL.

Create a Database

Here is the script for creating all the tables shown above:

CREATE TABLE comments (
	id text  NOT NULL,
	permalink text  NOT NULL,
	date timestamptz  NOT NULL,
	ip text  NOT NULL,
	subreddit text  NOT NULL,
	gildings int  NOT NULL,
	link text  NOT NULL,
	parent text  NOT NULL,
	body text  NOT NULL,
	media text  NULL,
	CONSTRAINT comments_pk PRIMARY KEY (id)
);
CREATE TABLE posts (
	id text  NOT NULL,
	permalink text  NOT NULL,
	date timestamptz  NOT NULL,
	ip text  NOT NULL,
	subreddit text  NOT NULL,
	gildings int  NOT NULL,
	title text  NOT NULL,
	url text  NULL,
	body text  NULL,
	CONSTRAINT posts_pk PRIMARY KEY (id)
);
CREATE TABLE subreddits (
	subreddit text  NOT NULL,
	CONSTRAINT subreddits_pk PRIMARY KEY (subreddit)
);
CREATE TABLE votes (
	id text  NOT NULL,
	permalink text  NOT NULL,
	direction text  NOT NULL,
	CONSTRAINT votes_pk PRIMARY KEY (id)
);

You can run the above script in your psql console; it will create all the necessary tables for this project.

psql -U your_username -d your_database -f create-tables.sql

The tables are not connected using foreign keys. There might be some common columns between two tables (like posts and comments referencing the same post URL), but they are in no way connected.

Also, pay attention to the data type of the date columns. This schema uses timestamptz, which is a PostgreSQL alias for timestamp with time zone. Reddit stores time zones in timestamps, so to properly import the data you will have to use this specific data type.

Load the Data

The next step is importing the data into the database from the .csv files that you have downloaded from Reddit. Use the following command in your psql console:

\copy table_name FROM '\path\to\file.csv' CSV HEADER ENCODING ‘UTF8’

Here are all the table names and their respective CSV files:

  • posts - posts.csv
  • votes - post_votes.csv
  • subreddits - subscribed_subreddits.csv
  • comments - comments.csv

Don’t forget to include the absolute path to the file before the file name and wrap it in single quotes.

Analyzing Your Reddit Activity with SQL

Time to get querying! We will go through some example queries to show the potential that the data holds. Let’s start with something simple.

Where Do I Post the Most on Reddit?

Do you know how many times you have posted in each subreddit? Here’s a simple query to help you find out:

SELECT
	subreddit,
	COUNT(*) AS num_posts
FROM posts
GROUP BY subreddit
HAVING COUNT(*) >= 5
ORDER BY num_posts DESC;
subredditnum_posts
gaming17
memes11
funny9

This query uncovers your posting habits by listing the subreddits where you've shared at least five posts. For me, the results are pretty telling:

  • Gaming is my clear obsession with 17 posts.
  • Memes take a solid second with 11 (because scrolling wasn’t enough, I had to contribute).
  • Funny comes in third with 9 – apparently, I think I’m hilarious.

Talking to Myself: Comments on Your Own Posts

Have you ever done an AMA (ask me anything)? It might be interesting to get a full list of all the answers that you have given. Let’s find all the comments that you have left under your own posts!

SELECT
	comments.permalink,
	comments.body,
	media
FROM comments
JOIN posts
	ON comments.link = posts.permalink;

The comments table includes two columns with reddit URLs, link and permalink. The permalink column is the link to the comment itself, while link is the link to the post under which the comment was posted (i.e. the parent post).

We can use a JOIN to add information on the parent post to each comment, since each comment has a post link which can be used to identify the parent post.

Remember, the posts table only contains your own posts, so comments with a parent link not found in the posts table will be discarded by the JOIN. This effectively filters out the comments left on other people’s posts. You can read more about how JOINs work in our article on Joining Two Tables in SQL.

Take a look at the output of the query:

permalink

body

media

https://old.reddit.com/r/pics/comments/haucpf/ive_found_a_few_funny_memories_during_lockdown/fv6ejit/

Thanks all for the love, comments, DMs etc! And finally, u/theMalleableDuck I salute you! Rick x

null

https://old.reddit.com/r/pics/comments/haucpf/ive_found_a_few_funny_memories_during_lockdown/fv7vibi/

250k! Wtf ???? Edit: 300k!!

null

Here permalink is the link to the comment thread under the post. You can open the link from the result table and see how it looks for yourself. The body column is the main text of the comment and media is a URL to any attachment that may be included with the comment. A comment can be text-only, media-only, or both text and media.

The result is a full list of comments that you have left on your own posts.

Voting Activity by Subreddit

Do you know how you engage with the subreddits you have joined? This query finds the amount of upvotes and downvotes that you have given in each community:

SELECT
	subreddit,
	COUNT(CASE WHEN direction = 'up' THEN 1 ELSE 0 end) as upvoted,
	COUNT(CASE WHEN direction = 'down' THEN 1 ELSE 0 end) as downvoted
FROM votes
JOIN subreddits
	ON votes.permalink LIKE '%' || subreddits.subreddit || '%'
GROUP BY subreddit
ORDER BY upvoted DESC, downvoted ASC;

Since the subreddit name is not stored directly in the votes table, we can use a modified JOIN condition to check if the permalink of the vote contains a specific subreddit name.

The ORDER BY clause makes sure that subreddits with more positive engagements are displayed first.

subredditupvoteddownvoted
funny138210
memes122519
gaming847101

The final result reveals where you’ve been casting your votes on Reddit. It lists subreddits along with the total number of upvotes and downvotes you’ve given, showing which communities you support the most – and where you might have stirred up a little controversy.

The final result shows where I’ve been handing out votes on Reddit. Funny leads the pack with 1,382 upvotes and barely any downvotes, while memes follows closely with 1,225 upvotes. Gaming, on the other hand, got 847 upvotes and 101 downvotes, meaning I might have been a little more critical there.

Joined vs. Not Joined: Where Do You Actually Post?

Now let’s try a more complicated query. How often do you post in subreddits that you are a part of compared to subreddits you haven’t joined? Let’s find out.

WITH joined_status AS (
	SELECT
    	  subredit,
    	  CASE 
          WHEN EXISTS (
             SELECT 1 FROM subreddits s 
 WHERE s.subreddit = p.subreddit)
    	  THEN 'Joined' ELSE 'Not Joined' END AS joined
	FROM posts
)
SELECT
	joined,
	COUNT(*) AS num_posts
FROM joined_status
GROUP BY joined
ORDER BY num_posts DESC;

This query has to be built in two steps. First we classify all posts as either “Joined” or “Not Joined”, then we group the result to get the number of posts in each category.

The first step creates new information that we will be using in the grouping later, so it is a good practice to wrap this part in a CTE. We use CASE WHEN to determine the value of the new joined column; if the post’s subreddit (p.subreddit) is one that you are subscribed to (s.subreddit) then the post is classified as “Joined”. Otherwise, the post is “Not Joined”.

The second step happens in the outer query. We can group by the new joined column to find the number of each type of post. Here’s my result:

joinednum_posts
Joined41
Not Joined3

Your result will probably look similar to this in that most of the posts you’ve created are in subreddits that you’ve joined. If you do not see a result table with the rows like the one above (i.e. you only get a single result), that means you have only created posts of one type.

Will Reddit Analysis Be Your Next SQL Project?

Nicely done! We have acquired a much better understanding of the data than what the basic Reddit Recap provides – and practiced SQL while doing it! You can expand this project by adding queries about the number of posts that you made in a given period, posts from different devices, and gildings that you’ve received. You can also dive into other files that you’ve downloaded from Reddit, like friends, chat_history, comment_votes, and more!

 

 

If you’re looking for more hands-on SQL practice, check out our SQL Practice Track. With over 1,000 interactive exercises, it’s a great way to strengthen your SQL skills. Happy practicing!