15th Oct 2021 8 minutes read

How to Join Only the First Row in SQL

sql
JOIN

Table of Contents

The Problem
4 Ways to Join Only the Top Row in SQL
Let’s Practice SQL JOINs!

In your projects, you may encounter situations when you have many orders corresponding to one customer or many temperature observations corresponding to the same location, but you only need to join the first row with the most recent order or the most recently observed temperature to the corresponding record in another table. In this article, I’ll go through several ways to do this in SQL.

The best way to practice basic and advanced SQL is our interactive SQL Practice Set course. It contains 88 hands-on exercises to help you refresh your SQL skills, starting with the basics and going to challenging problems.

The Problem

There are many different scenarios where you have a one-to-many relationship between two tables and you need to join only the first match from one table to the corresponding record in another. For example, you may be looking for:

The most expensive item in each order.
The most recently observed temperature for each location.
The most experienced employee in each department.
The most recent order for each customer.

In all these cases, you may order the table with many corresponding records accordingly (e.g., by item price, observation date, etc.), and therefore, turn your problem into selecting the first, or the top, row.

To demonstrate several possible solutions to this problem, we use the following tables that list the customers and their respective orders.

customers
id	first_name	last_name	phone	email
11	Kate	White	+1 (415) 000 0000	kate111111@gmail.com
12	Rose	Parker	+1 (415) 111 1111	rose111111@gmail.com
13	William	Spencer	+1 (220) 222 2222	bill111111@gmail.com
14	John	Smith	+1 (220) 333 3333	john111111@gmail.com

orders
id	order_date	customer_id	shipped_date	order_status
101	2021-10-01	14	2021-10-02	Completed
102	2021-10-01	11	2021-10-02	Completed
103	2021-10-02	12	2021-10-03	Completed
104	2021-10-02	11	2021-10-03	Completed
105	2021-10-02	13	NULL	Canceled
106	2021-10-03	13	2021-10-05	Completed
107	2021-10-04	12	2021-10-05	Completed
108	2021-10-04	14	2021-10-06	Completed
109	2021-10-04	13	2021-10-06	Completed
110	2021-10-04	11	2021-10-06	Completed
111	2021-10-05	11	NULL	Awaiting shipment
112	2021-10-05	12	NULL	Awaiting payment

As you can see, every customer has several orders at our store. Let’s say for each customer, we want to know the date and the status of his/her most recent order. Here’s the output we are looking for:

id	first_name	last_name	order_date	order_status
11	Kate	White	2021-10-05	Awaiting shipment
12	Rose	Parker	2021-10-05	Awaiting payment
13	William	Spencer	2021-10-04	Completed
14	John	Smith	2021-10-04	Completed

The table lists the most recent order for each customer. No duplicates – each customer is mentioned only once, with the corresponding order that is the most recent according to the order date.

Now let’s go through several possible ways to get this output from our initial tables.

4 Ways to Join Only the Top Row in SQL

I’ll present four possible solutions to joining only the first row in SQL. Some of these solutions can be used with any database, while others work only with specific databases (e.g., PostgreSQL or MS SQL Server).

Solution 1

If we know that the orders in our table are numbered sequentially, with a greater value of ID indicating a more recent order, we can use this column to define the latest record for each customer. Our step-by-step solution is the following:

Define the greatest order ID for each customer.
Assuming these IDs correspond to the most recent order for each customer, create a table that lists only the most recent orders.
Join the customers table with this table of the most recent orders.

This solution can be implemented using common table expressions (CTEs).

WITH last_orders AS (
     SELECT *
     FROM orders
     WHERE id IN (
        SELECT MAX(id)
        FROM orders
        GROUP BY customer_id
)
)
SELECT customers.id, customers.first_name, customers.last_name,
  last_orders.order_date, last_orders.order_status
FROM customers
JOIN last_orders
ON customers.id = last_orders.customer_id
ORDER BY customer_id;

Alternatively, you can do the same using nested subqueries:

SELECT customers.id, customers.first_name, customers.last_name,
  last_orders.order_date, last_orders.order_status
FROM customers
JOIN (
     SELECT *
     FROM orders
     WHERE id IN (
        SELECT MAX(id)
        FROM orders
        GROUP BY customer_id
)
) AS last_orders
ON customers.id = last_orders.customer_id
ORDER BY customer_id;

In the queries above, we use one SELECT statement, or subquery, to find order IDs that correspond to the most recent order for each customer. We have another subquery to list these orders, and yet another query to join the table with the most recent orders with the table with customer information.

I prefer to use CTEs in cases like these because, in my opinion, they have better structure and readability. If you want to learn more about CTEs or WITH clauses, check out this introductory article and this interactive Recursive Queries course that covers all kinds of CTEs.

This solution gets us the output we need, but it relies on orders being indexed sequentially by when it was created. This may not always be the case. So, let’s move to the next solution that gives us more control over the output.

Solution 2

If we cannot rely on the order ID to define the most recent order, we can add a column that does the job. Specifically, we can use a window function to number the rows of our orders table based on the order date, separately for each customer.

Note that in our example, we use the order date without information on the exact order time for simplicity. This works in our case because we don’t have customers making multiple orders on the same day. However, you need to use the full timestamp to sort the orders in such cases.

Our strategy in this solution is the following:

Number the rows in the orders table so that the most recent order for each customer gets number 1.
Select only the most recent order for each customer by filtering the records with row numbers equal to 1.
Join the customers table with the table containing only the most recent orders.

Again, we can implement the above strategy using CTEs:

WITH numbered_orders AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date DESC
        ) AS row_number
    FROM orders
),
last_orders AS (
    SELECT *
    FROM numbered_orders
    WHERE numbered_orders.row_number = 1
)
SELECT customers.id, customers.first_name, customers.last_name,
        last_orders.order_date, last_orders.order_status
FROM customers
JOIN last_orders
    ON customers.id = last_orders.customer_id
ORDER BY customer_id;

or using nested subqueries:

SELECT customers.id, customers.first_name, customers.last_name,
        last_orders.order_date, last_orders.order_status
FROM customers
JOIN (
    SELECT *
    FROM (
        SELECT
            *,
            ROW_NUMBER() OVER (
                PARTITION BY customer_id
                ORDER BY order_date DESC
            ) AS row_number
        FROM orders
    ) AS numbered_orders
    WHERE numbered_orders.row_number = 1
) AS last_orders
    ON customers.id = last_orders.customer_id
ORDER BY customer_id;

In the above SQL queries:

We use the ROW_NUMBER() function to number the rows in the orders Note that before numbering the rows, we group them by customer ID with PARTITION BY and sort them by date in descending order to get the most recent order in the top row. We save the output of this subquery as numbered_orders.
Next, we select the orders with the row number equal to 1 and save the result of this subquery as last_orders.
Finally, we join the customers table with last_orders to get the required output.

If you are new to window functions, learn more in this beginner-friendly guide and consider taking this interactive Window Functions course. For an overview of the syntax, check out the SQL Window Functions Cheat Sheet.

Both of these solutions so far can be applied in all kinds of relational databases. Now, let’s move on to database-specific solutions.

Solution 3

PostgreSQL allows the DISTINCT ON clause that can be of great value when we need to join only the first match in SQL:

WITH last_orders AS (
    SELECT DISTINCT ON (customer_id)
        *
    FROM orders
    ORDER BY customer_id, order_date DESC
)
SELECT customers.id, customers.first_name, customers.last_name,
    last_orders.order_date, last_orders.order_status
FROM customers
JOIN last_orders
    ON customers.id = last_orders.customer_id
ORDER BY customer_id;

Instead of a separate subquery to number the rows or define the most recent order using order ID, we use DISTINCT ON (customer_id) to get only the first row corresponding to each customer. Also, in our CTE, we sort the rows by order date in descending order to ensure that the first row for each customer corresponds to the most recent order of this customer.

The DISTINCT ON () clause is very convenient for cases like this, but unfortunately, it is available only in PostgreSQL.

Solution 4

We can use the SQL toolkit for specifying the number of rows to be displayed in the output. This option is available in most SQL dialects, but the syntax can be different.

Several SQL dialects (e.g., SQLite, MySQL, and PostgreSQL) use the LIMIT clause to specify the number of rows to be returned. You can use this option to select only the most recent order for each customer. You’ll need to sort the results by order date in descending order then limit the output to only one row:

SELECT customers.id, customers.first_name, customers.last_name,
       orders.order_date, orders.order_status
FROM customers
JOIN orders
ON orders.id = (
SELECT id
           FROM orders
           WHERE customer_id = customers.id
           ORDER BY order_date DESC
           LIMIT 1
      )
ORDER BY customer_id;

MS SQL Server doesn’t support the LIMIT clause, but it has another solution to join only the top row in SQL. You can use the TOP 1 clause in MS SQL Server to get only the most recent orders joined with the customers table:

SELECT customers.id, customers.first_name, customers.last_name,
       orders.order_date, orders.order_status
FROM customers
JOIN orders
    ON orders.id = (
        SELECT TOP 1 id
        FROM orders
        WHERE customer_id = customers.id
        ORDER BY order_date DESC
    )
ORDER BY customer_id;

Like the previous solution, we order the rows by date in descending order to make sure that TOP 1 clause selects the most recent order for each customer.

Let’s Practice SQL JOINs!

I hope that these solutions have shown you how powerful and flexible SQL can be with various tasks. You can see how SQL JOINs can be used to join only the first row when there is a one-to-many relationship between two tables. There are many more use cases where SQL JOINs can help address non-trivial problems.

To review and deepen your knowledge of SQL JOINs, I recommend this interactive course that includes 93 coding challenges. It covers INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN; filtering data with different kinds of JOINs; joining a table with itself; joining tables on non-key columns; and more.

If you want to master advanced tools for data analysis with SQL, consider taking the Advanced SQL track that covers Window Functions, GROUP BY Extensions in SQL, and common table expressions (CTEs).

Thanks for reading, and happy learning!

Tags:

sql
JOIN