25th Dec 2020 10 minutes read

What Is a Database Index?

data engineering

Table of Contents

What Is a Database Index?
How Does an Index Work?
How to Add an Index to a Table
How An Index Works Internally
Who Chooses Which Index to Use?
Multi-Column Key Indexes
The Cost of Having Indexes
What’s Next with Indexes?

Database indexes are critical to query speed and efficiency. Learn when, why, and how to use them – and when not to – in this post.

Indexes are very important database objects; they optimize data access and improve database performance by helping the database execute SQL queries faster. Why not take this opportunity to learn how indexes work and how to create them?

What Is a Database Index?

Indexes are data structures that can increase a database’s efficiency in accessing tables. Indexes are not required; the database can function properly without them, but query response time can be slower.

Every index is associated with a table and has a key, which is formed by one or more table columns. When a query needs to access a table that has an index, the database can decide to use the index to retrieve records faster.

Question: What is a SQL Index?

A SQL index is a database structure that speeds up data retrieval operations by providing quick access to table rows. It works like a book's index, allowing the database to find data without scanning the entire table.

How Does an Index Work?

If you’ve ever used a book index, you’ll understand immediately how a database index works. In a book index, we first search the index for the topic we want. We get the page number where that topic is found in the book, and we go to this page. It’s very simple.

When the database searches in an index, the first step is to find the key value in the index. Every key value is stored with a pointer to the record in the table associated with this key value. Then, when the key value is found, the database follows that pointer and directly reads the record(s) from the table.

Suppose we table with 40 million records and a schema like the following:

ssn	first_name	last_name	zip_code
5436782	John	Smith	ZZ123459
5123456	Igor	Chernosky	ZZ123459
4325834	Mary	Connery	AF432112
4748122	Sue	Leclerc	AW543112

We run this query:

SELECT ssn
FROM persons 
WHERE last_name = ‘Chernosky’

To get the results, the database must do a full sequential scan of the table persons, which means that the database must read every record on the table and verify if the WHERE condition last_name = ‘Chernosky’ for this record is true or false. As this table has 40 million rows, the database is carrying out more than a few operations. On my local machine, this query takes 46 seconds!

How to Add an Index to a Table

At this point, we have a really slow query – it takes 46 seconds to return just one record. We can try to add an index to the table persons using the column last_name as the index key. The idea is to improve query performance by avoiding a full sequential scan of the entire table.

Here’s how to create such an index:

CREATE INDEX ix1_test ON persons ( last_name )

The syntax for index creation is simple. We need a CREATE INDEX statement specifying:

The index name (here, ix1_test).
The associated table (persons).
The column to use as the index key (last_name).

After creating the index and executing the previous query a second time, the database will do an index search for the last_name = ‘Chernosky’ condition instead of a full sequential scan. Let’s see how much time it takes.

With the new index, the same query takes 15 milliseconds – 30,000 times faster! The reason behind this huge performance increase is using the index to look for last_name = ‘Chernosky’ instead of doing a full sequential scan. Remember, an index’s process is similar to the process of using a book index. Imagine how much time you’d need to find a specific topic if you decided to search for it by reading the entire book, page by page!

Before closing this section, I would like to suggest the course Understanding Indexes, which has hands-on exercises that show how indexes work. Also, the article What is a Primary Key in SQL? explains the role of primary keys in SQL databases; it’s a good read if you’re not already familiar with the concept.

How An Index Works Internally

In SQL databases, indexes are internally organized in the form of trees. Like actual trees, database indexes have many branch bifurcations, and individual records are represented (or pointed) by the leaves. A database index tree is composed of several nodes connected by pointers.

The index tree is created when the CREATE INDEX statement is executed. The database software has an algorithm that builds the index tree. The first step of the index creation algorithm is to sort the records based on the index key. Next, it builds the structure by creating and populating every node of the tree index. The creation of an index tree can take some time, especially when the table has many records.

If we use an index to search for a specific record in the table, we need to start searching from the root of the index tree. On every branch bifurcation, a decision must be made about which branch to choose. This means logically evaluating the search condition and the range of values in each branch. As a very simple example, imagine you want to find the number 12 in a tree that has two branches. Branch X has numbers from 0-10 and Branch Y has numbers from 11-20. As 12 is greater than 11, you’d search for the number in Branch Y.

One special kind of index tree is the B-tree. B-trees are Balanced trees, meaning that the quantity of nodes from the root to any leaf node in the B-tree is the same. We will show a graphical representation of a B-tree index. First, let’s look at the table we’ll associate with the index. It’s called part, and the index will be placed on the part_id column:

part_id	name	price	manufacturer
1	Touch screen m81	$90	Cbios Inc
2	Processor quad core I4	$92	Xcreators Inc
5	Keyboard 101K	$13	Xcreators Inc
7	Ram Memory 265Mb	$145	Cbios Inc
8	Ram Memory 512Mb	$211	Cbios Inc
11	Processor octo core I8	$178	Xcreators Inc

Next, let’s see the SQL that will create the index ix_part_id:

   CREATE INDEX ix_part_id ON part ( part_id )

At the time of CREATE INDEX execution, the complete index data structure is created and populated. In the following image, we can see the graphical representation of the B-tree index ix_part_id and the table part:

The B-tree index is represented as an inverted tree, where the root is on top and the leaves are on the bottom. Notice that the red boxes representing leaf nodes have a pointer that indicates one record in the table. Every leaf node is a pair consisting of a key value and a pointer to the record. The intermediate and root-level nodes can store four key values (although some of them are not in use). When new records are inserted into the table, these empty places will be used to store new key values.

This index can be used for searching records. Suppose the database has to execute a query to search for a specific part_id or a range of part_ids, e.g. WHERE part_id = 2.

To find the record with part_id = 2, we start searching in the root of the index, and comparing 2 with the root index value 7. If the result of the comparison is “<”, we take the pointer on the left side. If the result of the comparison is “>=” we take the pointer on the right side. As 2 is less than 7, we take the left pointer to the next node and repeat the same process. When the node has more than one value, we’ll compare until we find a range where the searched value belongs and take the pointer for that range. In our example, the range for 2 is (2,5]; we take the pointer to the leaf node and see that the leaf node value is 2, so we take the pointer to that record in the table.

In the image below, the green lines show how we traverse the B-tree to search part_id = 2:

An index is an ordered data structure where the index keys are ordered by some criteria; thus, we can take advantage of that and use indexes to avoid sorting records. Then if we have a query with an ORDER BY or a GROUP BY using the same key as an existing index, the database can use the index to avoid sorting the records – a performance improvement!

Who Chooses Which Index to Use?

We can have more than one index per table as long as each index uses a different column as its key. Depending on the query, perhaps more than one index can be used to improve its performance.

So, who decides the best index to use for a given SQL query ? This is selected by an internal database module called the “query planner”. The following example has a query on a table with two indexes. We will see how the database chooses the index that will execute the query faster.

First, let’s add a new index to the table persons:

CREATE INDEX ix3_test ON persons ( zip_code )

Here’s the query we will execute:

SELECT ssn
FROM   persons 
WHERE  last_name = ‘Smith’
  AND  zip_code = ‘12345’

In this query, the database has two indexes to choose from:

ix1_test based on the key last_name
ix3_test based on the key zip_code

The execution of the query takes 22 milliseconds, as we can see below:

We know the query has used an index, but we don’t know which index was chosen. SQL databases have an EXPLAIN statement that returns the query planner’s execution plan. The output includes several technical details which are outside the scope of this article, but it is simple to identify which index has been chosen. Let’s execute the same query, this time adding the EXPLAIN before the SELECT.

We can see the index chosen was idx1_test.

Multi-Column Key Indexes

Index keys can be formed for one column or several columns, much like primary keys. The syntax for a multi-column key index is pretty simple:

   CREATE INDEX ix4_test ON person ( first_name, last_name )

We can put any combination of columns in the key of a multi-column index; however, best practices suggest putting columns that are frequently used in queries’ WHERE clauses. For example, the previous index would be very helpful on the following query:

SELECT social_security_number
FROM   person 
WHERE  first_name = ‘Kevin’
  AND  last_name = ‘Smith’

There are other WHERE conditions that can take advantage of the previous multi-column index, e.g. a query with WHERE first_name = ‘Kevin’ can also use the index, even when last_name is not in the WHERE. Another interesting facet of this topic is redundant indexes; however, due to its length and complexity, we can’t cover it in this article.

For a complete review of SQL tables and the many different options available, see How to Create a Table in SQL.

The Cost of Having Indexes

Indexes – like most things in life – are not free. Every time a table with an index is modified by an INSERT, UPDATE, or DELETE, all the indexes associated with the table have to be refreshed. These changes take time, and the performance of a process that massively modifies tables can be affected when several indexes exist in the table.

Before creating an index, we need to evaluate the cost-benefit of having it. Do we expect noticeable performance improvements? Is this table statistical or heavily modified? Can we accept a degradation in performance when we modify the table? After doing this analysis, we can decide if the index needs to be created.

Another option is to create and drop indexes based on different needs; in fact, some databases do that automatically.

What’s Next with Indexes?

We’ve covered the basics of database indexes:

The syntax for creating an index.
When an index is used by a query.
How to know if an index is being used (the EXPLAIN statement).
When to put an index on a table.

If you want to really get familiar with indexes, I suggest the course Creating Database Structure. It will teach you the secrets of creating and managing the structure of a relational database. Finally, I would also like to suggest the e-book SQL Performance Explained by Markus Winand, where you can find a lot of technical details about SQL performance.

Tags:

data engineering

What Is a Database Index?

How Does an Index Work?

How to Add an Index to a Table

How An Index Works Internally

Who Chooses Which Index to Use?

Multi-Column Key Indexes

The Cost of Having Indexes

What’s Next with Indexes?

You may also like