Yesterday I had an interview with the question about indexing, I didn’t answer very well so I’d write a blog to record my study and understanding for future useage.
Understanding the Need for Database Indexes: A Simple Example
Through a straightforward example, this tutorial aims to elucidate the necessity of database indexes. Let’s consider we possess a database table named Employee, encompassing three columns: Employee_Name, Employee_Age, and Employee_Address. Imagine this Employee table hosts thousands of rows of data.
Suppose we wish to retrieve information on all employees named ‘Jesus’ from this table. We opt to utilize the following query statement:
1
SELECT*FROM Employee WHERE Employee_Name ='Jesus'
What transpires if the table lacks an index?
Upon executing this query, what unfolds as the database endeavors to locate employees named Jesus? The database is compelled to sift through each row within the Employee table to ascertain whether the employee’s name (Employee_Name) matches ‘Jesus’. Since our goal is to acquire information on every employee named Jesus, halting the search after finding the first matching row is not an option, as there could be additional rows meeting the criteria. Thus, the search must continue row by row until the final one is examined - indicating the database must inspect thousands of rows to identify all employees named Jesus. This process is known as a full table scan.
How Do Database Indexes Enhance Performance?
One might ponder if conducting a full table scan for such a rudimentary task seems inefficient - shouldn’t the database be more astute? This is akin to manually scanning an entire table from start to finish - slow and far from elegant (“not at all sleek”). However, as you might infer from the title, this is precisely where indexes come into play. The essence of using indexes is to expedite search operations by reducing the number of records/rows in a table that need to be examined.
What is an Index?
An index is a data structure that stores the values of a specific column in a table (most commonly, a B-Tree). Indexes are created on the columns of a table. Hence, a key takeaway is that an index contains the values of a column in a table, and these values are stored within a data structure. Remember, an index is a data structure.
Which Data Structures Can Serve as Indexes?
B-Trees are the most frequently utilized data structures for indexes because they offer low time complexity for search, deletion, and insertion operations, which can all be performed in logarithmic time. Another critical reason is that the data stored in B-Trees is ordered. The choice of data structure for indexes is usually made by the Relational Database Management System (RDBMS), but in some cases, you can specify the data structure to use when creating an index.
How Do Hash Table Indexes Work?
Hash tables are another type of data structure you might encounter as indexes - often referred to as hash indexes. The rationale for using hash indexes is their high efficiency in locating values. Hence, for queries that involve comparing strings for equality, such as the previously discussed query (SELECT * FROM Employee WHERE Employee_Name = ‘Jesus’), a hash index on the Employee_Name column can significantly speed up value retrieval. The operation of hash indexes involves using the column’s values as index keys, with the keys’ corresponding actual values being pointers to the relevant rows in the table. Since hash tables can essentially be viewed as associative arrays, a typical data item might resemble “Jesus => 0x28939”, where 0x28939 is a reference to the memory location of the row containing Jesus. Querying a value like “Jesus” in a hash index and obtaining a reference to the corresponding row in memory is evidently much faster than scanning the entire table to find rows with the value “Jesus”.
Drawbacks of Hash Indexes
Hash tables are unordered data structures, rendering them ineffective for many types of query statements. For example, if you sought to find all employees under 40 years of age, how would you use a hash index for the query? It’s unfeasible because hash tables are suited for key-value pair queries - i.e., queries for exact matches (such as “WHERE name = ‘Jesus’”). The key-value mapping of hash tables also implies that their storage of keys is unordered. This is why hash indexes are typically not the default data structure for database indexes - they lack the flexibility offered by B-Trees.
Are There Other Types of Indexes?
Indexes using R-Trees as their data structure are typically utilized for spatial queries. For instance, a query like “Find all Starbucks within 2 kilometers of my location” would benefit from a database table indexed with an R-Tree, enhancing query efficiency. Another type of index is the bitmap index, which is well-suited for columns containing boolean values (true and false), especially when many instances of these values (representing true or false) are present - essentially, columns with low selectivity.
How Do Indexes Improve Performance?
Since indexes essentially serve to store the values of a column in a data structure, they facilitate quicker retrieval of these column values. If the index uses the most common data structure, a B-Tree, then the data within it is ordered. Having ordered column values can significantly boost performance. Here’s why:
Assume we create a B-Tree index on the Employee_Name column. This means that when we perform the aforementioned SQL search for employees named ‘Jesus’, we no longer need to scan the entire table. Instead, we utilize the index to locate employees named ‘Jesus’ because the index has already sorted the names alphabetically. The sorted nature of the index implies that querying a name becomes much faster, as employees whose names start with ‘J’ are grouped together. Another crucial point is that indexes also store pointers to the rows in the table to access other column data.
What Exactly Is Stored in a Database Index?
You now understand that a database index is created on a table’s column and stores all the values of that column. However, it’s vital to grasp that the database index does not store values from other columns (fields) in the table. For instance, if we create an index on the Employee_Name column, the values of Employee_Age and Employee_Address are not stored in this index. Indeed, if we were to store all other field values in this index as well, it would essentially amount to copying the entire table as an index - occupying excessive space and being highly inefficient.
Indexes Store Pointers to Rows in the Table
If we find a record’s value in the index for a column, how do we access the other values of this record? It’s straightforward - the database index also stores pointers to the corresponding rows in the table. A pointer refers to a memory area, recording the reference to the corresponding row’s data on the disk. Thus, in addition to storing column values, the index stores an index to the row data. For example, a value (or node) in the index for the Employee_Name column might be described as (“Jesus”, 0x82829), where 0x82829 is the address of the row containing “Jesus” on the disk. Without this reference, you could only access a single value (“Jesus”), which would be meaningless since you wouldn’t be able to obtain other values for the employee - such as address and age.
How Does the Database Know When to Use an Index?
When the SQL (SELECT * FROM Employee WHERE Employee_Name = ‘Jesus’) is executed, the database checks whether there is an index on the column involved in the query. Assuming an index exists on the Employee_Name column, the database then determines whether using this index for the query is sensible - as in some scenarios, using an index might be less efficient than a full table scan. For more on these scenarios, consider reading the article: Selectivity in SQL
Can You Force the Database to Use an Index?
Typically, you don’t tell the database when to use an index - it decides on its own. However, it’s worth noting that in most databases (like Oracle and MYSQL), you can actually specify which index you want to use.
How to Create an Index Using SQL:
In our previous example, creating an index on the Employee_Name column would involve the following SQL:
1
CREATE INDEX name_index ON Employee (Employee_Name)
How to Create a Composite Index
We could create a composite index on two columns of the employee table with the following SQL:
1
CREATE INDEX name_age_index ON Employee (Employee_Name, Employee_Age)
What’s a Good Analogy for Database Indexes?
A fitting analogy for database indexes is the index of a book. If you have a book about dogs and you wish to find information on ‘Golden Retrievers’, why flip through the entire book when you could refer to the index at the back to see which pages discuss ‘Golden Retrievers’? Similarly, just as a book’s index contains page numbers, a database’s index contains pointers, directing you to the rows containing the values you’re querying in SQL.
What Are the Costs of Using Database Indexes?
So, what are the downsides to using database indexes? Firstly, indexes occupy space - the larger your table, the more space the index consumes. Secondly, there’s a performance hit (mainly for update operations) when adding, deleting, or updating row data in the table, as the same operations must occur in the index. Remember: creating an index on a column (or columns) requires keeping the column’s data up to date in the index.
The general principle is to create an index on a column only if it’s frequently used in queries.