Continuing with our last database index blog, in this part 2, we will briefly look at how B+ tree is used in the actual database system to indexing data, also we will take a brief comparison of the difference between the relational database and no-sql database. The widely used MySQL will be used as an example throughout this blog.
Since index in MySQL is tightly associated with the storage engine, we will focus on the 2 most popular ones of them - MyISAM and InnoDB, the reason is that these 2 storage engines are the default choice when using MySQL, MyISAM is default choice before 5.5, and then InnoDB became the default after it.
The accompanying table used in this blog will be a simplified pet table as described in the official MySQL documentation. It has the following fields:
Assuming part of the table data looks like below:
Now we will take a slightly deep dive into the actual usage of B+ tree to store the index in MySQL. We won’t go into the implementation details which is outside of the scope of this blog but will briefly describe the usage in abstraction.
In MyISAM storage engine, the data being indexed & stored at tree’s leaf nodes is the addresses of records. Let’s take a look at the graph below, which depicts the abstraction of the index of the example table above.
myisam
As we can see from the diagram, the primary index is built upon petNo column, the leaf nodes store the addresses of each record, as a result, the process of locating record would be locating the value of supplied key, then use the value which is address of data to locate the actual record. This type of index is referred to as non-clustered index.
The difference of primary index in InnoDB storage engine is that the data records are tightly connected to index structure, that is the value stored at each leaf node is the actual record. The diagram below depicts the difference.
innodb
As can be seen, each leaf node contains the entire data record associated with that key. As a result, once the key is located, the entire record (row) will also be located, there is no additional lookup based on the address as shown in MyISAM storage engine. Also, the entire table of data is the index itself organized & stored as a B+ tree structure, the key is the primary key of the table. Besides, another difference we can see is that the data is physically ordered when indexed using the InnoDB engine. So in InnoDB engine, a tale must have a primary key, which is another difference with MyISAM. This type of index is also referred to as clustered index.
So far we have only looked at the 2 popular storage engines when it comes to decision about which one to use, there isn’t a unified rule to tell you which one is better. From a pure index point of view, the performance doesn’t differ, MyISAM has a slight cost of additional lookup, but it doesn’t hurt performance that much. The choice of storage engine should vary case by case, business by business, etc., and shouldn’t be dominated by index design of the engine. E.g. InnoDB is a pretty new engine that provides a lot of new features such as transaction support, row-level locking, etc., users or organizations who needs these features will find InnoDB will be a more suitable choice over MyISAM. Or if you have a large volume of data but with limited server resources, you may find MyISAM would be a better choice, since the index structure of it takes less resource. In practice, users should carefully consider their situation such as data itself, server resource, what kind of query will be performed frequently, etc. to make a suitable choice.
In recent years, the rise of nosql databases like mongo, etc. attract tons of attention in the industry. Many organizations started adopting it due to its scalability, handling of large volumes of data, etc., which shines in the big data field compared to RDBMS. I started developing a similar interest when writing the blog about index on RDBMS. What surprises me is that nosql database like Mongo used a very similar data structure to implement index, which is B tree instead of B+ tree used by RDBMS.
This choice is largely due to its no-relation nature design. We all know that Mongo stores JSON document and collection schema is very loose. As we said before, the one big advantage of B+ tree as the index is its efficiency to do range query, while in nosql, this is not that obvious since range query is less frequent compared to the relational database. Also, since in B tree, the internal nodes can also store data, which can make query time as fast as O(1), as worst as O(logN), while when using B+ tree, you always have to go to leaf node to get data back, so time is always O(logN). As a result, B tree is chosen by Mongo to achieve better efficiency for querying, if you remember the content in Blog 1, this will also reduce disk I/O(s) time.
Now, you have seen the usage of B+ tree as the index for the relational database, as well as the different choice (B tree) for nosql database, hopefully, you will have a better understanding of this concept and can take this into account when you choose databases or storage engines after reading this blog.