Have you ever found yourself executing a database query and wondered how the database engine actually stores and fetches the data? Or maybe in a review someone pointed out, “This query is going to require a table scan because it’s not using an indexed column.”, and you thought, “What does that even mean?”. While then this post might be for you!
Authors note: This is the result of my own curiosity and investigation, and likely doesn’t fully reflect the reality of many database storage engines. I’m mostly trying to solidify what I have been learning so please point out anything that is egregiously wrong 🙂
The backend organizes the data you
INSERT INTO table in a structure called a B+Tree. Then when you do a
SELECT * FROM table it can efficiently find that data. However because tables can be huge in size it’s often not practical to store the entire tree in memory. Instead only a portion of the tree’s nodes will ever be stored in memory at a time. These nodes are often referred to as pages. A page is the smallest unit of data that a database engine reads and writes to disk at a time. One page equals one tree node. To make this a little more concrete let’s work with an actual example. Here is a very small movie database table with 5 record.
Let’s imagine we just created our table, there is no data in it, and we begin inserting data. Let’s observe how the B+Tree grows.
sqlite> CREATE TABLE movies ( ...> id INTEGER PRIMARY KEY AUTOINCREMENT, ...> title TEXT, ...> rating REAL, ...> release INTEGER ...> );
First we insert Toy Story:
sqlite> INSERT INTO movies (title, rating, release) VALUES ("Toy Story", 0.92, 1995);
This is the first record so we insert it into the root node of the tree, which also happens to be a leaf node as well.
Next we insert Black Panther:
sqlite> INSERT INTO movies (title, rating, release) VALUES ("Black Panther", 0.96, 2018);
Again we insert the record into the root node.
A quick aside about B+Trees. B+Trees (and B-Trees) are trees that can have up to N-1 keys in their nodes. So if I had a B+Tree of order 3 (what I’m using in these examples), that means each node can contain a maximum of 2 keys and 3 links to children. When we need to insert a new key into a full node the node splits and passes the middle key up to the parent. This happens recursively until there is room for a key in a node or we reach the root. If we reach the root and it’s full then it also splits and a new root is created which increases the overall height of the tree by 1. You’ll see this happen in our next insert.
The difference between a B+Tree and a B-Tree is that non leaf nodes (interior nodes) don’t contain values so a key in a interior node will be replicated in a leaf where the data can be found. Also leaf nodes in B+Trees are linked together to form a linked list. This allows linear traversal of all the data if we can’t search by key (hence a full table scan). Anyways, back to the show.
INSERT INTO movies (title, rating, release) VALUES ("Alien", 0.98, 1979);
This insert resulted in a page split (remember a node is contained in a page of data) which increased the height of the tree by one. And because the root is now an interior node, it doesn’t contain data and you need to navigate the tree to get to the data.
sqlite> INSERT INTO movies (title, rating, release) VALUES ("Star Wars", 0.96, 1977);
Another split happened moving key 3 for Alien up to the root. Finally we insert the last record for our table.
sqlite> INSERT INTO movies (title, rating, release) VALUES ("The Incredibles", 0.75, 2004);
Here you can really begin to get a sense of the recursive bottom up page split. We first navigate down to the node where key 5, The Incredibles, will land. Realize it’s full, split it, and move the middle key (4) up to the parent node. The parent node is also too full so we again split it, and not having a parent to put it into after the split, increase the height of the tree by one.
We now have a complete B+Tree to represent our table and can find any record efficiently when searching by a primary key. This is what that table might look like laid out in a file.
Please note that this is just a visual representation and not very accurate, but it should give a general idea. In reality we could fit many more keys in a 4KB page. Also there would be additional header information in the page like, node type (exterior/leaf/root), parent pointer, number of keys in the page, etc.
Let’s quickly walk through executing a simple query and figure out how expensive it is.
sqlite> SELECT * FROM movies WHERE id=3
We start by loading up the first page from disk
load_page(PAGE_SIZE * 0). This gives us the root page. We then compare the id we are searching for against the keys in the page. 3 is greater than or equal to 3. That gives us a link to the page starting at byte 0x2000 so we do
PAGE_SIZE * 2). Again we compare the id against the keys in the page. 3 is less than 4 which gives us a link to 0x5000 so lets fetch that page.
load_page(0x5000). We look into this page and see the id 3 exists here and it’s a leaf node so we load the data and return it.
You can see how we navigate the tree through pages:
Each page load is referred to as a probe. It took us three probes (disk reads) to get to our record. However in reality the root node is almost always going to be cached since all searches are going to pass through it. So in reality only two disk reads would be needed. Not bad!
You can even verify we used a search in instead of a scan by using the
sqlite> EXPLAIN QUERY PLAN SELECT * FROM movies WHERE id=3; QUERY PLAN `--SEARCH TABLE movies USING INTEGER PRIMARY KEY (rowid=?)
Now what about a different search like this:
sqlite> SELECT * FROM movies WHERE release=1979;
This is going to return the same result, but unlike before we can’t efficiently navigate the tree. Instead we need to walk from the root staying left all the way down till we get to the left most leaf. This represent the first record in the tree. From there we can follow the linked list checking each record along the way for a matching
release value to see if we find a match. This results in
TREE_HEIGHT * LEAF_NODES page probes (6 for our example). That’s a full table scan. We can again verify it with the
sqlite> EXPLAIN QUERY PLAN SELECT * FROM movies WHERE release=1979; QUERY PLAN `--SCAN TABLE movies
The solution to this, if we plan to query by
release date often, would be to create an index.
sqlite> CREATE INDEX IF NOT EXISTS movie_release ON movies (release);
This builds another tree like before, but the keys will be the
release date and the values are going to be the primary
id key. Also we will store the value in either the interior nodes or the leafs (B-Tree). Let’s take a look at what it would look like:
Our above query will now search the
release date index for the primary key, get the primary key and then search the table tree without doing a full scan. Assuming 0 page caching this would take 5 probes. If we had more records in our table this makes a huge difference in search time. You can quickly see it goes from
O(2 * log(N)). Again we can verify we are no longer using a a full table scan by consulting
sqlite> EXPLAIN QUERY PLAN SELECT * FROM movies WHERE release=1979; QUERY PLAN `--SEARCH TABLE movies USING INDEX movie_release (release=?)
Well I think this is about where I’ll stop. I’m just going to finally dump some last minute thoughts that I didn’t go into above that you might be wondering about.
- Locks may or may not be on entire pages when doing a write so it can cause other rows on the same page to be locked as well (I expect there are some tricks engines use to keep it limited to row only).
- Using a auto incrementing primary key has the nice property of always writing to the right most page in the tree which can cause better efficiency with regards to page cache hits when doing many
INSERTs. Using a random primary key (like UUID4) can cause page cache thrashing and inefficiency because of the many cache misses and the possibility to land in different leaf nodes.
- Page size was historically chosen to match the default amount that a read head on a spinning disk would read for increased efficiency. With the advent of SSD I’m not sure if that still holds true.
- If a record is too big for a page then a link to a page called an overflow page (in SQLite at least) will exist in the node. The overflow page will itself be a linked list to other overflow pages enough to fit all the data.
- Even though we have possibly many keys to iterate over in each node we still measure performance by the number of page probes since that is by far the most expensive operation in terms of searching.
- You can use hash table indexes instead of trees, but you lose the ability to do range queries.
- The branching factor of the tree (how many keys will fit in a page) can be roughly estimated to be
(page_size - header_size) / (key_size + link_size). If you do the math you’ll realize that at around a tree of depth 4 you could be storing TB of row data. Here’s a chart for best case page packing in InnoDB given 16KB page size (in 2013 source)
If you are curious about the actual file format for a page I recommend checking out the SQLite docs on them which are great.
Ok this is actually where I stop. Thanks for reading!