I’ve previously talked about B-trees and how databases store and retrieve data. However I kept it fairly high level and didn’t go into much depth beyond the fact that B-tree nodes are usually stored as pages and that pages represent an indivisible unit of disk. The reason I didn’t go into much more depth was that it was mostly where my knowledge ended on the subject.

I’ve finally got around to reading Database Internals, by Alex Petrov, and am nearing the end of the first half of the book. The book has been incredibly helpful in pulling back the veil to help me better understand the inner workings of database storage engines. I thought it would be helpful to myself (and hopefully others) to write a bit about what I learned rooted in some working, yet simple, code. I hope to make this a small series of posts and thought that I would kick things off with pages.

A page is a basic unit of I/O in the same way that a byte is a basic unit of memory. The page size is usually chosen to min/max physical disk characteristics. Like minimizing the movement of a hard disks physical arm movements, or to match the amount of read/write done by a SSD. A page size is usually going to be between 4-16KB.

#define PAGE_SIZE 4096

void* alloc_page() {
    return calloc(1, PAGE_SIZE);
}

This allocates a congruent, zeroed array of 4096 bytes that serves as a blank canvas. Later we will be able to write and read it from disk. But how do we structure our data in it? One approach would be to simply write to it from left to right like so:

+--------+--------+--------+--------+---------------+
| HEADER | CELL 1 | CELL 2 | CELL N | FREE SPACE... |
+--------+--------+--------+--------+---------------+

This works alright, but it causes some problems with free space management as records are deleted. If we want to remove an item and reclaim the space it occupied then we need reorder and repack everything. Another problem is the idea that we may have pointers to the cell records and reshuffling would cause them to be invalid. This calls for a better, more efficient page structure. Enter slotted pages.

Slotted pages work by splitting up the page into a few independent memory regions that grow into each other. It’s a technique that used by many big databases, like PostgreSQL.

The header region stores metadata about the page and always start at the beginning of the page. It holds information like the page id, B-tree node type (root, interior, leaf), and the start and end of the free space region. It’s also common to include things like a magic number, version number, a checksum to protect against data corruption, bit flags to indicate things like compression, sibling page ids, etc.

The item pointer region is simply offsets in the page to the cell data region which are the actual table rows (leaf nodes) or the search key + child node page id (interior nodes). The free space region is the remaining area of unoccupied space in the node. The pointer region and cell data region grow into each other.

enum PageType {
    ROOT = 1,
    INTERIOR,
    LEAF,
};

typedef struct PageHeader {
    uint32_t id;
    enum PageType type;   // ROOT, INTERIOR, or LEAF
    uint16_t free_start;  // offset to start of free space
    uint16_t free_end;    // offset to end of free space
    uint16_t total_free;  // free_end - free_start
    uint8_t flags;
} PageHeader;

#define PAGE_HEADER(page) \
    ((PageHeader*)page)

void* new_page(enum PageType type, uint32_t id) {
    void* page = alloc_page();
    PageHeader* header = PAGE_HEADER(page);
    header->id = id;
    header->type = type;
    header->free_start = sizeof(PageHeader);
    header->free_end = PAGE_SIZE - 1;
    header->total_free = header->free_end - header->free_start;

    return page;
}

Adding a cell to the page is as simple as checking if the page has enough room, getting the free_end - cell_size offset, copying the cell data into the page, adding a pointer to the newly inserted cell data offset at free_start, and updating the header boundaries. If the page was too full for a new item then this is where a page split would happen with the B-tree.

typedef struct CellPointer {
    uint16_t cell_location;
    uint16_t cell_size;
} CellPointer;

#define CELLPOINTER_OFFSET_TO_IDX(offset) \
    ((offset - sizeof(PageHeader)) / sizeof(CellPointer))

uint16_t add_cell(void* page, void* cell, uint16_t cell_size) {
    CellPointer cell_pointer;
    PageHeader* header = PAGE_HEADER(page);
    // We shouldn't be trying to add a cell to a page that doesn't have room for it!
    // The total space that'll need to be available is the size of the cell + the cell pointer.
    assert(header->total_free >= cell_size + sizeof(CellPointer));

    cell_pointer.cell_location = header->free_end - cell_size;
    cell_pointer.cell_size = cell_size;
    // Add the cell to the page
    memcpy((void*) page + cell_pointer.cell_location, cell, cell_size);

    // Add cell pointer to the page
    uint16_t pointer_offset = header->free_start;
    memcpy((void*) page + pointer_offset, &cell_pointer, sizeof(CellPointer));

    // Update the header with the new page bounds
    header->free_end -= cell_size;
    header->free_start += sizeof(CellPointer);
    header->total_free = header->free_end - header->free_start;

    return CELLPOINTER_OFFSET_TO_IDX(pointer_offset);
}

Removing an item also becomes quite simple as we can just null the pointer, essentially making the cell un-referenced.

#define CELLPOINTER_IDX_TO_OFFSET(idx) \
    (idx * sizeof(CellPointer) + sizeof(PageHeader))

void remove_cell(void* page, uint16_t idx) {
    uint16_t pointer_offset = CELLPOINTER_IDX_TO_OFFSET(idx);

    PAGE_HEADER(page)->flags |= CAN_COMPACT;
    ((CellPointer*)(page + pointer_offset))->cell_location = 0;
}

You might notice that we don’t reclaim space on removal, but instead set a flag, CAN_COMPACT. If we have a lot of INSERT and DELETE statements that target the same node we may find ourselves low on free space even though there are few referenced cells. To solve this we can perform a garbage collection job. This is equivalent to commands like VACUUM in PostgreSQL and OPTIMIZE TABLE in MySQL. A basic approach would be to walk over the pointer list copying non null pointers and their cells into a new in memory page while skipping over the null ones. Once that is complete we can re write everything back into the original page and update the free_start and free_end pointers.

typedef struct PointerList {
    CellPointer* start;
    uint16_t size;
} PointerList;

PointerList pointer_list(void* page) {
    PageHeader* header = PAGE_HEADER(page);
    PointerList list;
    list.start = page + sizeof(PageHeader);
    list.size = (header->free_start - sizeof(PageHeader)) / sizeof(CellPointer);

    return list;
}

void compact(void* page) {
    PageHeader* header = PAGE_HEADER(page);
    PointerList plist = pointer_list(page);

    if (!(header->flags & CAN_COMPACT))
        return;

    void* copy = new_page(ROOT, 0);
    CellPointer* cur_pointer;
    for (int i = 0; i < plist.size; i++) {
        cur_pointer = plist.start + i;

        // Only copy live cell pointers and cells
        if (cur_pointer->cell_location != 0) {
            add_cell(copy, page + cur_pointer->cell_location, cur_pointer->cell_size);
        }
    }

    header->free_start = PAGE_HEADER(copy)->free_start;
    header->free_end = PAGE_HEADER(copy)->free_end;
    header->total_free = header->free_end - header->free_start;
    header->flags &= ~CAN_COMPACT;
    memcpy(page + sizeof(PageHeader), copy + sizeof(PageHeader), PAGE_SIZE - sizeof(PageHeader));

    free(copy);
}

The result would look like this:

Finding a key in a page becomes a matter of doing a binary search over the pointers, dereferencing them to get the cell, and compare the key.

Finally if your schema contains something that goes beyond what is reasonable for a page to store then you can use overflow pages. Overflow pages are simply pages that contain spill over data for a record. For example if you use some_col VARCHAR(500) maybe only the first N bytes are stored in the leaf node page and the rest would go into an overflow page. This is what MySQL does with alot of it’s row formats, like DYNAMIC.

Hopefully this helps demystify how database storage engines physically layout and work with your data. If you want even more juicy details I recommend picking up Database Internals. All the example code provided here can be found in working form in this Github repo.