A bit set (also referred to as a bit array, bit map, or bit vector) is, quoting wikipedia,

an array data structure that compactly stores bits. It can be used to implement a simple set data structure.

Let’s take a closer look at what that means.

Imagine we have a single unsigned byte. That byte can represent the values 0-255, but it can also be treated as an array with 8 indexable positions (0 to 7) where the values of each position can be 0 or 1. We can use bitwise operators to quickly perform various set operations on those bits treating the index as the number we are adding to the set.

package main

import "fmt"

type SimpleBitset uint8

func (b *SimpleBitset) Add(index int) {
    *b |= 1 << index
}

func (b *SimpleBitset) Has(index int) bool {
    return *b&(1<<index) > 0
}

func (b *SimpleBitset) Remove(index int) {
    *b &= ^SimpleBitset(1 << 4)
}

func (b *SimpleBitset) Union(other SimpleBitset) {
    *b |= other
}

func (b *SimpleBitset) Intersection(other SimpleBitset) {
    *b &= other
}

func (b *SimpleBitset) Difference(other SimpleBitset) {
    *b &= ^other
}

func (b *SimpleBitset) Clear() {
    *b = 0
}

func main() {
    bitset := SimpleBitset(0)

    // Adding (setting a bit)
    bitset.Add(4)
    bitset.Add(6)
    fmt.Printf("%08b\n", bitset) // 01010000

    // Checking if a bit is set
    fmt.Println(bitset.Has(4)) // true
    fmt.Println(bitset.Has(5)) // false

    // Removing (unsetting a bit)
    bitset.Remove(4)
    fmt.Printf("%08b\n", bitset) // 01000000

    // Union
    other := SimpleBitset(0b00000011)
    bitset.Union(other)
    fmt.Printf("%08b\n", bitset) // 01000011

    // Intersection
    other = SimpleBitset(0b01000010)
    bitset.Intersection(other)
    fmt.Printf("%08b\n", bitset) // 01000010

    // Difference
    other = SimpleBitset(0b01000000)
    bitset.Difference(other)
    fmt.Printf("%08b\n", bitset) // 00000010

    // Clear
    bitset.Clear()
    fmt.Printf("%08b\n", bitset) // 00000000
}

This is cool, it’s basically a more efficient map[uint]bool. However, for it to be useful we need to make the bit set larger. Luckily it’s quite trivial to do so by representing the bit array as a []uint. The basic idea is when you want to check if some number exists you find which uint in the slice it belongs to and then check the bit offset in the uint:

word, offset := val / wordSize, val % wordSize
words[word] & (1 << offset) > 0

However there is a bit of a problem. What is the size of var wordSize uint? On some machines it might be 64 bits and on others it will be 32. There is a neat little trick you can do to check this.

var wordSize uint = 32 << (^uint(0) >> 63)

This will result in wordSize being 64 if uint is 64 bits and 32 otherwise. This works because a 32 bit system would shift the inverted uint(0) 63 positions to the right resulting in 32 << 0 which is simply 32. If it was a 64 bit system it would result in 32 << 1 moving it over by one bit to 64.

With this we can build a more useful bit set:

type BitSet struct {
    size     uint
    wordSize uint
    words    []uint
}

// NewBitSet creates a fixed size BitSet
func NewBitSet(size int) BitSet {
    var wordSize uint = 32 << (^uint(0) >> 63)
    nWords := math.Ceil(float64(size) / float64(wordSize))
    words := make([]uint, int(nWords))

    return BitSet{uint(size), wordSize, words}
}

func (b *BitSet) Add(val uint) error {
    if val >= b.size {
        return fmt.Errorf("Val (%d) is outside the BitSet bounds (0 - %d)", val, b.size)
    }

    word, offset := val / b.wordSize, val % b.wordSize
    b.words[word] |= 1 << offset
    return nil
}

func (b *BitSet) Has(val uint) bool {
    if val >= b.size {
        return false
    }

    word, offset := val / b.wordSize, val % b.wordSize
    return b.words[word]&(1<<offset) > 0
}

func (b *BitSet) Remove(val uint) error {
    if val >= b.size {
        return fmt.Errorf("Val (%d) is outside the BitSet bounds (0 - %d)", val, b.size)
    }

    word, offset := val / b.wordSize, val % b.wordSize
    b.words[word] &= ^(1 << offset)
    return nil
}

func (b *BitSet) Union(other *BitSet) error {
    if other.size > b.size {
        return fmt.Errorf("Unable to Union a larger BitSet (%d) into a smaller one (%d)", other.size, b.size)
    }

    for i, word := range other.words {
        b.words[i] |= word
    }

    return nil
}

func (b *BitSet) Intersection(other *BitSet) error {
    if other.size > b.size {
        return fmt.Errorf("Unable to Intersect a larger BitSet (%d) into a smaller one (%d)", other.size, b.size)
    }

    otherSize := len(other.words)
    for i := range b.words {
        if i < otherSize {
            b.words[i] &= other.words[i]
        } else {
            b.words[i] = 0
        }
    }

    return nil
}

func (b *BitSet) Clear() {
    for i := range b.words {
        b.words[i] &= 0
    }
}

func (b *BitSet) String() string {
    var buf strings.Builder

    buf.WriteByte('{')
    for i, word := range b.words {
        if word == 0 {
            continue
        }

        for offset := range b.wordSize {
            if word&(1<<offset) > 0 {
                if buf.Len() > 1 {
                    buf.WriteString(", ")
                }
                fmt.Fprintf(&buf, "%d", uint(i)*b.wordSize+offset)

            }
        }

    }
    buf.WriteByte('}')

    return buf.String()
}

It’s worth noting that the bit set doesn’t need to be bound to some size. It could of been dynamic, but having a known size is often useful. Also being fixed size doesn’t preclude us from simply growing the bit set via a new bit set like so:

if err := bitset.Add(tooBig); err != nil {
    biggerBitset := NewBitSet(int(tooBig * 2))
    biggerBitset.Union(&bitset)
    bitset = biggerBitset
}

Ok so now we can represent sets of small numbers in a space efficient manner. How is this useful for databases?

Bloom Filters

Bloom filters are a probabilistic data structure that will tell you if an element is maybe in a set or definitely not in it. A bloom filter uses a bit set as it’s base and then when you want to check if an element is a member of the set you run a number, K, different hash functions on the element. The output of those hash functions are mapped to slots in the bit set. If any of the values are not set then the element is definitely not in the set. If they are all set, then it might be in the set.

Databases can use bloom filters as a way to quickly check if it should bother reading data from disk, potentially saving an expensive I/O operation. An example where this is used is LSMTree style storage engines. When key/value pairs are written to disk, the key is also added to the bloom filter. When a GET query comes in, you check if it might exist in the bloom filter. If it doesn’t then we know for sure we shouldn’t bother checking the file.

Free Lists

The idea here is you use your bit set to track pages which are free. 0 means it’s being used while 1 means it’s free. When a page is no longer being used, you add it to the bit set. When you want a new page to hold some tuples, you consult with the bit set to see which page is free to use. If none are free then you need to allocate a new one. SQL server uses this to track free extents (groups of pages) in something called the Global Allocation Map (GAM) .

Null Bitmap

A physical tuple might consist of a bunch of columns, many of which might be nullable. Instead of storing the natural size of the item, like 8 bytes for a BIGINT, you set it’s column position in the bitmap to 1 which represents null. Then you skip storing it at all. When you read the tuple you consult the bit set which tells you which of the attributes are are present and which are null. This often requires reserving some space in the tuple header to store the bit set.

Compaction

Bit sets are also able to be compacted quite well if you have long runs of all 0s or all 1s. From simple run length encoding to something more involved like roaring bitmaps you can use these to save a significant amount of space.