Today’s paper is The Bw-Tree: A B-tree for New Hardware Platforms. The paper describes the Buzzword Tree, a new B-tree variant. B-trees and their modern successor the B+ trees are heavily used inside DBMSes for indexes, making them performance critical.
The Bw-tree focuses on improving performance on current and future hardware. For the Bw-tree this means optimizing for multi-core scalability, cpu cache usage, and flash storage. This paper in particular focuses on the novel techniques used to optimize the in-memory aspects of the Bw-tree, largely ignoring the flash storage layer.
I’m sure many people have heard of Moore’s Law, and the many pronouncements of its “death”. Well, Moore’s Law is still in effect, but CPU clock speeds aren’t getting much faster. Instead the transistors are being put to use in other ways, with more cores, larger cpu caches, and all other sorts of fun. This means that today, to reach peak performance, one’s code has to (1) take advantage of multiple cores and (2) be cache-efficient.
Let’s break down those two concepts. There are multiple ways to take advantage of multiple cores. For simplicity, let’s assume that our program is an already multi-core aware DBMS and is using multiple threads to work together. Let’s say we have one thread each handling one query. We’ll be able to handle more Queries Per Second (QPS) as we add more cores to run query threads. However, if the queries start to read or write the same data, then we’ll need a way to prevent two threads from stomping on each other, and so we introduce mutexes to guard our data and critical sections. The problem is now that the locks have introduced a bottleneck in our system. If lots of threads want to read or write the same data, they’ll need to acquire the same locks, introducing lock contention and lowering our QPS and throughput.
This way of using threads and locks is the most common pattern to scale with multiple cores. In very high performance systems though, the lock contention becomes too large of a problem to ignore. Some programs deal with this by introducing finer-grained locks, where instead of locking a whole datastructure, you lock just the piece of data you need. The linux kernel, for example, has mostly taken this approach.
There are still inherent limitations to this. We have to a priori know what level of lock granularity is best, and we still have to pay the overhead of having lots of locks even if the underlying data isn’t hot at all.
Another approach to scaling is called the “lock-free” approach. In place of using mutexes to guard our critical section, we create entirely new algorithms and datastructures that do not require sole access to the data. Instead the algorithms and datastructures use atomic CPU instructions like Compare-And-Swap to make “atomic” changes to the data that makes progress towards our goal, but is guaranteed to always leave the data in a correct state. Designing and implementing these lock-free algorithms and datastructures is very difficult, but the reward is, hopefully, the linear scaling of throughput with the number of cores in our computer.
This is the approach the Bw-tree takes, focusing on using lock-free algorithms and datastructures to improve its performance on new hardware. A quick note though: In DBMS land, mutexes and locks are called “latches” for historical reasons. This means that in the paper, the authors will talk about “latch-free” algorithms instead of “lock-free”.
To make things even more confusing though, in DBMS land the word “lock” is still used, but it means “logical locks”. An example of a logical lock in a DBMS is a query preventing any other query from changing some index it’s using, and that “logical lock” might be implemented by using latches. ¯\_(ツ)_/¯
The other tack that the designers use is to focus on the usage of the CPU caches. CPU caches today are multi-tier, data/instruction split, NUMA beasts. If you want to learn more about them, I highly suggest this article by Fabian Giesen, as well as the other articles in the series.
For the purpose of this post though, we’ll just note that when a CPU wants to read some memory, say the memory that describes our Bw-tree, it’s much faster to read that memory from one of the CPU caches than from main memory. How much faster is it to read the data from the CPU cache? Well it can be 10 to 100 times faster!
With that much of a performance difference, it would be nice if we could read all our memory from cache. Of course our caches can’t hold as much data as main memory, so this isn’t possible, but the Bw-tree does its best to keep the data it needs in the caches.
For example, one problem with CPU caches, is that if the cached memory gets written to, we have to invalidate our cache entry. Hence, the Bw-tree limits its cache invalidations by taking a write-once approach, and layering updates as deltas.
B+ Tree Background
For the following sections, it is helpful, but not necessary, to have a familiarity with B-Trees and B+ Trees. If you’re interested in more materials, a good resource is Andy Pavlo’s Advanced Database Systems (Spring 2016) Course Lectures. Lecture 6 provides a great in-depth description of how B+-Trees in DBMSes are implemented in practice.
The Bw-tree uses a page mapping table to abstract away where pages reside:
Our cache layer maintains a mapping table, that maps logical pages to physical pages, logical pages being identified by a logical “page identifier” or PID. The mapping table translates a PID into either (1) a flash offset, the address of a page on stable storage, or (2) a memory pointer, the address of the page in memory… We use PIDs in the Bw-tree to link the nodes of the tree.
The authors freely admit that the idea of a mapping table isn’t novel, but it unlocks a lot of their other advancements. For one, this indirection allows “pages” to be more elastic. Elastic both in terms of size, allowing them to grow beyond a single page, and also elastic in terms of representation and placement.
Because the Bw-tree uses logical page identifiers (PIDs), when a page is moved to disk, the internal links to that page don’t have to be updated. They will still have the correct PID, but instead an update to the mapping table will be issued, telling future readers that the page now resides somewhere on disk.
The same idea applies for allowing a page to grow to larger than the typical page size (4KiB). Instead of having to force a page split immediately, we could issue an update to the mapping table, marking the page as being larger than a single page.
Another way the Bw-tree uses the mapping table is with its delta updates.
Page state changes are done by creating a delta record (describing the change) and prepending it to an existing page state. We install the (new) memory address of the delta record into the page’s physical address slot in the mapping table using the atomic compare and swap (CAS) instruction. If successful, the delta record address becomes the new physical address for the page. This strategy is used both for data changes (e.g., inserting a record) and management changes (e.g., a page being split or flushing a page to stable storage).
The idea with delta updates is that instead of doing in-place updates of the page in memory, they prepend a delta record describing the change to the page state. This gives two benefits: (1) Since the underlying page state is not written to, it is not invalidated from the cache and (2) It allows for atomic updates to the entire page. Since each delta update to a page has to perform a CAS on the mapping table, it creates exactly one serialization point for updates. Otherwise if two queries wanted to update the same page, in two different places, with logically conflicting updates, there would be no way to catch that.
When the length of the delta chain becomes too long, a query thread that sees it, will finish its query, then come back to consolidate the updates into a new page. Finally that new page is installed back into the mapping table with the same CAS operation as before.
Structure Modifying Operations
Part of the problem of using lock-free datastructures, is that sometimes you need to update many pieces of it inside one logical change. Without a lock on the entire datastructure, it’s impossible to make those changes atomically. Structure Modifying Operations (SMOs), like page splits, are a common example of these multi-update logical changes in a B-Tree.
The authors write:
Latches do not protect parts of our index tree during structure modifictions (SMOs) such as page splits. This introduces a problem. For example, a page split introduces changes to more than one page: the original overly large page O, the new page N that will receive half of O’s contents, and the parent index page P that points down to O, and that must subsequently point to both O and N. Thus, we cannot install a page split with a single CAS.
Since the whole operation cannot be installed atomically, they break down the SMO into smaller pieces, each of which can be applied atomically. To do so they borrow some of their design from the B-link Tree.
We use a B-link design  to make this eaiser. With a side link in each page, we can decompose a node split into two “half split” atomic actions. In order to make sure that no thread has to wait for an SMO to complete, a thread that sees a partial SMO will complete it before proceeding with its own operation. This ensures that no thread needs to wait for an SMO to complete.
An interesting thing to note is that the Bw-tree’s lock-free, non-blocking approach permeates its whole design. A query won’t even block waiting for an in-progress SMO, but instead work on completing it as well!
The rest of the paper goes into the details of these algorithms and approaches used by the Bw-tree. The paper does a great job of explaining each algorithm, so I would recommend reading through it if you’re interested in how it works. I’ll just give a quick shoutout to a couple interesting details.
Epoch Garbage Collection
One problem with using lock-free datastructures is that there isn’t a way to guarantee that you have exclusive access to the datastructure. This can be a problem if we want to free the memory backing the datastructure, or repurpose it, since we don’t know if any other thread is using it.
To get around this problem, the Bw-tree uses an epoch based reclamation mechanism. Whenever a thread wants to access a page, it registers itself with the current epoch E. When the thread is finished with its operation, it exits the epoch. If a thread wishes to free a page, instead of freeing it immediately, it will issue an atomic CAS instruction to prevent future readers from accessing it, e.g. updating the mapping table. Then it will record the current epoch, and the page to be reclaimed.
This combination of rules ensures that once all threads have exited a given epoch E, it is safe to actually reclaim the memory of the pages that were marked for freeing in epoch E.
The authors explain:
Threads “enrolled” in epoch E might have seen earlier versions of objects that are being deallocated in epoch E. However, a thread enrolled in a epoch E cannot have seen objects deallocated in epoch E-1 because it had not yet started its dependency interval. Hence, once all threads enrolled in epoch E have completed and exited the epoch (“drained”), it is safe to recycle the objects deallocated in epoch E.
It seems that epoch based reclamation is a good match for lock-free structures, and I wonder if we’ll see heavier use of it in the future.
The Bw-tree uses the B-link approach of having each node contain a “side link” to its right sibling node. Having this extra link is very important, because it allows the Bw-tree to perform a node split atomically.
To split a node P, the Bw-tree first creates a new node Q. It then copies all the keys greater than the separation key Kp into Q, and sets the right sibling of Q to be R, where R is the current right sibling of P. Finally, we install Q into the mapping table.
At this point, Q is invisible to the rest of the index, and P still exists as before with all the same data and layout. To install Q into the index, we create and prepend a split delta record to P.
The authors explain:
We atomically install the split by prepending a split delta record to P. The split delta contains two pieces of information: (1) the separator key Kp used to invalidate all records within P greater than Kp, since Q now contains these records and (2) a logical side pointer to the new sibling Q. This installation completes the first “half split”.
This “half split” leaves the index in a completely cromulant state. When a query wishes to look up a key Kq, with Kp < Kq, then it will encounter the split delta record and know to proceed its lookup to Q. Even though the parent of P has no index term for Q, the index still behaves correctly.
Finally, to complete the node split, we prepend an index term delta record to the parent of P and Q.
This index delta contains (1) Kp, the separator key between P and Q, (2) a logical pointer to Q, and (3) Kq, the separator key for Q (formerly the separator directing searches to P).
This index installation is really just an optimization, but an important one. After the half split, the index is correct, but by installing the index delta we allow future readers to jump directly to Q instead of having to be indirected through P.
Of course, with all of these delta records around, the Bw-tree does eventually consolidate all the delta records into a new base page, when the delta chain grows too large.
The performance evaluation section of the paper is very thorough, and has interesting insights into the Bw-tree’s performance on both synthetic and real-world workloads.
The authors used three main workloads for their evaluations:
(1) Xbox LIVE. This workload contains 27 Million get-set operations obtained from Microsoft’s Xbox LIVE Primetime online multi-player game . Keys are alphanumeric strings averaging 94 bytes with payloads averaging 1200 bytes. The read-to-write ratio is approximately 7.5 to 1.
(2) Storage deduplication trace. This workload comes from a real enterprise deduplication trace used to generate a sequence of chunk hashes for a root file directory and compute the number of deduplicated chunks and storage bytes. This trace contains 27 Million total chunks and 12 Million unique chunks, and has a read to write ratio of 2.2 to 1. Keys are 20-byte SHA-1 hash values that uniquely identify a chunk, while the value payload contains a 44-byte metadata string.
(3) Synthetic. We also use a synthetic data set that generates 8-byte integer keys with a 8-byte integer payload. The workload begins with an index of 1M entries generated using a uniform random distribution. It performs 42 million operations with a read to write ratio of 5 to 1.
One of the worries of writing a lock-free implementation of a Bw-tree is that if you have operations that take varying amounts of time, racing to install their updates via CAS, that a slower operation might become starved. In the case of the Bw-tree we have single key updates that are quick to perform, and longer running operations like node splits, merges, and consolidations.
The authors found that the update failure rate was extremely low, below 0.02% across all workloads, while the split and consolidation operations were much higher, up to 1.25% for the Xbox and deduplication workloads, and 8.88% for the synthetic workload.
I don’t have a good intuition for those numbers, and admittedly the synthetic workload is a pathological case, but the authors remain undeterred.
This is expected, since splits and consolidates must compete with the faster record update operations. However, we believe these rates are still manageable.
B-Tree and Skip-List Perf Comparison
The paper compares Bw-tree performance against BerkeleyDB, a traditional B-Tree architecture, and a lock-free skip list implementation. Against BerkeleyDB, the Bw-tree has a speedup of 18.7x on the Xbox workload, 8.6x on the deduplication workload, and 5.8x on the synthetic workload. Overall, these are very impressive numbers!
Against the skip list, the Bw-tree has a speedup of 3.7x on the synthetic workload, and 4.4x on a read-only workload. Again, quite impressive!
The authors attributed their superior performance to their lock-free approach, and their CPU cache efficiency. To further investigate their CPU cache efficiency, they used Intel VTune to record and evaluate cache performance against the skip list.
As we expected, the Bw-tree exhibits very good cache efficiency compared to the skip list. Almost 90% of its memory reads come from either the L1 or L2 cache, compared to 75% for the skip list.
It’s good to see researchers and engineers rigorously hunting down and explaining the sources of their performance wins, rather than just throwing around big numbers.
The paper is very thorough and does a good job providing examples and intuition to the complex algorithms involved. Overall a great read, and very impressive research.
While the performance improvements of the Bw-tree are spectacular, I think the bigger take away from the paper is a new approach for multi-core scalability. Lock-free datastructures, where updates are encoded as deltas, seems to me like a very flexible way to compose and create multi-core scalable solutions.