This paper by Joy Arulraj, Matthew Perron, and Andy Pavlo describes a new way to rethink the DBMS Write Ahead Log (WAL) for NVM technology.
NVM stands for Non Volatile Memory which means that it won’t lose the data written to it when it loses power, unlike DRAM. The reason NVM is such a big deal is that it promises to provide read/write latency around 2-8x of DRAM, have good random read/write latency, and have much larger storage capacity than DRAM. Compared to NAND flash Intel claims their new 3D XPoint will be 10x lower latency, 4x write throughput, 3x read throughput.
For DBMSes, this means a rethinking of many core assumptions about the latency of storage. It’s plausible that in the future, smaller databases will be housed entirely on NVM storage.
The paper reconsiders how a DBMS should guarantee its integrity when operating on NVM storage. The authors write:
A DBMS guarantees the integrity of a database by ensuring (1) that all of the changes made by committed transactions are durable and (2) that none of the changes made by aborted transactions or transactions that were running at the point of a failure are visible after recovering from the failure. These two constraints are referred to as durability of updates and failure atomicity, respectively [5, 19].
Typically DBMSes use Write-ahead logging to ensure that updates are durable and failures are atomic. The Write-ahead log is append-only for performance concerns since HDDs, and even SSDs, perform better when doing consecutive writes. By writing modifications to the durable log before committing them to the database, it ensures that updates are durable. By including undo information in the log as well, it also ensures that if the database crashes in the middle of a transaction, it will be able to use the log to undo uncommitted transactions. Finally the DBMS occasionally writes a snapshot of the database to the log so that it can truncate the log before that. This prevents the log from growing unboundedly.
The downside of all of this is that the undo information written, as well as the snapshots, are a non trivial amount of extra writing. The ARIES protocol for recovering from the log also takes time proportional to the size of the log. This means that the most heavily used databases will take the longest time to recover, which is exactly what you don’t want.
The paper proposes a different approach called Write-Behind Logging.
Write-behind logging (WBL) leverages fast, byte-addressable NVM to reduce the amount of data that the DBMS records in the log when a transaction modifies the database. The reason why NVM enables a better logging protocol than WAL is three-fold. Foremost, the write throughput of NVM is more than an order of magnitude higher than that of an SSD or HDD. Second, the gap between sequential and random write throughput of NVM is smaller than that of older storage technologies. Finally, individual bytes in NVM can be accessed by the processor, and hence there is no need to organize tuples into pages or go through the I/O subsystem.
Write-behind logging is different in that it makes updates directly to the tuples in the database before writing to the log. Instead it keeps track of the “dirty” tuples that have been written, but not yet committed. The Dirty Tuple Table (DTT) itself is not even written to durable storage, but is kept in memory during operation. We’ll explain later how a change gets committed.
During operation, in addition to the Dirty Tuple Table, the DBMS maintains two important commit timestamps, Cp and Cd where Cp < Cd.
[The DBMS] records the timestamp of the latest committed transaction all of whose changes and updates of prior transactions are safely persisted on durable storage (Cp). Second, it records the commit timestamp (Cd, where Cp < Cd) that the DBMS promises to not assign to any transaction before the subsequent group commit finishes. This ensures that any dirty modifications that were flushed to durable storage will have only been made by transactions whose commit timestamp is earlier than Cd.
By keeping track of these commit timestamps, on recovery the DBMS can simply proceed as before, ignoring tuples with commit timestamps between (Cp, Cd). This is easy enough in an MVCC database, since each tuple already has a commit timestamp range in which it is valid. Write-behind logging just adds a DBMS-wide commit timestamp range in which all tuples that fall in it are invalid since they represent writes that weren’t committed before DBMS failure. To limit the growth of these invalidated commit timestamp ranges, background GC processing removes invalid tuples.
The group commit protocol is then as follows. The DBMS checks the DTT for changes and writes all of them to durable storage using the NVM devices
sync primitive. It then writes out a log entry containg Cp and Cd. For long running transactions that span multiple group commits, those commit timestamps are also recorded. After the log record is written to durable storage, the DTT table is cleared of entries that have been committed.
The paper includes an intuition for why Write-behind logging is better for NVM than HDDs or SSDs before them:
With WBL, the DBMS writes out the changes to locations spread across the durable storage device. For example, if a transaction updates tuples stored in two tables, then the DBMS must flush the updates to two locations in the durable storage device. This design works well for NVM as it supports fast random writes. But it is not a good choice for slower devices that incur expensive seeks to handle random writes.
The authors measure both DBMS throughput and recovery time. Both see improvements with Write-Behind Logging versus WAL on NVM.
The benefits of WBL are more prominent for the balanced and write-heavy workloads presented in Figures 14b and 14c. We observe that the NVM-WBL configuration delivers 1.2–1.3× higher throughput than the NVM-WAL configuration because of its lower logging overhead. That is, under WBL the DBMS does not construct as many log records as it does with WAL and therefore it writes less data to durable storage.
1.2-1.3× improvement on throughput is nothing to sneeze at, but improvement in recovery time is staggering. The recovery time for NVM-WAL for 1,000 or 10,000 uncommitted transactions before failure is about 7 seconds and 1 minute respectively. Meanwhile NVM-WBL are both on the order of 100 milliseconds.
This makes sense when we consider that recovery time with ARIES and Write-ahead logging is proportional to the size of the log. In contrast, recovery time for Write-behind logging is almost instant, just adding a small overhead to ignore invalidated tuples during runtime. The DBMS using Write-behind logging doesn’t need to load a snapshot, then redo and undo log records. The current durable state is always “valid”, so no snapshotting needs to be done. The only thing that changes is the state of what commit timestamp ranges are “invalid” and need to be ignored.
NVM is an exciting technology, and although it has been promised for quite some time, it seems that we might actually start seeing it in real datacenters within the next two years. Work like this paper is an exciting reconsideration of a field from first principles when basic building blocks change. I hope we see more of this not just in DBMS research but in other systems as well.