Common Systems Programming Optimizations & Tricks

Today’s blog post is an overview of some common optimization techniques and neat tricks for doing “systems programming” – whatever that means today. We’ll walk through some methods to make your code run faster, be more efficient, and to squeeze just a little more juice from whatever you got.

All of the examples we’ll go over are also on github at paulcavallaro/systems-programming.

False sharing is a fairly well-understood problem in optimizing multi-threaded code on modern SMP systems. The problem has been written about fairly extensively, but the basic idea is that physical memory on a machine isn’t infinitely granular, i.e. you can’t just read a byte.¹ Instead, when you want to read a single byte of memory, the processor will pull in and cache not just that byte, but the surrounding data as well, on the assumption that it will likely be used too. This unit of data that gets read and cached is called a “cache line”, and is essentially the smallest piece of memory that can be accessed.

As of 2019 cache lines are powers of 2, usually in the range of 32 to 256 bytes, with 64 bytes being the most common.

Now, to support multiple processors on a single machine reading and writing from the same memory in a coherent way, only one processor on a machine can have exclusive access to a given cache line.²

False sharing is when you accidentally put two unrelated pieces of data in the same cache line. Now having two processors update separate chunks of data, say counters, will interfere with one another, as each processor attempts to gain exclusive access to the cache line that houses both fields.

The explanation for the name false sharing is that even though these two counters shouldn’t be affecting one another from a correctness standpoint, they are – big air quotes incoming – “falsely sharing” a cache line for no good reason.

One solution is to force the data onto separate cache lines, which you can do in C/C++ by forcing the alignment of the members of a struct/class. In examples/cache-lines.cc we use absl’s ABSL_CACHELINE_ALIGNED macro to achieve this.

To show the effect in practice, we benchmark two different structs of std::atomic<int64> counters, NormalCounters and CacheLineAwareCounters.

// NormalCounters is straight forward naive implementation of a struct of
// counters.
// Note: We also use ABSL_CACHELINE_ALIGNED on the NormalCounters struct, but
// not its members, so that the entire struct will be aligned to a cache line.
// Otherwise the struct might be placed towards the end of a cache line,
// accidentally straddling two cache lines, thereby improving its performance.
struct ABSL_CACHELINE_ALIGNED NormalCounters {
  std::atomic<int64> success{0};
  std::atomic<int64> failure{0};
  std::atomic<int64> okay{0};
  std::atomic<int64> meh{0};
};

// CacheLineAwareCounters forces each counter onto a separate cache line to
// avoid any false sharing between the counters.
// Note: We must use ABSL_CACHELINE_ALIGNED for each member, since we want to
// pad every single counter so it will be forced onto its own separate cache
// line.
struct ABSL_CACHELINE_ALIGNED CacheLineAwareCounters {
  ABSL_CACHELINE_ALIGNED std::atomic<int64> success{0};
  ABSL_CACHELINE_ALIGNED std::atomic<int64> failure{0};
  ABSL_CACHELINE_ALIGNED std::atomic<int64> okay{0};
  ABSL_CACHELINE_ALIGNED std::atomic<int64> meh{0};
};

The benchmark uses either 1, 2, 3, or 4 threads, each bumping a separate atomic counter inside the struct 64K times. Here are the results on a 2013 MacBook Pro with a Haswell processor:

Executing tests from //examples:cache-lines
-----------------------------------------------------------------------------
Cache Line Size: 64
sizeof(NormalCounters) = 64
sizeof(CacheLineAwareCounters) = 256
2019-08-13 01:16:18
Run on (4 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
BM_NormalCounters/threads:1                389 us        387 us       1812
BM_NormalCounters/threads:2               1264 us       2517 us        234
BM_NormalCounters/threads:3               1286 us       3718 us        225
BM_NormalCounters/threads:4               1073 us       3660 us        204
BM_CacheLineAwareCounters/threads:1        386 us        385 us       1799
BM_CacheLineAwareCounters/threads:2        200 us        400 us       1658
BM_CacheLineAwareCounters/threads:3        208 us        581 us       1152
BM_CacheLineAwareCounters/threads:4        193 us        721 us       1008

A note on these results: Time measures wall clock time per-thread and CPU measures cpu time per-thread.

We can see that the sizes of the two structs are different, where sizeof(NormalCounters) = 64 and sizeof(CacheLineAwareCounters) = 256. This is because the alignment constraint we put on the individual fields, such that each member is on its own cache line. So instead of an int64 taking up 8 bytes like usual, it’s taking up a full cache line, which is 64 bytes on my machine.

We also see that for the single threaded case, the NormalCounters vs. CacheLineAwareCounters perform indistinguishably. But as we add more threads, the CacheLineAwareCounters perform much better than the naive normal counters that suffer from the false sharing.

Interestingly the wall clock time spent for CacheLineAwareCounters is higher for one thread than multiple threads, which could point to perhaps some subtle benchmarking problem, or maybe a fixed amount of delay that’s getting attributed across more threads now, and so is smaller per-thread.

The Magic Power of 2: Division Is Slowaloo

In current hardware, division is one of the most expensive operations, where expensive here means “longest latency”. Agner Fog’s listing of instruction latencies lists Intel Skylake’s DIV instruction operating on two 64 bit registers having a latency of 35-88 cycles, compared to an ADD instruction operating on the same two 64 bit registers having a latency of 1 cycle. So it would behoove us to avoid division where some other set of operations would work.

One place where division is commonly used besides for actually doing division, is in the modulo % operator. And a common place for the modulo operator is in hash tables: to go from a hash to a bucket we compute HASH % TABLE_SIZE. Modulo is used even more heavily in open-addressing schemes since we’re constantly remapping values back into the hash table bucket space.

So how do we go from using modulo to something else? Bit twiddling and the Magic Power of 2!

First, let me give away the answer: We’re going to force all of our hash tables to be a size that’s a power of 2.

We’ll be able to take advantage of this property for replacing division with faster bit twiddling. Also, this property is easy to maintain since we double the size of our hash table whenever we grow to amoritize the cost of rehashing, so the size will stay a power of 2 as we grow.

Now, we were using division – or modulo – for the purpose of mapping any hash value to a bucket index in our hash table. Bucket indices must be strictly less than our size, and this mapping should not lose any entropy from the hash value.

Instead of using division to do this, we’ll use a bitmask that will “mask” away all set bits except those that are strictly less than our power of 2. This way we keep all the entropy in the least significant bits, just like modulo, but it’s much faster – Agner Fog lists it at 1 cycle latency on the same Intel Skylake architecture.

As a quick refresher on bit twiddling and to explain how we’ll choose our bitmask, let’s look at some bit patterns.

Since numbers are represented in binary, we know that every power of 2 number N only has one bit set. For example:

Decimal |    Binary
    2   | 00 00 00 10
    8   | 00 00 10 00
   32   | 00 10 00 00
  128   | 10 00 00 00

And that means all N - 1’s have every less significant bit than log2(N) set. For example:

Decimal |    Binary
    1   | 00 00 00 01
    7   | 00 00 01 11
   31   | 00 01 11 11
  127   | 01 11 11 11

So in order to replace our modulo operator in the HASH % N calculation, we instead calculate HASH & (N - 1) using bitwise AND. This will only keep the set bits that are less significant than our log_2(N) bit, mapping any HASH value onto a number in the interval [0, N). We can even cache this bitmask if we want so we don’t have to recompute it in the future.

To show how much faster this bitmask trick is than using normal modulo operator, I’ve written a small benchmark to measure executing 1M modulo operations, versus executing 1M bitmasks.

Executing tests from //examples:power-of-two
-----------------------------------------------------------------------------
2019-08-13 02:24:03
Run on (4 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
--------------------------------------------------------
Benchmark                 Time           CPU Iterations
--------------------------------------------------------
BM_Mod                 9261 us       9234 us         75
BM_BitMask              325 us        324 us       1984

Here we can see how the DIV instruction used by the modulo operator is on the order of 28x slower, near the 35x slow down predicted by Agner Fog’s latency numbers.

Since this trick is easy to do, and provides a nice win, it’s used by many high-performance hash-tables, like absl’s “Swiss Tables” flat_hash_set and flat_hash_map and ConcurrencyKit’s ck_ht_map.

Repurposing Top Bits

Oftentimes you want to store a bit or two of extra information along with a pointer – in fact it’s so common Wikipedia has an article about it. One way to do this is to take advantage that on many 64-bit systems, such as linux, virtual memory addresses are only 48 bits wide, even though we use 8 bytes to store them.

This means, we can just store any old crap we want to in the top 16 bits by just masking it out whenever we want to actually dereference it. Here’s some example C++ code of using the top bit of a pointer to store whether the underlying data is “dirty” or not.

constexpr uintptr_t kDirtyBit = 0x8000000000000000;
constexpr uintptr_t kPtrMask = 0x7fffffffffffffff;
static_assert(sizeof(uintptr_t) == 8, "Only works on 64-bit systems");

template <typename T>
uintptr_t MarkDirty(T* t) {
  return reinterpret_cast<uintptr_t>(t) | kDirtyBit;
}

template <typename T>
T* GetPointer(uintptr_t t) {
  return reinterpret_cast<T*>(t & kPtrMask);
}

Interestingly though, since this is a feature of Linux’s memory management/virtual address space, it is subject to change – and it actually has!

LWN reported on the patchset in 2017 implementing five-level page tables in order to support larger amounts of addressable memory. The change, if enabled, would move Linux from 48-bit virtual addresses to 57-bit virtual addresses, increasing virtual address space from 256 TiB to 128 PiB – which ought to be enough for everyone 😀.

Part of why the change couldn’t be enabled by default is because various high performance programs, notably various JavaScript engines and LuaJIT, use this repurposing trick to pack some extra data into pointers.

Lock Striping

Locks can be used for mutual exclusion when you want to have multiple threads access shared data exclusively. The downside though is if the shared data is frequently accessed and the critical section is non-trivial, your threads could spend most of their time contending on the lock, instead of actually doing work.

One common way of solving this problem, is introducing more locks. Wait – what?

Well, instead of one lock that guards all of the data, instead you have many locks responsible for just a piece of the data. In this way we shard the data into separate independent buckets that don’t contend with each other. Assuming the data access is uniform-ish, increasing sharding of the data reduces the number of threads contending on a lock proportionally.

Below is a small C++ example of two implementations of thread-safe hash set. The first implementation, ThreadSafeHashSet, just uses a single lock to guard a single underlying absl::flat_hash_set. The second implementation LockStripedHashSet has N separate locks, guarding N separate abs::flat_hash_sets.

// Simple thread-safe single-lock hash set
class ThreadSafeHashSet {
 public:
  void Insert(uint64 i) {
    absl::MutexLock lock(&mu_);
    hash_set_.insert(i);
  }

  bool Contains(uint64 i) {
    absl::MutexLock lock(&mu_);
    return hash_set_.contains(i);
  }

 private:
  absl::Mutex mu_;
  absl::flat_hash_set<uint64> hash_set_;
};

// Chunk the data into `kNumChunks` separate hash sets, guarded by separate
// locks.
template <size_t kNumChunks>
class LockStripedHashSet {
 public:
  void Insert(uint64 i) {
    // Mod the data to calculate which hash_set/lock to use
    const size_t idx = i % kNumChunks;
    absl::MutexLock lock(&mu_[idx]);
    hash_set_[idx].insert(i);
  }

  bool Contains(uint64 i) {
    const size_t idx = i % kNumChunks;
    absl::MutexLock lock(&mu_[idx]);
    return hash_set_[idx].contains(i);
  }

 private:
  std::array<absl::Mutex, kNumChunks> mu_;
  std::array<absl::flat_hash_set<uint64>, kNumChunks> hash_set_;
};

To illustrate the benefits of lock striping, we benchmark the performance of the two thread-safe hash sets in the presence of many threads, each inserting 1M items. For the LockStripedHashSet we try with 4 chunks and with 8 chunks. The results are below:

Executing tests from //examples:lock-striping
-----------------------------------------------------------------------------
2019-08-24 22:24:37
Run on (4 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
--------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations
--------------------------------------------------------------------------
BM_SingleLock/threads:1                    65 ms         65 ms         11
BM_SingleLock/threads:2                   140 ms        254 ms          2
BM_SingleLock/threads:3                   140 ms        332 ms          3
BM_SingleLock/threads:4                   142 ms        405 ms          4
BM_LockStriping_4_Chunks/threads:1         71 ms         69 ms          9
BM_LockStriping_4_Chunks/threads:2         90 ms        178 ms          4
BM_LockStriping_4_Chunks/threads:3         89 ms        248 ms          3
BM_LockStriping_4_Chunks/threads:4         82 ms        299 ms          4
BM_LockStriping_8_Chunks/threads:1         70 ms         69 ms         10
BM_LockStriping_8_Chunks/threads:2         74 ms        143 ms          4
BM_LockStriping_8_Chunks/threads:3         71 ms        198 ms          3
BM_LockStriping_8_Chunks/threads:4         60 ms        200 ms          4

Again, Time measures wall clock time per-thread and CPU measures cpu time per-thread. Also note that since my machine only has 4 logical cores, this test only goes up to 4 threads, since anything beyond that doesn’t actually introduce any extra contention.

We can see that in the single-threaded case the LockStripedHashSet, no matter the chunking, performs slightly worse on both wall clock and CPU time than the simple ThreadSafeHashSet.

However, as more threads are added, increasing the contention on the locks, the LockStripedHashSet performs much better, and the 8 chunk version outperforms the 4 chunk version at higher thread counts.

While lock-striping can help alleviate contention on locks, it does have the downside of increasing storage overhead for the locks. In our example the overhead of 7 extra locks and extra absl::flat_hash_set bookkeeping is miniscule for a single instance in our benchmark, but if you replaced all the hash sets in an application with an 8-way striped thread-safe hash set, you could bloat its memory footprint by a significant amount.

Conclusion

While this is nowhere near an exhaustive list of the most common systems programming tricks, hopefully it whets your appetite to learn more, provides some tools for improving performance in your own applications, or at least make it easier to understand why performance sensitive code is doing what it’s doing.

Thanks to Mason Simon and Ray Yang for reviewing drafts of this post.

For the most part this is true, but there are instructions, like MOVNTDQA, to read from memory and skip caching it, but you’ll still end up reading the whole cache line and causing the same contention. ↩︎
If you’re interested in cache-coherence, most common multi-processor cache coherence protocols are based on the MESI protocol ↩︎