tcmalloc's Temeraire: A Hugepage-Aware Allocator

Today’s paper is Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator by A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan from OSDI ‘21.

This paper is a great read, less for its key results – a hugepage aware allocator that reduces tlb misses and memory usage – but for its meta-lessons: measure the entire system, invest in tooling, build a process to experiment confidently, profile globally act locally.

Of course the paper does actually contain lots of interesting tidbits about Temeraire, the hugepage aware memory allocator added to tcmalloc, so we’ll cover those too. But my key takeaways are all the higher-order lessons from the paper.

Fantastic Huge Pages and Where To Find Them

All of that sweet, sweet memory your programs are reading and writing to is virtual memory that is backed by your operating system by actual physical memory. Your processor and operating system work together to ensure that when a process accesses this virtual memory, it gets translated to an actual physical memory address, and these translations get cached in the translation lookaside buffer or TLB. This virtual memory is broken up into pages, historically 4 KiB in size, and the TLB caches the translations for a given virtual page to a physical page.

Huge pages are a processor feature, requiring OS support, that allow physical and virtual memory pages to be greater than 4 KiB in size, with 2 MiB being the usual hugepage size. The advantage of the larger page size is that now we only need one map entry, or TLB entry, for 2MiB of memory as opposed to previously needing 512 entries with a 4 KiB page size. This reduction in the number of entries we need to map the same amount of memory reduces the cache pressure on the TLB, reducing TLB misses, and can provide large performance improvements – on the order of 10% for already highly optimized systems.¹

Transparent hugepages or THP is a linux kernel feature that builds on huge page support which allows for right-sized, i.e. 2 MiB aligned, anonymous memory mappings to be automatically backed by hugepages. This is the kernel feature that Temeraire leverages, by designing an allocator that aims to ensure most allocations can be backed by these huge pages. If Temeraire could accomplish that transparently for every application in the fleet, that would provide a few percent increase (or decrease) on a very large number – a worthy goal.

However it isn’t as easy as allocating aligned 2 MiB pages from the OS and calling it a day. The difficulty with using huge pages is that – well, they’re huge. If you have a single 8 byte allocation on a 2 MiB hugepage, than you would be wasting 99.999% of your memory. So potentially wasting lots of memory is one big problem, and there are others as well, and we’ll walk through how Temeraire deals with them all.

Profiling a Datacenter

Before we jump into Temeraire’s solutions, let’s talk about evaluating success. Temeraire is not focusing on winning allocator benchmarks by reducing the amount of cycles spent in allocator code, but aims to improve overall application performance by backing more memory with hugepages. Since its aims are not well captured by allocator benchmarks, it would be useful to know how successful Temeraire is at its goal. You could imagine just measuring how much memory is backed by huge pages, but that is just a proxy for what we want to measure (application performance), and can lead to misleading results. Ideally, each application would have a high-quality benchmark that is highly representative of its performance in production, then you could run these suite of benchmarks and measure what the performance uplift of Temeraire is and then weigh that cost improvement against the cost increase of how much more memory Temeraire uses.

However most applications don’t have these kinds of benchmarks, and to create one for every application is a massive amount of work that just isn’t feasible. Even if you create highly-representative benchmarks (and keep them up to date!) for the top 10 applications, you’d still be missing the fat long-tail. What you’d like is to get accurate CPU profiles from production continuously and then you could just run a massive A-B experiment with Temeraire enabled/disabled and measure the changes in production against actual production workloads.

Well, Google-Wide Profiling to the rescue. Google-Wide Profiling, or GWP, automatically and continuously collects CPU profiles for every application in a Google datacenter, gathering detailed statistics like TLB misses and Instructions Per Cycle.

GWP allows Temeraire engineers to measure how changes to tcmalloc affect TLB misses and IPC of every application in the fleet². Now Temeraire engineers can actually answer the question that really matters: “How does my patch change actual production performance?” Instead of just depending on microbenchmarks of the allocator itself to measure performance, engineers can measure overall fleet-wide application performance, which leads to different decisions and tradeoffs.

The authors found that changes that are perf wins or optimizations for overall production performance can show up as allocator perf losses in a microbenchmark setting. For example, tcmalloc could decide to spend more time in the allocation code trying harder to find huge pages to back memory with, which hurts microbenchmarks, but could improve overall application performance which microbenchmarks don’t capture.

This kind of mismatch between the measurement that’s easier to get versus the actual property you care about is always a problem in analysis. It reminds me of the old joke about a drunk searching for his keys under the lamp because the light’s better there.³

We’d all prefer to not be the fool looking for the keys where the light’s good, but sometimes it’s the best we have. However, we have to keep in mind that that’s what we’re doing, and not to lose sight of the true measurement we actually care about.

For Temeraire, this kind of mismatch between benchmark performance and overall application performance is extremely relevant. The point of Temeraire is to improve overall application performance by having more memory backed by hugepages, but backing more memory with hugepages may take more work in the allocator code itself which looks like perf losses in microbenchmarks.

Having GWP as a tool allowed Temeraire engineers to measure overall performance, and not have to depend soley on microbenchmarks. The authors do point out that microbenchmarks are still very useful for iterating quickly and trying out new ideas. However microbenchmark results alone can’t be trusted, and rather a combination of GWP and an A/B experiment framework built by the authors to experiment with allocator settings in production provide the best signal.

A/B Experimentation For System Software

One of the key tools in enabling the development and deployment of Temeraire was the large-scale A/B experiment framework the authors created. Temeraire is a system full of heuristics that require tuning on actual workloads, and being able to experiment on actual production workloads at large-scale helped to both iterate on changes and confirm improvements.

While the practice of A/B experimentation is certainly not new, the embracing of it within system software is I think really just beginning. I like to think of Profile-guided optimization (PGO) as a kind of A/B experimentation for compilers, and I think of Otter Tune by Andy Pavlo et al. as another example. The world of system software is increasingly driven by many different heuristics, and a large part of the job is to tune the system. I believe that automating at scale the tuning of the system to the workload is a trend that will keep growing.

Temeraire built an A/B experiment framework to allow the testing of allocator parameter changes across the entire fleet, without other teams having to take action. While the authors point out how invaluable this experimentation tool was, they also emphasize how it was the last step in the development process. The authors built different tools at different points on the correctness/speed spectrum to enable testing. On one end is the A/B experiment framework, which is heavy-weight but produces extremely trustworthy and accurate results. On the other end is an empirical driver⁴ that generates calls to malloc and free as a Poison process. The process’ probabilities are based on telemetry collected by GWP on the distribution of sizes of both live heap allocations as well as calls to malloc itself, so that the resulting empirical driver is statistically similar to the fleet.

The empirical driver allowed tcmalloc engineers to get feedback on potential changes quickly, on the order of minutes. While the results were less accurate than a full A/B experiment, it helped engineers to quickly validate ideas and changes that were promising, and iterate quickly to refine them. Much writing has been devoted to how to shorten the feedback loop, but I find most of it is aimed at product engineers, while the system software community still idolize the Feynman algorithm of problem solving. I’m hopeful that the system software community will continue to invest in tooling to shorten feedback loops.

Temeraire Allocator Heuristics

Temeraire is broken up into four components: HugeFiller, HugeCache, HugeAllocator, and HugeRegion. HugeFiller is repsonsible for handling all allocations less than 1 MiB and many under 2 MiB, HugeCache caches fully-empty huge pages and handles mid-sized allocations more than one huge page but less than 1 GiB. HugeAllocator is the simplest component that is just responsible for getting huge-page aligned virtual addresses from the OS (via mmap w/ MAP_ANON) and storing unused and unbacked virtual adresses (after a huge page has been returned to the OS via madvise w/ MADV_DONTNEED). HugeRegion is a specialized component to deal with pathological cases where programs make 1-2 MiB allocations and generate too much “slack”⁵ for smaller allocations to use.

The pseudo-code algorithm they present is:

Span New(N) {
  // Slack is too small to matter
  if (N >= 1 GiB) return HugeCache.New(N);
  // Help bin-pack a single hugepage
  if (N <= 1 MiB) return HugeFiller.New(N);
  
  if (N < 2 MiB) {
    // If we can reuse empty space, do so
    Span s = HugeFiller.TryAllocate(N);
    if (s != NULL) return s;
  }
  
  // If we have a region, use it
  Span s = HugeRegion.TryAllocate(N);
  if (s != NULL) return s;

  // We need a new hugepage.
  s = HugeCache.New(N);
  HugeFiller.DonateTail(s);
  return s;
}

Focusing on the workhorse HugeFiller component, the authors’ goal was to keep all huge pages either maximally full to waste as little space as possible, or as empty as possible so that it’d be a future candidate to have all its allocations free’d and free up the entire huge page. When choosing which huge page to allocate from, the metrics they consider are: the longest free range L (measured in regular 4 KiB pages) in the huge page, the number of allocations in the page A, and the total number of used 4 KiB pages within the huge page U. The heuristic they settled on for an allocation K is (1) Consider only those huge pages that can fit the allocation (L >= K), (2) Choose the smallest L, (3) Choose the page with the largest # of allocations A.

The reason for prefering A over U as a metric is based on the “Radioactive Decay” allocation model, where each allocation is equally likely to become free with some probability p.⁶ Under this model, the total number of used pages U is not nearly as important as the total # of allocations A if you want to eventually be able to free the whole huge page. Any given huge page with A allocations will become fully free with probability A ** p, which means as A grows it quickly dominates any other consideration.

Results

The results for the paper are quite good. Requests Per Second up 5%, RAM usage down 8%, and 50% reduction in walking the page table on average across a dozen applications. These are great numbers! But I won’t talk about them any more because they’re not that interesting to me.

Telemetry Flywheel

I mentioned earlier that the authors built an empirical driver and tracing system for tcmalloc that would sample all allocations and frees globally to build a model on which to base a test harness that would statistically resemble the behavior of the fleet. This is probably one of the coolest pieces of software I’ve read about.

In addition to being extremely helpful for designing and iterating on Temeraire, the authors note that there were other unexpected benefits of the telemetry:

Beyond producing numbers motivating and evaluating our work, our fleetwide profiler is itself a powerful tool for designing allocators. It reveals patterns of allocation we can use to derive heuristics, it allows validation of hypotheses about typical (or even possible) behavior, it helps identify which patterns we can safely ignore as unimportant and which we must optimize.

One of my deeply held software engineering beliefs is that systematic investments in telemetry yield powerful results. It’s good to see this belief reinforced by these authors as well. Perhaps one final takeaway is that the quality of the telemtry and tooling itself matters greatly. A one-off hack to gather some empirical data may have allowed Temeraire to be built, but the ongoing effort to have the machinery work and ensure constantly updated collections is what elevates it to an indispensable tool.

Final Remarks

This was a very fun paper to read. It was well written, technically interesting, and most importantly confirmed all my pre-existing beliefs 😄.

10% performance improvements don’t sound like much to most people when you read blog posts around getting 10x perf wins by rewriting in Rust/Go/C or whatever. But put in context of Google spending $18.3 billion in the first 9 months of 2021 on datacenters and offices, a 10% perf improvement seems pretty good! ↩︎
Interestingly, the authors found that IPC can be too abstract a metric, and often does a poor job of predicting the actual useful throughput of an application. For Temeraire, they found that most systems had larger improvements in requests-per-second within SLO than just the IPC improvement would suggest. ↩︎
The full joke is: A policeman is on patrol and walks by a drunk crouching underneath a lamp post and asks what he’s doing. The drunk replies that he’s looking for his keys, to which the policeman offers to help and asks him where he may have dropped them. The drunk points across the street and says he dropped them over there, but the light’s better here. ↩︎
The authors have uploaded the empirical driver onto github so that it can be used by others. ↩︎
“slack” is a term of art that the paper defines: “Slack is the gap between an allocation’s requested size and the next whole hugepage. Virtual address space allocated from the OS is unbacked without reserving physical memory. On use, it is backed, mapped by the OS with physical memory. We may release memory to the OS once again making it unbacked.” ↩︎
The authors cite the paper Generational Garabge Collection and the Radioactive Decay Model by William D Clinger and Lars T Hansen as the genesis of this model, which was an interesting connection. I always thought of the “Lindy Effect” as the insight for generational GC – the expected total lifetime of any allocation is proportional to its current lifespan, i.e., allocations that live a long time are more likely to continue living then younger allocations. The paper actually shows that while that is a useful heuristic that generational GC’s take advantage of, even under the Radioactive Decay Model generational GC still has other advantages and outperform non-generational GCs. ↩︎