Optimizing Function Placement with Call-Chain Clustering

Today’s paper is from Facebook’s HHVM team: Optimizing Function Placement for Large-Scale Data-Center Applications. The paper describes two ways to improve code layout, aimed specifically at very large applications.

The first method is to optimize function layout, i.e., where functions, after inlining is performed, are placed inside the binary. If function A calls function B many times, we’d want B to be placed near A so that it’s more likely that the instructions for B will already be in the instruction cache.

The second method is using huge pages to back hot functions. The goal here is to reduce the number of TLB misses by mapping more code to a single TLB entry.

The authors also describe the methodology for gathering profiles for large scale application deployments, with an emphasis on practicality.

Sampling at Scale

In order to layout functions based on call frequency, you must first invent the universe… then you need to figure out what functions call each other. Traditionally this has been done by having a compiler build a special instrumented binary and then run a benchmark against it. The problem with this approach in most modern large scale applications is (1) the instrumented binaries are too slow to run against real world traffic, (2) these applications depend on lots of external services, and (3) keeping a representative benchmark up to date is a lot of work. Problems 2 and 3 compound to make it quite a bit of drudgery to gather call frequency counts from instrumented binaries.

What’s even worse is that most large scale applications have wildly different performance characteristics based on the load applied. If your benchmark strays from what’s actually happening in production you could easily hurt, rather than improve, your performance.

The approach taken by the paper is to sample call frequency information directly from production machines. They use two ways to sample call frequency. The first uses Intel’s Last Branch Record performance counter. The second is to use stack sampling, which is cheap if the program hasn’t omitted frame pointers. In the authors’ experience, both ways provide roughly the same accuracy, and both incur minimal overhead. The approaches are similar to previous work done in Google’s AutoFDO.

The authors stress the practicality of the approach. The developers do not need to change their deployments, nor maintain representative benchmarks. The optimizations also require minimal work in the compilation process, only needing relatively straight forward linker scripting to affect function layout.

Pettis-Hansen Clustering

The previous state of the art algorithm for function placement was invented in 1990 by Pettis & Hansen. The Pettis-Hansen heuristic uses the weighted call graph, but the graph is undirected. The authors improve on it with the Call-Chain Clustering (C3) algorithm that operates over the directed call graph.

To talk about the differences between the two algorithms, let’s consider this weighted, directed call graph from the paper:

Directed, weighted call graph with four nodes A, B, C, and D. A points to B with weight 100, A points to C with weight 40, B points to C with weight 30, and C points to D with weight 90

Each node represents a function, and each weighted, directed edge represents the number of calls made by one function to another. In the above graph, function A calls function B 100 times.

Both algorithms want to create clusters from this graph that represent groups of functions that we will place together in the binary.

Pettis-Hansen processes the edges of the graph in descending order of weight, greedily joining together the nodes/clusters connected by the edge. When nodes/clusters are merged, their external edges are merged and weights summed together. When merging two clusters A and B, it remembers the original graph’s weights, and will evaluate reversing the list of internal nodes that represent the cluster A as a way to maximize the weight of the original edge connecting A and B. This continues until there are no more edges left, and the graph is one cluster with a corresponding list of functions.

Running Pettis-Hansen against the example graph yields the following progression:

An undirected graph with three nodes A-B, C, and D. A-B connects to C with weight 70. C connects to D with weight 90. This is the resulting undirected graph after step 1. An undirected graph with two nodes A-B and C-D. A-B connects to C-D with weight 70. This is the resulting undirected graph after step 2. An undirected graph with one node B-A-C-D. This is the final undirected graph.

C3 – Call-Chain Clustering

C3 uses a directed call graph to make its clustering decisions. The reason a directed call graph matters is that if A calls B, and B never calls A, you would like to place A before B so that the “distance” of the call is shorter. Assuming the calls on average are in the middle of the function body, when A precedes B the call distance would be |A| / 2 where |A| is the length of the code in fuction A. If A came after B, then the distance would instead be |B| + |A| / 2, which is clearly worse. Using a directed call graph allows C3 to take the caller vs callee relation into account.

C3 processes the nodes of the graph in descending order of hotness, i.e. the sum of the weights of their incoming edges. It attemps to append the node to the end of the cluster of its most common caller. There is no special merging or combining of edges that takes place. The only thing that prevents the appending of the node is the size limit on the cluster called the merging threshold, which is a single page, since any cluster larger than this cannot fit in a cache-line nor memory page.

Once there is no more possible merges, C3 sorts the final clusters based on their “density”, which is defined as density(c) = time(c) / size(c) where time(c) is approximated by the call frequency of c. By arranging the functions based on their density, it can pack more hot code into fewer pages and lower the TLB cache pressure. Instead of doing any bin packing, C3 just places all of the clusters contiguously in descreasing order of density.

Going back to our example graph:

Directed, weighted call graph with four nodes A, B, C, and D. A points to B with weight 100, A points to C with weight 40, B points to C with weight 30, and C points to D with weight 90

Running C3 yields the following:

A directed graph with three clusters, one consisting of node A and node B, one cluster of just node C, and one cluster of just node D. This is the result after the first cluster merge. A directed graph with two clusters of node A and node B, and a second cluster of node C and node D. This is the result after the second cluster merge. A directed graph with one cluster of node A, B, C, and D. This is the final result.

Comparing Pettis-Hansen against C3 we have B; A; C; D and A; B; C; D respectively. Assuming that all function lengths are the same |f| then calculating the weighted call distances we have:

Pettis-Hansen: 260 |f|
  100 * 1.5|f| + 40 * 0.5|f| + 30 * 1.5|f| + 90 * 0.5|f| = 260 |f|
C3: 170 |f|
  100 * 0.5|f| + 40 * 1.5|f| + 30 * 0.5|f| + 90 * 0.5|f| = 170 |f|

Hence C3 results in a 35% reduction in call distance, and a reduction in call distance means that we’ll be more likely to have the called function code in either the instruction caches, or in a page already mapped by an existing TLB entry.

Huge Page Backed Code

Most modern CPUs and operating systems have support for huge pages, which means that instead of a single memory page being 4KB, we can have a huge page that is 2MB or 1GB. The reason to make pages larger is to reduce the number of I-TLB cache misses. The OS needs to translate from the virtual memory addresses used by a process to the physical memory addresses used by the CPU, and it caches those translations inside the Translation Lookaside Buffer (TLB). With huge pages, we can have 2MiB / 4KiB = 512 times(!) more code mapped to a single I-TLB entry.

The paper points out that indscriminately using huge pages for all program text can cause regressions, since the CPU has a limited number of huge page I-TLB entries. For example, on Intel’s Ivy Bridge microarchitecture there are only 8 huge page I-TLB entries. If we mapped the entire binary to huge pages, we’d actually end up causing more I-TLB cache pressure than we relieve.

Since the authors wanted to judiciously back only the hottest code with huge pages, they could not use existing libraries such as libhugetlbfs which uses huge pages for the entire binary. Instead they did some gnarly system hacking to implement huge page mapping in userspace.

At startup, the application copies the hot function section to scratch space, and unmaps that address range. That range is then re-mapped using anonymous huge pages, and the text is copied back in place. This technique allows the application to map the hottest functions using a small number of huge pages.

Performance & Evaluation

The paper compares the performance of C3 versus Pettis-Hansen with and without huge pages across four different applications: HHVM, Multifeed, TAO, and AdIndexer. C3 outperforms Pettis-Hansen across all binaries with our without huge pages.

Performance graphs comparing performance improvement of Pettis-Hansen, C3, Pettis-Hansen with Huge Pages, and C3 with Huge Pages

Without huge pages C3 achieves an average IPC improvement of 5.46%, compared to 2.64% for PH. Mapping the hot functions onto huge pages boosts the average performance improvement with PH to 6.35%, while the performance with C3 goes up to 7.51% on average.

In terms of I-TLB performance measured by I-TLB misses per thousand instructions, C3 also outperforms Pettis-Hansen across the board:

Without huge pages, PH reduces the I-TLB misses by 32.4% on average, while C3 reduces this metric by 44.2% on average over the baseline… With huge pages, the gap between C3 and PH in I-TLB misses is smaller (67.4% versus 65.6%, respectively).

Huge Page Regressions

From the performance graphs one can see that huge pages usually provide a large improvement and brings the performance of C3 and Pettis-Hansen closer together. However huge pages are not a miracle drug, and the authors demonstrate that huge pages on their own used indiscriminately for an entire binary can cause regressions.

Using HHVM, without C3 or Pettis-Hansen, as an example they found:

Our evaluation of this version of HHVM revealed a 1.15% performance regression running Facebook web traffic when compared to the baseline with the same function order and without mapping any function to huge pages.

Conclusion

It’s fun to read this paper because I remember the internal discussions around this while I was at Facebook. If I recall correctly, the insight to sort the functions by density of hotness was worth 2-3% of HHVM performance.

I also enjoy how practical the entire solution is, and the breakdown of the sources of the improved IPC. I can see CPU manufacturers and operating systems introducing more support for selectively huge page backed binaries, and larger huge page I-TLB caches in the near future as more applications grow in size. Or perhaps new thought leaders will emerge and push for smaller applications with more shared memory architectures and IPC. Find out next time on: As The Tech World Turns.