Monday, 24 October 2011

Assault by GC

TL;DR;

We had performance spikes, which we eased with some insane use of structs.

History

For a while now, we had been seeing a problem in the Stack Exchange engine where we would see regular and predictable stalls in performance. So much so that our sysadmins blogged about it back here. Annoyingly, we could only see this problem from the outside (i.e. haproxy logging, etc) – and to cut a very long story short, these stalls were due to garbage collection (GC).

We had a problem, you see, with some particularly large and long-lived sets of data (that we use for some crazy-complex performance-related code), that would hobble the server periodically.

Short aside on Server GC

The regular .NET framework has a generational GC – meaning: new objects are allocated “generation 0” (GEN-0). When it chooses, the system scans GEN-0 and finds (by walking references from the so-called “roots”) which (if any) of the objects are still in use to your application. Most objects have very short lives, and it can very quickly and efficiently reclaim the space from GEN-0; any objects that survived move to GEN-1. GEN-1 works the same (more-or-less), but is swept less often – moving any survivors into GEN-2. GEN-2 is the final lurking place for all your long-lived data – it is swept least often, and is the most expensive to check (especially if it gets big).

Until .NET 4.5 rolls into town with a background server GC, checking GEN-2 is a “stop the world” event – it (usually briefly) pauses your code, and does what it needs to. Now imagine you have a huge set of objects which will never be available to collect (because you will always be using them) all sat in GEN-2. What does that look like? Well, using StackExchange Data Explorer to analyse our haproxy logs, it looks a bit like this:

image

I’ve omitted the numbers, as they don’t matter; but interpretation: normally the server is ticking along with nice fast response times, then WHAM! A big spike (which we have correlated with GC) that just hammers the response-times.

So what to do?

We take performance very seriously at Stack Exchange, so as you imagine this got a bit of attention. The obvious answer of “don’t keep that data”, while a valid suggestion, would have hurt a lot of the overall performance, so we needed to find a way to remove or reduce this while keeping the data.

Our initial efforts focused on removing things like unnecessary copy/extend operations on the data, which helped some, but didn’t really make a step-change. Eventually, we concluded…

Break all the rules

Important: the following is merely a discussion of what has helped us. This is not a magic bullet, and should only be applied to some very limited scenarios after you have profiled and you know what you are doing and why. And it helps if you are just a little bit crazy.

First, a brief overview – imagine you are holding Customer objects in memory for an extended period; you have a lot of them in a List<Customer>, and occasionally add more (with suitable threading protection, etc). Further, you have some pre-calculated subsets of the data (perhaps by region) – so a number of smaller List<Customer>. That is entirely inaccurate, but sets the scene ;p

After exhausting alternatives, what we did was:

  • change Customer from a class to a struct (only within this crazy code)
  • change the main store from a List<Customer> to a Customer[]
  • change the subsets from List<Customer> to List<int>, specifically the offset into the main Customer[]

eek; so what does that do for us?

  • the main data is now a single object (the array) on the "large object heap", rather than lots of small objects
  • by using direct access into an array (rather than a list indexer) you can access the data in-situ, so we are not constantly copying Customer values on the stack
  • the int offsets for subsets are essential to make sure we don't have multiple separate values of each record, but on x64 using int offsets rather than references also means our subsets suddenly take half the memory

Note that for some of the more complex code (applying predicates etc), we also had to switch to by-ref passing, i.e.

void SomethingComplex(ref Customer customer) {...}
...
int custIndex = ...
SomethingComplex(ref customers[custIndex]);

This again is accessing the Customer value in-situ inside the array, rather than copying it.

To finish off the bad-practice sheet, we also had some crazy changes to replace some reference data inside the record to fixed sized buffers inside the value (avoiding an array on the heap per item), and some corresponding unsafe code to query it (including the rarely-used stackalloc), but that code is probably a bit complex to cover properly - essentially: we removed all the objects/references from this data. And after removing the objects, there is nothing for GC to look at.

Impact

It helped! Here’s the “after”, at the same scales:

image

As you can see, there are still a few vertical spikes (which still tie into GC), but they are much less dense, and less tall. Basically, the server is no longer tied up in knots. The code is a bit less OOP than we might usually do, but it is a: constrained to this very specific scenario (this is absolutely not a carte-blanche “make everything a struct”), and b: for understood reasons.

Plus, it was pretty interesting (oh dear; I really, really need to get out more). Enjoy.

Disclaimers

  • You need to be really careful messing with this – especially with updates etc, since your records are certainly over-size for atomic read/write, and you don’t want callers seeing torn data.
  • This is not about “struct is faster” – please don’t let that be what you take away; this is a very specific scenario.
  • You need to be really careful to avoid copying fat values on the stack.
  • This code really is well outside of the normal bell-curve. Indeed, it is pretty resonant of XNA-style C# (for games, where GC hurts the frame-rate).

Credit

A lot of input, data-analysis, scheming, etc here is also due to Sam Saffron and Kyle Brandt; any stupid/wrong bits are mine alone.

34 comments:

Noel Grandin said...

From a Java dev who has seen this before: your solution is one approach (which java does not completely support).
There are other approaches
- push it into an external cache engine like memcached
- push it into the process heap and outside the GC's heap
In general, server-side garbage-collected environments don't like large quantities of long-lived data.

Marc Gravell said...

@Noel indeed, and we looked at AllocHGlobal too (non-GC heap); what we have works well, though, and allows more direct access. Indeed, this would have been more painful in Java, due to the lack of user-defined value-types.

Marc Gravell said...

@Noel as an aside, where appropriate we also use the external option (we opted for Redis, and ended up writing our own optimized bindings for it)

Noel Grandin said...

@Marc, yes, the lack of value types and in particular, arrays of value types, makes it a lot harder to reduce GC pressure in Java.

These are the trials of high-performance GC-language server projects - increased productivity vs. having to sometimes work around the GC :-)

Barry Kelly said...

There's another pattern that makes this approach easier to work with. Instead of making Customer a struct, make CustomerImpl (or whatever you want to call it) a struct, and turn Customer into a class that only has two fields, of CustomerImpl[] _array and int _index, and proxy all the methods of Customer to the underlying CustomerImpl structure. Override operator== and friends to make your Customer behave like you'd expect a reference type to behave. Finally, create some kind of WrappedList which is backed by T[], but rather than handing out T, hands out f(T) instead, where f = (a, i) => new Customer(a, i).

Or to describe it at a high level, you create a reference type wrapper around the tuple of array and index to avoid having to modify the whole rest of the codebase to take ref parameters and know that it's dealing with a value type (which is error prone, as it only takes one copy along the chain and you start losing updates).

These flyweight wrapper objects should not be stored in long-lived collections themselves (or otherwise you just recreate the original problem). They should only hang around for argument-passing and algorithmic convenience. They should die young and get swept up by gen0 collections.

I've successfully used this pattern in the past, only it wasn't just an array and index that was being wrapped, in individual objects in a whole serialized object graph stuffed in a byte array. These byte arrays could have been unmanaged, but GC didn't get bad enough to require that. Instead, they were wrapped up in an object that implemented IDisposable, and they were pooled and reused; they were big enough they went into the large object heap. The pooled arrays not in use were referenced via WeakReference, so spikes in usage could eventually clear them out.

This system didn't have particularly noticeable GC pauses, and total time spent in GC was less than 1%. Collecting objects in gen0 is amazingly cheap; it's definitely worthwhile creating them and throwing them away soon after, if it makes the rest of your code more manageable. And if the abstraction the flyweight objects are hiding is hardcore GC-tweaking stuff, all the better.

(Why were object graphs being stored in a serialized format, and proxied out, you ask? Because deserialization was too expensive. Only a handful of objects needed touching for every request, and copying an array of bytes around was much faster than almost any workable deserialization mechanism.)

Marc Gravell said...

@Barry it should be noted that the "object" here was already very specific to this scenario - so a flyweight, while nice, would not really have been needed (and would only have added GEN-0 work, albeit very cheap). Yours is a good approach though, that I might have considered if I had need of the data *other* than in this really, really constrained block.

Anonymous said...

I'd like to see a post about how you came to correlate these with the GC level 3 collections. That's the part we usually struggle with. Fixing problems is easy, identifying them is the hard part.

Anonymous said...

Why did you change from List to Customer[] ?

Anonymous said...

(blogger ate my brackets after List in the above post)

Blogger said...
This comment has been removed by the author.
Anonymous said...

I ran into a similar problem 4 years ago with the .NET 2 GC. Had a windows service with long lived objects that measured in GBs.

After trying everything known to man and learning more about WinDbg than anyone should, I settled on a solution that is counter-intuitive but worked for my situation: I called GC.Collect at regular intervals from the app itself.

What this meant for the application that the GC storms are predictable, smaller and shorter in duration (because I control them).

Not sure that it would work for a scaled app like StackOverflow, but worked for me.

Ironically, I am running into a similar situation with Java (http://stackoverflow.com/questions/7811798/why-does-the-java-garbage-collector-kick-in-when-there-is-plenty-of-available-ra) , where the solution of running GCs all the time is actually causing problems.

Livingston said...

Maybe you could have used protobuf to store the objects in byte arrays. I think I read once that google invented protobuf's so that they could store "The Index" in ram!

Jon Scott Stevens said...

You really should look at externalizing your cache. In Java land, I've used ehcache quite successfully to run large porn sites. There is really no need for control over user-defined value-types in this case. Clearly you are over engineering a problem that is already solved.

Chris Smith said...

Your architecture is seriously broken if you have to resort to "hacks" like that.

JD Conley said...

I had to do this type of thing in the past as well. We ended up keeping optimized binary serialized representations on the LOH by allocating huge blocks of ByteArray via ArraySegment. Essentially managing our own heap. Works well if your objects are uniform in size. Sucks otherwise. With non-uniform objects you might as well go external with Redis. :)

Anonymous said...

I'm sorry but this is a pitiful solution! I can't believe you are admitting to coming up with this nonsense.

Marc Gravell said...

Re "Why did you change from List to Customer[]": a list indexer needs to copy the data each `get`, and these are over-sized. Elements of an array can be accessed directly without copy.

Re "I called GC.Collect at regular intervals from the app itself.": that wouldn't help here - nothing much would be reclaimed.

Re "Maybe you could have used protobuf to store the objects in byte arrays.": indeed, maybe if we had gone unmanaged-buffer (AllocHGlobal) that might have been our next step.

Re "You really should look at externalizing your cache.": we **do** have an externalized cache, in Redis. This is something different.

jwatte_food said...

So, if you had been programming the XNA Framework for Xbox 360 for the last 5 years, you would already have understood all of this and applied it by rote.
The GC is *not* your friend when you need to do heavy data processing, whether it be graph data sets, vertex buffers, or anything else.

Actually, after this learning, I highly suggest you try the XNA framework. You'll feel right at home :-)

Marc Gravell said...

@jwatte_food I noted the similarity to XNA code at the bottom of the post - it is not unfamiliar to me. My intention here was to share this story with regular .NET folks who more traditionally think in the "entity=class" mind-set - which is usually appropriate (especially in line-of-business), but not *always*.

Doug said...

I am a Java, not an expert in .NET but I can see some similarities between the two VM.
First: when people say creating object nowadays is cheap they lie. It may take less time to create an object but GC time is still something we need to count into as well.
Second: in .NET Java, Collection type data structures are nice and convenient but they cannot beat the performance of arrays, specially arrays of just primitives. I can prove this to you with Hibernate vs straight JDBC/ODBC query with large resultset.

Three: something that I learned: mechanical sympathy also applies with both VMs. That is: array is CPU cache friendlier than list is because array access patter is always predictable. Size of a array pointer is also small enough for CPU to fetch in advance.

I think for big data handling, use array like you did is always a winner. I don't think I can create a struct in java but if you can somehow use a facade to access a multi-dimensional array type data structure, you could push your performance even further.
Cheers,
Doug

truewill said...

I hate it when someone posts a blog like this. Despite all your careful disclaimers, someone will use this to justify replacing classes with structs and arrays - without profiling. 99.99% of the time, the GC Just Works.

Marc Gravell said...

@truewill yes, if only I had qualified the scenario or noted repeatedly not to change to struct out of whimsy...

br1 said...

Marc, did you consider C++/CLI? Programming with value semantics is easier in C++.

Also, please write about the actual code that needed this hack.

Marc Gravell said...

@br1 that would have been another option, but it is nice that we can solve it without going that extra step. Re "the actual code" - that gets messy; I'd rather stay away from being *too* specific here, just for the usual "CYA" reasons.

knocte said...

Did you try using HashSet<> instead of List<>? People normally use the latter when it's not needed.

knocte said...

Also, whenever you test with 4.5's BackgroundGC (and removing these hacks), can you update the blogpost? (I guess you can already test, thanks to some CTP?0

Marc Gravell said...

The data is unique and order is important - HashSet-of-T would be a bad fit.

Anonymous said...

There sure are a lot of strangely negative comments (trolling? they don't really explain their position). This was an interesting post and something I'm happy to be aware of should I ever run into anything similarly unusual.

Marc Gravell said...

@Anon there's also a lot of supportive/corroborative ones, so I can live with that. For further info on this topic, see http://samsaffron.com/archive/2011/10/28/in-managed-code-we-trust-our-recent-battles-with-the-net-garbage-collector

CodeInChaos said...

Why do you prefer the Server GC in this context? My understanding is that the server GC is for situations where you care about GC throughput and can tolerate a short pause. Typically because you just move the load to a different server before the GC starts.

Whereas client GC has lower throughput, but doesn't block the program execution.

CodeInChaos said...

Or since you probably do use multiple servers, why don't you take a server where a Gen2 GC is occuring out of the load balancer?

Anonymous said...

A quick question just: while i did read your article back in Oct. and found it interesting, I did some of my own trials, with a similar problem domain. I found that for me, LowLatency for the GC fixed the majority of my issues for Gen 2 collections. Did you try that approach if so, was there any reasons you discarded that option?

Marc Gravell said...

AFAIK low-latency is intended for (limited to?) workstation, and workstation GC would **not** love our app. Low-latency is also intended for *short regions of time* (example: animations), not to be left on all the time. I don't claim to be a GC guru, but I don't believe that would be the best fit for our scenario.

Anonymous said...

FYI, .NET 4.5 has a SustainedLowLatency GC mode, designed for longer periods of time. We use it on our web servers to essentially have a pauseless GC.

Then we recycle the app pools during late-at-night manteinence hours where we can take the servers out of the load balancer.

In .NET 4.5.1, there will be a LOH compactor that can be ran on demand. Combined with SustainedLowLatency, I think this would fix your problem without any code changes.