During a recent project I encountered memory allocation failures in a LISP system, seemingly caused by poor garbage collector performance. More careful investigation revealed a more fundamental problem however, one that is even completely unrelated to the choice of language or runtime.
Without going into unnecessary detail, I should give a brief overview of the system I have been working on. It is an import system for a search engine; in other words, it trawls large datasets that it downloads from an object store, processes those datasets, and uploads certain data in the form of XML documents to a search engine.
The source data that is loaded from the object store, are binary-encoded "container" objects (similar to folders in a file system) that may refer to other objects, and then actual documents in their native formats (for example raw .eml rfc2822 files).
The importer recursively descends the tree structure of a dataset, comparing the previously imported version to the current version, determining the differences (which objects that disappeared, were added or were updated), and conveys this difference to the search engine by means of an XML document.
For performance reasons, we want to run a number of import jobs concurrently; the individual recursive descent is not parallelised, but we like to process many trees concurrently. Much of this work is I/O bound, so the number of concurrent jobs we can run is limited by memory consumption, not by available processing power.
Adding this import functionality to an already existing LISP system allowed for very rapid development. A working version of this code, complete with job scheduling code and a test suite, was implemented over the course of just a few days. All was good, for a while. Then, we started encountering out of memory issues. It is simple to add more memory - but we quickly got to a point where we could only run two concurrent jobs, we allocated 8GB of memory for the LISP system, and we could still not be certain we wouldn't occationally encounter an allocation failure.
Dealing with bad allocations in C++ can be troubling enough; proper exception handling and code structure can make recovery from out of memory reliable - but still one area of the code may use all the memory causing an unrelated piece of code to actually fail its allocation. You can't know if the "memory pig" is the one that will fail it's allocation, unless you start using pooled memory for each area of your code. With LISP, there is the STORAGE-CONDITION condition that can be raised, allowing you to deal gracefully with allocation problems, similar to std::bad_alloc of C++. In SBCL however, sometimes the condition is raised, sometimes the low-level debugger is entered because the system fails to raise the condition in time. In other words, for a real-world LISP system, I need to be sure that out of memory conditions are unlikely to occur; we cannot rely on error handling to recover reliably from allocation problems.
Soon after starting to experience these occational out-of-memory problems, I started investigating which types of objects were allocated, using the (room) function. An example could look like:
The above is for a relatively small system; only 1G of dynamic space in use; but the pattern scales to larger dynamic spaces. There is an enormous amount of character strings and a large number of bytes allocated in arrays of unsigned bytes. The latter are the buffers I use when receiving the binary objects from the object store.
It surprised me that this many arrays were "alive" still; I never reference more than a handful of these - how can there be more than 8k of those alive? No wonder we're having memory problems. Running (gc :full t) will clean these up, so I'm right that they are not references. So how come do we still run out of memory? Why does the garbage collector not collect these by itself? Clearly, this must be a garbage collector problem, right?
I consulted the SBCL-Help mailing list for advise on how best to deal with this. Very quickly I received insightful and helpful advise both on-list and off-list. Of the various suggestions I received, at the time it seemed most reasonable to accept that the memory allocation pattern of my application (few large objects followed by many small) simply caused the large objects to get promoted to a later generation in the garbage collector, causing the collector to miss opportunities to collect them (the SBCL memory collector uses 6 generations or so).
Working under the assumption above, there were several solutions possible:
- Call the garbage collector explicitly
- Tune the garbage collector; e.g. make it two-generational
- Pool the resources - re-use buffers
- Allocate outside of the dynamic space
I did two things: I took the static-vectors library and implemented a simple memory allocator on top of that. This way, I could not explicitly allocate and free memory objects, using a single underlying static-vector allocated outside of the dynamic space. My allocator would simply return displaced vectors on top of the one large vector initially allocated for the pool. This looked like a beautiful solution to the problem. All my buffers were now completely out of reach of the garbage collector and I knew exactly when and were buffers were allocated and freed again. The code wasn't even complicated by this, a simple (with-pooled-memory...) macro would neatly encapsulate allocation and guarantee release.
Running the code with this, gave the following dynamic space breakdown:
There we go! My unsigned byte arrays are completely gone. But another problem surfaced...
Displaced array performance
It turns out that accessing displaced arrays comes at a significant cost - the compiler can no longer generate efficient code for traversing them. Take a look at the following comparison:
That's a significant penalty. The simple solution to this problem was to realize that I didn't need a memory pool. I allocate so few objects anyway, that I can easily allocate them all directly with the static-vectors library. Simply removing the pool (along with my beautiful little allocator that I was actually proud of) completely solved that problem.
This is a classical programmer error, and one that I am clearly not above: Imagining a solution in ones head without caring to investigate if it is even needed - spending time implementing a solution which in fact turns out to be a problem in itself. Anyway, with the pool gone and vectors allocated using the library that already allocates vectors, the performance problem was gone. But the string objects certainly weren't... How many of those did we have?
Half a gigabyte of strings?
Looking at the most recent dynamic space breakdown, we see we use 488MB for strings and 317MB for a whopping 19 million cons objects. Clearly there is more work to do - this system does a number of different jobs and there's nothing strange about it using some memory - but 19 million cons objects and almost three million strings, that's a lot. And this is on a "small" system; we can easily use 3-4G of dynamic space and the string and cons numbers scale.
This is actually where it gets interesting. I was forced to take a look at the actual data processing algorithm that I had implemented. The logic goes a bit like this:
So in order to compute the difference of one version of a folder to another version of a folder, we need to sort the children and iterate through them. Since we receive the data in these binary "blobs" and each actual entry in the blob is of variable length (depending, among other things, on the length of the name of the entry), I use a parser routine that generates an object (a struct) holding a more "palatable" representation of the very same data.
My struct would hold a list of child entries (again structs). Each child entry would have a name and a key/value store (a list of pairs of strings). So as an example, let's say I'm comparing two folders with 10.000 entries in each; each entry has five meta-data entries, so a child holds 5 (meta-data entryes) * 2 (key and value) + 1 (entry name) strings. Times 10k that's 110.000 strings I allocate both for the old and the new version. The entry lists are then efficiently sorted, and I can traverse and find the difference.
While this was simple to implement, it is also clearly extremely wasteful. And this has nothing to do with the choice of language. Had I implemented this in C++, a profiler would reveal that a significant portion of the run-time is spent in string allocation and string de-allocation. While an efficient C++ solution would only use a few times more memory than the source binary data due to the precise deallocation, my LISP system hurts more because the garbage collector fails to collect many of these strings in a timely fashion. But the algorithm I implemented is equally wasteful; in other words: What I did was silly in any language.
It actually wasn't rocket science to change this. I now use a struct that holds these members:
The "buffer" is my binary blob from the object store. The "offset" is a list of integers (fixnums) that hold the offsets into the "buffer" at which the child entries start.
I can efficiently retrieve the name of an entry by supplying the "buffer" and the start offset of the entry; a simple macro to retrieve various fields of the binary data makes the code for this really elegant.
Sorting is simple too. I do not sort the entries in the buffer - that would be time consuming and very difficult (as the entries are of variable size). Instead, I sort the offset list; not by offset number of course, but by entry name of offset. It's as simple as:
With this change, I probably don't even need the static-vectors allocation. But now that I have it, I'll keep it - it's an improvement, even if it's not essential.
An essential tool for me in finding the cause of the problem, and for identifying a few other routines that were unnecessarily heavy on allocating objects, was the statistical sampling profiler built into SBCL. A simple session looks like:
Finding the culprit
My expensive object parsing routine showed up in a profile report like this:
It is visible here that 35.7% of all allocations are caused by make-output-stream; most of the calls to that are from stringify-object. That in turn is called by princ-to-string. That, again, by format-print-integer. And that, finally, gets called by the objseq->contentlist routine which was the name of my parsing routine. This trace is what led me to understand the cause of the actual problem.
For example; I had a (sexp->xml ...) routine which converts a S-expression into a textual XML document. This routine was some of the first LISP I ever wrote, so it was built using some "fancy" constructs (a beginner learning the language) to recurse down into the expression and construct the document. However, looking at a profile running my search engine import revealed:
In other words; 41.8% of all allocations in the system are done by concatenate, and in 38.7% of the total allocations, it's from a call from reduce. Investigation quickly reveals that my fancy XML document routine makes heavy use of reduce and concatenate.
Changing the code to a simpler version that uses a string output stream instead completely removed this routine from the top of the profile. Just like that, about 40% of all allocations done in the code were gone.
Using the sampling profiler as a tool to find the source of allocations has proven an excellent tool to me. The ability to start such a profile on QA or even production systems (I have not needed this - but I could if I needed) is amazingly powerful. Profilers are not just for finding CPU bottlenecks, clearly.
It is common for me, and I know for many others, to want to "convert" data from one representation to another before we work on it. Like I did when I parsed the binary objects to a higher-level representation for sorting and comparison. Sometimes that makes sense - but not only does it take time to develop all this representation-conversion functionality (or you need to find and integrate libraries that do it for you); it takes time to run (every time), and your code may not even be simpler!
In this particular case, for comparing two 10k entry folders I would do 220k string allocations - my new implementation does 4 simple array allocations, and it is no harder to read and doesn't take up more lines of code than the old implementation.
I'm not saying abstraction is evil. I did indeed abstract things - for example, I developed a macro for efficiently extracting specific fields out of entries in the buffer. In this case, my abstraction is in code, instead of in the data structures. This is easier done in LISP with its powerful macro system than in other languages of course - but the same principle could be applied in any other language.
I'll try to think twice about where to put my abstractions the next time, that's for sure. Abstractions are good, to a point - but they don't always have to be in the data representation.
HTTP/2 is a symptom of a disease; it is a terribly ill conceived deterioration of the otherwise pretty good HTTP/1.1 protocol.
Most really terrible ideas (like a computer in your toaster or a 3D television) usually linger for a while and then die and leave us alone. Of course I was expecting for HTTP/2 to go this route, but yesterday was a shocker:
Looking at integrating with Apple's PSN (Push Notification Service) API, it became apparent they only support HTTP/2 access. Oh dear, I don't yet have an HTTP/2 client library (and frankly I was expecting never to need one).
Let's go over the main features of HTTP/2 one by one. I can start with the only one that is slightly justifiable:
Header compression is like Donald Trump. Both are probably very good solutions - to problems that you should not have had in the first place.
Staying on topic here, header compression allows legacy applications that have deteriorated into unmanageable piles of crap and therefore use excessively large headers on all their API requests, to gain some performance advantages by supporting compact and efficient transmission of the bloated crap that is their headers.
I have worked with software development for what is more than a lifetime for some of you who read this - I absolutely understand that you can be in a situation where this is useful. Therefore, I am not against the concept of header compression as such, but I will maintain that it is a solution to a problem what we should strive very hard not to have. There are no good excuses for having bloated headers.
Google with their SPDY extension attempted to add header compression to HTTP/1.1 - frankly I think this would have been a reasonable solution. It would be a hack to support legacy garbage that should really be rewritten - but I absolutely understand the business reasons that could justify improving support for (prolonging the useful life of) legacy garbage applications.
Now, as you can see from the Apple APNS documentation, you can't just use header compression. Apparently there are problems if you use it 'too much', so they warn you to send your headers in very special ways to avoid overflowing header compression tables at their server:
APNs requires the use of HPACK (header compression for HTTP/2), which prevents repeated header keys and values. APNs maintains a small dynamic table for HPACK. To help avoid filling up the APNs HPACK table and necessitating the discarding of table data, encode headers in the following way—especially when sending a large number of streams: ...
This really speaks for itself. So we now have a compression scheme (for no good reason) that is so fragile that we need to pussy-foot around it not to overflow the tables? Really?
In traditional "old style" networking (you know the kind we use today because it actually works), we would traditionally use TCP to provide us with "connections". On top of the connection layer (possibly with layers in between), we would put our application layer - like HTTP for example. HTTP is a transport protocol that allows us to perform more abstract transactions on top of our reliable connections as provided by the TCP layer.
These days, it's always the OS kernels that actually provide the TCP protocol for us. It was not always like that - applications did once implement their own TCP stacks (or they used libraries that did it for them - but still executed TCP in the application). Why would that be? Why does this code live better in the OS?
The reasons are many. As time went by and networks evolved, a lot of smart people learned a lot of lessons and refined TCP to the point where it is today. Conceptually there's nothing difficult about retransmissions, window sizes, or backing off transmissions in case of congestion. But pulling this off in the real world is difficult. Really really difficult. So difficult in fact, that Linux was the first kernel since BSD (that I know of anyway) to attempt to develop a TCP stack from scratch - everyone else, from Solaris and AIX to Windows and HPUX have refined theirs from the BSD stack.
Despite TCP's best efforts to provide reliable connections, someone have now decided not only that it's not good enough and not worth fixing at the connection layer; but they have even decided that "we" can do better at the application layer. "We" being any application client library and random teenager running cloud services out of his mothers basement. The HTTP/2 PING frame is introduced as a means of checking if the underlying TCP connection is still working; and even Apple encourages it's use for this very purpose.
Really guys? TCP already has keep-alive functionality for this very purpose. And yes, TCP keep-alive is not trouble free, but that is not because of some flaw in TCP, it is because the problem is fundamentally hard in the real world. Moving the problem away from the TCP stack and re-implementing it at the application layer will indisputably make it more expensive traffic wise (TCP layer keep-alive is quite efficiently implemented, something that HTTP/2 fundamentally cannot do because it rides on top of TCP itself). It will also, almost definitely, cause all kinds of traffic problems now that we move the responsibility of sending TCP keep-alive away from the operating system TCP stack (which has decades of refinement in it) into the application layer where history already showed that such functionality doesn't belong (we moved our TCP stacks from the applications to the kernel - remember?).
HTTP/2 ping is a terrible idea. The best we can hope for is that nobody uses it. That boat has sailed already though, as Apple is encouraging the PING frame use for checking the state of TCP connections to their APNS service.
A wonderful and under-used feature of HTTP/1.1 is that of pipe-lining. You can "stream" any number of requests through your HTTP/1.1 connection without waiting for one request to complete before sending the next. This solves latency problems, allows for servers to concurrently process requests even if you use only a single connection, and it's standard in HTTP/1.1.
People don't get this. And I don't get why. It's super simple. Even big-name applications sometimes use multi-part MIME documents to wrap multiple requests together so they can send them in a single POST - "to solve latency problems" (I kid you not).
Anyway... Some servers apparently do not support pipelining correctly. This is hard for me to believe (having implemented a HTTP/1.1 server that supports it just fine - and that was not hard), but at least that's the word on the street.
For this reason, and apparently for this reason alone, all popular browsers disable HTTP/1.1 pipelining. Yes I know, this is hard to believe but I promise you I'm not making this up.
The solution to this, you ask? If HTTP/1.1 pipelining is mis-implemented, you might think that someone would push to get it fixed. But no... it gets better and as you'll see, I couldn't even make this stuff up if I tried:
HTTP/2 … allows interleaving of request and response messages on the same connection … It also allows prioritization of requests
So... in other words: Basic HTTP/1.1 pipelining is too difficult to implement; therefore we now want to implement a considerably more complex scheme. Really?
I had honestly thought that HTTP/2 was so obviously ridiculous, and difficult to actually implement reasonably well, that no-one would ever actually employ it in production applications.
It appears that we may not be this lucky. At least Apple is requiring HTTP/2 for their APNS API even though their documentation needs to warn developers about how to work around obvious deficiencies in their implementation (the HPACK restrictions).
Since sanity or even just good taste is obviously not going to save us here, I am left hoping that plain simple laziness will keep a mass migration to the dangerous and broken HTTP/2 from ever happening.
Should we start to see HTTP/2 adoption, I can predict a couple of things:
- There will be unspeakable numbers of security issues - like use-after-free and out-of-bounds access, especially caused by HTTP/2 frame handling in HTTP/2 servers and client libraries
- Trivial DoS attacks will quickly cause servers to significantly limit both header-compression and frame multiplexing support, making actual HTTP/2 deployments more and more like HTTP/1.1 (only more complex and therefore more dangerous)
- We will start seeing significant network performance problems (hopefully only locally) triggered by indiscriminate use of PING frames for connection testing
- With the sentiment that 'bloated headers are fine', our performance problem moves into the applications that actually parse and generate the headers; transfer cost of bloated headers is only a symptom of a disease, removing that symptom does not cure the disease
Actually, these are not even predictions. They are obvious facts that anyone can see. Still, I'm going to pass them as predictions and claim clairvoyance.
It continues to amaze me when I look at other peoples code, how many good developers think in signed integers. This is a short example of such code and a walk-through of why it is mis-guided.
Signedness in the world of computers
Almost nothing in the world of computers is signed; you cannot send a negative number of bytes, you cannot have a negative amount of memory, you cannot put a negative number of elements in a vector and so on and so forth. We will frequently subtract numbers, but we will usually never have negative results.
For example: I may want to send a buffer using a send routine that does not necessarily send all data at once - subtacting the sent number of bytes from the full number of bytes can never be negative (in other words, the routine can never send more than it was given).
Good code like the STL realizes this and typically uses the size_t type for sizes in general (basically it is used for everything you can count that you can keep in memory). It is of course unsigned (for the aforementioned reasons) and it is as large as it needs to be, for any count of "things you can have in memory".
The problem then arises when a signed-integer-loving developer mixes his int-using code with the better size_t-using code. Good compilers will complain about pitfalls in signed/unsigned arithmetic and the developer is forced to respond with an equal measure of type conversions or casts.
The result will usually be unnecessarily verbose code that is less capable of handling large amounts of data (since the sign bit can no longer be used for actually counting). Granted, the last bit is more a problem on 32-bit systems than it is on 64-bit systems, at least at the time of this writing - but you never know then someone decides to use your code on a 32-bit system. If your code is any good (and often even if it isn't), it will end up on all sorts of systems without your knowledge anyway...
Let's take a look at an example I came across today: This is an extract of a piece of production code that actually works, it's just unnecessarily ugly because the developer had an unhealthy affinity towards signed integers:
So we have a variable sent_bytes with a nice readable descriptive name, so far so good. It escapes me how that can be a signed integer - how exactly do we send a negative number of bytes? Right, that is difficult indeed. By making it int rather than size_t, we have either 63 bits instead of 64 bits for the counter (which is probably fine) or 31 bits instead of 32 (which may be a problem if we use large buffers). But regardless of whether the more limited positive range is a problem or not, we don't get anything in return. Nothing is gained by opening up for negative numbers; there is no plausible scenario in which we could ever have sent a negative number of bytes. So we get nothing for something and that's never a good trade. Now I would buy the argument for making this a signed integer if it somehow made the rest of the code more readable - let's proceed with the analysis then.
So in our loop we compare the number of sent bytes to the actual number of bytes we wish to send. Notice how a static_cast<int> is used to stop the compiler from complaining that it is dangerous to compare the two types (as they don't have the same range a conversion is made - and that is not necessarily what you want). Now consider for a moment what this cast accomplishes... It casts the value so that the compiler stops complaining - it does not actually solve the problem the compiler is complaining about. This is like shutting off the fire alarm instead of putting out the fire - it stops the noise but it doesn't really solve the problem. The problem of course would (as of this writing) most likely only occur on 32 bit systems where we would have a 2.5G buffer and after sending the first 2G our sent_bytes would wrap and depending on the rest of the logic we may have more or less luck with our algorithm when sent_bytes is -1.5G (or some other number - depending on the good will of the compiler; read on)
And this is actually what makes it even more difficult for me to understand: Unsigned arithmetic in C++ is valid and well defined - it is integer arithmetic modulo 2^n where n is the bit-size of your integer. In other words, when you add one to your maximum integer value, you're back to zero. There's nothing magical about that, but it is extremely useful. In contrast, the standard does not define behaviour of signed overflow - if you rely on signed integers and they overflow, the behaviour is undefined by the standard. Why would anyone tend towards a type that has less standardized behaviour if it does not provide a benefit? The mind boggles.
We move on to the SSL_write line. Here we must subtract the number of sent bytes from the total size of our buffer so as to compensate for adding the number of sent bytes to the start offset of our data buffer. The developer has chosen an alternative means of getting around the compiler warning this time: Construct a temporary signed integer from the unsigned size of the buffer, then subtract the signed sent_bytes from there. Now, if we were to attempt to send 2.5G of data on a 32-bit platform this code would explode, but the compiler does not tell us about this because we skillfully mislead it by inserting valid constructs that hide our real crimes from the compiler.
Ultimately, we add our signed integer res to our number of sent bytes. Naturally, we expect that res cannot be negative when we use it in this manner - no assertion or otherwise is inserted for this, but that is something one could have done. Or we can trust the library to honour its contract and never return success and set res to a negative value. This concludes the use of sent_bytes, and in no situation do we gain anything from having it be a signed value. How about we consider a fixed version of the code:
Note the distinct lack of casts and explicit construction of temporaries to work around type mismatches. Other than that, the only functional difference between the two versions are, that this shorter and more readable version will successfully send 2.5G of data on a 32-bit platform, rather than blowing up unpredictably.
I fear that the affinity for signed integers comes from the old "magic return value" where we use non-negative values for the real return value and assign special meaning to negative magic numbers that callers must remember to check for (like SSL_write in the example earlier). That, of course, is the worst excuse; especially when working in C++ that actually has means to deal with exceptional circumstances where less fortunate C developers might feel compelled to return negative magic numbers.
Old habits die hard I guess. But it troubles me that I see this from young developers - they did not code C back in the '90s (in fact they were born in the '90s). I don't have all the answers - this one is definitely a mystery to me.