Archive for December 2016

A new-years KISS

2016-12-30

Anyone who says "less is more" is of course high, stupid, or both. Less is not more - but the message intended in that common cliche is that "less is better", which is certainly the case more often than not.

The problem

Without going into details... A developer is faced with the problem of generating a string like foo=42&bar=more&baz, given the following set of mappings:

Key	Value
foo	42
bar	more
baz

The diligent reader will recognize the string as the options part of a URI, and can also safely infer that the table of key/value mappings is indeed held in an STL map<string,string> structure.

The solution

During some early spring cleaning here before the very end of 2016, I stumbled across the code for this solution - it works and the code did not cause problems; but it is heavily depending on the boost library which I am trying to get rid of to the extent possible (and that would probably be the subject of an article on itself - suffice to say I'm trying to cut down on dependencies).

namespace { struct to_string { typedef std::string result_type; const result_type operator()(const HTTPOptions::options_t::value_type &pair) const { if (pair.second.empty()) { return pair.first; } return pair.first + std::string("=") + pair.second; } }; } const std::string HTTPOptions::to_uri() const { return boost::algorithm::join( m_options | boost::adaptors::transformed(to_string()) , "&"); }

The above is the code as I found it. It uses a boost algorithm to "join" the options with ampersands, and declares a struct to be able to apply an operator() using a transformer, to convert the pairs in the map to strings of the "key=value" form.

It is time to take a step back... Why is it we need all this? What is the actual job that we are performing? Take a moment to think about it... Given the mapping of strings, how would you imagine that you could construct the resulting string?

Another solution

I deleted the includes of boost headers and deleted the code above. Instead, I wrote this little bit:

std::string HTTPOptions::to_uri() const { std::ostringstream str; for (auto i = m_options.begin(); i != m_options.end(); ++i) str << (i == m_options.begin() ? "" : "&") << i->first << (i->second.empty() ? "" : "=") << i->second; return str.str(); }

This is a pretty straight forward solution really. It iterates though the mapping and simply inserts the string pairs one by one. No operator|() to apply a transformer to a map, no structs with operators that construct new strings based on the old ones, and no high-level transformers to abstract away the nitty gritty details.

I am not against abstractions. But we also have to consider the reality of things here:

	Old solution	New solution
Size	20 lines	10 lines
Dependencies	Boost, STL	STL
Allocations	2n + log n	log n

The allocations metric is a relevant performance metric; basically, when we perform an operator+ on two strings, we construct a new string using the two old (without modifying the two old). This costs us an allocation. Therefore, for every call to to_string::operator() we perform two new allocations. The boost join method will use insert to append each pair to the resulting string, and I think it is fair to assume that std::string is smart enough to incrementally grow its underlying buffer thereby giving us around log(n) allocations for adding n pairs to the string.

The new solution however uses a ostringstream which, like the std::string will perform something like log(n) allocations in order to append n strings to the result. No temporary strings are created during the operation of the new solution - therefore the number of allocations is negligible compared to the original solution.

Last but not least, the shorter solution is trivial. It is obvious what it does and it is obvious how it does it. It is simply more readable, not just because it is fewer lines, but because of its flow.

Something to take with us into the new year...

The morale of the story is: When you are implementing a solution to a conceptually very simple problem, and you find yourself writing a lot of code using clever and complex libraries and constructs, you need to take a step back.

For every simple problem, there is a large, complicated and expensive technological solution.

But it doesn't have to be like that. We can do better. Let's make 2017 a year where we do better. Happy new year everyone!

LISP memory woes, but not

2016-12-13

During a recent project I encountered memory allocation failures in a LISP system, seemingly caused by poor garbage collector performance. More careful investigation revealed a more fundamental problem however, one that is even completely unrelated to the choice of language or runtime.

System Outline

Without going into unnecessary detail, I should give a brief overview of the system I have been working on. It is an import system for a search engine; in other words, it trawls large datasets that it downloads from an object store, processes those datasets, and uploads certain data in the form of XML documents to a search engine.

The source data that is loaded from the object store, are binary-encoded "container" objects (similar to folders in a file system) that may refer to other objects, and then actual documents in their native formats (for example raw .eml rfc2822 files).

The importer recursively descends the tree structure of a dataset, comparing the previously imported version to the current version, determining the differences (which objects that disappeared, were added or were updated), and conveys this difference to the search engine by means of an XML document.

For performance reasons, we want to run a number of import jobs concurrently; the individual recursive descent is not parallelised, but we like to process many trees concurrently. Much of this work is I/O bound, so the number of concurrent jobs we can run is limited by memory consumption, not by available processing power.

The problem

Adding this import functionality to an already existing LISP system allowed for very rapid development. A working version of this code, complete with job scheduling code and a test suite, was implemented over the course of just a few days. All was good, for a while. Then, we started encountering out of memory issues. It is simple to add more memory - but we quickly got to a point where we could only run two concurrent jobs, we allocated 8GB of memory for the LISP system, and we could still not be certain we wouldn't occationally encounter an allocation failure.

Dealing with bad allocations in C++ can be troubling enough; proper exception handling and code structure can make recovery from out of memory reliable - but still one area of the code may use all the memory causing an unrelated piece of code to actually fail its allocation. You can't know if the "memory pig" is the one that will fail it's allocation, unless you start using pooled memory for each area of your code. With LISP, there is the STORAGE-CONDITION condition that can be raised, allowing you to deal gracefully with allocation problems, similar to std::bad_alloc of C++. In SBCL however, sometimes the condition is raised, sometimes the low-level debugger is entered because the system fails to raise the condition in time. In other words, for a real-world LISP system, I need to be sure that out of memory conditions are unlikely to occur; we cannot rely on error handling to recover reliably from allocation problems.

Early symptoms

Soon after starting to experience these occational out-of-memory problems, I started investigating which types of objects were allocated, using the (room) function. An example could look like:

(room) Dynamic space usage is: 966,068,000 bytes. Read-only space usage is: 5,680 bytes. Static space usage is: 4,400 bytes. Control stack usage is: 1,912 bytes. Binding stack usage is: 800 bytes. Control and binding stack usage is for the current thread only. Garbage collection is currently enabled. Breakdown for dynamic space: 575,989,232 bytes for 3,263,700 simple-character-string objects. 258,397,040 bytes for 8,669 simple-array-unsigned-byte-8 objects. 137,638,992 bytes for 8,602,437 cons objects. 104,355,056 bytes for 987,245 other objects. 1,076,380,320 bytes for 12,862,051 dynamic objects (space total.)

The above is for a relatively small system; only 1G of dynamic space in use; but the pattern scales to larger dynamic spaces. There is an enormous amount of character strings and a large number of bytes allocated in arrays of unsigned bytes. The latter are the buffers I use when receiving the binary objects from the object store.

It surprised me that this many arrays were "alive" still; I never reference more than a handful of these - how can there be more than 8k of those alive? No wonder we're having memory problems. Running (gc :full t) will clean these up, so I'm right that they are not references. So how come do we still run out of memory? Why does the garbage collector not collect these by itself? Clearly, this must be a garbage collector problem, right?

I consulted the SBCL-Help mailing list for advise on how best to deal with this. Very quickly I received insightful and helpful advise both on-list and off-list. Of the various suggestions I received, at the time it seemed most reasonable to accept that the memory allocation pattern of my application (few large objects followed by many small) simply caused the large objects to get promoted to a later generation in the garbage collector, causing the collector to miss opportunities to collect them (the SBCL memory collector uses 6 generations or so).

Early solution

Working under the assumption above, there were several solutions possible:

Call the garbage collector explicitly
Tune the garbage collector; e.g. make it two-generational
Pool the resources - re-use buffers
Allocate outside of the dynamic space

Personally I think calling the gc explicitly would defy the purpose of having a garbage collector in the first place - there has to be a better way. I also don't like to have to tune it - my workload is not special, as I see it, it's just an application working with different types of data; tuning should not be necessary. After all, SBCL should be a quite mature system, and certainly LISP systems are not exactly new or immature technology.

I did two things: I took the static-vectors library and implemented a simple memory allocator on top of that. This way, I could not explicitly allocate and free memory objects, using a single underlying static-vector allocated outside of the dynamic space. My allocator would simply return displaced vectors on top of the one large vector initially allocated for the pool. This looked like a beautiful solution to the problem. All my buffers were now completely out of reach of the garbage collector and I knew exactly when and were buffers were allocated and freed again. The code wasn't even complicated by this, a simple (with-pooled-memory...) macro would neatly encapsulate allocation and guarantee release.

Running the code with this, gave the following dynamic space breakdown:

(room) Dynamic space usage is: 812,977,552 bytes. Read-only space usage is: 5,680 bytes. Static space usage is: 4,400 bytes. Control stack usage is: 1,912 bytes. Binding stack usage is: 800 bytes. Control and binding stack usage is for the current thread only. Garbage collection is currently enabled. Breakdown for dynamic space: 488,917,712 bytes for 2,875,896 simple-character-string objects. 317,265,168 bytes for 19,829,073 cons objects. 122,818,160 bytes for 1,267,261 other objects. 929,001,040 bytes for 23,972,230 dynamic objects (space total.)

There we go! My unsigned byte arrays are completely gone. But another problem surfaced...

Displaced array performance

It turns out that accessing displaced arrays comes at a significant cost - the compiler can no longer generate efficient code for traversing them. Take a look at the following comparison:

(with-memory-pool (100000000) (time (with-pool-buffer (a 100000000) (loop for x across a do (setf x 42)) (loop for x across a sum x)))) => 4.006 seconds (time (static-vectors:with-static-vector (a 100000000) (loop for x across a do (setf x 42)) (loop for x across a sum x))) => 0.683 seconds

That's a significant penalty. The simple solution to this problem was to realize that I didn't need a memory pool. I allocate so few objects anyway, that I can easily allocate them all directly with the static-vectors library. Simply removing the pool (along with my beautiful little allocator that I was actually proud of) completely solved that problem.

This is a classical programmer error, and one that I am clearly not above: Imagining a solution in ones head without caring to investigate if it is even needed - spending time implementing a solution which in fact turns out to be a problem in itself. Anyway, with the pool gone and vectors allocated using the library that already allocates vectors, the performance problem was gone. But the string objects certainly weren't... How many of those did we have?

Half a gigabyte of strings?

Looking at the most recent dynamic space breakdown, we see we use 488MB for strings and 317MB for a whopping 19 million cons objects. Clearly there is more work to do - this system does a number of different jobs and there's nothing strange about it using some memory - but 19 million cons objects and almost three million strings, that's a lot. And this is on a "small" system; we can easily use 3-4G of dynamic space and the string and cons numbers scale.

This is actually where it gets interesting. I was forced to take a look at the actual data processing algorithm that I had implemented. The logic goes a bit like this:

process_diff A B: ;; load and sort the entries of the two containers obj_a = load_and_parse(A) obj_b = load_and_parse(B) sort_entries_in(obj_a) sort_entries_in(obj_b) ;; now compare the two ia = start_of(obj_a) ib = start_of(obj_b) while ia || ib if name_of(ia) < name_of(ib) added(name_of(ib))... else if name_of(ia) > name_of(ib) disappeared(name_of(ia))... else ...

So in order to compute the difference of one version of a folder to another version of a folder, we need to sort the children and iterate through them. Since we receive the data in these binary "blobs" and each actual entry in the blob is of variable length (depending, among other things, on the length of the name of the entry), I use a parser routine that generates an object (a struct) holding a more "palatable" representation of the very same data.

My struct would hold a list of child entries (again structs). Each child entry would have a name and a key/value store (a list of pairs of strings). So as an example, let's say I'm comparing two folders with 10.000 entries in each; each entry has five meta-data entries, so a child holds 5 (meta-data entryes) * 2 (key and value) + 1 (entry name) strings. Times 10k that's 110.000 strings I allocate both for the old and the new version. The entry lists are then efficiently sorted, and I can traverse and find the difference.

While this was simple to implement, it is also clearly extremely wasteful. And this has nothing to do with the choice of language. Had I implemented this in C++, a profiler would reveal that a significant portion of the run-time is spent in string allocation and string de-allocation. While an efficient C++ solution would only use a few times more memory than the source binary data due to the precise deallocation, my LISP system hurts more because the garbage collector fails to collect many of these strings in a timely fashion. But the algorithm I implemented is equally wasteful; in other words: What I did was silly in any language.

It actually wasn't rocket science to change this. I now use a struct that holds these members:

(defstruct ofslist "..." (buffer nil :type (or null (simple-array (unsigned-byte 8)))) (offset nil :type (or null (array fixnum))) ...)

The "buffer" is my binary blob from the object store. The "offset" is a list of integers (fixnums) that hold the offsets into the "buffer" at which the child entries start.

I can efficiently retrieve the name of an entry by supplying the "buffer" and the start offset of the entry; a simple macro to retrieve various fields of the binary data makes the code for this really elegant.

Sorting is simple too. I do not sort the entries in the buffer - that would be time consuming and very difficult (as the entries are of variable size). Instead, I sort the offset list; not by offset number of course, but by entry name of offset. It's as simple as:

(sort (ofslist-offset ret) (lambda (a b) (ofsentry< buffer a buffer b)))

With this change, I probably don't even need the static-vectors allocation. But now that I have it, I'll keep it - it's an improvement, even if it's not essential.

Memory profiling

An essential tool for me in finding the cause of the problem, and for identifying a few other routines that were unnecessarily heavy on allocating objects, was the statistical sampling profiler built into SBCL. A simple session looks like:

(sb-sprof:start-profiling :mode :alloc) ;; trigger your workload - let it run for a few minutes (sb-sprof:stop-profiling) (with-open-file (f "/tmp/prof.out" :direction :output :if-exists :supersede) (sb-sprof:report :stream f))

Finding the culprit

My expensive object parsing routine showed up in a profile report like this:

Count % Count % Callees ------------------------------------------------------------------------ 47 0.4 COM.KEEPIT.BSEARCH::OBJSEQ->CONTENTLIST [12] 2 0.0 S-SQL:TO-SQL-NAME [176] 39 0.3 S-SQL:SQL-ESCAPE-STRING [155] 193 1.5 FORMAT [25] 4253 33.4 SB-IMPL::STRINGIFY-OBJECT [75] 4534 35.7 4534 35.7 MAKE-STRING-OUTPUT-STREAM [1] ------------------------------------------------------------------------ 4253 33.4 PRINC-TO-STRING [76] 0 0.0 4253 33.4 SB-IMPL::STRINGIFY-OBJECT [75] 4253 33.4 MAKE-STRING-OUTPUT-STREAM [1] ------------------------------------------------------------------------ 4489 35.3 SB-FORMAT::FORMAT-PRINT-INTEGER [77] 0 0.0 4489 35.3 PRINC-TO-STRING [76] 236 1.9 GET-OUTPUT-STREAM-STRING [5] 4253 33.4 SB-IMPL::STRINGIFY-OBJECT [75] ------------------------------------------------------------------------ 2 0.0 LOCAL-TIME::%CONSTRUCT-TIMESTRING [189] 41 0.3 COM.KEEPIT.QUOTA::COUNT-ACCOUNT-RESOURCE-USAGE [39] 49 0.4 (LAMBDA ...) [127] 4408 34.7 COM.KEEPIT.BSEARCH::OBJSEQ->CONTENTLIST [12] 0 0.0 4500 35.4 SB-FORMAT::FORMAT-PRINT-INTEGER [77] 11 0.1 SB-IMPL::%WRITE-STRING [144] 4489 35.3 PRINC-TO-STRING [76]

It is visible here that 35.7% of all allocations are caused by make-output-stream; most of the calls to that are from stringify-object. That in turn is called by princ-to-string. That, again, by format-print-integer. And that, finally, gets called by the objseq->contentlist routine which was the name of my parsing routine. This trace is what led me to understand the cause of the actual problem.

Other simplifications

For example; I had a (sexp->xml ...) routine which converts a S-expression into a textual XML document. This routine was some of the first LISP I ever wrote, so it was built using some "fancy" constructs (a beginner learning the language) to recurse down into the expression and construct the document. However, looking at a profile running my search engine import revealed:

Count % Count % Callees ------------------------------------------------------------------------ 593 3.1 (LABELS COM.KEEPIT.XML:ELEMENT :IN COM.KEEPIT.XML:SEXP->XML) [7] 7 0.0 COM.KEEPIT.API:API-POST [90] 3 0.0 COM.KEEPIT.BSEARCH::RFC2822-HEADERS [45] 7511 38.7 REDUCE [92] 8 0.0 COM.KEEPIT.XML:SEXP-MERGESTRINGS [74] 1 0.0 SB-KERNEL:%CONCATENATE-TO-STRING [1] 8053 41.5 8122 41.8 SB-KERNEL:%CONCATENATE-TO-STRING [1] 1 0.0 SB-KERNEL:%CONCATENATE-TO-STRING [1]

In other words; 41.8% of all allocations in the system are done by concatenate, and in 38.7% of the total allocations, it's from a call from reduce. Investigation quickly reveals that my fancy XML document routine makes heavy use of reduce and concatenate.

Changing the code to a simpler version that uses a string output stream instead completely removed this routine from the top of the profile. Just like that, about 40% of all allocations done in the code were gone.

In retrospect...

Using the sampling profiler as a tool to find the source of allocations has proven an excellent tool to me. The ability to start such a profile on QA or even production systems (I have not needed this - but I could if I needed) is amazingly powerful. Profilers are not just for finding CPU bottlenecks, clearly.

It is common for me, and I know for many others, to want to "convert" data from one representation to another before we work on it. Like I did when I parsed the binary objects to a higher-level representation for sorting and comparison. Sometimes that makes sense - but not only does it take time to develop all this representation-conversion functionality (or you need to find and integrate libraries that do it for you); it takes time to run (every time), and your code may not even be simpler!

In this particular case, for comparing two 10k entry folders I would do 220k string allocations - my new implementation does 4 simple array allocations, and it is no harder to read and doesn't take up more lines of code than the old implementation.

I'm not saying abstraction is evil. I did indeed abstract things - for example, I developed a macro for efficiently extracting specific fields out of entries in the buffer. In this case, my abstraction is in code, instead of in the data structures. This is easier done in LISP with its powerful macro system than in other languages of course - but the same principle could be applied in any other language.

I'll try to think twice about where to put my abstractions the next time, that's for sure. Abstractions are good, to a point - but they don't always have to be in the data representation.

Jakob Østergaard Hegelund

A new-years KISS

The problem

The solution

Another solution

Something to take with us into the new year...

LISP memory woes, but not

System Outline

The problem

Early symptoms

Early solution

Displaced array performance

Half a gigabyte of strings?

Memory profiling

Finding the culprit

Other simplifications

In retrospect...

Categories

Archive

Links