High performance JSON parsing

2017-03-29

Imagine having JSON documents of varying structure from which you need to efficiently extract a small specific selection of values. Now imagine having millions of those documents, imagine the documents growing over time in size and complexity, and imagine having to parse them and extract the necessary data on a tight time and memory budget.

In this article I describe a solution to this problem wherein a language construct is developed that allows for precise specification of the desired data subset that is to be extracted. This language construct allows for an efficient scanner to be generated, tailored for the extraction specification, and for an extraction to be performed where only the actual extracted values are ever allocated and copied - any unneeded document element is skipped without ever being copied.

The problem

JSON documents are being used as containers for keeping somewhat structured information in more and more places. As with other hierarchical formats (XML comes to mind), it would be prudent to expect that the objects will grow in size and complexity over time as features are added to systems.

I work on a system that extracts certain key data from such JSON documents. While conceptually trivial, practical efficient extraction of specific values from countless numbers of larger and larger documents is a challenge that I do not see is currently well addressed in existing parser libraries. Sure, any library out there can extract values - but if my JSON document contains, say, 1000 strings and I need only one of them, most libraries out there will allocate 1000 strings for me and then let me easily extract the one that I need (and deallocate the rest). This is just not good enough.

A brief word about JSON

JSON is simply a human-readable textual representation of a tree structure. It does nothing that wasn't done a decade earlier with XML, or four decades before that with S-expressions (or with BER, ASN.1 or countless other formats in between). It is not particularly compact, not particularly readable, not more flexible and all in all does not have any particular quality that earns it a place above its predecessors. JSON looks like it does, only because it is a valid subset of JavaScript and since most of the world is not JavaScript, this particular quality would seem of questionable importance.

I have nothing in particular against JSON; it is, inarguably, a fairly unspectacular format that solves a simple job not much worse than many other formats. While its popularity eludes me, I am fully aware of its success - and whether I think people are smart in using the format or not is irrelevant, because I didn't make the objects I need to parse and they are JSON regardless of whether I think that is clever or not. I will simply deal with this in the best way possible.

Value extraction

Let's imagine we have an object with a few essential elements we need to extract, and countless elements we couldn't care less about. Something like:

{ "a0": 42, "creator": { "name": "John Doe", "email": "john@do.e" }, "a1": 43, "date": "2017-03-01T00:01:23Z", "more": "<html><body>Yes this really happened</body></html>", "a2": "foo", ... "a238": { "stuff": [ 1, 2, 3, 4 ] } }

If I chose to parse the document as an object named o in JavaScript, I could now address two particular fields I want to extract as; o.creator.name and o.date. But of course, if we just did that, all the other fields - including the potentially hundreds of o.a2-o.a238 fields would all be instantiated as strings, numbers, arrays and sub-objects in memory. And this is where the challenge is: How do we instantiate just what we need and nothing more?

Extraction definition

What we really need is a way in our language to specify the values to extract, in such a way, that only those specified will actually be extracted.

Could we imagine a library that could implement this in a way where the access to extracted data would be relatively type safe and efficient, while also allowing for a relatively fast parsing of the actual JSON document during extraction?

Conceptually, a JSON-like structure could be used to both specify the path to or location of the values to extract, and, to name a local variable that would hold said value if it was found:

{ "creator": { "name": name }, "date": date }

This JSON-like structure directly specifies the path of the two desired elements, o.creator.name and o.date, and it names two local variables name and date that could hold the values after the match.

I feel this is a pretty neat idea - it is trivial for a developer to use - imagine passing such a description to a JSON extraction library and simply getting two local variables instantiated holding the extracted values, with no additional values ever having been parsed or allocated. The library could simply skip anything that didn't match the "path" of the desired elements. If we could pull this off, we'd be on to something!

The solution

Luckily I'm writing this solution in an extensible language; so adding a language construct for such alternative variable instantiation is straight forward. And, as for notation, we can almost use JSON, we just need to remove a few unnecessary characters and use only one type of parenthesis. Basically, my solution looks like this:

(with-json document (("creator" ("name" name)) ("date" date)) ;; ... here we can work on name and date as they are local ;; variables initialized with the extracted values

This is a LISP macro which creates a new lexical scope in which the two desired variables are instantiated (and reasonably strongly typed); a recursive descent JSON parser is generated which works its way through the byte array document. The parser tracks its current "path" through the document, allowing it to efficiently know if a value is to be extracted or not. Only in the (rare) case that we are on a path desired for extraction, will the actual byte-array data from document be extracted, converted to a UTF-8 string, un-escaped from it's JSON representation, and referenced by the desired local named variable.

Some cheap tricks can even be employed: At compile time, the strings used in the path specifications are converted to their raw byte-array equivalents - in the common case this allows the parser to do raw byte comparisons on names when it tracks the current path, instead of UTF-8 encoding and un-escaping every single name that is encountered.

Performance

Since this extraction tool only ever allocates the values it is asked to extract, it very simply will outperform JSON parsers that parse the full JSON object to some internal representation - there is nothing surprising or impressive about this - by doing less it performs better.

Any comparison to another parser that does more work is therefore also unfair, if used as a head-to-head comparison, because it is not. Still, to prove a point, that taking a more efficient approach can make sense, I ran a few tests where I pass a 2600 byte JSON document to my own parser, and to the established cl-json library (which is a fine library, please don't get me wrong here).

To get some sort of consistency, I time a loop where I parse the object 1000 times. I run these loops a few times and pick a representative result. What I get is:

(time (loop for i from 0 below 1000 summing (with-json doc ((("cannot-match-this" not-here) ("also-cannot-match" ("this" ("here" not-here-too))))) (+ (length not-here) (length not-here-too)))))) Evaluation took: 0.103 seconds of real time 0.103601 seconds of total run time (0.103006 user, 0.000595 system) 100.97% CPU 269,154,386 processor cycles 400 bytes consed

The 400 bytes allocated varies from 0 to about 90k, with the typical timing runs giving below a few thousand. The generic parser will of course give quite different allocation numbers:

(time (loop for i from 0 below 1000 summing (let ((j (json:decode-json-from-string doc))) (length j)))) Evaluation took: 0.314 seconds of real time 0.324294 seconds of total run time (0.298846 user, 0.025448 system) [ Run times consist of 0.007 seconds GC time, and 0.318 seconds non-GC time. ] 103.18% CPU 818,330,418 processor cycles 103,429,040 bytes consed

For this code, the allocation numbers are very consistent, and not surprisingly completely different. Changing the simpler scanner so it actually matches the two string values will increase the allocations:

(time (loop for i from 0 below 1000 summing (with-json doc ((("can-match-this" here) ("can-match" ("this" ("too" too))))) (+ (length here) (length too)))))) Evaluation took: 0.107 seconds of real time 0.107268 seconds of total run time (0.106739 user, 0.000529 system) 100.00% CPU 278,550,641 processor cycles 915,728 bytes consed

In this example I actually end up with two local variables that hold the values I was interested in. Using the generic library, I would still need to search the parsed representation to actually find the parsed values - so while this example uses only 30% of the CPU time of the more general library and does less than 1% of the allocations, it actually gets more done of what I need.

Again, please don't take this as a mocking of the excellent cl-json library; what I am doing here is very different and understandably shows very different characteristics.

The future

The library has a nice set of features, for a first version. It allows for specification of default values, type-specific parsers for values and other such features - but it also cuts some corners. There are types of data that cannot be extracted and there are names that cannot be matched. There are optimizations that were not done and there are compile-time sanity checks that could be added.

As time permits, I hope to iron out those wrinkles and work towards publishing this parsing tool under a reasonably free license. We shall see what happens...

Jakob Østergaard Hegelund