High performance JSON parsing
Imagine having JSON documents of varying structure from which you need to efficiently extract a small specific selection of values. Now imagine having millions of those documents, imagine the documents growing over time in size and complexity, and imagine having to parse them and extract the necessary data on a tight time and memory budget.
In this article I describe a solution to this problem wherein a language construct is developed that allows for precise specification of the desired data subset that is to be extracted. This language construct allows for an efficient scanner to be generated, tailored for the extraction specification, and for an extraction to be performed where only the actual extracted values are ever allocated and copied - any unneeded document element is skipped without ever being copied.
JSON documents are being used as containers for keeping somewhat structured information in more and more places. As with other hierarchical formats (XML comes to mind), it would be prudent to expect that the objects will grow in size and complexity over time as features are added to systems.
I work on a system that extracts certain key data from such JSON documents. While conceptually trivial, practical efficient extraction of specific values from countless numbers of larger and larger documents is a challenge that I do not see is currently well addressed in existing parser libraries. Sure, any library out there can extract values - but if my JSON document contains, say, 1000 strings and I need only one of them, most libraries out there will allocate 1000 strings for me and then let me easily extract the one that I need (and deallocate the rest). This is just not good enough.
A brief word about JSON
I have nothing in particular against JSON; it is, inarguably, a fairly unspectacular format that solves a simple job not much worse than many other formats. While its popularity eludes me, I am fully aware of its success - and whether I think people are smart in using the format or not is irrelevant, because I didn't make the objects I need to parse and they are JSON regardless of whether I think that is clever or not. I will simply deal with this in the best way possible.
Let's imagine we have an object with a few essential elements we need to extract, and countless elements we couldn't care less about. Something like:
What we really need is a way in our language to specify the values to extract, in such a way, that only those specified will actually be extracted.
Could we imagine a library that could implement this in a way where the access to extracted data would be relatively type safe and efficient, while also allowing for a relatively fast parsing of the actual JSON document during extraction?
Conceptually, a JSON-like structure could be used to both specify the path to or location of the values to extract, and, to name a local variable that would hold said value if it was found:
This JSON-like structure directly specifies the path of the two desired elements, o.creator.name and o.date, and it names two local variables name and date that could hold the values after the match.
I feel this is a pretty neat idea - it is trivial for a developer to use - imagine passing such a description to a JSON extraction library and simply getting two local variables instantiated holding the extracted values, with no additional values ever having been parsed or allocated. The library could simply skip anything that didn't match the "path" of the desired elements. If we could pull this off, we'd be on to something!
Luckily I'm writing this solution in an extensible language; so adding a language construct for such alternative variable instantiation is straight forward. And, as for notation, we can almost use JSON, we just need to remove a few unnecessary characters and use only one type of parenthesis. Basically, my solution looks like this:
This is a LISP macro which creates a new lexical scope in which the two desired variables are instantiated (and reasonably strongly typed); a recursive descent JSON parser is generated which works its way through the byte array document. The parser tracks its current "path" through the document, allowing it to efficiently know if a value is to be extracted or not. Only in the (rare) case that we are on a path desired for extraction, will the actual byte-array data from document be extracted, converted to a UTF-8 string, un-escaped from it's JSON representation, and referenced by the desired local named variable.
Some cheap tricks can even be employed: At compile time, the strings used in the path specifications are converted to their raw byte-array equivalents - in the common case this allows the parser to do raw byte comparisons on names when it tracks the current path, instead of UTF-8 encoding and un-escaping every single name that is encountered.
Since this extraction tool only ever allocates the values it is asked to extract, it very simply will outperform JSON parsers that parse the full JSON object to some internal representation - there is nothing surprising or impressive about this - by doing less it performs better.
Any comparison to another parser that does more work is therefore also unfair, if used as a head-to-head comparison, because it is not. Still, to prove a point, that taking a more efficient approach can make sense, I ran a few tests where I pass a 2600 byte JSON document to my own parser, and to the established cl-json library (which is a fine library, please don't get me wrong here).
To get some sort of consistency, I time a loop where I parse the object 1000 times. I run these loops a few times and pick a representative result. What I get is:
The 400 bytes allocated varies from 0 to about 90k, with the typical timing runs giving below a few thousand. The generic parser will of course give quite different allocation numbers:
For this code, the allocation numbers are very consistent, and not surprisingly completely different. Changing the simpler scanner so it actually matches the two string values will increase the allocations:
In this example I actually end up with two local variables that hold the values I was interested in. Using the generic library, I would still need to search the parsed representation to actually find the parsed values - so while this example uses only 30% of the CPU time of the more general library and does less than 1% of the allocations, it actually gets more done of what I need.
Again, please don't take this as a mocking of the excellent cl-json library; what I am doing here is very different and understandably shows very different characteristics.
The library has a nice set of features, for a first version. It allows for specification of default values, type-specific parsers for values and other such features - but it also cuts some corners. There are types of data that cannot be extracted and there are names that cannot be matched. There are optimizations that were not done and there are compile-time sanity checks that could be added.
As time permits, I hope to iron out those wrinkles and work towards publishing this parsing tool under a reasonably free license. We shall see what happens...