Jakob Østergaard Hegelund

Tech stuff of all kinds

High performance JSON parsing

2017-03-29

Imagine having JSON documents of varying structure from which you need to efficiently extract a small specific selection of values. Now imagine having millions of those documents, imagine the documents growing over time in size and complexity, and imagine having to parse them and extract the necessary data on a tight time and memory budget.

In this article I describe a solution to this problem wherein a language construct is developed that allows for precise specification of the desired data subset that is to be extracted. This language construct allows for an efficient scanner to be generated, tailored for the extraction specification, and for an extraction to be performed where only the actual extracted values are ever allocated and copied - any unneeded document element is skipped without ever being copied.

The problem

JSON documents are being used as containers for keeping somewhat structured information in more and more places. As with other hierarchical formats (XML comes to mind), it would be prudent to expect that the objects will grow in size and complexity over time as features are added to systems.

I work on a system that extracts certain key data from such JSON documents. While conceptually trivial, practical efficient extraction of specific values from countless numbers of larger and larger documents is a challenge that I do not see is currently well addressed in existing parser libraries. Sure, any library out there can extract values - but if my JSON document contains, say, 1000 strings and I need only one of them, most libraries out there will allocate 1000 strings for me and then let me easily extract the one that I need (and deallocate the rest). This is just not good enough.

A brief word about JSON

JSON is simply a human-readable textual representation of a tree structure. It does nothing that wasn't done a decade earlier with XML, or four decades before that with S-expressions (or with BER, ASN.1 or countless other formats in between). It is not particularly compact, not particularly readable, not more flexible and all in all does not have any particular quality that earns it a place above its predecessors. JSON looks like it does, only because it is a valid subset of JavaScript and since most of the world is not JavaScript, this particular quality would seem of questionable importance.

I have nothing in particular against JSON; it is, inarguably, a fairly unspectacular format that solves a simple job not much worse than many other formats. While its popularity eludes me, I am fully aware of its success - and whether I think people are smart in using the format or not is irrelevant, because I didn't make the objects I need to parse and they are JSON regardless of whether I think that is clever or not. I will simply deal with this in the best way possible.

Value extraction

Let's imagine we have an object with a few essential elements we need to extract, and countless elements we couldn't care less about. Something like:

{ "a0": 42, "creator": { "name": "John Doe", "email": "john@do.e" }, "a1": 43, "date": "2017-03-01T00:01:23Z", "more": "<html><body>Yes this really happened</body></html>", "a2": "foo", ... "a238": { "stuff": [ 1, 2, 3, 4 ] } }

If I chose to parse the document as an object named o in JavaScript, I could now address two particular fields I want to extract as; o.creator.name and o.date. But of course, if we just did that, all the other fields - including the potentially hundreds of o.a2-o.a238 fields would all be instantiated as strings, numbers, arrays and sub-objects in memory. And this is where the challenge is: How do we instantiate just what we need and nothing more?

Extraction definition

What we really need is a way in our language to specify the values to extract, in such a way, that only those specified will actually be extracted.

Could we imagine a library that could implement this in a way where the access to extracted data would be relatively type safe and efficient, while also allowing for a relatively fast parsing of the actual JSON document during extraction?

Conceptually, a JSON-like structure could be used to both specify the path to or location of the values to extract, and, to name a local variable that would hold said value if it was found:

{ "creator": { "name": name }, "date": date }

This JSON-like structure directly specifies the path of the two desired elements, o.creator.name and o.date, and it names two local variables name and date that could hold the values after the match.

I feel this is a pretty neat idea - it is trivial for a developer to use - imagine passing such a description to a JSON extraction library and simply getting two local variables instantiated holding the extracted values, with no additional values ever having been parsed or allocated. The library could simply skip anything that didn't match the "path" of the desired elements. If we could pull this off, we'd be on to something!

The solution

Luckily I'm writing this solution in an extensible language; so adding a language construct for such alternative variable instantiation is straight forward. And, as for notation, we can almost use JSON, we just need to remove a few unnecessary characters and use only one type of parenthesis. Basically, my solution looks like this:

(with-json document (("creator" ("name" name)) ("date" date)) ;; ... here we can work on name and date as they are local ;; variables initialized with the extracted values

This is a LISP macro which creates a new lexical scope in which the two desired variables are instantiated (and reasonably strongly typed); a recursive descent JSON parser is generated which works its way through the byte array document. The parser tracks its current "path" through the document, allowing it to efficiently know if a value is to be extracted or not. Only in the (rare) case that we are on a path desired for extraction, will the actual byte-array data from document be extracted, converted to a UTF-8 string, un-escaped from it's JSON representation, and referenced by the desired local named variable.

Some cheap tricks can even be employed: At compile time, the strings used in the path specifications are converted to their raw byte-array equivalents - in the common case this allows the parser to do raw byte comparisons on names when it tracks the current path, instead of UTF-8 encoding and un-escaping every single name that is encountered.

Performance

Since this extraction tool only ever allocates the values it is asked to extract, it very simply will outperform JSON parsers that parse the full JSON object to some internal representation - there is nothing surprising or impressive about this - by doing less it performs better.

Any comparison to another parser that does more work is therefore also unfair, if used as a head-to-head comparison, because it is not. Still, to prove a point, that taking a more efficient approach can make sense, I ran a few tests where I pass a 2600 byte JSON document to my own parser, and to the established cl-json library (which is a fine library, please don't get me wrong here).

To get some sort of consistency, I time a loop where I parse the object 1000 times. I run these loops a few times and pick a representative result. What I get is:

(time (loop for i from 0 below 1000 summing (with-json doc ((("cannot-match-this" not-here) ("also-cannot-match" ("this" ("here" not-here-too))))) (+ (length not-here) (length not-here-too)))))) Evaluation took: 0.103 seconds of real time 0.103601 seconds of total run time (0.103006 user, 0.000595 system) 100.97% CPU 269,154,386 processor cycles 400 bytes consed

The 400 bytes allocated varies from 0 to about 90k, with the typical timing runs giving below a few thousand. The generic parser will of course give quite different allocation numbers:

(time (loop for i from 0 below 1000 summing (let ((j (json:decode-json-from-string doc))) (length j)))) Evaluation took: 0.314 seconds of real time 0.324294 seconds of total run time (0.298846 user, 0.025448 system) [ Run times consist of 0.007 seconds GC time, and 0.318 seconds non-GC time. ] 103.18% CPU 818,330,418 processor cycles 103,429,040 bytes consed

For this code, the allocation numbers are very consistent, and not surprisingly completely different. Changing the simpler scanner so it actually matches the two string values will increase the allocations:

(time (loop for i from 0 below 1000 summing (with-json doc ((("can-match-this" here) ("can-match" ("this" ("too" too))))) (+ (length here) (length too)))))) Evaluation took: 0.107 seconds of real time 0.107268 seconds of total run time (0.106739 user, 0.000529 system) 100.00% CPU 278,550,641 processor cycles 915,728 bytes consed

In this example I actually end up with two local variables that hold the values I was interested in. Using the generic library, I would still need to search the parsed representation to actually find the parsed values - so while this example uses only 30% of the CPU time of the more general library and does less than 1% of the allocations, it actually gets more done of what I need.

Again, please don't take this as a mocking of the excellent cl-json library; what I am doing here is very different and understandably shows very different characteristics.

The future

The library has a nice set of features, for a first version. It allows for specification of default values, type-specific parsers for values and other such features - but it also cuts some corners. There are types of data that cannot be extracted and there are names that cannot be matched. There are optimizations that were not done and there are compile-time sanity checks that could be added.

As time permits, I hope to iron out those wrinkles and work towards publishing this parsing tool under a reasonably free license. We shall see what happens...

The rocket science of VPN

2017-03-16

The phasing out of older systems and introduction of new ones landed me in a situation where I needed to configure a VPN service for myself and a group of colleagues.

Requirements were something like:

You'd think this was pretty basic stuff right? After all, VPNs have been around for decades - the newest technologies mentioned here are iOS and Android, none of those are exactly experimental untried platforms either.

Well it turns out, this sort of technology is completely unheard of, unrealistic to implement and I am apparently decades ahead of my time, thinking something like this would be nice to have

Goodbye: most protocols

No single protocol, that I want to use, is supported natively (built-in) across all the devices and systems I need.

PlatformIPSecL2TP/IPSecPPTPIKEv1IKEv2
OSXYesYesNoNoYes
iOSYesYesNoNoYes
Ubuntu/DebianYesYesYesYesYes
AndroidNoYesYesNoNo
WindowsNoYesYesNoYes

I specifically do not want Layer-2 tunneling and therefore I will disregard L2TP/IPSec. This pretty much leaves me with IKEv2 as the only choice - it may require an app on Android devices, but I think we will survive that.

An alternative approach would be to accept that no built-in protocol is "good enough" and then go full third-party; OpenVPN or some other solution. I dislike this approach because I don't like to force installs of special software for basic networking functionality onto all clients - I don't have any hard evidence saying this would not work or this would be bad for some specific reason, I just don't like it. And since I'm the one tasked with this, what I like and don't like matters.

I'm going to set up an IKEv2 provider using StrongSwan running on Ubuntu, and that's that

Goodbye: Kerberos

Windows, in the old days before Kerberos (or "Active Directory" as the marketing term is for Kerberos and LDAP on Windows), invented various more or less safe password exchange schemes so that a client could authenticate against a server without transmitting too much information about the password onto the network for an attacker to be able to derive the password. One such protocol is MSCHAPv2. The MSCHAPv2 protocol requires that the server has a specially hashed version of the user's password on file, or simply the full plain text password.

After Windows adopted Kerberos in 1999 and released it with Windows 2000, there has not been a need for protocols like MSCHAPv2 since the Kerberos protocol not only solves the same problem as MSCHAPv2 better, but also provides functionality such as actual Single-Sign-On which is completely out of the scope of MSCHAPv2.

You might think, then, that 17 years after that, it would be possible to have a Windows VPN client authenticate against a Kerberos infrastructure - well if you thought so you'd be wrong. Windows servers only support MSCHAPv2 as user/password authentication protocol for IKEv2. And, to be consistent, OSX does exactly the same.

If you run your VPN server on Windows, your VPN clients can validate up against their "AD" passwords, because the AD will keep the necessary extra password material on file - this is not Kerberos authentication, it is not single-sign-on, it is a crude hack that allows the use of arcane authentication protocols against what could have been a modern authentication system. Well, I don't run my VPN servers on Windows, so this is not directly an option.

Goodbye: Workaround with KCRAP

The MIT Kerberos KDC actually supports the same hash function as is necessary for the MSCHAPv2 exchange; you could actually configure your KDC to store these hashes alongside the other information it stores.

On top of that, you would need the KCRAP service which allows the extraction of these hashes from your KDC database.

On top of that, you would need a FreeRADIUS server, which could support an MSCHAPv2 exchange using the KCRAP service as its back-end

And finally you would need the StrongSwan VPN provider to talk to FreeRADIUS which can talk to KCRAP and then all should work in theory...

This is too much. There is an excellent description of how to configure this here, but it is too much for me. Too many moving parts. Too much highly experimental software. Too many things that I will need to support entirely on my own, patch with every upgrade of other components, etc.

If StrongSwan had a KCRAP module and if the KCRAP server was an integrated part of the KDC, I might go this route. But with the current state of things, this is not something I want to bother with in a production setup - I have a job to do besides this.

Goodbye: Text files with user passwords in the clear

The simple way to get this working, is to set up every user with their clear text password in the /etc/ipsec.secrets file. Most guides on the interwebs describe this (you could use a database or other store, but it would still basically be plain text passwords on file) - and I got it working. But hang on... This won't fly for several reasons:

I need to find a way to keep user credentials on the server that does not involve storing them in clear text and requiring me to manage them manually.

Goodbye: TOTP and 2FA

One option I considered was using one-time passwords. If I can't use my Kerberos infrastructure, at least I could set up a TOTP infrastructure and the users would then have either a HW token or an app on their smartphone which could provide them with a password.

The obvious benefit here is that the server would always know the current clear text password (it is easy to generate), so MSCHAPv2 could work against this. And the OATH framework has been around for ages, so there must be good support for this, right? Yeah well you'd think that, but then you'd be wrong. Again.

First of all; in a TOTP setup it's nice to allow a little bit of slack so that the user can use an OTP that just expired - for usability and to account for user's device's clocks being a little off - this is completely incompatible with MSCHAPv2 since the hash of the "only correct" password is used in the challenge the server sends to the client. It just can't work.

Second, what I really wanted was proper two-factor authentication - I would like the user to log on with a TOTP, and then provide their username and password (or the other way around, I'm flexible). While IKEv2 does support multiple rounds of authentication, the Windows client does not. Well I guess we won't be going down this route either then.

Hello: Certificates

Left with the prospect of using "large machine-generated passwords" as the best authentication for my users, it suddenly becomes relevant to consider simple Certificate based authentication.

Passwords alone are bad because they get leaked or even guessed if they're really poor; large machine generated passwords don't have the problem of being guessed. But if they are large, they will certainly be written down (anyone saying "we have a policy that says users shouldn't do that" has completely lost the connection to the real world in my not so humble opinion) - so they will be leaked. This is a compromise.

Certificates is really a way of dealing with this: A certificate holds a key that is so large it will not be guessed, and the certificate file is the accepted method of transportation of this key. Therefore, while it may be copied around, it will not get written down or left on a sticky-note on someone's desk. Certificates also have a lifetime built right into them.

The only downside I really see with certificates is that I need to maintain a Certificate Revocation List (CRL) for certificates that have been used on devices that are lost. If we used passwords or pre-shared keys, it would be simple to delete the key from the VPN server. But really, mechanisms for this are established and - almost to my surprise at this point - well integrated.

So actually - at the start of this process, messing with certificates was the last thing I wanted to do - but as each desirable option proved unworkable, certificates started looking less unattractive. At this point, maintaining our own public key infrastructure does not seem all that unreasonable after all.

The PKI setup

Losely based on the guidelines at the excellent StrongSwan wiki, we set up a self-signed Certificate Authority certificate. From here on, it was pretty simple to create certificates for the VPN servers and individual certificates for all the users.

The StrongSwan server needs no reconfiguration as users are added - any user who can present a (non-revoked) certificate signed by our CA is allowed access.

Users receive a PKCS12 file containing their user specific certificate and the private key for that certificate. A machine generated password is set on the file (not stored with the file) and given to the user to use for certificate import.

Users then simply need to install our CA certificate and their own personal certificate - with these bits our VPN gateway can authenticate the user and the user can authenticate the VPN gateway.

Weaknesses

Clearly, if a certificate is leaked, just like when a password is leaked, access is granted to people you did not intend to grant access to. Certificates are no different from passwords in this regard.

The only real solution to this problem, as it stands today, seems to be two-factor authentication. While IKEv2 does support multiple authentication rounds, the Windows IKEv2 client does not support this part of the standard. There appears to be no way to implement 2FA on IKEv2, if it is a requirement to use the built-in VPN clients, at least on Windows today (and I simply didn't investigate the other platforms so Windows may not be alone in lacking this support for an otherwise standard feature in IKEv2).

To mitigate this, we use a relatively short (less than one year) lifetime on the certificates. While this is not much use if a device is stolen and abused right away, it will at least mitigate the problem with certificates on old or unused office computers that are later recycled internally or even discarded without being wipe.

A note on certificate file formats

While there are heaps of helpful pages on the interwebs detailing how to convert your DER format certificates to PEM or PFX or PKCS12, there is very little easily digestable information about why these formats exist and what they are all good for.

The one major pitfall to be aware of, is, that some of these files are "container" formats that are used to ship chains of certificates or even certificates along with private keys. Other formats will just hold your certificate.

While I may not want to admit this actually happened to me during the early experiments with the certificate infrastructure, imagine what happens when you convert your CA certificate into a PFX file (to make it easier to import on a Windows machine) and then just because the conversion example you found on the net told you to, you also provide the private key of the certificate... Well now you have a PFX file holding your CA Certificate and the private key for the certificate... Ouch! Not something you would want to distribute to your clients (as anyone with the private key of the CA can generate new certificates in your name).

A short rundown of the formats I encountered and their uses:

FormatHolds
PEMCertificate or key; text format
PKCS12Certificate(s) and private key(s) - a bundle for distributing a user certificate and its key to a user, for example
DERCertificate or key; binary format
PFXCertificate and optionally a key - also a bundle format

We use PEM format on the VPN server and distribute PKCS12 files to the users. The PKCS12 files have the extra little benefit of being able to require a password for import - so you can distribute the PKCS12 file by one mechanism and give the user the import password through another mechanism.

Just as you thought you were done...

So I have a Windows workstation, my Macbook and my iPhone connecting via VPN and it is all good! Except... nothing really works. DNS doesn't resolve any of the internal hosts.

While IKEv2 does indeed support split tunnelling, split horizon DNS was not part of the standard... It is being worked on, and it is an IETF draft as of this writing, but it is not in IKEv2. In an older IPSec setup (which could not be brought to work with Windows unfortunately) we used an internal DNS for the internal domains and the client would use "whatever other" DNS it was using for everything else. That is simply not a possibility with IKEv2 today.

Well, the quick solution to that is to use our internal DNS for everything - it can handle the load no worries, and it will forward requests to external DNS just fine. So that solves it right? Well you might think so, but if you thought so you'd be wrong, yet again.

It turns out that OSX and iOS both ignore the DNS information pushed from the IKEv2 server, if split networking is used. However, if you simply don't use split networking, your OSX and iOS devices will happily use the IKEv2 server provided DNS. Well, I suppose our gateways can forward traffic to the outside as well then...

Split DNS on iOS

When I say split DNS isn't possible on iOS it's not entirely true. Apple has a tool dubbed "Configurator 2" which can configure "Policies" for your iOS devices. If you generate such a policy, include your user certificate, define the IKEv2 VPN setup in the policy and then store the policy, this tool will generate a rather nasty looking XML file for you

What you do next, is edit this profile XML document and insert some extra DNS entries that you cannot insert using the Configurator tool. Now when you import this profile on your iOS device, it will import the certificate, set up the VPN, and supposedly correctly apply split DNS configuration as desired

This, of course, is completely unmanageable for me - all I need is VPN access - I don't want to have to hire a full time employee to configure our iOS devices in all eternity. I don't know how Windows worked with DNS and split tunnelling; it doesn't matter since iOS and OSX dictate that we won't be doing split tunnelling any time soon.

Debian and Ubuntu client support

Finally, as we got OSX, Windows and iOS users connecting, setting this up on Linux proved to be the next obstacle. Frankly I had delayed testing this because the servers run StrongSwan on Linux and I "knew" this would just work. Well, at least you would think so right, but then you would be... yes, wrong.

While it is of course possible to manually edit the config files and make StrongSwan connect as client up against the VPN servers, the graphical "Network Manager" tool that modern day Linux users have come to expect, simply doesn't work for configuring an IKEv2 VPN

On Debian 8, there's just no support. It simply isn't there. On Ubuntu 16.04LTS, it is there but it doesn't work. There is a bug in the network manager plugin version that is on Ubuntu 16.04 (Xenial) which is fixed in a later Ubuntu (Zesty) - but it appears it is not getting fixed in 16.04LTS. One can only hope for a fixed plugin in the next .minor release on the LTS.

In conclusion

I am absolutely amazed at how little integration there is in the products that make up a VPN, authentication and authorization infrastructure. While I was of course expecting a challenge or two, I am absolutely flabbergasted at how close to "impossible" something as simple as a relatively secure VPN could be to set up so that it works out of the box on a selection of common modern clients.

That we, 17 years after MSCHAPv2 ceased to be relevant, are still forced to base a mostly non-Windows infrastructure around it, is shocking.

Anyway I am moderately happy with having finally gotten a reasonably secure VPN going to take over from an older much less secure VPN which was not supporting the clients we needed. I would have liked 2FA and Kerberos, but that is apparently not realistic today and maybe never will be (if no-one has gotten it in the past 17 years, I am not holding my breath for the next 17).

Onwards, to glory! Back to real work. With some basic infrastructure in place, chances are we can now focus on real work.

A new-years KISS

2016-12-30

Anyone who says "less is more" is of course high, stupid, or both. Less is not more - but the message intended in that common cliche is that "less is better", which is certainly the case more often than not.

The problem

Without going into details... A developer is faced with the problem of generating a string like foo=42&bar=more&baz, given the following set of mappings:

KeyValue
foo42
barmore
baz

The diligent reader will recognize the string as the options part of a URI, and can also safely infer that the table of key/value mappings is indeed held in an STL map<string,string> structure.

The solution

During some early spring cleaning here before the very end of 2016, I stumbled across the code for this solution - it works and the code did not cause problems; but it is heavily depending on the boost library which I am trying to get rid of to the extent possible (and that would probably be the subject of an article on itself - suffice to say I'm trying to cut down on dependencies).

namespace { struct to_string { typedef std::string result_type; const result_type operator()(const HTTPOptions::options_t::value_type &pair) const { if (pair.second.empty()) { return pair.first; } return pair.first + std::string("=") + pair.second; } }; } const std::string HTTPOptions::to_uri() const { return boost::algorithm::join( m_options | boost::adaptors::transformed(to_string()) , "&"); }

The above is the code as I found it. It uses a boost algorithm to "join" the options with ampersands, and declares a struct to be able to apply an operator() using a transformer, to convert the pairs in the map to strings of the "key=value" form.

It is time to take a step back... Why is it we need all this? What is the actual job that we are performing? Take a moment to think about it... Given the mapping of strings, how would you imagine that you could construct the resulting string?

Another solution

I deleted the includes of boost headers and deleted the code above. Instead, I wrote this little bit:

std::string HTTPOptions::to_uri() const { std::ostringstream str; for (auto i = m_options.begin(); i != m_options.end(); ++i) str << (i == m_options.begin() ? "" : "&") << i->first << (i->second.empty() ? "" : "=") << i->second; return str.str(); }

This is a pretty straight forward solution really. It iterates though the mapping and simply inserts the string pairs one by one. No operator|() to apply a transformer to a map, no structs with operators that construct new strings based on the old ones, and no high-level transformers to abstract away the nitty gritty details.

I am not against abstractions. But we also have to consider the reality of things here:

Old solutionNew solution
Size20 lines10 lines
DependenciesBoost, STLSTL
Allocations2n + log nlog n

The allocations metric is a relevant performance metric; basically, when we perform an operator+ on two strings, we construct a new string using the two old (without modifying the two old). This costs us an allocation. Therefore, for every call to to_string::operator() we perform two new allocations. The boost join method will use insert to append each pair to the resulting string, and I think it is fair to assume that std::string is smart enough to incrementally grow its underlying buffer thereby giving us around log(n) allocations for adding n pairs to the string.

The new solution however uses a ostringstream which, like the std::string will perform something like log(n) allocations in order to append n strings to the result. No temporary strings are created during the operation of the new solution - therefore the number of allocations is negligible compared to the original solution.

Last but not least, the shorter solution is trivial. It is obvious what it does and it is obvious how it does it. It is simply more readable, not just because it is fewer lines, but because of its flow.

Something to take with us into the new year...

The morale of the story is: When you are implementing a solution to a conceptually very simple problem, and you find yourself writing a lot of code using clever and complex libraries and constructs, you need to take a step back.

For every simple problem, there is a large, complicated and expensive technological solution.

But it doesn't have to be like that. We can do better. Let's make 2017 a year where we do better. Happy new year everyone!