The phasing out of older systems and introduction of new ones landed me in a situation where I needed to configure a VPN service for myself and a group of colleagues.
Requirements were something like:
- Native client support on Windows, OSX, Linux and preferably iOS and Android
- Modern secure protocol
- Split tunnel support; so work-related traffic goes through the VPN and unrelated traffic goes directly through whatever network the user is on
- Can authenticate against existing Kerberos infrastructure
- Can authorize based on existing LDAP infrastructure
- Possibly two-factor authentication support using hardware tokens or an app
Well it turns out, this sort of technology is completely unheard of, unrealistic to implement and I am apparently decades ahead of my time, thinking something like this would be nice to have
Goodbye: most protocols
No single protocol, that I want to use, is supported natively (built-in) across all the devices and systems I need.
I specifically do not want Layer-2 tunneling and therefore I will disregard L2TP/IPSec. This pretty much leaves me with IKEv2 as the only choice - it may require an app on Android devices, but I think we will survive that.
An alternative approach would be to accept that no built-in protocol is "good enough" and then go full third-party; OpenVPN or some other solution. I dislike this approach because I don't like to force installs of special software for basic networking functionality onto all clients - I don't have any hard evidence saying this would not work or this would be bad for some specific reason, I just don't like it. And since I'm the one tasked with this, what I like and don't like matters.
I'm going to set up an IKEv2 provider using StrongSwan running on Ubuntu, and that's that
Windows, in the old days before Kerberos (or "Active Directory" as the marketing term is for Kerberos and LDAP on Windows), invented various more or less safe password exchange schemes so that a client could authenticate against a server without transmitting too much information about the password onto the network for an attacker to be able to derive the password. One such protocol is MSCHAPv2. The MSCHAPv2 protocol requires that the server has a specially hashed version of the user's password on file, or simply the full plain text password.
After Windows adopted Kerberos in 1999 and released it with Windows 2000, there has not been a need for protocols like MSCHAPv2 since the Kerberos protocol not only solves the same problem as MSCHAPv2 better, but also provides functionality such as actual Single-Sign-On which is completely out of the scope of MSCHAPv2.
You might think, then, that 17 years after that, it would be possible to have a Windows VPN client authenticate against a Kerberos infrastructure - well if you thought so you'd be wrong. Windows servers only support MSCHAPv2 as user/password authentication protocol for IKEv2. And, to be consistent, OSX does exactly the same.
If you run your VPN server on Windows, your VPN clients can validate up against their "AD" passwords, because the AD will keep the necessary extra password material on file - this is not Kerberos authentication, it is not single-sign-on, it is a crude hack that allows the use of arcane authentication protocols against what could have been a modern authentication system. Well, I don't run my VPN servers on Windows, so this is not directly an option.
Goodbye: Workaround with KCRAP
The MIT Kerberos KDC actually supports the same hash function as is necessary for the MSCHAPv2 exchange; you could actually configure your KDC to store these hashes alongside the other information it stores.
On top of that, you would need the KCRAP service which allows the extraction of these hashes from your KDC database.
On top of that, you would need a FreeRADIUS server, which could support an MSCHAPv2 exchange using the KCRAP service as its back-end
And finally you would need the StrongSwan VPN provider to talk to FreeRADIUS which can talk to KCRAP and then all should work in theory...
This is too much. There is an excellent description of how to configure this here, but it is too much for me. Too many moving parts. Too much highly experimental software. Too many things that I will need to support entirely on my own, patch with every upgrade of other components, etc.
If StrongSwan had a KCRAP module and if the KCRAP server was an integrated part of the KDC, I might go this route. But with the current state of things, this is not something I want to bother with in a production setup - I have a job to do besides this.
Goodbye: Text files with user passwords in the clear
The simple way to get this working, is to set up every user with their clear text password in the /etc/ipsec.secrets file. Most guides on the interwebs describe this (you could use a database or other store, but it would still basically be plain text passwords on file) - and I got it working. But hang on... This won't fly for several reasons:
- Users can't change credentials then - passwords will be forever valid
- It is truly nasty to have clear text passwords in a big flat file that will regularly need editing (users come and go)
- I don't want to manage this. It's unnecessary overhead - a waste of my time - Kerberos manages user credentials, I don't
- These passwords are essentially "pre-shared keys" then; only shorter. With that comes all the problems of pre-shared keys, only faster
I need to find a way to keep user credentials on the server that does not involve storing them in clear text and requiring me to manage them manually.
Goodbye: TOTP and 2FA
One option I considered was using one-time passwords. If I can't use my Kerberos infrastructure, at least I could set up a TOTP infrastructure and the users would then have either a HW token or an app on their smartphone which could provide them with a password.
The obvious benefit here is that the server would always know the current clear text password (it is easy to generate), so MSCHAPv2 could work against this. And the OATH framework has been around for ages, so there must be good support for this, right? Yeah well you'd think that, but then you'd be wrong. Again.
First of all; in a TOTP setup it's nice to allow a little bit of slack so that the user can use an OTP that just expired - for usability and to account for user's device's clocks being a little off - this is completely incompatible with MSCHAPv2 since the hash of the "only correct" password is used in the challenge the server sends to the client. It just can't work.
Second, what I really wanted was proper two-factor authentication - I would like the user to log on with a TOTP, and then provide their username and password (or the other way around, I'm flexible). While IKEv2 does support multiple rounds of authentication, the Windows client does not. Well I guess we won't be going down this route either then.
Left with the prospect of using "large machine-generated passwords" as the best authentication for my users, it suddenly becomes relevant to consider simple Certificate based authentication.
Passwords alone are bad because they get leaked or even guessed if they're really poor; large machine generated passwords don't have the problem of being guessed. But if they are large, they will certainly be written down (anyone saying "we have a policy that says users shouldn't do that" has completely lost the connection to the real world in my not so humble opinion) - so they will be leaked. This is a compromise.
Certificates is really a way of dealing with this: A certificate holds a key that is so large it will not be guessed, and the certificate file is the accepted method of transportation of this key. Therefore, while it may be copied around, it will not get written down or left on a sticky-note on someone's desk. Certificates also have a lifetime built right into them.
The only downside I really see with certificates is that I need to maintain a Certificate Revocation List (CRL) for certificates that have been used on devices that are lost. If we used passwords or pre-shared keys, it would be simple to delete the key from the VPN server. But really, mechanisms for this are established and - almost to my surprise at this point - well integrated.
So actually - at the start of this process, messing with certificates was the last thing I wanted to do - but as each desirable option proved unworkable, certificates started looking less unattractive. At this point, maintaining our own public key infrastructure does not seem all that unreasonable after all.
The PKI setup
Losely based on the guidelines at the excellent StrongSwan wiki, we set up a self-signed Certificate Authority certificate. From here on, it was pretty simple to create certificates for the VPN servers and individual certificates for all the users.
The StrongSwan server needs no reconfiguration as users are added - any user who can present a (non-revoked) certificate signed by our CA is allowed access.
Users receive a PKCS12 file containing their user specific certificate and the private key for that certificate. A machine generated password is set on the file (not stored with the file) and given to the user to use for certificate import.
Users then simply need to install our CA certificate and their own personal certificate - with these bits our VPN gateway can authenticate the user and the user can authenticate the VPN gateway.
Clearly, if a certificate is leaked, just like when a password is leaked, access is granted to people you did not intend to grant access to. Certificates are no different from passwords in this regard.
The only real solution to this problem, as it stands today, seems to be two-factor authentication. While IKEv2 does support multiple authentication rounds, the Windows IKEv2 client does not support this part of the standard. There appears to be no way to implement 2FA on IKEv2, if it is a requirement to use the built-in VPN clients, at least on Windows today (and I simply didn't investigate the other platforms so Windows may not be alone in lacking this support for an otherwise standard feature in IKEv2).
To mitigate this, we use a relatively short (less than one year) lifetime on the certificates. While this is not much use if a device is stolen and abused right away, it will at least mitigate the problem with certificates on old or unused office computers that are later recycled internally or even discarded without being wipe.
A note on certificate file formats
While there are heaps of helpful pages on the interwebs detailing how to convert your DER format certificates to PEM or PFX or PKCS12, there is very little easily digestable information about why these formats exist and what they are all good for.
The one major pitfall to be aware of, is, that some of these files are "container" formats that are used to ship chains of certificates or even certificates along with private keys. Other formats will just hold your certificate.
While I may not want to admit this actually happened to me during the early experiments with the certificate infrastructure, imagine what happens when you convert your CA certificate into a PFX file (to make it easier to import on a Windows machine) and then just because the conversion example you found on the net told you to, you also provide the private key of the certificate... Well now you have a PFX file holding your CA Certificate and the private key for the certificate... Ouch! Not something you would want to distribute to your clients (as anyone with the private key of the CA can generate new certificates in your name).
A short rundown of the formats I encountered and their uses:
|PEM||Certificate or key; text format|
|PKCS12||Certificate(s) and private key(s) - a bundle for distributing a user certificate and its key to a user, for example|
|DER||Certificate or key; binary format|
|PFX||Certificate and optionally a key - also a bundle format|
We use PEM format on the VPN server and distribute PKCS12 files to the users. The PKCS12 files have the extra little benefit of being able to require a password for import - so you can distribute the PKCS12 file by one mechanism and give the user the import password through another mechanism.
Just as you thought you were done...
So I have a Windows workstation, my Macbook and my iPhone connecting via VPN and it is all good! Except... nothing really works. DNS doesn't resolve any of the internal hosts.
While IKEv2 does indeed support split tunnelling, split horizon DNS was not part of the standard... It is being worked on, and it is an IETF draft as of this writing, but it is not in IKEv2. In an older IPSec setup (which could not be brought to work with Windows unfortunately) we used an internal DNS for the internal domains and the client would use "whatever other" DNS it was using for everything else. That is simply not a possibility with IKEv2 today.
Well, the quick solution to that is to use our internal DNS for everything - it can handle the load no worries, and it will forward requests to external DNS just fine. So that solves it right? Well you might think so, but if you thought so you'd be wrong, yet again.
It turns out that OSX and iOS both ignore the DNS information pushed from the IKEv2 server, if split networking is used. However, if you simply don't use split networking, your OSX and iOS devices will happily use the IKEv2 server provided DNS. Well, I suppose our gateways can forward traffic to the outside as well then...
Split DNS on iOS
When I say split DNS isn't possible on iOS it's not entirely true. Apple has a tool dubbed "Configurator 2" which can configure "Policies" for your iOS devices. If you generate such a policy, include your user certificate, define the IKEv2 VPN setup in the policy and then store the policy, this tool will generate a rather nasty looking XML file for you
What you do next, is edit this profile XML document and insert some extra DNS entries that you cannot insert using the Configurator tool. Now when you import this profile on your iOS device, it will import the certificate, set up the VPN, and supposedly correctly apply split DNS configuration as desired
This, of course, is completely unmanageable for me - all I need is VPN access - I don't want to have to hire a full time employee to configure our iOS devices in all eternity. I don't know how Windows worked with DNS and split tunnelling; it doesn't matter since iOS and OSX dictate that we won't be doing split tunnelling any time soon.
Debian and Ubuntu client support
Finally, as we got OSX, Windows and iOS users connecting, setting this up on Linux proved to be the next obstacle. Frankly I had delayed testing this because the servers run StrongSwan on Linux and I "knew" this would just work. Well, at least you would think so right, but then you would be... yes, wrong.
While it is of course possible to manually edit the config files and make StrongSwan connect as client up against the VPN servers, the graphical "Network Manager" tool that modern day Linux users have come to expect, simply doesn't work for configuring an IKEv2 VPN
On Debian 8, there's just no support. It simply isn't there. On Ubuntu 16.04LTS, it is there but it doesn't work. There is a bug in the network manager plugin version that is on Ubuntu 16.04 (Xenial) which is fixed in a later Ubuntu (Zesty) - but it appears it is not getting fixed in 16.04LTS. One can only hope for a fixed plugin in the next .minor release on the LTS.
I am absolutely amazed at how little integration there is in the products that make up a VPN, authentication and authorization infrastructure. While I was of course expecting a challenge or two, I am absolutely flabbergasted at how close to "impossible" something as simple as a relatively secure VPN could be to set up so that it works out of the box on a selection of common modern clients.
That we, 17 years after MSCHAPv2 ceased to be relevant, are still forced to base a mostly non-Windows infrastructure around it, is shocking.
Anyway I am moderately happy with having finally gotten a reasonably secure VPN going to take over from an older much less secure VPN which was not supporting the clients we needed. I would have liked 2FA and Kerberos, but that is apparently not realistic today and maybe never will be (if no-one has gotten it in the past 17 years, I am not holding my breath for the next 17).
Onwards, to glory! Back to real work. With some basic infrastructure in place, chances are we can now focus on real work.
Anyone who says "less is more" is of course high, stupid, or both. Less is not more - but the message intended in that common cliche is that "less is better", which is certainly the case more often than not.
Without going into details... A developer is faced with the problem of generating a string like foo=42&bar=more&baz, given the following set of mappings:
The diligent reader will recognize the string as the options part of a URI, and can also safely infer that the table of key/value mappings is indeed held in an STL map<string,string> structure.
During some early spring cleaning here before the very end of 2016, I stumbled across the code for this solution - it works and the code did not cause problems; but it is heavily depending on the boost library which I am trying to get rid of to the extent possible (and that would probably be the subject of an article on itself - suffice to say I'm trying to cut down on dependencies).
The above is the code as I found it. It uses a boost algorithm to "join" the options with ampersands, and declares a struct to be able to apply an operator() using a transformer, to convert the pairs in the map to strings of the "key=value" form.
It is time to take a step back... Why is it we need all this? What is the actual job that we are performing? Take a moment to think about it... Given the mapping of strings, how would you imagine that you could construct the resulting string?
I deleted the includes of boost headers and deleted the code above. Instead, I wrote this little bit:
This is a pretty straight forward solution really. It iterates though the mapping and simply inserts the string pairs one by one. No operator|() to apply a transformer to a map, no structs with operators that construct new strings based on the old ones, and no high-level transformers to abstract away the nitty gritty details.
I am not against abstractions. But we also have to consider the reality of things here:
|Old solution||New solution|
|Size||20 lines||10 lines|
|Allocations||2n + log n||log n|
The allocations metric is a relevant performance metric; basically, when we perform an operator+ on two strings, we construct a new string using the two old (without modifying the two old). This costs us an allocation. Therefore, for every call to to_string::operator() we perform two new allocations. The boost join method will use insert to append each pair to the resulting string, and I think it is fair to assume that std::string is smart enough to incrementally grow its underlying buffer thereby giving us around log(n) allocations for adding n pairs to the string.
The new solution however uses a ostringstream which, like the std::string will perform something like log(n) allocations in order to append n strings to the result. No temporary strings are created during the operation of the new solution - therefore the number of allocations is negligible compared to the original solution.
Last but not least, the shorter solution is trivial. It is obvious what it does and it is obvious how it does it. It is simply more readable, not just because it is fewer lines, but because of its flow.
Something to take with us into the new year...
The morale of the story is: When you are implementing a solution to a conceptually very simple problem, and you find yourself writing a lot of code using clever and complex libraries and constructs, you need to take a step back.
For every simple problem, there is a large, complicated and expensive technological solution.
But it doesn't have to be like that. We can do better. Let's make 2017 a year where we do better. Happy new year everyone!
During a recent project I encountered memory allocation failures in a LISP system, seemingly caused by poor garbage collector performance. More careful investigation revealed a more fundamental problem however, one that is even completely unrelated to the choice of language or runtime.
Without going into unnecessary detail, I should give a brief overview of the system I have been working on. It is an import system for a search engine; in other words, it trawls large datasets that it downloads from an object store, processes those datasets, and uploads certain data in the form of XML documents to a search engine.
The source data that is loaded from the object store, are binary-encoded "container" objects (similar to folders in a file system) that may refer to other objects, and then actual documents in their native formats (for example raw .eml rfc2822 files).
The importer recursively descends the tree structure of a dataset, comparing the previously imported version to the current version, determining the differences (which objects that disappeared, were added or were updated), and conveys this difference to the search engine by means of an XML document.
For performance reasons, we want to run a number of import jobs concurrently; the individual recursive descent is not parallelised, but we like to process many trees concurrently. Much of this work is I/O bound, so the number of concurrent jobs we can run is limited by memory consumption, not by available processing power.
Adding this import functionality to an already existing LISP system allowed for very rapid development. A working version of this code, complete with job scheduling code and a test suite, was implemented over the course of just a few days. All was good, for a while. Then, we started encountering out of memory issues. It is simple to add more memory - but we quickly got to a point where we could only run two concurrent jobs, we allocated 8GB of memory for the LISP system, and we could still not be certain we wouldn't occationally encounter an allocation failure.
Dealing with bad allocations in C++ can be troubling enough; proper exception handling and code structure can make recovery from out of memory reliable - but still one area of the code may use all the memory causing an unrelated piece of code to actually fail its allocation. You can't know if the "memory pig" is the one that will fail it's allocation, unless you start using pooled memory for each area of your code. With LISP, there is the STORAGE-CONDITION condition that can be raised, allowing you to deal gracefully with allocation problems, similar to std::bad_alloc of C++. In SBCL however, sometimes the condition is raised, sometimes the low-level debugger is entered because the system fails to raise the condition in time. In other words, for a real-world LISP system, I need to be sure that out of memory conditions are unlikely to occur; we cannot rely on error handling to recover reliably from allocation problems.
Soon after starting to experience these occational out-of-memory problems, I started investigating which types of objects were allocated, using the (room) function. An example could look like:
The above is for a relatively small system; only 1G of dynamic space in use; but the pattern scales to larger dynamic spaces. There is an enormous amount of character strings and a large number of bytes allocated in arrays of unsigned bytes. The latter are the buffers I use when receiving the binary objects from the object store.
It surprised me that this many arrays were "alive" still; I never reference more than a handful of these - how can there be more than 8k of those alive? No wonder we're having memory problems. Running (gc :full t) will clean these up, so I'm right that they are not references. So how come do we still run out of memory? Why does the garbage collector not collect these by itself? Clearly, this must be a garbage collector problem, right?
I consulted the SBCL-Help mailing list for advise on how best to deal with this. Very quickly I received insightful and helpful advise both on-list and off-list. Of the various suggestions I received, at the time it seemed most reasonable to accept that the memory allocation pattern of my application (few large objects followed by many small) simply caused the large objects to get promoted to a later generation in the garbage collector, causing the collector to miss opportunities to collect them (the SBCL memory collector uses 6 generations or so).
Working under the assumption above, there were several solutions possible:
- Call the garbage collector explicitly
- Tune the garbage collector; e.g. make it two-generational
- Pool the resources - re-use buffers
- Allocate outside of the dynamic space
I did two things: I took the static-vectors library and implemented a simple memory allocator on top of that. This way, I could not explicitly allocate and free memory objects, using a single underlying static-vector allocated outside of the dynamic space. My allocator would simply return displaced vectors on top of the one large vector initially allocated for the pool. This looked like a beautiful solution to the problem. All my buffers were now completely out of reach of the garbage collector and I knew exactly when and were buffers were allocated and freed again. The code wasn't even complicated by this, a simple (with-pooled-memory...) macro would neatly encapsulate allocation and guarantee release.
Running the code with this, gave the following dynamic space breakdown:
There we go! My unsigned byte arrays are completely gone. But another problem surfaced...
Displaced array performance
It turns out that accessing displaced arrays comes at a significant cost - the compiler can no longer generate efficient code for traversing them. Take a look at the following comparison:
That's a significant penalty. The simple solution to this problem was to realize that I didn't need a memory pool. I allocate so few objects anyway, that I can easily allocate them all directly with the static-vectors library. Simply removing the pool (along with my beautiful little allocator that I was actually proud of) completely solved that problem.
This is a classical programmer error, and one that I am clearly not above: Imagining a solution in ones head without caring to investigate if it is even needed - spending time implementing a solution which in fact turns out to be a problem in itself. Anyway, with the pool gone and vectors allocated using the library that already allocates vectors, the performance problem was gone. But the string objects certainly weren't... How many of those did we have?
Half a gigabyte of strings?
Looking at the most recent dynamic space breakdown, we see we use 488MB for strings and 317MB for a whopping 19 million cons objects. Clearly there is more work to do - this system does a number of different jobs and there's nothing strange about it using some memory - but 19 million cons objects and almost three million strings, that's a lot. And this is on a "small" system; we can easily use 3-4G of dynamic space and the string and cons numbers scale.
This is actually where it gets interesting. I was forced to take a look at the actual data processing algorithm that I had implemented. The logic goes a bit like this:
So in order to compute the difference of one version of a folder to another version of a folder, we need to sort the children and iterate through them. Since we receive the data in these binary "blobs" and each actual entry in the blob is of variable length (depending, among other things, on the length of the name of the entry), I use a parser routine that generates an object (a struct) holding a more "palatable" representation of the very same data.
My struct would hold a list of child entries (again structs). Each child entry would have a name and a key/value store (a list of pairs of strings). So as an example, let's say I'm comparing two folders with 10.000 entries in each; each entry has five meta-data entries, so a child holds 5 (meta-data entryes) * 2 (key and value) + 1 (entry name) strings. Times 10k that's 110.000 strings I allocate both for the old and the new version. The entry lists are then efficiently sorted, and I can traverse and find the difference.
While this was simple to implement, it is also clearly extremely wasteful. And this has nothing to do with the choice of language. Had I implemented this in C++, a profiler would reveal that a significant portion of the run-time is spent in string allocation and string de-allocation. While an efficient C++ solution would only use a few times more memory than the source binary data due to the precise deallocation, my LISP system hurts more because the garbage collector fails to collect many of these strings in a timely fashion. But the algorithm I implemented is equally wasteful; in other words: What I did was silly in any language.
It actually wasn't rocket science to change this. I now use a struct that holds these members:
The "buffer" is my binary blob from the object store. The "offset" is a list of integers (fixnums) that hold the offsets into the "buffer" at which the child entries start.
I can efficiently retrieve the name of an entry by supplying the "buffer" and the start offset of the entry; a simple macro to retrieve various fields of the binary data makes the code for this really elegant.
Sorting is simple too. I do not sort the entries in the buffer - that would be time consuming and very difficult (as the entries are of variable size). Instead, I sort the offset list; not by offset number of course, but by entry name of offset. It's as simple as:
With this change, I probably don't even need the static-vectors allocation. But now that I have it, I'll keep it - it's an improvement, even if it's not essential.
An essential tool for me in finding the cause of the problem, and for identifying a few other routines that were unnecessarily heavy on allocating objects, was the statistical sampling profiler built into SBCL. A simple session looks like:
Finding the culprit
My expensive object parsing routine showed up in a profile report like this:
It is visible here that 35.7% of all allocations are caused by make-output-stream; most of the calls to that are from stringify-object. That in turn is called by princ-to-string. That, again, by format-print-integer. And that, finally, gets called by the objseq->contentlist routine which was the name of my parsing routine. This trace is what led me to understand the cause of the actual problem.
For example; I had a (sexp->xml ...) routine which converts a S-expression into a textual XML document. This routine was some of the first LISP I ever wrote, so it was built using some "fancy" constructs (a beginner learning the language) to recurse down into the expression and construct the document. However, looking at a profile running my search engine import revealed:
In other words; 41.8% of all allocations in the system are done by concatenate, and in 38.7% of the total allocations, it's from a call from reduce. Investigation quickly reveals that my fancy XML document routine makes heavy use of reduce and concatenate.
Changing the code to a simpler version that uses a string output stream instead completely removed this routine from the top of the profile. Just like that, about 40% of all allocations done in the code were gone.
Using the sampling profiler as a tool to find the source of allocations has proven an excellent tool to me. The ability to start such a profile on QA or even production systems (I have not needed this - but I could if I needed) is amazingly powerful. Profilers are not just for finding CPU bottlenecks, clearly.
It is common for me, and I know for many others, to want to "convert" data from one representation to another before we work on it. Like I did when I parsed the binary objects to a higher-level representation for sorting and comparison. Sometimes that makes sense - but not only does it take time to develop all this representation-conversion functionality (or you need to find and integrate libraries that do it for you); it takes time to run (every time), and your code may not even be simpler!
In this particular case, for comparing two 10k entry folders I would do 220k string allocations - my new implementation does 4 simple array allocations, and it is no harder to read and doesn't take up more lines of code than the old implementation.
I'm not saying abstraction is evil. I did indeed abstract things - for example, I developed a macro for efficiently extracting specific fields out of entries in the buffer. In this case, my abstraction is in code, instead of in the data structures. This is easier done in LISP with its powerful macro system than in other languages of course - but the same principle could be applied in any other language.
I'll try to think twice about where to put my abstractions the next time, that's for sure. Abstractions are good, to a point - but they don't always have to be in the data representation.