Jakob Østergaard Hegelund

Tech stuff of all kinds

Unrestful XMLHTTPRequest

2014-09-02

The RESTful model for web service APIs preaches simplicity and brings back the sanity into web service design. It has, once again, become acceptable to solve simple problems with simple solutions. I have on several occations been the principal designer behind an internet facing RESful API, and I truly enjoy the power and expressiveness of plain simple HTTP/1.1 as the vehicle that transports requests and responses to and from the APIs.

A little background

I need to point out some obvious things to set the scene for this post. HTTP defines status codes. Every response has one. The ones we all know are "200 OK", "404 Not Found" and perhaps "500 Internal Server Error". But HTTP defines many other status codes - in fact, with a little creative thinking, HTTP will have error codes for pretty much any situation you will run into when building an API. Don't believe me? Take a look for yourself - it is RFC 2616 (yes I know it has been split now, but 2616 is still a good document). The most important aspect is, that when designing an API, you can actually return a real error code in the HTTP response, and you can put a descriptive error message in the body of the response document.

GET /our/stuff HTTP/1.1
host: mine.local

-------------------------
HTTP/1.1 403 Forbidden
content-length: ...
content-type: text/plain

Hi there. We simply won't allow people from your
side of the internet access to OUR stuff.

Sincerely,
 US

The second killer feature of HTTP is its brutal simplicity and consistency. Every request consists of a request line (GET /our/stuff HTTP/1.1 in the above example), some headers and a body. Every response consists of a status line (HTTP/1.1 403 Forbidden in the above example), some headers and a body. There are very few exceptions to this rule. One such exception, is, that a response to a HEAD request must not include a body - because the whole point of the HEAD request is to do a GET without actually getting the body data. HTTP does however mandate some specific semantics on the methods - a GET request, for example, must be idempotent (meaning, multiple identical invocations of the method must have the same effect as one invocation) where a POST doesn't have to be. This of course makes perfect sense - multiple requests to retrieve a particular document should of course all yield the same result, whereas it would only be natural if the initial creation of a document succeeds while subsequent creations of the same document would fail (with a 409 Conflict error of course).

Fail 1: error codes

When issuing API requests from JavaScript in a browser using XMLHTTPRequest directly or indirectly through the myriad of JS libraries out there, one would like to issue a request and receive the response. This is what the method facilitates - except if the response status code is 401 Unauthorized. If that status code is sent, the browser will intercept the method return and pop up a horific looking pop-up, prompting the user to authenticate. What's the point?!? Sure, if the user himself entered the URI in the browser address bar, it makes sense. But the user should not be troubled with return codes from the internal workings of a JS application running in his browser. If the application wishes to authenticate, it could ask the user directly, or the browser could expose a method for this purpose.

The workaround I devised for this, was to accept a header "authentication-failure-code" which can be set to any integer from 400 to 499. If the API wishes to return a 401 status and this header is set, the API will return the given integer instead of 401. This is a simple way to keep the API clean while providing a workaround to the misguided implementations of XMLHTTPRequest out there.

Fail 2: GET requests

Let us assume that I have an API which can transform a document from a simple mark-down style text, to a full XHTML document with fancy formatting and graphics according to some theme configured in the system. When implementing an editor, I want to execute this API request whenever the user has edited his document, so that we can present an up-to-date view of what the finalized messaeg will look like. Which method would we us for that?

SOME-MEHOD /render/welcome-message HTTP/1.1
host: api.local
content-length: ...
content-type: text/xml

<render>
 <lang>en-GB</lang>
 <content>Dear ${fullname}

Once upon a time there was an RFC, but nobody could
be bothered to read it.

The end.
 </content>
</render>
-------------------------
HTTP/1.1 200 OK
content-length: ...
content-type: text/html

<!DOCYPE...>
<html>
 ...
<body>
<h1>Dear John D. Anyuser</h1>

<p>Once upon a time...

Well, this method would not change anything on the server, so it is definitely idempotent. First off, neither POST nor PUT would be suitable. The request returns a body, so HEAD is also a no-go. In fact, the only reasonable method to use, is GET. It even makes perfect sense - we execute a "static" method on the server (a pre- configured rendering routine) with no side effects, just like when we request a static document, or a search result. Instead of encoding the document we wish to have transformed in the path of the URI (which would be inconvenient and even impossible with larger texts), we simply supply it in the body of the request - which is perfectly valid and well defined by the HTTP 1.1 RFC. So what is all the fuss about you may ask? Why not just go ahead and do this and be done with it? Well, I did, and as it turns out, XMLHTTPRequest will ignore the body in a request if the method is GET. No I am not kidding and this is not a joke (or at least it is not a very funny one). It is right here.

Yet again I was forced to implement a workaround in an otherwise fairly clean API, to allow for something which I this time completely fail to see the explanation for. I mean, they specifically went to all the trouble of special casing GET so that XMLHTTPRequest specifically would not allow a standard HTTP request - what for? To help us? Please, if that is the case, stop helping. Please just remove the special case from the standard, remove the code necessary to implement this breakage of HTTP support, and thereby allow plain simple RESTful APIs to be used from the browsers that people have. Anyway, the workaround was simple; add another handler so that users can use the "PUT" method (even though that makes NO sense, as nothing gets updated on the server) instead of "GET", thereby bypassing the special case in XMLHTTPRequest that breaks protocol.

Ah... glad I got all this off my chest. I hope you find the workarounds useful, and if you are a browser vendor or otherwise have leverage to influence things, please consider if it would be possible to work towards supporting HTTP in all its beautiful simplicity in the browsers of the future. Thank you.

Pick the pointer

2014-04-11

This is probably not the best way to go about things, from a readability perspective... But it just occurred to me today that there's yet another use for std::max. Consider:

    fsal::Entity *se = ptrl
      ? static_cast<fsal::Entity*>(ptrl.ptr())
      : static_cast<fsal::Entity*>(ptrc.ptr());
Since ptrl.ptr() and ptrc.ptr() are diffferent types, I need to cast them both separately. This is a lot of typing and a lot of reading. The shorter way? How about:
    fsal::Entity *se = std::max<fsal::Entity*>(ptrl.ptr(), ptrc.ptr());
I will make the argument that this is elegant. It is short and concise and does exactly what I need (it picks out the one pointer that is not 0). As for readability, this is so far from "normal" that it is probably not a good idea.

A couple of years down the road

2014-02-15

Having run a number of Oracle ZFS Appliances for a couple of years now, I guess it is fair to take a moment to sum up where they delivered on their promises and where they fell short. The shortcomings first - I would say that this is a comprehensive list of where I feel let down by the system. Now, I would expect to be negatively surprised sooner or later by any system, if I use it intensively for critical jobs over a long period of time. Having this list does not mean it is a bad system - but knowing this list before placing your order can help you make sure you get the system sized right, and that you match your expectations to what is actually possible to do well in the real world.

Ungraceful degradation on overload

When running too many concurrent replications, having too many clients reading and writing too much data at the same time, you would expect your storage system to start responding more slowly to individual requests - so even though the system is processing more IOPS than ever before, the individual clients begin to see the system as slowing down (taking longer to serve an individual IO).

Well, if you happen to be overloading this appliance with the right combination of small writes to otherwise cold data (as for example a vSphere environment would), then you can end up in a situation where the system "pauses" all IO for a few seconds, then serves a spike of IO for a few seconds, pauses again and so forth. This was terrible when it happened to us at first, but I believe we know the cause and the resolution by now - in short, you need enough spindles for your IOPS and you need to use the right recordsize for your shares.

Any storage system will misbehave if underpowered and mis-configured, and this system is no different in that respect - the behaviour in that situation however, is not as I would have expected. This is good to know I guess :)

Too slow cluster failover

Consider this: You run a virtual server - the guest OS thinks it has a physical SCSI disk to talk to. In reality, that is a fake, provided by ESXi - in reality it is a VMDK file on an NFS export. Now, if the NFS export is unavailable for 90 seconds, NFS will simply block and wait - as soon as it is available again, NFS will resume where it left off with no data loss at all. So no problem right? Wrong! The guest OS will have sent a SCSI command to its (fake) SCSI disk, and it will expect that command to complete within the SCSI timeout which is 60 seconds. When that does not happen, the guest OS will either retry, reboot, hang or what have you... All depending on the OS and OS revision in question. Yes you can often tune this timeout to suit your needs - but in a hosting environment where customers may administer their own servers, this is not necessarily easy to pull off. What this boils down to, is, that you need whatever downtime you have on your NFS storage to be less than 60 seconds.

When upgrading firmware on the Oracle appliances, you upgrade the passive head. Then, you fail over so that the passive becomes the active, and then you can upgrade the new passive head. Et Voila, both your heads are upgraded and the only downtime you had was the time taken by the cluster failover. This works for minor and major firmware upgrades and it is great!

...except for one little detail: In the older firmwares the failover could cause almost 180 seconds of NFS downtime. In the newer firmwares it seems to be down to around 60 seconds. So things have improved massively and we are almost there - but the bottom line is that either failover time is comfortably below 60 seconds, or it is not - and if it is not, then users will notice.

Deduplication

This was one of the cost-reducing features we bought into. Oracle ZFS Appliances employ in-line de-duplication which means that data is de-duplicated as soon as it enters the appliance. There is no batch job (like NetApp and Storwize) that has to run nightly. This is great for a hosting environment, because we really don't have off-hours. Many systems are more busy at night than during the day, and customers are international. You just cannot take out 8 hours during the night for batch processing...

It turns out that de-duplication results in a de-duplication table (a DDT) which holds the hashes of the data blocks that are potentially de-duplicated. If this DDT fits in RAM, all is well. If it does not fit in RAM, then performance of the appliance will deteriorate massively to the point where it very quickly becomes completely unable to serve anything to anyone.

This is actually in the manual. Oracle does not recommend that you use this feature unless you absolutely understand your dataset and how the de-duplication works. But that does not mean you believe it when you read it - but mark my words, they advise against it for a reason - please don't just enable it to play wit it as it as system-wide impact. The full system deteriorates, not just the share where you enable it. And yes, Oracle say this, they are very honest about this, but it is tempting to not believe them and go ahead and try it anyway. Don't.

That was actually the list of gripes. Not too bad I guess, all things considered. The important part is, that most of these are not an issue anymore at all - not when you know them.

Happy times!

Now for the next list. There are a few points where the system has delivered above my expectations. We had high expectations, so that is saying some.

Expansion

Need more space or more IOPS? Simple - you go ahead and buy another shelf of disks like the ones you have in your system already. While the system is on-line you cable up the shelf (the SAS links are redundant so you can safely cable in a new shelf while the system is operating). Once you confirm that you have two paths to all shelves in the system (using the simple overview in the web UI), you tell it to extend your storage pool with the new disks. This is just a couple of clicks - it takes a few minutes and it causes NO downtime or interruption or degradation of any kind for your users. The system just "magically" adds the disks to your storage pool.

Since ZFS is a copy-on-write system, it gets to choose where it writes new data. It will decide on which disks to write to based on how full they are and how busy they are - therefore, when you add a full shelf of "virgin" drives, a higher ratio of writes will go to this shelf until your storage has been balanced out. So, there is no batch-job like re-balance process to run - the system will automatically and all by itself even out the data over your newly installed shelf of disks.

We have done this a couple of times, and it really is as painless and simple as I make it sound. I am impressed.

Second point: You just buy the shelf. You do not buy extra licenses for replication, snapshots, flash cache, iSCSI, NFS, analytics, ... It is simple. You buy the shelf, you use the shelf. No licensing nightmare. No features that stop working because they are not licensed for all your storage... No. Simple. Nice!

Capacity

As I have covered already, we do not use de-duplication. But we enable compression on everything - the system includes various levels of compressions (trade-off between CPU consumption versus compression ratio), but the "cheapest" compression is so CPU efficient that it does not cost you performance - and yet it gives us above 1.4x compression on average. Enabling the fast compression is a no-brainer - there is simply no downside unless your workload contains only highly compressed data already.

The system performs very well even using 7200 rpm high capacity disks for primary storage. Remember, there is only about a factor 2 in performance between these "slow" drives, and the fastest 10k or 15k drives money can buy. Compare that to the difference proper use of flash cache or RAM cache can make, and the speed of the mechanical drives will seem nearly irrelevant. Of course it isn't, but this system is very good at delivering performance even from high capacity drives.

What this means, is, you get a system where you can actually get a lot of proper storage space (which includes replication, snapshots, NFS and iSCSI and all that jazz) for a relatively small amount of money... compared to much else in the market at least. It is not like these appliances come for free - but I genuinely feel that you get a lot for your money.

Performance

You should see the ZFS Appliance as consisting of several layers. There is a "data management" layer which takes care of writing "objects" on disks. And then they built two things on top of that - they built the "zvol", a volume, which can be exported via iSCSI (or other block protocols if you need them), and they built the "zpl", the ZFS POSIX Layer, a file system which can be exported via NFS. So, whether you choose to create volumes and export them via iSCSI or if you create file systems and export them via NFS, you are working with a "first class" member. Some systems (cough.. NetApp.. cough..) will create a file-system file and export that via iSCSI, which may not be optimal from a performance point of view. Well, the ZFS Appliance does both block and file well. Quite well...

The system does copy-on-write, which means it decides for itself where to write new data. What that means, is, that random write workloads become sequential-write workloads for your disks. That is a brilliant way improving write performance. What copy-on-write also means, is, the file system is always consistent - there is no file system checker for ZFS because there are no pathologies for a checker to repair.

To serve synchronous writes quickly, the system employs a couple of interesting SAS devices; "LogZillas" - these are basically, as I understand it, RAM disks that include a super capacitor and some flash to survive a power loss. They are used as NVRAM to allow synchronous write requests from clients to be served very quickly.

Tiering

The system does not do "tiering" in the old fashioned traditional sense of the word - but it does something better. Let me explain.

All data will go to the disks (but this is fast - the writes are sequential), and will stay there. So the mirroring (or whichever redundancy you choose) is taken care of at that level.

The system MAY then choose to cache some of that data either in a read-optimized flash cache (which is around 100 times faster than mechanical disk), or it may choose to cache the data in DRAM (which is 100 times faster than flash).

Since the data is already redundant on your mechanical disks, the system does not need to keep redundancy on flash - so no mirroring means you get twice the effective flash for cache! If a cache device fails, the system continues working with the remaining devices, no worries.

The system is extremely good a choosing what data to keep in RAM and flash. I typically see more than 80% of all read IOs that hit the system being served directly from RAM. That impresses me too.

This "tiering" between RAM, flash and mechanical disks is a continuous process - it is not a scheduled job that runs once every now and then. It does not ask of you to configure rules for which data-sets to put where. It just very simply does what is best for your system so you get the best performance possible all the time. No administrative hassle, no rules to get wrong. And it really really works.