Jakob Østergaard Hegelund

Tech stuff of all kinds

A couple of years down the road

2014-02-15

Having run a number of Oracle ZFS Appliances for a couple of years now, I guess it is fair to take a moment to sum up where they delivered on their promises and where they fell short. The shortcomings first - I would say that this is a comprehensive list of where I feel let down by the system. Now, I would expect to be negatively surprised sooner or later by any system, if I use it intensively for critical jobs over a long period of time. Having this list does not mean it is a bad system - but knowing this list before placing your order can help you make sure you get the system sized right, and that you match your expectations to what is actually possible to do well in the real world.

Ungraceful degradation on overload

When running too many concurrent replications, having too many clients reading and writing too much data at the same time, you would expect your storage system to start responding more slowly to individual requests - so even though the system is processing more IOPS than ever before, the individual clients begin to see the system as slowing down (taking longer to serve an individual IO).

Well, if you happen to be overloading this appliance with the right combination of small writes to otherwise cold data (as for example a vSphere environment would), then you can end up in a situation where the system "pauses" all IO for a few seconds, then serves a spike of IO for a few seconds, pauses again and so forth. This was terrible when it happened to us at first, but I believe we know the cause and the resolution by now - in short, you need enough spindles for your IOPS and you need to use the right recordsize for your shares.

Any storage system will misbehave if underpowered and mis-configured, and this system is no different in that respect - the behaviour in that situation however, is not as I would have expected. This is good to know I guess :)

Too slow cluster failover

Consider this: You run a virtual server - the guest OS thinks it has a physical SCSI disk to talk to. In reality, that is a fake, provided by ESXi - in reality it is a VMDK file on an NFS export. Now, if the NFS export is unavailable for 90 seconds, NFS will simply block and wait - as soon as it is available again, NFS will resume where it left off with no data loss at all. So no problem right? Wrong! The guest OS will have sent a SCSI command to its (fake) SCSI disk, and it will expect that command to complete within the SCSI timeout which is 60 seconds. When that does not happen, the guest OS will either retry, reboot, hang or what have you... All depending on the OS and OS revision in question. Yes you can often tune this timeout to suit your needs - but in a hosting environment where customers may administer their own servers, this is not necessarily easy to pull off. What this boils down to, is, that you need whatever downtime you have on your NFS storage to be less than 60 seconds.

When upgrading firmware on the Oracle appliances, you upgrade the passive head. Then, you fail over so that the passive becomes the active, and then you can upgrade the new passive head. Et Voila, both your heads are upgraded and the only downtime you had was the time taken by the cluster failover. This works for minor and major firmware upgrades and it is great!

...except for one little detail: In the older firmwares the failover could cause almost 180 seconds of NFS downtime. In the newer firmwares it seems to be down to around 60 seconds. So things have improved massively and we are almost there - but the bottom line is that either failover time is comfortably below 60 seconds, or it is not - and if it is not, then users will notice.

Deduplication

This was one of the cost-reducing features we bought into. Oracle ZFS Appliances employ in-line de-duplication which means that data is de-duplicated as soon as it enters the appliance. There is no batch job (like NetApp and Storwize) that has to run nightly. This is great for a hosting environment, because we really don't have off-hours. Many systems are more busy at night than during the day, and customers are international. You just cannot take out 8 hours during the night for batch processing...

It turns out that de-duplication results in a de-duplication table (a DDT) which holds the hashes of the data blocks that are potentially de-duplicated. If this DDT fits in RAM, all is well. If it does not fit in RAM, then performance of the appliance will deteriorate massively to the point where it very quickly becomes completely unable to serve anything to anyone.

This is actually in the manual. Oracle does not recommend that you use this feature unless you absolutely understand your dataset and how the de-duplication works. But that does not mean you believe it when you read it - but mark my words, they advise against it for a reason - please don't just enable it to play wit it as it as system-wide impact. The full system deteriorates, not just the share where you enable it. And yes, Oracle say this, they are very honest about this, but it is tempting to not believe them and go ahead and try it anyway. Don't.

That was actually the list of gripes. Not too bad I guess, all things considered. The important part is, that most of these are not an issue anymore at all - not when you know them.

Happy times!

Now for the next list. There are a few points where the system has delivered above my expectations. We had high expectations, so that is saying some.

Expansion

Need more space or more IOPS? Simple - you go ahead and buy another shelf of disks like the ones you have in your system already. While the system is on-line you cable up the shelf (the SAS links are redundant so you can safely cable in a new shelf while the system is operating). Once you confirm that you have two paths to all shelves in the system (using the simple overview in the web UI), you tell it to extend your storage pool with the new disks. This is just a couple of clicks - it takes a few minutes and it causes NO downtime or interruption or degradation of any kind for your users. The system just "magically" adds the disks to your storage pool.

Since ZFS is a copy-on-write system, it gets to choose where it writes new data. It will decide on which disks to write to based on how full they are and how busy they are - therefore, when you add a full shelf of "virgin" drives, a higher ratio of writes will go to this shelf until your storage has been balanced out. So, there is no batch-job like re-balance process to run - the system will automatically and all by itself even out the data over your newly installed shelf of disks.

We have done this a couple of times, and it really is as painless and simple as I make it sound. I am impressed.

Second point: You just buy the shelf. You do not buy extra licenses for replication, snapshots, flash cache, iSCSI, NFS, analytics, ... It is simple. You buy the shelf, you use the shelf. No licensing nightmare. No features that stop working because they are not licensed for all your storage... No. Simple. Nice!

Capacity

As I have covered already, we do not use de-duplication. But we enable compression on everything - the system includes various levels of compressions (trade-off between CPU consumption versus compression ratio), but the "cheapest" compression is so CPU efficient that it does not cost you performance - and yet it gives us above 1.4x compression on average. Enabling the fast compression is a no-brainer - there is simply no downside unless your workload contains only highly compressed data already.

The system performs very well even using 7200 rpm high capacity disks for primary storage. Remember, there is only about a factor 2 in performance between these "slow" drives, and the fastest 10k or 15k drives money can buy. Compare that to the difference proper use of flash cache or RAM cache can make, and the speed of the mechanical drives will seem nearly irrelevant. Of course it isn't, but this system is very good at delivering performance even from high capacity drives.

What this means, is, you get a system where you can actually get a lot of proper storage space (which includes replication, snapshots, NFS and iSCSI and all that jazz) for a relatively small amount of money... compared to much else in the market at least. It is not like these appliances come for free - but I genuinely feel that you get a lot for your money.

Performance

You should see the ZFS Appliance as consisting of several layers. There is a "data management" layer which takes care of writing "objects" on disks. And then they built two things on top of that - they built the "zvol", a volume, which can be exported via iSCSI (or other block protocols if you need them), and they built the "zpl", the ZFS POSIX Layer, a file system which can be exported via NFS. So, whether you choose to create volumes and export them via iSCSI or if you create file systems and export them via NFS, you are working with a "first class" member. Some systems (cough.. NetApp.. cough..) will create a file-system file and export that via iSCSI, which may not be optimal from a performance point of view. Well, the ZFS Appliance does both block and file well. Quite well...

The system does copy-on-write, which means it decides for itself where to write new data. What that means, is, that random write workloads become sequential-write workloads for your disks. That is a brilliant way improving write performance. What copy-on-write also means, is, the file system is always consistent - there is no file system checker for ZFS because there are no pathologies for a checker to repair.

To serve synchronous writes quickly, the system employs a couple of interesting SAS devices; "LogZillas" - these are basically, as I understand it, RAM disks that include a super capacitor and some flash to survive a power loss. They are used as NVRAM to allow synchronous write requests from clients to be served very quickly.

Tiering

The system does not do "tiering" in the old fashioned traditional sense of the word - but it does something better. Let me explain.

All data will go to the disks (but this is fast - the writes are sequential), and will stay there. So the mirroring (or whichever redundancy you choose) is taken care of at that level.

The system MAY then choose to cache some of that data either in a read-optimized flash cache (which is around 100 times faster than mechanical disk), or it may choose to cache the data in DRAM (which is 100 times faster than flash).

Since the data is already redundant on your mechanical disks, the system does not need to keep redundancy on flash - so no mirroring means you get twice the effective flash for cache! If a cache device fails, the system continues working with the remaining devices, no worries.

The system is extremely good a choosing what data to keep in RAM and flash. I typically see more than 80% of all read IOs that hit the system being served directly from RAM. That impresses me too.

This "tiering" between RAM, flash and mechanical disks is a continuous process - it is not a scheduled job that runs once every now and then. It does not ask of you to configure rules for which data-sets to put where. It just very simply does what is best for your system so you get the best performance possible all the time. No administrative hassle, no rules to get wrong. And it really really works.