Having run a number of Oracle ZFS Appliances for a couple of years now, I guess it is fair to take a moment to sum up where they delivered on their promises and where they fell short. The shortcomings first - I would say that this is a comprehensive list of where I feel let down by the system. Now, I would expect to be negatively surprised sooner or later by any system, if I use it intensively for critical jobs over a long period of time. Having this list does not mean it is a bad system - but knowing this list before placing your order can help you make sure you get the system sized right, and that you match your expectations to what is actually possible to do well in the real world.
Ungraceful degradation on overload
When running too many concurrent replications, having too many clients reading and writing too much data at the same time, you would expect your storage system to start responding more slowly to individual requests - so even though the system is processing more IOPS than ever before, the individual clients begin to see the system as slowing down (taking longer to serve an individual IO).
Well, if you happen to be overloading this appliance with the right combination of small writes to otherwise cold data (as for example a vSphere environment would), then you can end up in a situation where the system "pauses" all IO for a few seconds, then serves a spike of IO for a few seconds, pauses again and so forth. This was terrible when it happened to us at first, but I believe we know the cause and the resolution by now - in short, you need enough spindles for your IOPS and you need to use the right recordsize for your shares.
Any storage system will misbehave if underpowered and mis-configured, and this system is no different in that respect - the behaviour in that situation however, is not as I would have expected. This is good to know I guess :)
Too slow cluster failover
Consider this: You run a virtual server - the guest OS thinks it has a physical SCSI disk to talk to. In reality, that is a fake, provided by ESXi - in reality it is a VMDK file on an NFS export. Now, if the NFS export is unavailable for 90 seconds, NFS will simply block and wait - as soon as it is available again, NFS will resume where it left off with no data loss at all. So no problem right? Wrong! The guest OS will have sent a SCSI command to its (fake) SCSI disk, and it will expect that command to complete within the SCSI timeout which is 60 seconds. When that does not happen, the guest OS will either retry, reboot, hang or what have you... All depending on the OS and OS revision in question. Yes you can often tune this timeout to suit your needs - but in a hosting environment where customers may administer their own servers, this is not necessarily easy to pull off. What this boils down to, is, that you need whatever downtime you have on your NFS storage to be less than 60 seconds.
When upgrading firmware on the Oracle appliances, you upgrade the passive head. Then, you fail over so that the passive becomes the active, and then you can upgrade the new passive head. Et Voila, both your heads are upgraded and the only downtime you had was the time taken by the cluster failover. This works for minor and major firmware upgrades and it is great!
...except for one little detail: In the older firmwares the failover could cause almost 180 seconds of NFS downtime. In the newer firmwares it seems to be down to around 60 seconds. So things have improved massively and we are almost there - but the bottom line is that either failover time is comfortably below 60 seconds, or it is not - and if it is not, then users will notice.
This was one of the cost-reducing features we bought into. Oracle ZFS Appliances employ in-line de-duplication which means that data is de-duplicated as soon as it enters the appliance. There is no batch job (like NetApp and Storwize) that has to run nightly. This is great for a hosting environment, because we really don't have off-hours. Many systems are more busy at night than during the day, and customers are international. You just cannot take out 8 hours during the night for batch processing...
It turns out that de-duplication results in a de-duplication table (a DDT) which holds the hashes of the data blocks that are potentially de-duplicated. If this DDT fits in RAM, all is well. If it does not fit in RAM, then performance of the appliance will deteriorate massively to the point where it very quickly becomes completely unable to serve anything to anyone.
This is actually in the manual. Oracle does not recommend that you use this feature unless you absolutely understand your dataset and how the de-duplication works. But that does not mean you believe it when you read it - but mark my words, they advise against it for a reason - please don't just enable it to play wit it as it as system-wide impact. The full system deteriorates, not just the share where you enable it. And yes, Oracle say this, they are very honest about this, but it is tempting to not believe them and go ahead and try it anyway. Don't.
That was actually the list of gripes. Not too bad I guess, all things considered. The important part is, that most of these are not an issue anymore at all - not when you know them.
Now for the next list. There are a few points where the system has delivered above my expectations. We had high expectations, so that is saying some.
Need more space or more IOPS? Simple - you go ahead and buy another shelf of disks like the ones you have in your system already. While the system is on-line you cable up the shelf (the SAS links are redundant so you can safely cable in a new shelf while the system is operating). Once you confirm that you have two paths to all shelves in the system (using the simple overview in the web UI), you tell it to extend your storage pool with the new disks. This is just a couple of clicks - it takes a few minutes and it causes NO downtime or interruption or degradation of any kind for your users. The system just "magically" adds the disks to your storage pool.
Since ZFS is a copy-on-write system, it gets to choose where it writes new data. It will decide on which disks to write to based on how full they are and how busy they are - therefore, when you add a full shelf of "virgin" drives, a higher ratio of writes will go to this shelf until your storage has been balanced out. So, there is no batch-job like re-balance process to run - the system will automatically and all by itself even out the data over your newly installed shelf of disks.
We have done this a couple of times, and it really is as painless and simple as I make it sound. I am impressed.
Second point: You just buy the shelf. You do not buy extra licenses for replication, snapshots, flash cache, iSCSI, NFS, analytics, ... It is simple. You buy the shelf, you use the shelf. No licensing nightmare. No features that stop working because they are not licensed for all your storage... No. Simple. Nice!
As I have covered already, we do not use de-duplication. But we enable compression on everything - the system includes various levels of compressions (trade-off between CPU consumption versus compression ratio), but the "cheapest" compression is so CPU efficient that it does not cost you performance - and yet it gives us above 1.4x compression on average. Enabling the fast compression is a no-brainer - there is simply no downside unless your workload contains only highly compressed data already.
The system performs very well even using 7200 rpm high capacity disks for primary storage. Remember, there is only about a factor 2 in performance between these "slow" drives, and the fastest 10k or 15k drives money can buy. Compare that to the difference proper use of flash cache or RAM cache can make, and the speed of the mechanical drives will seem nearly irrelevant. Of course it isn't, but this system is very good at delivering performance even from high capacity drives.
What this means, is, you get a system where you can actually get a lot of proper storage space (which includes replication, snapshots, NFS and iSCSI and all that jazz) for a relatively small amount of money... compared to much else in the market at least. It is not like these appliances come for free - but I genuinely feel that you get a lot for your money.
You should see the ZFS Appliance as consisting of several layers. There is a "data management" layer which takes care of writing "objects" on disks. And then they built two things on top of that - they built the "zvol", a volume, which can be exported via iSCSI (or other block protocols if you need them), and they built the "zpl", the ZFS POSIX Layer, a file system which can be exported via NFS. So, whether you choose to create volumes and export them via iSCSI or if you create file systems and export them via NFS, you are working with a "first class" member. Some systems (cough.. NetApp.. cough..) will create a file-system file and export that via iSCSI, which may not be optimal from a performance point of view. Well, the ZFS Appliance does both block and file well. Quite well...
The system does copy-on-write, which means it decides for itself where to write new data. What that means, is, that random write workloads become sequential-write workloads for your disks. That is a brilliant way improving write performance. What copy-on-write also means, is, the file system is always consistent - there is no file system checker for ZFS because there are no pathologies for a checker to repair.
To serve synchronous writes quickly, the system employs a couple of interesting SAS devices; "LogZillas" - these are basically, as I understand it, RAM disks that include a super capacitor and some flash to survive a power loss. They are used as NVRAM to allow synchronous write requests from clients to be served very quickly.
The system does not do "tiering" in the old fashioned traditional sense of the word - but it does something better. Let me explain.
All data will go to the disks (but this is fast - the writes are sequential), and will stay there. So the mirroring (or whichever redundancy you choose) is taken care of at that level.
The system MAY then choose to cache some of that data either in a read-optimized flash cache (which is around 100 times faster than mechanical disk), or it may choose to cache the data in DRAM (which is 100 times faster than flash).
Since the data is already redundant on your mechanical disks, the system does not need to keep redundancy on flash - so no mirroring means you get twice the effective flash for cache! If a cache device fails, the system continues working with the remaining devices, no worries.
The system is extremely good a choosing what data to keep in RAM and flash. I typically see more than 80% of all read IOs that hit the system being served directly from RAM. That impresses me too.
This "tiering" between RAM, flash and mechanical disks is a continuous process - it is not a scheduled job that runs once every now and then. It does not ask of you to configure rules for which data-sets to put where. It just very simply does what is best for your system so you get the best performance possible all the time. No administrative hassle, no rules to get wrong. And it really really works.
As a first post in the series, it makes sense to just briefly set the scene for what lead up to us using this particular system. Maybe some of this can be useful to others in the same situation as we were.
Before actually acquiring these systems you need to consider whether they are the right option of course. What we needed and what we felt that the Oracle ZFS Appliances could deliver, are:
- Proper iSCSI storage delivered over Ethernet to Hyper-V or SQL Server systems
- High performance NFS suitable for both mail and web hosting as well as vSphere
- Easy and simple administration with little to no daily chores
- Simple pricing model on acquisition as well as expansion
- Low cost per TB while maintaining sufficient performance
- Snapshots on filesystems and volumes
- Asynchronous replication across sites
Everyone knows NetApp - I don't know if anyone ever got fired for buying NetApp... First of all, the web user interface of the NetApp is not very impressive - it is very rudimentary and you have nearly no insight into what the appliance is doing. Which client is doing how many IOs, what are the latencies, how many writes versus reads are going to the bottom shelf, etc. etc. These are all questions you cannot answer by looking at the user interface. So that's a downer, if your job is to figure out why your storage infrastructure isn't delivering... Second, and this was probably what did it for us, was, we could simply not get decent NFS performance from it when doing IO on huge numbers of small files (typical Maildir setup). We were testing in a vendor lab with a vendor provided engineer and he ended up concluding that the performance was "just fine". We could see that if this was "just fine", then "just fine" would not cut it for us. So basically we decided against it because of manageability and performance.
IBM Storwize Unified
The new kid at the block (at that time). The Storwize is a block storage system (and probably a rather good one at that) that does iSCSI (and other block protocols if you need them), coupled with a clustered pair of GPFS serving front ends for your NFS (and CIFS) needs. Now, using GPFS in a cluster is (at least in theory) brilliant, because this is a properly clustered file system quite unlike both NetApp WAFL and Oracle ZFS. What this means, is, the clustered heads run active-active, and if one fails, you do not have downtime at all. There is no fail-over delay in an active-active setup. The UI was modern - I think it is the UI team from the XIV which got to do the UI for the Storwize too - pretty and functional. It had many more data points to inspect than the NetApp (but it doesn't come near the Oracle). Performance on the small file NFS workload was much better than the NetApp - it was, as I remember it, acceptable. In the end, it was a closer race between the Storwize and the Oracle - we needed site replication of data, and the Storwize could not do that at the time. This was one of the major factors in us choosing Oracle over Storwize. Everyone on the team would have their own angle on these systems of course, but a couple of downsides on the Storwize that I noted were:
- The pricing model - complicated set of licenses
- Upgrades - extra hardware and extra licenses required
- An architectural kludge; isolated block storage solution plus NFS heads hanging on top, managed by a UI that attempts to span both worlds - when you delete a file on the GPFS, that does not necessarily free space on the block volume, for example
- One of the few systems at the time to allow tiering between flash, fast disk and slow disk
- Proper cluster file system (GPFS) on NFS heads for zero-downtime node failure
- Vendor provided lab time, equipment and personnel for testing - very nice!
Oracle ZFS Appliance
So what happened when we tested the Oracle appliance? One was provided to us for testing and we ran the same workload on it as we did the others. A few things stood out when testing this system:
- File-systems and volumes provision instantly - there is no waiting for volume provisioning and formatting. A LUN or share of any size is instantly available when created.
- One large pool of storage is shared among all file systems and LUNs - this allows all-out thin provisioning. You can limit the maximum size of a file system by setting a quota (and you should). So basically, instead of re-sizing file systems the old fashioned way (which is often time consuming for the admin and resource intensive on the storage system, causing slowdowns for clients), you simply change the quota on the file system. Again, the change is instant with no impact on performance.
- Analytics. From the web user interface one can drill down into incredible detail on the system and workload. From starting points like Network IO, Disk IO, NFS operations etc. you can drill down "by client", "by latency", "by type of operation" and so forth - and you can drill down again and again. So displaying the latency of read operations to a particular drive for example, is a straight forward operation. I do not doubt that experts can do that using a command line or a third party tool on some other systems too, but on this system it is straight forward and intuitive.
- The magic numbers are gone. There are no practical limits. You can have any number of files of any size in any number of directories with any number of snapshots in any number of file systems. Most competing systems seem to have reachable limits on snapshot counts, file system sizes, LUN sizes and sometimes file sizes.
The 7320 system runs primary storage for client systems - it provides iSCSI and NFS storage to a number of systems. It also replicates its data periodically to the remote 7120 system. The idea is that in case the primary datacenter is lost, a very recent copy of all data will be on-line on the secondary system. Performance wise it will be slower than the primary, but the data is ready on-line immediately. No restoring from tapes for several days...
The primary system is sized to be able to run de-duplication of selected shares. De-duplication on ZFS is block-level pool-wide in-line deduplication - this means, all blocks that are written to the storage pool will cause a lookup in a rather large hash table. If this hash table can fit in RAM or flash, then all is well. If not, then the system will slow to a crawl. However, our primary system has a terabyte of flash so it will be perfectly able to dedup anything we throw at it.
The secondary system is not sized to run deduplication. The theory at the time we configured it, was, that since the system does not run as primary storage but mostly just manages a replica, it did not need to be sized to run dedup. It would just receive the replicas as serialized ZFS snapshot differences, and that would be it. If the secondary system was to take over the primary role (temporarily until a new primary could be established), we could just disable deduplication or live with less optimal performance.
In the real world though, this does not work as intended. When the secondary system receives a replica update, it will still need to access the de-duplication hash table (DDT) to write the data to the pool. In our case, this system with 24GB memory had to do lookups in a 50GB DDT which means it took a disk read for every single write. Needless to say, this is not fast. In fact, both replication completions and dataset deletions cause the secondary system to "freeze" - well, the management interface fails and cannot restart until the operation completes. I had a single dataset deletion take more than five days(!) to complete. During that time, the system served NFS just fine, but the management interface was down and no replica updates were received. It took an Oracle product expert with a shared shell to diagnose the problem and give an ETA of the completion of the operation.
So right now I am disabling deduplication on the primary system, to prevent deduplication from being activated on the secondary. This is not as terrible as it sounds, as the data we have doesn't really deduplicate all that well. Compression still works well and we use it on everything.
The lesson here is, that for a replicated setup, the secondary system really needs to be specced to run dedup if you want the primary system to. Since the 7120 does not support flash read cache, this means you need at least a single 7320 node on the secondary site.