Jakob Ƙstergaard Hegelund

Tech stuff of all kinds
Archive for November 2012

Sun ZFS, replication and de-deuplication


I have a storage setup consisting of a 7320 cluster in one datacenter and a much slower but higher storage capacity 7120 system in a secondary datacenter.

The 7320 system runs primary storage for client systems - it provides iSCSI and NFS storage to a number of systems. It also replicates its data periodically to the remote 7120 system. The idea is that in case the primary datacenter is lost, a very recent copy of all data will be on-line on the secondary system. Performance wise it will be slower than the primary, but the data is ready on-line immediately. No restoring from tapes for several days...

The primary system is sized to be able to run de-duplication of selected shares. De-duplication on ZFS is block-level pool-wide in-line deduplication - this means, all blocks that are written to the storage pool will cause a lookup in a rather large hash table. If this hash table can fit in RAM or flash, then all is well. If not, then the system will slow to a crawl. However, our primary system has a terabyte of flash so it will be perfectly able to dedup anything we throw at it.

The secondary system is not sized to run deduplication. The theory at the time we configured it, was, that since the system does not run as primary storage but mostly just manages a replica, it did not need to be sized to run dedup. It would just receive the replicas as serialized ZFS snapshot differences, and that would be it. If the secondary system was to take over the primary role (temporarily until a new primary could be established), we could just disable deduplication or live with less optimal performance.

In the real world though, this does not work as intended. When the secondary system receives a replica update, it will still need to access the de-duplication hash table (DDT) to write the data to the pool. In our case, this system with 24GB memory had to do lookups in a 50GB DDT which means it took a disk read for every single write. Needless to say, this is not fast. In fact, both replication completions and dataset deletions cause the secondary system to "freeze" - well, the management interface fails and cannot restart until the operation completes. I had a single dataset deletion take more than five days(!) to complete. During that time, the system served NFS just fine, but the management interface was down and no replica updates were received. It took an Oracle product expert with a shared shell to diagnose the problem and give an ETA of the completion of the operation.

So right now I am disabling deduplication on the primary system, to prevent deduplication from being activated on the secondary. This is not as terrible as it sounds, as the data we have doesn't really deduplicate all that well. Compression still works well and we use it on everything.

The lesson here is, that for a replicated setup, the secondary system really needs to be specced to run dedup if you want the primary system to. Since the 7120 does not support flash read cache, this means you need at least a single 7320 node on the secondary site.

Hyper-V 2012 versus VMWare vSphere 5

Tasked with investigating whether a big P2V project should go the Hyper-V 3 (Server 2012) route or the VMWare (vSphere 5.1) route, I spent a few weeks testing both solutions. I also spent some time googling the subject - and that is actually why I am writing about it here. The classic "VMWare versus Hyper-V" search yielded absolutely no useful information. In this post I will add my findings in the hope that this will be useful to someone.

The history and now

I have not worked with vSphere or Hyper-V in the past, so as a newcomer to both systems I have had to do some reading on both. Among the readings were best practices from both VMWare and Microsoft.

VMWare is by far the market leader, and have ben since they started the whole PC virtualization market with VMWare workstation. Microsoft is playing catch-up with Hyper-V, which has just been released in its third major version with Windows 8 and Server 2012. However, as people say, "Microsoft usually gets it right the third time" and that is now. Server 2012 with Hyper-V 3 looks promising from a virtualization point of view.

Feature wise, if you look at just what the hypervisor can provide for your guest systems, I will make the claim that they are both just fine. Both will offer perfectly fine solutions for a wide range of guest applications. If you look at Hyper-V marketing material you will see how Hyper-V supports bigger disks than vSphere and if you look at vSphere material you will see how VMWare has much better memory management allowing for better memory overcommit on a broad range of guest systems. There are probably hundreds of such points where one solution offers something better than the other - but all in all I find this largely irrelevant. You need a solution that works, that is manageable, reliable and affordable. If one solution costs you 10% more RAM than the other, that is really just a rounding error on the total cost of running a large virtualization setup. You can buy a lot of RAM for what an hour of downtime will cost you in personnel expenses and lost business.

So, even though both camps will hate me for saying that there is no big difference in basic hypervisor features , I am going to make the claim that there isn't. Sure, point for point there are big differences, but if you look at what the differences really mean, on the bigger scale, the differences are insignificant.

The big differences lie elsewhere. The "big three" that I found are: Complexity, Reliability and Storage.

1: Complexity

I want a clustered setup so that if the hardware of a hypervisor fails, my guest systems will start up on another hypervisor. Also, I want the software to place the guests on the available hardware so that I don't have to think about which guests run on which hypervisors. All in all, I want to decouple my systems from the problems of physical hardware, and I don't want a ton of new problems introduced by the virtualization solution. In other words; I want life to be simpler, not harder. Oh, and I don't want to have to care about failing hardware.

My setup will consist of:

Have you ever cold booted a data center? Have you ever had to cold boot systems with circular dependencies?

If you can answer yes to one or more of the above, try to imagine cold booting your virtualization setup that runs all your internal and customer systems. Your entire shop is down and you need to get things back up. Now try to think about what kind of dependencies you can accept. Sure, the storage needs power before your hypervisors can use it. Simple enough to understand - hopefully your storage does not depend on an AD, and hopefully you didn't virtualize the AD... Or did you? If not, how many of the old stand-alone servers need to run before you start the storage, and how many need to run before you start the hypervisors?

Complexity kills.

I installed a vSphere 5.0 cluster complete with virtualized vCenter server, shared NFS storage, failover and everything, in less than two days. From scratch. Having never even worked with a vCenter client before. Having never installed an ESXi hypervisor before. Not even knowing exactly what each term covered; vCenter, vSphere, vCloud, vWhathaveyou.

I installed a Hyper-V 3 cluster complete with iSCSI shared storage, non-virtualized AD and non-virtualized System Center in slightly more than two weeks. I had considerably more Windows skills than vSphere skills when I set out to do this, so the massive effort needed to just get the basic setup up and running was disapointing to me.

The difference in complexity between the two solutions is massive. Let us first look at vSphere: Each ESXi hypervisor can be seen as a fairly autonomous unit - it knows which guests it should run, and it knows that it may need to take over guests from other hypervisors in case they fail. The vCenter server is the high level logic that does all this configuration for you, so that all the little ESXis knows their place in the cluster. Hoever, the vCenter server is not needed for failover to occur - each little ESXi will do exactly what it has been configured to do, take over guests from others that fail, without intervention from the vCenter. It is very nice that the ESXi hypervisors do not need to depend on vCenter for such basic operations, but it also opens up the interesting opportunity to virtualize your vCenter - in the very cluster that it manages! This is absolutely brilliant because it both isolates your vCenter from failure of physical hardware (it will fail over just like any other guest), it also frees you from having an external system that must also be running before your hypervisor cluster can run. In fact, once your shared storage is up, you can power on the cluster. There are no other dependencies. Cold-boot of a data center is a two step process. Simple.

Enter Hyper-V. First of all, instead of the 32 MB ESXi image that is easily installed on internal flash storage on your blade server, you need a full Windows 2012 Server install - that means disk drives and an array controller to do your RAID-1, even though you will have all guest storage on your SAN. This is is not really a problem of course - but the two base hypervisor installs provide an interesting contrast. In order to run a cluster so that your guest systems can fail over in case of physical (or other) hypervisor malfunction, you must use a cluster-shared-volume (CSV) over iSCSI. For this to fly, you must have an AD. Since the Hyper-V is run as a role in Windows Failover Clustering, your hypervisor is not going to function until the AD is up. This means, you must install an AD (which is at least two servers - for failover) as physical servers (or virtualized on another solution - however ironic that would be).

During my testing it quickly became clear that the resilience of a Hyper-V cluster alone is lacking - in case of network connectivity problems that cause even minor iSCSI connectivity problems, guest systems are readily powered off by the failover cluster manager. Once the network is back to normal, it is not possible to see which guest systems were switched off due to the network problems, and which have been switched off for administrative reasons (customer didn't pay, system is hacked, or whatever other reasons you could have to switch off a guest system). My hope was that System Center could rectify this by applying some higher level logic and automatically re-start the guest systems that fail. That means, you also need your System Center server running before your virtualization setup will function - and it also means that you cannot virtualize your SC server (at least not on your Hyper-V cluster).

Cold booting our Hyper-V based datacenter is now a four step process. Boot the storage. Boot the AD. Boot the SC. Boot the hypervisors. Sure this is still manageable, if these are the only dependencies you have... That said, during my weeks with Hyper-V, I never actually got SC to apply such policy to my guest systems. I don't know if it is possible - I assumed it would be, but I have not actually seen it work in real life.

What I have seen, however, is my cluster-shared-volume being dropped by hypervisors even during minor network problems. The iSCSI standard dictates a 60s timeout, but my CSV got dropped (causing guest systems to power off) much faster than that. I even got the rather disturbing log message once; that my CSV was lost and "might be corrupted". Imagine that... That your multi-terabyte LUN that holds hundreds of guest systems gets corrupted simply due to a network glitch. My CSV was fine though, but the fact that the error message even exists is worrying. I mean, is it really a possibility or is the log message simply written by someone who don't know the system? I will get back to the storage discussion shortly.

Considering the complexity of the two setups - both the direct complexity measured in the number of discrete non-virtualized systems you need to run aside from your storage and the hypervisors themselves (vSphere: 0, Hyper-V: 3 (2AD + SC)) and the steps in a cold boot (vSphere: 2, Hyper-V; 4), but also the functional complexity of the individual components, the difference is huge.

2: Reliability

You can build a system so simple that it obviously has no deficiencies. Or you can build a system so complicated that it has no obvious deficiencies. Very closely related to the Complexity difference is the reliability difference as I observed it during my testing.

Over the course of about two months, I ran a few handfulls of guest systems on my vSphere cluster. During that time there was a serious switch problem on one of the blade enclosure of one of the hypervisor blades - the switch was overloaded and dropped a significant amount of its traffic. There have been many many smaller glitches - reboots of switches that cause spanning tree convergence (30 second complete network blackouts) and so on. During this time I have not had a single guest system power off or fail. It seems the systems "freeze" when ESXi loses connectivity to its storage, only to "thaw" when the network normalizes.

In the three weeks I worked with Hyper-V, I ran fewer guest systems on that cluster. I experienced a handfull of seemingly spontaneous power-offs - upon investigation it was clear that each and every one of them was caused by the CSV having problems due to network glitches. What a pity that the Hyper-V storage binding is so fragile. It quickly became routine that every morning I would check on the guest systems and power them on if they had been powered off during the night. This was what I hoped that SC would help me do - I certainly wouldn't consider running hundreds or thousands of guests like this, having to remember which to manually power on every morning...

So why this difference? Well, consider again the complexity of the two solutions:

On vSphere, each ESXi speaks NFS to the storage. Any consistency concerns are handled solely by the storage system, the hypervisor is just a client.

On Hyper-V, each hypervisor is a clustered windows server that, with the help of a set of AD servers, coordinates access to a shared NTFS-derived (CSV) file-system in order to provide access to storage for itself in a way that does not conflict with other hypervisors using the same underlying storage. Any single hypervisor can corrupt the full volume.

Each ESXi knows which guests it may need to take over for other hypervisors that fail, and it can do so on its own. I was pleasently surprised by the simplicity of this part of the vSphere design.

Each Hyper-V is run as a role under the failover cluster manager. If a guest system fails on one Hyper-V, the cluster manager can re-start it on another Hyper-V - in real life, however, the CSV will typically be lost first (due to a small network glitch), causing the failure of the actual guest systems which can then not fail over, because the CSV has failed. This in turn causes the power-off of all affected guest systems.

In the rare case where you do a controlled shutdown of a hypervisor (using, for example, cluster-aware updating - a brilliant feature in Server 2012 by the way!), the guest systems are live migrated to other hypervisors and each hypervisor can then, in turn, be updated without any downtime of the guests. So, on paper, the Hyper-V solution is great - but in the real world where networks can fail, I find the resilience of clustered Hyper-V abysmal and the complexity of the setup disturbing.

3: Storage

With a decent NFS filer (we use Oracle ZFS appliances), you can have filer-based snapshots done automatically with just a small storage overhead. You can access these snapshots directly over NFS and they can be on-line always. You can even replicate your storage to a secondary location where you can store snapshots further back in time to allow for quick restores of week old guest systems.

All the snapshot and replication logic is possible with an iSCSI LUN too of course, but all (or many of) your guest systems will reside on a single LUN. That means - if you want to copy over a single guest system from your week old snapshot, you need to mount the full LUN snapshot to extract the data. You need to bring the snapshot on-line and you need some client system to mount it. Then you can start extracting the data. It may seem like a small difference, and conceptually it is, but on a day-to-day basis I believe that the ability to work with your guest system files as just that - files - and copy them around as you please, will make life simpler. You will never need to interact with your filer to restore a snapshot - they are always accessible, always on-line, and can be copied around with a simple copy command (or file manager - pick your poison).

Then we're back to the good old NFS versus iSCSI performance arguments. Well guess what - there are none. You can get the performance you need over any protocol - sure, a given NetApp filer may do NFS better than iSCSI and a Storwize may do iSCSI better than NFS, but you can easily get whatever performance you need using either protocol. What you really need to decide is, do you want iSCSI or NFS? For virtualization in a setup like I describe here, there is no contest. Do you want easily accessible files and always on-line snapshots, or do you want multi-terabyte LUNs with dedicated "mount clients" to extract your data? NFS is simple, the alternative is not.

All in all

There are tons of small and not-so-small annoyances and shortcomings of both solutions. Like why so many VMWare alerts are nearly useless because of their lack of detail or contextual information. Or why you cannot hot-add a NIC in Hyper-V. Or why VMWare converter does not properly support more LVM based Linux installations. Or why Hyper-V has no VMWare converter equivalent.

There is the thing with licensing also. VMWare is outrageously expensive. Running a cluster over three years, you will end up paying more for the vSphere licenses than for your hardware. Considering that vSphere is just the thin layer between the hardware and the actual guest systems that are really the reason we do any of this, it seems wrong to pay so disproportionally much for such a little layer.

All in all though, a vSphere cluster with failover and load balancing will take away a lot of pain. The vSphere software may be a thin layer, but it is such an important layer to get right - and if you want this layer to actually work, it seem that there is very little competition. And that is why the outrageous pricing from VMWare is not really outrageous.