Jakob Østergaard Hegelund

Tech stuff of all kinds

Sun ZFS, replication and de-deuplication

2012-11-21

I have a storage setup consisting of a 7320 cluster in one datacenter and a much slower but higher storage capacity 7120 system in a secondary datacenter.

The 7320 system runs primary storage for client systems - it provides iSCSI and NFS storage to a number of systems. It also replicates its data periodically to the remote 7120 system. The idea is that in case the primary datacenter is lost, a very recent copy of all data will be on-line on the secondary system. Performance wise it will be slower than the primary, but the data is ready on-line immediately. No restoring from tapes for several days...

The primary system is sized to be able to run de-duplication of selected shares. De-duplication on ZFS is block-level pool-wide in-line deduplication - this means, all blocks that are written to the storage pool will cause a lookup in a rather large hash table. If this hash table can fit in RAM or flash, then all is well. If not, then the system will slow to a crawl. However, our primary system has a terabyte of flash so it will be perfectly able to dedup anything we throw at it.

The secondary system is not sized to run deduplication. The theory at the time we configured it, was, that since the system does not run as primary storage but mostly just manages a replica, it did not need to be sized to run dedup. It would just receive the replicas as serialized ZFS snapshot differences, and that would be it. If the secondary system was to take over the primary role (temporarily until a new primary could be established), we could just disable deduplication or live with less optimal performance.

In the real world though, this does not work as intended. When the secondary system receives a replica update, it will still need to access the de-duplication hash table (DDT) to write the data to the pool. In our case, this system with 24GB memory had to do lookups in a 50GB DDT which means it took a disk read for every single write. Needless to say, this is not fast. In fact, both replication completions and dataset deletions cause the secondary system to "freeze" - well, the management interface fails and cannot restart until the operation completes. I had a single dataset deletion take more than five days(!) to complete. During that time, the system served NFS just fine, but the management interface was down and no replica updates were received. It took an Oracle product expert with a shared shell to diagnose the problem and give an ETA of the completion of the operation.

So right now I am disabling deduplication on the primary system, to prevent deduplication from being activated on the secondary. This is not as terrible as it sounds, as the data we have doesn't really deduplicate all that well. Compression still works well and we use it on everything.

The lesson here is, that for a replicated setup, the secondary system really needs to be specced to run dedup if you want the primary system to. Since the 7120 does not support flash read cache, this means you need at least a single 7320 node on the secondary site.