We started with and still have a lot of customers on our bare metal servers, each in an A/B hot spare configuration. There are some advantages for sure with bare metal but virtualization adds so much more flexibility. That said we recently had a bad experience with Proxmox running KVM machines where a bunch of processes went max CPU (We suspect a ZFS I/O storm) and killed the whole machine, or should I say left it just running enough so HA wouldn't kick in and when we failed over manually to other machines in the cluster it was a huge nightmare. Some of this was ultimately human error and bad configuration choices but it caused us to take a bit of a hybrid approach.
We're rebuilding the whole cluster from the ground up, again with Proxmox but with a much different approach. Here are the Bullitt points:
- No ZFS. It used a lot of memory that it often failed to relinquish and it's suspected that it was the root of our CPU issue. We're back to EXT4 thin volumes on caching hardware raid controllers and a series of three mirrors over six drives. Not the fastest but the best data integrity in our opinion. FYI we always only use local storage.
- No more clustering. This was tough as live migration also saved a lot of work many times, but clustering also caused a lot of heart ache and never really worked in emergencies as anticipated.
- All VMS will have a hot spare running on a different host at all times and they'll be striped across machines. We'll handle replication the old fashioned way with rsync scripts, etc..
- We're also going to incorporate the new Proxmox back up server.
This sounds like regression and probably is but we can still do things like snapshot machines, backup machines, and migrate them manually. It sort of mimics our bare metal hot A/B approach but in a virtualized environment.
As Scotty once said: "Sometimes the more you overdue the plumbing, the easier it is to stop up the drain"
Oh and by the way, Chevy...