I recently had a conversation with an associate concerning Windows Failover Clusters failing over at the time a Veeam backup job is running. The situation involves two virtual machines running Windows Server 2012 configured as nodes in a Failover Cluster. The cluster supports SQL Server 2012 High Availability in a two node Always on Availability Group. The problem that the DBA noticed was that the network connection (NIC) was momentarily being dropped. The cluster saw that and failed over. This was repeatable and was happening during a time when the backup job was running. What could be the issue? I wasn’t really sure if Veeam was causing the problem but I started doing some research to identify what was going on at the time. First, the Windows Event Logs revealed a network interruption and then the cluster chatter to figure out what to do about it and the failover itself. The SQL Log in Management Studio showed the node change as well. This activity was happening during the time period that the job ran.
The VMs are hosted on an VMware ESXi 5.1 platform. Veeam Backup & Replication relies on VMware to do much of the heavy lifting in preparing the VM files (.vdk) for copying. Veeam like other backup solutions uses a scheduler to execute jobs and a database to keep track of the results of the job.
To backup a virtual machine, create a backup job on the backup server. There are a number of links available. This one from Veeam will show you the fundamentals. http://www.veeam.com/kb1521. The backup server relies on another server called a backup proxy to move the backup files from the active datastore where the virtual disks are located to a repository datastore. The repository datastore is connected to the backup proxy.
Conceptually, Veeam calls VMware to preform the following steps:
- Veeam contacts vCenter to create a snapshot of the VM(s) that comprise the job. The snapshot is taken so that the running machine is in a operationally consistent state.
- Veeam instructs VMware to reconfigure the Veeam backup proxy to mount the datastore containing the VM disk files.
- The backup proxy transfers the files to a backup repository connected to it.
- When the backup is complete, the proxy is reconfigured to dismount the datastore.
- VMware then proceeds to delete the snapshot.
This deletion of the snapshot can take some time with an active virtual machine. During the last step, the final merge of 16MB or less, the VM is placed in a “stun” mode, effectively stopping the VM and any activity associated with it. When the merge is completed, the VM is then “unstunned”. This should last an instant and not have a noticeable impact on the running VM.
trigger. The following article, “Fine tuning heartbeats for aggressive/passive configurations”, http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx, provided some guidelines to help you decide the tradeoff with polling and timeout values. After making the change to a more passive configuration, the problems appeared to have vanished. There is no absolute setting. You will have to experiment with the settings in your environment to find what’s appropriate.
Hope this helps.
This posting is provided “as is” with no warranties, guarantees or rights whatsoever.