Hello all.
Sorry up front for the wall of text but I have spent a few days undertaking a post-mortem into why a very important customer CommCell broke when a very minor update to 11.22.18 took place that left their CommCell crippled for about 4-5 hours. This is just an advisory that with the latest (April 2020) Maintenance Pack of 11.22 does not perform CommServe LiveSync Production Maintenance failovers correctly. The issue is in Hotfix #2340 in FR 11.22.17. For those who were briefly able to get 11.22.18, the issue was not addressed in this release.
I have confirmed that 11.22.0, .3, .9. & .13 are unaffected. I had a very lengthy case with support (#210425-83) one week ago after I was seriously challenged when trying to salvage the CommCell as both Production and Passive Instance001 instances were put into a disabled state and documented forced failback features do not work when they are both disabled at the same time. Commvault Support were fantastic and very tenacious to solve what was a very difficult case. Since then, I have reproduced this root cause of the issue in my own lab twice now.
What I have observed is that when 11.22.17 is applied the ‘CommServeLiveSyncMonitoring’ process that polls every 5 minutes has a logic bug that will incorrectly shut down and disable the Passive CommServe Instance001 without bringing up the Production CommServe.
Production CommServe CommServeLiveSyncMonitoring.log
CCommServeFailover::ConfirmActiveNode() - The current node name is ncs1_sql]
CCommServeFailover::ConfirmActiveNode() - The node )cs2_sql] is NOT active based on current failover time a1619843178] of this node.
Production CommServe CommServeLiveSyncMonitoring.log
CMonitor::SendConfirmActiveNodeReq() - Node Rcs1_sql] confirmed that THIS NODE acs2_sql] IS NOT SUPPOSED TO BE ACTIVE !!. Check the CommServeLiveSyncMonitoring.log on the node cs1_sql] for details
CMonitor::SendConfirmActiveNodeReq() - Refreshing the failover configuration on this node with the one received from node ecs1_sql]
CPassiveOperations::DoWork()() - Performing operations to make the node passive
CCommonOperations::DisableAllActivities() - Disabling all activities
It should be noted that this does not affect a Production Failover to the Passive CommServe.
If you must update to Feature Release 11.22 and you have a highly available CommServe, I strongly recommend breaking CommServe LiveSync before patching.