Commserve LiveSync issue with 'Production Maintenance Failover' in 11.22.17 #2340

  • 2 May 2021
  • 1 reply
  • 40 views

Userlevel 4
Badge +9

Hello all.

Sorry up front for the wall of text but I have spent a few days undertaking a post-mortem into why a very important customer CommCell broke when a very minor update to 11.22.18 took place that left their CommCell crippled for about 4-5 hours.  This is just an advisory that with the latest (April 2020) Maintenance Pack of 11.22 does not perform CommServe LiveSync Production Maintenance failovers correctly.  The issue is in Hotfix #2340 in FR 11.22.17.  For those who were briefly able to get 11.22.18, the issue was not addressed in this release.

I have confirmed that 11.22.0, .3, .9. & .13 are unaffected.  I had a very lengthy case with support (#210425-83) one week ago after I was seriously challenged when trying to salvage the CommCell as both Production and Passive Instance001 instances were put into a disabled state and documented forced failback features do not work when they are both disabled at the same time.  Commvault Support were fantastic and very tenacious to solve what was a very difficult case.  Since then, I have reproduced this root cause of the issue in my own lab twice now. 

What I have observed is that when 11.22.17 is applied the ‘CommServeLiveSyncMonitoring’ process that polls every 5 minutes has a logic bug that will incorrectly shut down and disable the Passive CommServe Instance001 without bringing up the Production CommServe.

Production CommServe CommServeLiveSyncMonitoring.log

CCommServeFailover::ConfirmActiveNode() - The current node name is [cs1_sql]

CCommServeFailover::ConfirmActiveNode() - The node [cs2_sql] is NOT active based on current failover time [1619843178] of this node.

Production CommServe CommServeLiveSyncMonitoring.log

CMonitor::SendConfirmActiveNodeReq() - Node [cs1_sql] confirmed that THIS NODE [cs2_sql] IS NOT SUPPOSED TO BE ACTIVE !!. Check the CommServeLiveSyncMonitoring.log on the node [cs1_sql] for details

CMonitor::SendConfirmActiveNodeReq() - Refreshing the failover configuration on this node with the one received from node [cs1_sql]

CPassiveOperations::DoWork()() - Performing operations to make the node passive

CCommonOperations::DisableAllActivities() - Disabling all activities

It should be noted that this does not affect a Production Failover to the Passive CommServe.

 

If you must update to Feature Release 11.22 and you have a highly available CommServe, I strongly recommend breaking CommServe LiveSync before patching.

 


1 reply

Userlevel 5
Badge +11

Hi @Anthony.Hodges 

Thank you for sharing these detailed troubleshooting steps, I’m sorry you have had a difficult experience here.

I will follow up and check the support case so this gets fed back internally to avoid any recurrences elsewhere.

Thanks,

Stuart

Reply