Solved

Commserve LiveSync issue with 'Production Maintenance Failover' in 11.22.17 #2340


Userlevel 5
Badge +10

Hello all.

Sorry up front for the wall of text but I have spent a few days undertaking a post-mortem into why a very important customer CommCell broke when a very minor update to 11.22.18 took place that left their CommCell crippled for about 4-5 hours.  This is just an advisory that with the latest (April 2020) Maintenance Pack of 11.22 does not perform CommServe LiveSync Production Maintenance failovers correctly.  The issue is in Hotfix #2340 in FR 11.22.17.  For those who were briefly able to get 11.22.18, the issue was not addressed in this release.

I have confirmed that 11.22.0, .3, .9. & .13 are unaffected.  I had a very lengthy case with support (#210425-83) one week ago after I was seriously challenged when trying to salvage the CommCell as both Production and Passive Instance001 instances were put into a disabled state and documented forced failback features do not work when they are both disabled at the same time.  Commvault Support were fantastic and very tenacious to solve what was a very difficult case.  Since then, I have reproduced this root cause of the issue in my own lab twice now. 

What I have observed is that when 11.22.17 is applied the ‘CommServeLiveSyncMonitoring’ process that polls every 5 minutes has a logic bug that will incorrectly shut down and disable the Passive CommServe Instance001 without bringing up the Production CommServe.

Production CommServe CommServeLiveSyncMonitoring.log

CCommServeFailover::ConfirmActiveNode() - The current node name is [cs1_sql]

CCommServeFailover::ConfirmActiveNode() - The node [cs2_sql] is NOT active based on current failover time [1619843178] of this node.

Production CommServe CommServeLiveSyncMonitoring.log

CMonitor::SendConfirmActiveNodeReq() - Node [cs1_sql] confirmed that THIS NODE [cs2_sql] IS NOT SUPPOSED TO BE ACTIVE !!. Check the CommServeLiveSyncMonitoring.log on the node [cs1_sql] for details

CMonitor::SendConfirmActiveNodeReq() - Refreshing the failover configuration on this node with the one received from node [cs1_sql]

CPassiveOperations::DoWork()() - Performing operations to make the node passive

CCommonOperations::DisableAllActivities() - Disabling all activities

It should be noted that this does not affect a Production Failover to the Passive CommServe.

 

If you must update to Feature Release 11.22 and you have a highly available CommServe, I strongly recommend breaking CommServe LiveSync before patching.

 

icon

Best answer by Mike Struening RETIRED 16 June 2021, 15:38

View original

7 replies

Userlevel 7
Badge +15

Hi @Anthony.Hodges 

I’m afraid I don’t have any update for you, but I have been chatting with the engineer assigned to the case, who you have also been speaking to.

Rather than chase this informally, we have decided to formally escalate the case to Development for better tracking and traction.

Hopefully we will have some news for you soon once we engage these internal processes and raise visibility of the issue.

Thanks,

Stuart

Userlevel 5
Badge +10

@Stuart Painter After a lengthy follow-up with support I have been informed this was one of the unlisted fixes in Maintenance Release 11.22.25.

Userlevel 7
Badge +23

@Anthony.Hodges , if you go to the reply with the most helpful solution, you can click Best Answer. This will then show at the top for future visitors. 
 

I can also do it for you, though I don’t want to rob you of the honor!

 

edit: Yup, this was a Conversation, not a Q&A. Only the latter can be marked as solved. Fixed it and will mark solved!

Userlevel 7
Badge +15

Hi @Anthony.Hodges 

Thank you for sharing these detailed troubleshooting steps, I’m sorry you have had a difficult experience here.

I will follow up and check the support case so this gets fed back internally to avoid any recurrences elsewhere.

Thanks,

Stuart

Userlevel 5
Badge +10

@Stuart Painter Do you know if there has been any update on this?  My case “210504-63” has no answer yet.

Userlevel 7
Badge +23

Sharing the Hotfix numbers as well:

Development team confirmed that Hotfixes 2770 and 2771 are part of FR22.25 which resolves the when maintenance failover can cause upgraded node to take up active instance.

Let me know if you’re good to mark this as solved :nerd:

Userlevel 5
Badge +10

Hi @Mike Struening, I cannot locate the mark as solved button.  We’re good to mark it as solved.

Reply