Solved

Index Restore Filling up Index Disk / Cleanup Temp

  • 26 August 2021
  • 20 replies
  • 248 views

Hey,

This is not the first time this happens to me and every time it is an issue since it fills up the disk and it wont release the space until (at least) I either reboot the media agent or add a advsetting key (CLEANUP_TEMP_DB_DAYS = 0). In this case, particular case, i have done both and still wont release the space.

Long story short, any time I request a browse from a job that is available on disk but is fairly old, the media agent kicks off a Index Restore, since those indexes are not on media agent anymore. The index kicks in, and thats when the operation starts to fill up the disk. (always with NDMP (V2) that are massive).

So, I then, I cancel my browse request, but it is too late now, the background tasks will continue to a point where it fill up the disk.

When I look at the three size, I can see the client GUID which just gives us a little bit more info as to which client / confirm the client that cause the problem.

 

So I currently have a media agent that its all of space for a Browse that was initiated days ago and didnt even actually let me browse due to the operation of restoring the index. So I closed that browse operation yet the Index Restore kept going. Used the key mentioned and restarted the services and still nothing. Tried load balance workflow which was not helpful.

So looking for suggestions, experiences with the same issue and whatever is helpful. 

icon

Best answer by Mike Struening 7 October 2021, 19:42

View original

20 replies

Userlevel 7
Badge +23

@dude , I’ll get some of our top people on this.  It’s the least we could do for someone who is always helping everyone else!!

In case it becomes important, can you share the CS and MA versions?

Appreciate that @Mike Struening. Running V11.20.64 (Latest hotfix pack available).

Userlevel 1
Badge +1

Hi dude,

Index caches becoming full are always tricky as there is no way to reduce the size of an index (besides backing up less files). Since you’re on V2 you’re already taking advantage of space optimizations. 

One option that comes to mind is if you can identify 600E client, you could force backing that up using a different MA if you have one available. This would allow you to free up 300gb or so from this index cache. Of course we don’t know exactly how big the index is for B9B0 as it hasn’t finished restoring so its possible that 300gb would be eaten too. 

I will have to say the best option here is to increase the space available for the index cache, possibly to 2.5-3TB.

Thanks for the reply @Jon Kuzmick I`m not quite sure I follow. The index restore to my understanding was only kicked off because of a attempt to restore a file from a NDMP client. Since I actually did not restore anything, in fact I cancelled the browse attempt, the index restore kept going until it filled up the disk. So this is not a backup operation that is filling up the disk, in fact I had 900GB before the restore attempt. I can not think that in a matter of few hours because of a cancelled restore operation my Index cache filled almost 1TB and there is no way to claim that space back. Again Ive had this issue in the past as I mentioned and the key had helped, but this time it didnt.

I would understand what you are saying IF the Index had been filled due to backup operation with actually valid files from backup jobs. In this case, it is a browse restore that did not happen yet the software keep building that index.

Userlevel 1
Badge +1

Hi dude,

I thought you were asking for a method of getting the index to “fit”, my mistake. The index restore will continue even if the browse is cancelled. The index clean up process happens every 24 hours. There may be a key (I don’t think it’s the one you listed) to clean it up faster, I will need to look it up. I just did a quick search but came up short. I will look tomorrow morning, but by then the cleanup may have triggered. 

It may be best to create a support ticket for us to investigate this. You could even request me in the ticket directly if you’d like.

Userlevel 1
Badge +1

Hi Dude,

I looked through our database of keys and there does not appear to be a way to reduce the period of the index cache cleanup. As of this morning, did you see any space clear up?

Hi @Jon Kuzmick thanks for checking on that. My index cache is now completely full. I looked at some old cases, that is where I found the key, so it was given to me in the past by support

  • Name
    • CLEANUP_TEMP_DB_DAYS
  • Category
    • Indexing
  • Type
    • Integer
  • Value
    • 0 (number of days to be cleaned - I believe the default is 7)

Unless this key is not valid within the current CV Version (v11.20.64) , perhaps discontinued within the code….Is this something you can validate with Dev?

Userlevel 1
Badge +1

@dude,

Key is valid, but description leads me to believe it may only work on V1 indexes.

Do you have a support contract? Could you create a ticket? It’s fine if you can’t, but would be nice to have a database/logs to look at. 

 

Userlevel 7
Badge +23

Hey @dude , following up to see if you ended up opening an incident for this issue (and if you did, what the case number is).

Thanks!

I`m working through this and will post the solution once I get to it. Tks

Userlevel 7
Badge +15

V2 should be keeping the full index on disk for the entire dataset. It should not be doing index restores unless it is damaged. The initial implementation always kept the index for jobs still in retention - I do not believe that changed, but I have been out of the game for a bit so it is possible.

There was a proposed change (I think it happened already) where the index was going to be subclient based for NAS to make it more granular, and allow it to load balance better across index space on multiple MAs (which became a problem for HyperScale since one large NAS client could exhaust the space on 1 node in the grid). Typically index databases are built at the backupset level, but with NAS that can grow too big.

I had a quick look at your commcell health / index report and see three possibly broken indexes for NAS clients - not sure if its the same clients you are looking at but you should be able to access the report here. This could be why the index restores are kicking off - as the index on the MA is possibly damaged.

There was a proposed change (I think it happened already) where the index was going to be subclient based for NAS to make it more granular, and allow it to load balance better across index space on multiple MAs (which became a problem for HyperScale since one large NAS client could exhaust the space on 1 node in the grid). Typically index databases are built at the backupset level, but with NAS that can grow too big.

I did try that for the client in question, index was moved to a subclient level but issue persists. 

Userlevel 7
Badge +23

@dude , let me know if you end up opening a case for this one.  Very curious as to the solution, once discovered!

@dude , let me know if you end up opening a case for this one.  Very curious as to the solution, once discovered!

Sent in a private message.

Userlevel 7
Badge +23

You’re the man, my friend!!

Userlevel 7
Badge +23

Sharing update (and likely solution) here:

We found the subclient index for backupset GUID “<GUID>” failed due to low space on Media Agent index drive. 

 

1.    The backupset GUID was consuming 1.3TB of space
-    With Dev assistance we deleted the backupset GUID “\<GUID>”
 

-    In this case any index restore for subclient index will ref last cycle only. 

 

2.    Confirm subclient browse is now working

edit: removed script details as dev requires escalation to ensure no adverse effects will occur.

Yes. The actual problem is a lot deeper than that and I actually think it should be reviewed by Dev. Basically whena  Browse/Restore is kicked off (in my case on a NDMP) since my client has a huge index, the browse/restore has to reconstruct the index for that client which was well over 1.4TB (1.4TB being the free space on my index cache)  not enough. So causing the media agent to fill up the entire index cache disk.

So with the index disk out of space and Browse/Restore never completed or failed. So the ability to browse and even some other backup jobs for that MA were failing.

The folder for that particular index agent had to be (pause MA services first) deleted and restarting the services on the MA was necessarity as well.

In addition to that, considering the size of my backups for that NDMP Client I was then recommended to run the workflow “Enable Subclient Index” that basically converts the indexing from a backupset level to the subclient level making it the browse/restore requests to only (if necessary) rebuild indexing for the particular subclient.

During that process for the workflow it also requires the rebuild a certain number of cycles which then led me to fill up the disk once again and never completed the workflow. So then the drilling down to the folder again that was filling up. Paused all services on the MA, deleted the folder again, run the Qcommand mentioned by @Mike Struening and re-run the workflow so it could complete the remaining subclients.

I hope this can help other users, but that Commvault also improves the way Index Reconstruction works because that is a complicated way to deal with an problem that could have been solved by the software understanding there is no more disk space therefore abort the task and errors out the browse/restore and delete the temp recon db created regaining the space utilizaded. Perhaps an error message with an article.

I appreciate the support here.

Userlevel 7
Badge +23

Agreed.  I can that the dev team has been involved with the case owner so I’ll keep tracking this one.

Are you not getting a change fix?  I can follow up on this if you’re not.

So far I have not had any additional issues. Like I said, the behavior during the Browse/Restore operation was not expected. One would think that eventually it would fail due to not enough space and the sofware would smartly cancel the rebuild operation reclaiming the space that was used by the initial B/R request. not what happened. 

The workaround worked but envolved manual steps and qcommands as you referenced. My suggestion was to have this behavior reviewed and perhaps a code fix to avoid such issues in the future. For now, I`m good.

Userlevel 7
Badge +23

Let me see what I can do.

Reply