Skip to main content
Solved

Best way to backup a directory with (only) image files. Possibly TB's of image files

  • 22 January 2024
  • 1 reply
  • 99 views

Forum|alt.badge.img+11

I have a backup request to backup a new file share location (a file share) that handles image files (not sure of type or format). Currently the size is about 500 GB (number of files is unknown at this time), but there is some talk of it becoming larger to possibly > 10 TB (or ultimately “many new files being constantly moved in and out of this folder at regular times” vs total size of the folder). We are backing up a lot more than this of “unique data” across the systems so 10-20 TB is not a huge amount of “new unique data” being backed up, but its enough to make me reconsider just adding it without some thought.

Currently all of my backup storage policies use DDB’s (we have a few).  I did not set any of the storage policies we have up, just inherited it all.

If I used our ‘standard” storage policy for our file shares (that all use DDB’s), would the (possible) size of > 10 TB of image files blow up my DDB’s or cause some unintended consequences to the DDB (as the data will likely not Dedup well), or do DDB’s only contain references/indexes of the data on storage (meaning: the type of data files being backed up is inconsequential to the DDB, just the volume/amount of data files/folders/objects to track causes it to grow).  I know this will use up storage space, and likely little/no compression will be occur, but maybe the image files are all “similar” would we get some block level dedup on them.

Anyway; I’m wondering if TB’s of likely unique image files are something where one would want to do a setup that was very special, or if I just hook this up to our standard DDB enabled storage policies, if the DDB size will just handle it and not balloon in size or responsiveness. Of if there is a rule of thumb to use to precalculate something like this and impact it could have to the DDB before we back it up (I know storage space and licensing will be impacted, and can simply estimate “if we back up 10 TB, it can be 10 TB of storage or licensing needed before we back this up)

Best answer by Damian Andre

I think you are right that there will be little savings, if any if the images are truly unique and the user described how the data will be used accurately. If there are duplicates then obviously there will be some savings, but perhaps not enough to justify using up fast disk storage performance to host a DDB. Keep in mind the saving between full backups as well - even if the data is unique itself, between backup jobs (versions) they could be almost identical.

That being said, 500 GB right now is not a big deal, so maybe you can run some tests over a few weeks and see what type of ratio you are getting between full backups and monitor the ‘read world’ profile of this data. A lot of this hinges on your retention as well. If you have shorter retention, its probably not worth de-duplicating. If longer, then it may absolutely be worth it to get saving between full jobs.

Never versions of Commvault have a feature call horizontal DDB scaling, which means if performance starts to suffer, it will start a new ‘mini-DDB’ to to speak. So it does well to balance the performance to space savings really well - I am not so worried about bloating the DDB, especially since 10TB is very small in the scheme of things.

The amount of files and directories may start to become a bottleneck over time. At which point you may decide to use block-level file system backups instead, where blocks are the disks are backed up rather than opening/reading/closing individual files on the file system. The upside is faster backups and whole volume restores, the downside is slower individual restores or restores of only a few items. So think about how the data is typically going to be recovered if needed.

Licensing is based on the frontend usage - i.e if you backup 10 TB with or without dedupe, its still 10 TB of data to process and that is what you pay for with capacity based licensing - not how much is actually stored on disk at the end of the dedupe process. There are other license models though which could be better suited.

 

Anyway, few things to think about, but hope that helps provide some guidance.

 

View original

1 reply

Damian Andre
Vaulter
Forum|alt.badge.img+23
  • Vaulter
  • 1235 replies
  • Answer
  • January 23, 2024

I think you are right that there will be little savings, if any if the images are truly unique and the user described how the data will be used accurately. If there are duplicates then obviously there will be some savings, but perhaps not enough to justify using up fast disk storage performance to host a DDB. Keep in mind the saving between full backups as well - even if the data is unique itself, between backup jobs (versions) they could be almost identical.

That being said, 500 GB right now is not a big deal, so maybe you can run some tests over a few weeks and see what type of ratio you are getting between full backups and monitor the ‘read world’ profile of this data. A lot of this hinges on your retention as well. If you have shorter retention, its probably not worth de-duplicating. If longer, then it may absolutely be worth it to get saving between full jobs.

Never versions of Commvault have a feature call horizontal DDB scaling, which means if performance starts to suffer, it will start a new ‘mini-DDB’ to to speak. So it does well to balance the performance to space savings really well - I am not so worried about bloating the DDB, especially since 10TB is very small in the scheme of things.

The amount of files and directories may start to become a bottleneck over time. At which point you may decide to use block-level file system backups instead, where blocks are the disks are backed up rather than opening/reading/closing individual files on the file system. The upside is faster backups and whole volume restores, the downside is slower individual restores or restores of only a few items. So think about how the data is typically going to be recovered if needed.

Licensing is based on the frontend usage - i.e if you backup 10 TB with or without dedupe, its still 10 TB of data to process and that is what you pay for with capacity based licensing - not how much is actually stored on disk at the end of the dedupe process. There are other license models though which could be better suited.

 

Anyway, few things to think about, but hope that helps provide some guidance.

 


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings