Skip to main content
Question

DDB horizontal scaling - how hash references work ?

  • 24 July 2024
  • 8 replies
  • 48 views

Hello,

can someone explain, how DDBs are queried, if new DDBs for a specific purpose are created ?

Looking at the figures below, the VMs are leveraged among both DDBs, while only the initially used DDB has a severe amount of primary hash references.

I know, that a new DDB is created and subclients are reassociated if a DDB reaches its limits (Q&I time and/or # of hashes), but I always assumed, that after the reassociation, some of the data is again identified as unique.

The picture below let me assume, that old references are still honored and only new hashed are added to the currently associated DDB. If this was true, a new horizontal DDB does not automatically require additional space in the backend.

 

 

thanks
Klaus

8 replies

Userlevel 6
Badge +15

Hello @johanningk 

 

Thanks for the great question and details. 
When a horizontally scaled DDB reaching its thresholds it will spawn out a new DDB partition to work from. The effect of this is that NEW subclients that are created on either an excising client or a new client will use this DDB but the old DDB’s will keep using the old DDB. 

If we were to move the old subclients onto the new DDB it would just shift the problem while also causing a hit in storage usage and we don’t want that.

So depending on what caused the DDB to scale out and when it occurred you can find a large number of clients on one and small number on the other. The original DDB will also continue to grow as the old clients change and are continued to be protected but the idea of scaling out is to limit that growth and minimise the impact onto new clients. 

Kind regards
Albert Williams

Userlevel 2
Badge +10

Hi @Albert Williams ,

Thanks for the reply, but it does not really answer my question.

  • are existing hashes on the original / other DDBs (of a pool) taken into account, even if a subclient / client has been assigned / moved to another DDB within the pool. 

I’d like to know, whether the introduction of a new DDB (within the same Pool) will lead to increased capacity requirement/consumption in the backend (due to duplicate hashes).

I understand, that subclients are reassociated, once a DDB reaches its limits. As long as this is only done to limit the size / growth of a DDB (to keep it fast responding), that’s perfect.

The picture in my original question let me assume, that all existing hashes in any DDB (of a specific type) within the  pool are take into account, regardless of the DDB the subclient is associated to, to determine unique hashes/blocks.

Thanks, 
Klaus

Userlevel 6
Badge +15

Hello @johanningk 

 

The wording you use make me think you may have it backwards. 


You said the following: 

“I understand, that subclients are reassociated, once a DDB reaches its limits. “

There is no reassociating of any subclients. Only new subclients will use the new DDB partition and the old subclients will stay put. 


In regards to if we reference hash's across DDB partitions, we would never do this. You are correct that by not doing this we are going to cause a increase of disk usage due to duplicate hashes being present between both DDB’s but by doing this there is no dependency between the two DDB’s. 

The entire goal of Horizontal scaling is to improve backup performing by reducing the amount of Hash's the query has to run through to determine if it is present or not. By reducing the size and increasing the value of each hash we will cause a increase in disk performance but it is so low when you compare it to the performance improvement you will gain. 

 

Dev feel so strongly about this pay off being worth it that Horizontal scaling is now a default feature to be turned on for any new DDB created after FR24. 

Kind regards

Albert Williams

Userlevel 2
Badge +10

Hi @Albert Williams 

according to the DDB building block Deduplication Building Block Guide (commvault.com), the software does this periodically.
 

My environment always had around 830 active VMs in the backup,
looking at the VM count of the DDBs these seemed to be more or less equally leveraged between a DDB started in march and the second, that was automatically created in may due to the DDB thresholds.

I never started the mentioned WF manually, to move clients from one DDB to another.

In case, cross references are not executed between DDBs within the same pool (and type), I wonder why the amount of unique hashes of the DDBs varies that much.

Until now I understood it the same way you state above.
- only the associated DDB within a pool is queried, which will lead to duplicate hashes within one pool.

But the figures reported let think different.

That’s why I was asking this question to get a better understanding.

Thanks,
Klaus

Userlevel 6
Badge +15

Hello @johanningk 


The documentation you have linked seems to have a typo. 

The word “few” should be “new”. I have submitted a change request to the documentation.

 

Kind regards

Albert Williams

Userlevel 2
Badge +10

Thanks @Albert Williams ,

according to the report, the available subclients are leveraged amongst the two DDBs of the Pool.
The vast majority of the VMs (aprx. 800 of 870) have already been there in March, when, the Pool has been initially started.

In May, a 2nd DDB has been created and started to be used.

  • how did the re-association took place, if not done manual ? 
  • If the subclients are equally assinged to both DDBs, why is the amount of unique hashes for the 2nd DDB only about 3% compared to the 1st DDB ?

The JavaGUI reports at least the same amount of unique hashes for the DDBs. ((not sure whether I can see the subclient/client association per DDB in the JavaGUI))

rgds
Klaus

Userlevel 6
Badge +15

Hello @johanningk 
 

As i explained before, new subclients will use the new DDB. The old DDB will keep running with the old subclients. No association will take place. No VM’s will be moved. 

Kind regards

Albert Williams

Userlevel 2
Badge +10

ok,

what I understand from what you say:

Since May, only about 20-30 VMs have been added to the backup (new), while 800+ VMs have already been backed up back in March.
Since I did not initiate a reassociation of clients/subclients manually using the DDB Seeding Workflow
https://store.commvault.com/webconsole/softwarestore/store.do#!/136/672/14345,
the first DDB still has some 800 active VMs assigned and the second one only about 30.

kind regards
Klaus

Reply