Solved

What type of hardware and architecture would be required for sensitive data governance on 4PB of data?


Badge +2
  • Commvault Certified Expert
  • 5 replies

Hi, 

I have a customer that would like us to detail the hardware & architecture required to use Activate (Sensitive Data Governance) on the existing backup job data within their Commvault solution. 

The solution needs to be able to index and analyse 4PB of file system application data stored within the archive and backup deduplicated storage pool. The solution would also need to scale to support a live crawl of the file system data on the servers in the future. 

The current architecture has a limit of 160TB per node. Using this architecture guideline would result in a large index server hardware footprint (25 servers for indexing alone). 

Is there potentially a different architecture guideline to follow for big data sizes? 

 

Reference:
https://documentation.commvault.com/11.22/expert/120371_sensitive_data_governance_hardware_specifications_01.html

 

Specifications for Dedicated Servers for File Data

Component

Large

Medium

Small

Source data size per node*

160 TB

80 TB

40 TB

Objects per node (estimated)

80 million

40 million

20 million

CPU or vCPU

32 cores

16 cores

8 cores

RAM

64 GB

32 GB

16 GB

Index disk space

(SSD class disk recommended)

12 TB

6 TB

3 TB

 

icon

Best answer by Blaine Williams 8 July 2021, 10:41

View original

10 replies

Userlevel 5
Badge +8

Hi Hemant, 

I will discuss this internal and find out for you. 

Badge +2

Thanks Blaine

Userlevel 5
Badge +8

Hi Hemant, 

Can we get answers (approx) for below:

1 – Do they need to do sensitive data analysis on complete 4PB?
2 – What kind of files do they have?
3 – What is approx. average file size?
 

Badge +2

Hi Blaine,

 

There is archive and backup data in a Commvault deduplicated storage pool. 

It is made up of File System data.

The application size of all the jobs totals over 4PB.

The files are office based extensions, with an average of over 1MB per file. 

Please let me know if you need anything, appreciate the assistance. 

Userlevel 5
Badge +8

Hi Hemant, 

You didn't answering the following so I have elaborated further, can you please advise. 

1 – Do you need to do sensitive data analysis on the complete 4PB of data?
2 – What kind of files do they have in there. Eg, all doc, email etc...?


 

Badge +2

Hi Blaine, 

 

1 – Do you need to do sensitive data analysis on the complete 4PB of data? Yes

 

2 – What kind of files do they have in there. Eg, all doc, email etc...? Office based extensions e.g. .doc, .pdf, docx, etc. 

 

Userlevel 5
Badge +8

@Hemant 

Roughly there are 400 million objects as per average size 1MB.

We will require at most 5 large access nodes (Index serve, CA and gateway/webserver). We have lots of optimization which helps in picking correct document for SDG so we can except this count to be less.

Also we recommend large environment configuration to be done incrementally so that we can scale accordingly. 

Component

Large

File source data size per node*

160 TB

Email source application size

25 TB

File objects per node (estimated)

80 million

Email objects per node (estimated)**

250 million

CPU

32 cores

RAM

64 GB

Index disk space

(SSD class disk recommended)

12 TB

https://documentation.commvault.com/commvault/v11_sp20/article?p=95225.htm

I hope this helps your planning. 

Badge +2

Thanks @Blaine Williams , much appreciated. 

Is the optimisations available on BoL to review? 

Userlevel 7
Badge +23

@Hemant , I believe @Blaine Williams is referring to internal code that is optimized at its tasks.

I’ll defer to him if I misunderstood.

Badge +2

Thanks Mike.

Reply