Question

REST API and job queuing


Badge +1

Hello everyone,

Recently, we started testing REST APIs with our Commvault test environment. The tests were successful with less than 5 subclients running with the REST API. 
However, as soon as I increase the load to 50, 100 or 200 simultaneous subscribers, I notice that the CommServ is very slow to take the jobs into account. Subclient jobs appear 6 by 6 and it takes me 15min to handle 100jobs by the CommServ.

I have done a lot of research but I have not found anything in the Commvault documentation that could lead to this anomaly.
Has anyone using the REST API on their Commvault environment ever noticed this type of behavior?

For information: 
Commserv in version 11.26.18
Python script to call the REST API


6 replies

Userlevel 7
Badge +23

@sauvegarde , do you mind sharing the hardware specs for the Commserve?  Also, how many other jobs are running at this time?

this will help us narrow things down.

Badge +1

Hello Mike,

Below are the hardware specifications of the CommServ:
OS : Windows 2012R2
CPU : 4CPU
RAM : 24GB

To describe my complete test environment, I also use a proxy vsa and 4 media agents.
Proxy VSA : W2012R2

MA : Redhat 7.6
 

During my test, no other commvault job is running. 
This behavior with the REST APIs is strange because there should be no problem with the simultaneous triggering of 50 subclients.

Userlevel 6
Badge +15

Salut @sauvegarde 😉

Just to make sure we all understand (or mostly myself), can you confirm or not each statement :

  • you use RESTAPI Python scripts to initiate/control backup jobs ?
  • you initiate this from the Commserve itself, or from each clients, or from elsewhere ?
  • when you have 5 jobs in parallel, it’s working fine
  • when you try to have lots more jobs in parallel, then it’s sluggish ?

 

Badge +1

Hello @Laurent 

I will try to answer as best I can to the different points.

  • Are you using Python RESTAPI scripts to launch/control backup tasks?

To correct this point, the script used is based on the following examples: https://documentation.commvault.com/v11/essential/45532_samples_for_developer_sdk_for_python.html#polling-job-status

This one triggers a job depending on the parameters given and to perform my tests, the script is called X times with different parameters to run backup.

  • You initiate this from the Commserve itself, or from each clients, or from elsewhere ?

The script is triggered from an external server (scheduler server). The goal of my test is to replace the  Qcommand backup by the REST API backup.
 

  • When you have 5 jobs in parallel, it’s working fine

It works great !

  • when you try to have lots more jobs in parallel, then it’s sluggish ?

I defined 3 steps to confirm the correct operation of the REST API with python.
50 concurrent subclients
100 concurrent subclients
200 concurrent subclients

When I trigger 50 concurrent subclients with REST API, I notice from my CommServ that the jobs appear 6 by 6 and it takes 7min for the 50 backups to be running on the CommServ.


That's why I'd like some feedback if someone is using the rest API with python and has seen a similar anomaly? :)

Userlevel 6
Badge +15

Thanks for your explanations. I now understand a bit more your concern.

On my environment, we also have an external scheduler, so we needed to interact with Commserve to initiate and control jobs.

Our scripting experts had created custom perl script to generate RESTAPI queries and interact properly, to initiate backup job for subclients, poll for their status, and so on.

As most of the backups were performed during the night, we began to suffer from issues on high load timeslots, with many backups still running (and so beeing checked for their status every 60s) and new jobs beeing added. We had a lot of error 500 and 503… 

Webserver logs were full of such queries at the peak activity times.

We then tried to use Commvault’s internal scheduler to offload the webserver, and make sure we really schedule our backups (and all servers _are_ protected, otherwise they’ll be reported in SLAs/reports..).

 

The strange information that you’re pointing is that they appear ‘6 by 6’, and not 5 or 10..

Well, I’ll follow this topic to see where this could be coming from.. 

Userlevel 7
Badge +23

Appreciate the details (and the help from @Laurent as always)!

Might be worth seeing if anyone else has input, though I’m leaning towards creating a support case.  I would not expect such a crawling impact.  Granted, there is more activity from the overhead, but enough to cause the slowness you are seeing?  I don’t believe so.

If you end up creating a support case, share the incident number here so I can track it.

Reply