Hi Marcel,
Can you get the performance statistics from db2sbt.log on the DB2 server? This log will be in the Commvault log directory.
Thanks,
Sunil
Hi Sunil,
See attached log file,
Thanks in advance,
Marcel Huisman
@Marcel Huisman It appears 30% of the backup time was waiting on a free buffer(MsgQ).
You may need to increase the backup streams and see how that helps.
You could double the streams from 9 to 18 as the parallelism and buffers are already set to a higher value(40)
Ref link - http://www.redbooks.ibm.com/abstracts/tips1344.html?Open
Thanks Mahender, just started a new backup with 18 streams..
I also see below in the db2diag.log:
2021-03-14-17.52.37.444843+060 I1331304A1841 LEVEL: Error (OS)
PID : 15598158 TID : 1 PROC : db2exfmt
INSTANCE: db2prd NODE : 000 DB : PRD
APPID : *LOCAL.db2prd.210314165405
HOSTNAME: lpdd001
EDUID : 1
FUNCTION: DB2 Common, OSSe, ossErrorIOAnalysis, probe:100
CALLED : OS, -, open
OSERR : EACCES (13) "The file access permissions do not allow the specified action."
DATA #1 : String, 111 bytes
A total of 2 analysis will be performed :
- User info
- Path access permission
Target file = /db2exfmt.msg
DATA #2 : String, 188 bytes
Real user ID of current process = 1003
Effective user ID of current process = 1003
Real group ID of current process = 2012
Effective group ID of current process = 2012
DATA #3 : String, 41 bytes
current sbrk(0) value: 0x0000000113e01900
DATA #4 : String, 209 bytes
Information of each subdirectory leading up to the first inaccessible one is shown in the format below :
<UID>:<GID>:<permissions> (subdirectories)
Cannot descend any further than the above subdirectory.
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
y0] 0x090000001F1DBB9C pdOSSeLoggingCallback + 0x448
x1] 0x09000000096772FC oss_log__FP9OSSLogFacUiN32UlN26iPPc + 0x1BC
72] 0x0900000009677770 ossLogSysRC + 0x70
b3] 0x090000000969FC3C ossErrorIOAnalysis__FCPC21OSSErrorAnalysisParam + 0xD7C
r4] 0x090000001ED256A4 sqloSystemErrorHandler + 0x514
]5] 0x090000001F200864 sqloopenp + 0xE48
56] 0x090000001E89AAFC sqlagf_open_handle__FP12SQLO_FHANDLEPc + 0x84
E7] 0x090000001E5B9854 sqlaeror__FP5sqlca + 0x108
/8] 0x090000001E5B837C sqlabndx + 0x208
+9] 0x090000001EAF54D4 sqlabndx + 0x2E0
10] 0x000000010000CD28 main + 0x1CE8
n11] 0x00000001000002F8 __start + 0x70
Not sure which directory is inaccessible..
KR,
Marcel Huisman
Below message in db2diag.log, does this message mean my UTIL_HEAP_SZ is too low?
Reserved heap size exceeded for Backup/Restore/Util Heap (UTIL_HEAP_SZ) on node 0,
allocating additional unreserved memory.
Requested block size : 4198399 bytes.
Physical heap size : 205783040 bytes.
Configured heap size : 204800000 bytes.
Unreserved memory used by heap : 983040 bytes.
Unreserved memory left in set : 1972568064 bytes.
Hello Marcel,
Can you inform the status of the backup? And also could you provide information about the DB2 version we are working with here?
Thanks,
Raunak Jalan
Hi Raunak,
The backup is still running. will post the statistics when the backup has finished. We use DB2 v10.5.
KR,
Marcel Huisman
Last backup statistics. I updated the util_heap_sz to automatic, wil monitor the backup today..
2021-03-17-14.05.22.899305+060 E421942A5549 LEVEL: Info
PID : 4718886 TID : 43695 PROC : db2sysc 0
INSTANCE: db2prd NODE : 000 DB : PRD
APPHDL : 0-5590 APPID: *LOCAL.db2prd.210316101954
AUTHID : DB2PRD HOSTNAME: lpdd001
EDUID : 43695 EDUNAME: db2agent (PRD) 0
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:395
MESSAGE : Performance statistics
DATA #1 : String, 5052 bytes
Parallelism = 48
Number of buffers = 48
Buffer size = 2068480 (505 4kB pages)
BM# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 96316.56 56139.15 38903.69 110.11 1246896 2454309
001 96312.92 1969.70 17285.08 76178.38 268557 528335
002 96305.68 1661.15 13537.22 80280.31 70578 138629
003 96312.73 284.80 7605.03 87609.69 17117 33311
004 96252.02 6151.72 16703.25 72481.56 373321 734569
005 96236.86 647.76 13641.52 81122.37 65934 129421
006 96314.81 196.36 6453.44 88852.73 12940 25088
007 96310.49 119.03 4316.01 91064.60 7484 14348
008 96306.78 48.82 3130.93 92318.12 5670 10776
009 96284.55 112.27 2197.86 93166.51 3106 5717
010 96303.68 667.74 10757.38 84062.47 33911 66344
011 96310.38 15.23 539.28 94948.18 581 756
012 96245.17 44.58 2725.72 92667.31 1967 3486
013 96249.93 471.50 10453.35 84508.70 30106 58874
014 96255.10 24.00 2561.92 92860.24 1802 3161
015 96252.45 13.41 1608.19 93823.33 1066 1711
016 96247.74 17.15 1008.69 94414.66 679 947
017 96252.49 2.14 389.37 95053.65 287 178
018 96225.68 0.54 176.29 95241.77 209 24
019 96206.24 0.59 136.06 95262.47 204 14
020 95823.56 149.12 6710.29 88415.38 10872 21401
021 95853.67 0.53 12.34 95264.89 6 8
022 96315.45 0.68 241.94 95264.86 211 27
023 95975.21 0.62 12.34 95264.82 6 8
024 96316.15 0.64 240.94 95266.42 207 21
025 95904.91 91.46 5084.96 90099.86 5555 10934
026 96228.61 40.88 3456.83 91922.99 3103 5723
027 95950.10 35.75 2496.52 92745.05 1699 3342
028 96313.34 0.50 234.11 95270.38 201 9
029 96007.35 24.94 1828.59 93423.95 1094 2150
030 96306.57 0.58 224.77 95273.88 201 8
031 96316.56 0.55 233.54 95274.02 201 8
032 96308.62 0.53 226.80 95273.84 201 8
033 96306.44 0.28 225.13 95273.72 201 8
034 96308.25 0.43 226.58 95273.68 201 8
035 96301.46 0.27 220.84 95273.31 201 8
036 96312.55 0.00 227.14 95277.79 201 8
037 96307.82 0.38 227.93 95272.00 201 8
038 96252.11 22.94 1327.33 94094.32 887 1357
039 95880.80 0.56 3.84 95273.40 6 8
040 95926.00 0.33 5.18 95272.28 6 8
041 96253.65 4.66 488.09 94953.53 339 280
042 96031.55 2.24 222.66 95052.84 88 170
043 96059.78 0.33 16.26 95261.17 6 8
044 96084.04 0.98 26.12 95250.66 9 15
045 96252.50 0.24 183.68 95261.33 201 8
046 96248.62 0.34 174.69 95266.44 201 8
047 95801.45 0.39 4.25 95273.11 6 8
--- -------- -------- -------- -------- -------- --------
TOT 4617449.62 68969.02 178714.19 4332117.30 2168726 4255570
MC# Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 95277.43 64668.67 30589.76 0.04 120080 236874
001 95277.43 64659.83 30595.47 3.67 116919 230639
002 95277.32 64529.22 30720.80 3.63 117463 231712
003 95277.41 68329.22 26917.89 3.60 108866 214753
004 95277.95 64615.43 30632.83 3.64 122058 240776
005 95277.49 64674.19 30570.68 3.63 121181 239046
006 95277.45 67338.96 27904.16 3.67 139619 275418
007 95277.39 68281.24 26960.68 3.60 112152 221235
008 95277.40 68443.99 26794.69 3.60 107810 212670
009 95277.58 67631.46 27608.40 3.61 126191 248929
010 95277.70 64729.29 30524.33 3.60 119820 236361
011 95277.40 67777.43 27455.47 3.61 123743 244100
012 95277.95 64955.30 30296.97 3.63 105628 208365
013 95277.70 67547.67 27683.27 3.60 127330 251176
014 95277.79 67298.99 27929.84 3.60 138073 272368
015 95277.42 64641.54 30595.10 3.63 110127 217240
016 95277.79 64658.73 30576.14 3.61 108454 213940
017 96326.16 67573.98 28691.59 4.13 143231 282539
--- -------- -------- -------- -------- -------- --------
TOT 1716044.85 1192355.24 523048.17 62.20 2168745 4278149
2021-03-17-14.05.22.900337+060 E427492A530 LEVEL: Info
PID : 4718886 TID : 43695 PROC : db2sysc 0
INSTANCE: db2prd NODE : 000 DB : PRD
APPHDL : 0-5590 APPID: *LOCAL.db2prd.210316101954
AUTHID : DB2PRD HOSTNAME: lpdd001
EDUID : 43695 EDUNAME: db2agent (PRD) 0
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:421
MESSAGE : Estimated size of backup in bytes:
DATA #1 : unsigned integer, 8 bytes
4462864068608
Hi All,
In the DB2SBT.log, i also see a lot of messages about waiting for pipeline buffer, this occurs every few minutes:
15859976 1 03/18 09:06:38 4936695 6PIPELAYER ] ERROR: Waiting for pipeline buf. SndIdx d0/2], RCId R46325766]
8979020 1 03/18 09:07:14 4936695 3PIPELAYER ] ERROR: Waiting for pipeline buf. SndIdx S1/2], RCId ,46325764]
20578730 1 03/18 09:08:59 4936695 4PIPELAYER ] ERROR: Waiting for pipeline buf. SndIdx .0/2], RCId 246325763]
11010616 1 03/18 09:09:47 4936695 7PIPELAYER ] ERROR: Waiting for pipeline buf. SndIdx u1/2], RCId 146325760]
14942968 1 03/18 09:09:50 4936695 :PIPELAYER ] ERROR: Waiting for pipeline buf. SndIdx 0/2], RCId 46325756]
20185514 1 03/18 09:10:43 4936695 1PIPELAYER ] ERROR: Waiting for pipeline buf. SndIdx n0/2], RCId d46325757]
See attached db2sbt.log
KR,
Marcel Huisman
@Marcel Huisman , following up on this to ensure you’re getting this resolved. Might be best at this point to open an incident (unless anyone else has some input).
@Marcel Huisman , following up to see how this is looking. I checked your e-mail address for incidents and am not seeing anything new.
Thanks!
Thanks for your reply Mike,
My colleague started a case through our Commvault reseller ProAct. backup runs now over 30 hours. IBM is also working on the case from DB2 perspective.. Try to keep you posted..
Marcel
Thanks, @Marcel Huisman ! can you pm me the case number?
Should not be closed, received an update today.. I think you are referring to a case for another company...
Possibly! I had to search for the incident and might have the wrong one. Do you have the case number to share with me?
@Marcel Huisman looks like I had the right case but the wrong status: 210408-208.
I’ll keep checking.
@Marcel Huisman , I see the case was archived after you set additional key nNumPipelineBuffers on MA to 500.
Did that resolve the issue?
Hi Mike,
Thanks for monitoring the case… And no, the issue is still there, but I think it’s more a db2 issue because one tablespace contains 75% of the data. We will try to resolve this first.
Marcel
My pleasure! Hoping you get this resolved soon. Keep us posted (though I’ll be checking the case periodically as well)!
@Marcel Huisman , looks like we have a potential solution in the case. Can you confirm that this worked?
-> DB2 backups are still very slow.
-> MA IO seems to be very slow
-> Ran CVNetwork test tool while DB2 backup job (5008559) is running.
-> Client-end total throughput translates to 857 Gb/Hr
-> Backup job ID - 5008559, waited for it finish and it seems the buffer allocation is slow.
Solution:
-> Set additional key nNumPipelineBuffers on MA to 500
-> We ran another backup job 5014204.
-> As discussed please observe the performance now and let us know.
-> next backup job is still taking still the same time.
Looking at the backup logs it seems the DB2 overall read is a bit on the lower end.
Which could play a part in the slow network throughput even with the additional settings to increase buffers already in place.
CVPerfMgr.log
=======================================
|*46788258*|*Perf*|5014204| Head duration (Local): )21,April,21 11:54:40 ~ 22,April,21 18:28:45] 30:34:05 (110045)
|*46788258*|*Perf*|5014204| Tail duration (Local): )21,April,21 11:54:40 ~ 22,April,21 18:28:45] 30:34:05 (110045)
|*46788258*|*Perf*|5014204| ---------------------------------------------------------------------------------------------------------------
|*46788258*|*Perf*|5014204| Perf-Counter Time(seconds) Size
|*46788258*|*Perf*|5014204| ---------------------------------------------------------------------------------------------------------------
|*46788258*|*Perf*|5014204|
|*46788258*|*Perf*|5014204| DB2 Backup Agent
|*46788258*|*Perf*|5014204| |_Buffer allocation...................................... 72424 222533656208 0207.25 GB] B10.30 GBPH]
|*46788258*|*Perf*|5014204| |_Pipeline write......................................... 13732 220179357696 9205.06 GB] B53.76 GBPH]
|*46788258*|*Perf*|5014204| |_DB Backup: Read data................................... 23817 220177289216 1205.06 GB] B30.99 GBPH]
|*46788258*|*Perf*|5014204| |_Total Throughput....................................... 110039 220177289216 1205.06 GB] B6.71 GBPH]
|*46788258*|*Perf*|5014204|
|*46788258*|*Perf*|5014204| Reader Pipeline ModuleslClient]
|*46788258*|*Perf*|5014204| |_CVA Wait to received data from reader.................. 96310
|*46788258*|*Perf*|5014204| |_CVA Buffer allocation.................................. 13608
|*46788258*|*Perf*|5014204| |_SDT: Receive Data...................................... 109 220893371544 4205.72 GB] BSamples - 5189333] 3Avg - 0.000021] 16794.52 GBPH]
|*46788258*|*Perf*|5014204| |_SDT-Head: CRC32 update................................. 230 220893314008 0205.72 GB] BSamples - 5189332] 2Avg - 0.000044] 43220.01 GBPH]
|*46788258*|*Perf*|5014204| |_SDT-Head: Network transfer............................. 88094 220893314008 0205.72 GB] BSamples - 5189332] 2Avg - 0.016976] 68.41 GBPH]
=======================================
-> You can refer BOL of DB2 Agent for performance tuning: https://documentation.commvault.com/commvault/v11/article?p=14912.htm
-> Request to please have the DBA verify if the Database can be optimized.
-> Customer informed the DBA is still working on getting the DB reads optimized
Hi Mike,
Thanks for monitoring the case… And no, the issue is still there, but I think it’s more a db2 issue because one tablespace contains 75% of the data. We will try to resolve this first.
Marcel
Hello,
We had a similar situation, with a Tablespace containing the major part of the data, and the backup was taking a lot of time just finishing with one stream for the biggest tablespace.
We solved by untagging the option “Use Dedup Device” in the DB2 sub-client properties.
In this way the tablespaces are not saved sequentially, but they are distributed among all the Data Streams used, and all the streams finish at the same time.
Even if the option should be used if the backup target is a Deduplicated disk, we didn’t noticed any drop in the Dedup Ratio after disabiling it.
Now the backup Troughput is much better, and we can improve the speed in a predictable way, adding more Data Streams.
The only thing that is limiting now is the CPU consumed on the client, but we are able to backup up to 3 TB/h with the right CPU resources allocation.
Lucio
I attach the stats from “db2diag.log”:
MESSAGE : Performance statistics
DATA #1 : String, 2999 bytes
Parallelism = 30
Number of buffers = 30
Buffer size = 4194304 (1024 4kB pages)
BM# Total I/O MsgQ WaitQ Buffers GBytes
--- -------- -------- -------- -------- -------- --------
000 12611.54 8078.80 4083.16 47.60 1425381 5544
001 12565.74 3557.97 4596.61 4207.57 632655 2460
002 12567.96 609.22 1548.78 10326.93 102323 396
003 12600.68 186.98 790.61 11547.07 32466 124
004 12611.54 8.96 389.42 12142.21 2054 6
005 12609.72 1327.96 3494.67 7661.68 237529 922
006 12605.50 15.02 303.52 12215.47 3166 10
007 12578.83 72.39 516.95 11924.86 12229 46
008 12601.70 497.00 1431.25 10582.69 96174 372
009 12604.83 1.73 213.90 12318.10 775 1
010 12572.67 1.77 194.47 12318.37 656 1
011 12603.13 93.46 479.63 11957.43 10027 37
012 12576.57 17.28 331.44 12165.34 4554 16
013 12568.59 0.13 179.66 12331.42 421 0
014 12575.02 71.97 598.77 11840.90 16900 64
015 12581.14 1.96 200.48 12318.11 626 0
016 12599.52 1.63 209.84 12321.15 701 0
017 12602.35 0.19 206.95 12328.84 559 0
018 12562.40 30.10 240.90 12232.47 2116 6
019 12606.03 72.92 424.88 12034.54 7893 28
020 12573.94 0.09 181.74 12331.47 412 0
021 12572.99 12.24 368.63 12135.83 1934 6
022 12571.21 1.20 215.10 12298.86 488 0
023 12575.57 21.28 454.27 12042.06 3257 11
024 12610.96 54.14 773.70 11712.27 10698 40
025 12604.11 4.09 237.60 12293.09 1273 2
026 12610.33 16.20 497.98 12025.13 3575 12
027 12608.78 5.38 297.36 12237.07 1073 2
028 12579.08 23.23 508.31 11988.64 4086 14
029 12570.30 43.02 333.60 12134.39 4737 16
--- -------- -------- -------- -------- -------- --------
TOT 377682.87 14828.45 24304.34 336021.71 2620738 10149
MC# Total I/O MsgQ WaitQ Buffers GBytes
--- -------- -------- -------- -------- -------- --------
000 12335.87 8848.73 3465.52 0.00 480222 1875
001 12620.29 9813.40 2761.61 3.57 563171 2199
002 12336.78 9618.84 2688.78 3.57 556182 2172
003 12336.11 9497.61 2807.67 3.57 543133 2121
004 12335.51 8839.22 3471.95 3.57 478036 1867
--- -------- -------- -------- -------- -------- --------
TOT 61964.58 46617.83 15195.55 14.28 2620744 10237
For this system (10 TB) we use 5 Data Streams.
Lucio
@Marcel Huisman, hope this finds you well! I saw the case was archived after noting you set additional key nNumPipelineBuffers on MA to 500 and did some optimization on the database itself.
Can you share what those db changes were so we can mark this resolved with greater detail?
Thanks!
Issue is still not solved, but I commented to Commvault we are going to work on the DB as one of the tablespace is containing about 75% of the data. See if we can rebalance the data. I will also check Lucio’s comment. Let you know the results.
Marcel