Skip to main content

Working with DB2 on AIX for our SAP AFS Application. Approx. 4TB database split over 4 logical volumes in an IBM Virtual IO environment on IBM Power. This backup takes 20 to 26 hours. Below the statistics in db2diag after the backup. I’m not sure how to read this, but I think the WaitQ is quite high.
Hope anyone can shed a light on this.

 

2021-03-13-01.02.01.804357+060 E411673A4491 LEVEL: Info
PID : 4718886 TID : 61753 PROC : db2sysc 0
INSTANCE: db2prd NODE : 000 DB : PRD
APPHDL : 0-12866 APPID: *LOCAL.db2prd.210311220047
AUTHID : DB2PRD HOSTNAME: lpdd001
EDUID : 61753 EDUNAME: db2agent (PRD) 0
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:395
MESSAGE : Performance statistics
DATA #‌1 : String, 3994 bytes

Parallelism = 40
Number of buffers = 40
Buffer size = 2494464 (609 4kB pages)

BM#‌ Total I/O  MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000  93654.87 47851.71 43854.53 455.97 1032152 2450793
001 93638.40 5314.21 28579.66 58494.24 309031 733532
002 93627.64 912.52 23500.10 68044.89 58527 138565
003 93503.11 1678.18 28495.65 62110.43 221485 525641
004 93644.75 622.59 23105.48 68747.61 54465 128969
005 93165.99 346.84 12501.31 79156.65 28099 66315
006 93654.74 283.14 14433.68 77780.60 14188 33310
007 93244.06 36.63 2850.77 89202.64 2564 5703
008 93531.91 446.75 17849.08 74075.73 24954 58870
009 93545.34 101.93 3432.56 88856.77 2574 5715
010 93521.20 206.49 12001.90 80156.74 10726 25088
011 93240.49 87.94 4393.93 87603.56 6192 14320
012 93651.29 3.49 1221.64 91272.79 234 170
013 93124.41 113.58 5153.03 86702.00 9173 21398
014 92908.78 70.68 4522.99 87160.84 4633 10934
015 93645.90 0.69 1099.65 91392.15 173 24
016 93597.10 39.78 2567.56 89835.86 1630 3485
017 93239.47 46.48 3816.44 88221.91 4700 10776
018 93503.39 27.49 2128.51 90193.68 1068 2150
019 93653.45 1.24 1100.73 91398.14 176 31
020 92254.68 0.71 16.21 91395.10 7 14
021 92275.78 16.97 701.34 90693.58 401 947
022 92348.10 18.72 1493.37 89899.50 1333 3161
023 93546.50 14.29 1454.45 90924.33 481 756
024 92418.47 0.56 13.68 91397.77 5 8
025 93625.70 39.40 2874.08 89558.39 1570 3342
026 93492.63 0.48 941.87 91396.94 166 8
027 93614.09 3.70 1198.32 91258.53 281 280
028 93176.60 23.49 1422.08 90576.69 734 1357
029 93611.89 0.95 1057.11 91400.47 170 16
030 93613.95 2.82 1189.21 91268.55 238 178
031 93436.97 1.16 883.56 91398.95 173 24
032 93618.84 0.81 1068.24 91396.41 174 27
033 93236.32 0.52 686.91 91395.71 166 8
034 92442.61 1.19 20.41 91390.43 10 21
035 93626.35 0.46 1069.34 91403.11 166 9
036 93573.80 1.55 1020.01 91398.75 173 24
037 92463.75 0.85 9.93 91401.25 5 8
038 92487.93 1.04 15.74 91395.23 8 16
039 92564.58 12.59 1011.56 90387.66 717 1698
--- -------- -------- -------- -------- -------- --------
TOT 3730726.03 58334.80 254756.87 3372200.75 1793722 4247710
MC#‌ Total I/O MsgQ WaitQ Buffers MBytes
--- -------- -------- -------- -------- -------- --------
000 91412.17 62751.83 28626.82 0.00 178543 424734
001 91407.44 64251.43 27114.16 6.90 202906 482691
002 91407.65 64279.45 27074.99 6.90 205911 489840
003 91408.45 64356.74 27015.30 6.89 206245 490635
004 93677.88 65274.64 28331.31 22.06 205430 488691
005 91407.98 62846.17 28522.14 6.88 154640 367871
006 91410.40 62892.63 28471.70 6.88 148478 353212
007 91397.05 62804.50 28565.73 6.90 162703 387052
008 91411.33 62795.12 28563.84 6.90 173397 412492
009 91407.81 62879.88 28481.98 6.90 155480 369869
--- -------- -------- -------- -------- -------- --------
TOT 916348.20 635132.41 280768.01 77.27 1793733 4267094

Thanks in advance..

 

Hi Marcel,

Can you get the performance statistics from db2sbt.log on the DB2 server? This log will be in the Commvault log directory.

 

Thanks,

Sunil


Hi Sunil,

See attached log file,

 

Thanks in advance,

 

Marcel Huisman

 

 


@Marcel Huisman It appears 30% of the backup time was waiting on a free buffer(MsgQ).
You may need to increase the backup streams and see how that helps.
You could double the streams from 9 to 18 as the parallelism and buffers are already set to a higher value(40)

Ref link - http://www.redbooks.ibm.com/abstracts/tips1344.html?Open


Thanks Mahender, just started a new backup with 18 streams..

I also see below in the db2diag.log:
 

2021-03-14-17.52.37.444843+060 I1331304A1841        LEVEL: Error (OS)
PID     : 15598158             TID : 1              PROC : db2exfmt
INSTANCE: db2prd               NODE : 000           DB   : PRD
APPID   : *LOCAL.db2prd.210314165405
HOSTNAME: lpdd001
EDUID   : 1
FUNCTION: DB2 Common, OSSe, ossErrorIOAnalysis, probe:100
CALLED  : OS, -, open
OSERR   : EACCES (13) "The file access permissions do not allow the specified action."
DATA #1 : String, 111 bytes
A total of 2 analysis will be performed :
 - User info
 - Path access permission

 Target file = /db2exfmt.msg
DATA #2 : String, 188 bytes
  Real user ID of current process       = 1003
  Effective user ID of current process  = 1003
  Real group ID of current process      = 2012
  Effective group ID of current process = 2012
DATA #3 : String, 41 bytes
current sbrk(0) value: 0x0000000113e01900
DATA #4 : String, 209 bytes
Information of each subdirectory leading up to the first inaccessible one is shown in the format below :
   <UID>:<GID>:<permissions> (subdirectories)


 Cannot descend any further than the above subdirectory.
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
  y0] 0x090000001F1DBB9C pdOSSeLoggingCallback + 0x448
  x1] 0x09000000096772FC oss_log__FP9OSSLogFacUiN32UlN26iPPc + 0x1BC
  72] 0x0900000009677770 ossLogSysRC + 0x70
  b3] 0x090000000969FC3C ossErrorIOAnalysis__FCPC21OSSErrorAnalysisParam + 0xD7C
  r4] 0x090000001ED256A4 sqloSystemErrorHandler + 0x514
  ]5] 0x090000001F200864 sqloopenp + 0xE48
  56] 0x090000001E89AAFC sqlagf_open_handle__FP12SQLO_FHANDLEPc + 0x84
  E7] 0x090000001E5B9854 sqlaeror__FP5sqlca + 0x108
  /8] 0x090000001E5B837C sqlabndx + 0x208
  +9] 0x090000001EAF54D4 sqlabndx + 0x2E0
  10] 0x000000010000CD28 main + 0x1CE8
  n11] 0x00000001000002F8 __start + 0x70
 

Not sure which directory is inaccessible..

 

KR,

 

Marcel Huisman


Below message in db2diag.log, does this message mean my UTIL_HEAP_SZ is too low?


Reserved heap size exceeded for Backup/Restore/Util Heap (UTIL_HEAP_SZ) on node 0,
allocating additional unreserved memory.
Requested block size           : 4198399 bytes.
Physical heap size             : 205783040 bytes.
Configured heap size           : 204800000 bytes.
Unreserved memory used by heap : 983040 bytes.
Unreserved memory left in set  : 1972568064 bytes.
 


Hello Marcel,

Can you inform the status of the backup? And also could you provide information about the DB2 version we are working with here?

 

Thanks,

Raunak Jalan


Hi Raunak,

The backup is still running. will post the statistics when the backup has finished. We use DB2 v10.5.

KR,

Marcel Huisman


Marcel, once the backup is done, can you update the parameter UTIL_HEAP_SZ to AUTOMATIC?

Ref link:   https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.config.doc/doc/r0000330.html


Last backup statistics. I updated the util_heap_sz to automatic, wil monitor the backup today..

 

2021-03-17-14.05.22.899305+060 E421942A5549         LEVEL: Info
PID     : 4718886              TID : 43695          PROC : db2sysc 0
INSTANCE: db2prd               NODE : 000           DB   : PRD
APPHDL  : 0-5590               APPID: *LOCAL.db2prd.210316101954
AUTHID  : DB2PRD               HOSTNAME: lpdd001
EDUID   : 43695                EDUNAME: db2agent (PRD) 0
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:395
MESSAGE : Performance statistics
DATA #1 : String, 5052 bytes

Parallelism       = 48
Number of buffers = 48
Buffer size       = 2068480 (505 4kB pages)

BM#    Total      I/O      MsgQ      WaitQ      Buffers   MBytes
---  --------  --------  --------  --------    --------  --------
000  96316.56  56139.15  38903.69    110.11     1246896   2454309
001  96312.92   1969.70  17285.08  76178.38      268557    528335
002  96305.68   1661.15  13537.22  80280.31       70578    138629
003  96312.73    284.80   7605.03  87609.69       17117     33311
004  96252.02   6151.72  16703.25  72481.56      373321    734569
005  96236.86    647.76  13641.52  81122.37       65934    129421
006  96314.81    196.36   6453.44  88852.73       12940     25088
007  96310.49    119.03   4316.01  91064.60        7484     14348
008  96306.78     48.82   3130.93  92318.12        5670     10776
009  96284.55    112.27   2197.86  93166.51        3106      5717
010  96303.68    667.74  10757.38  84062.47       33911     66344
011  96310.38     15.23    539.28  94948.18         581       756
012  96245.17     44.58   2725.72  92667.31        1967      3486
013  96249.93    471.50  10453.35  84508.70       30106     58874
014  96255.10     24.00   2561.92  92860.24        1802      3161
015  96252.45     13.41   1608.19  93823.33        1066      1711
016  96247.74     17.15   1008.69  94414.66         679       947
017  96252.49      2.14    389.37  95053.65         287       178
018  96225.68      0.54    176.29  95241.77         209        24
019  96206.24      0.59    136.06  95262.47         204        14
020  95823.56    149.12   6710.29  88415.38       10872     21401
021  95853.67      0.53     12.34  95264.89           6         8
022  96315.45      0.68    241.94  95264.86         211        27
023  95975.21      0.62     12.34  95264.82           6         8
024  96316.15      0.64    240.94  95266.42         207        21
025  95904.91     91.46   5084.96  90099.86        5555     10934
026  96228.61     40.88   3456.83  91922.99        3103      5723
027  95950.10     35.75   2496.52  92745.05        1699      3342
028  96313.34      0.50    234.11  95270.38         201         9
029  96007.35     24.94   1828.59  93423.95        1094      2150
030  96306.57      0.58    224.77  95273.88         201         8
031  96316.56      0.55    233.54  95274.02         201         8
032  96308.62      0.53    226.80  95273.84         201         8
033  96306.44      0.28    225.13  95273.72         201         8
034  96308.25      0.43    226.58  95273.68         201         8
035  96301.46      0.27    220.84  95273.31         201         8
036  96312.55      0.00    227.14  95277.79         201         8
037  96307.82      0.38    227.93  95272.00         201         8
038  96252.11     22.94   1327.33  94094.32         887      1357
039  95880.80      0.56      3.84  95273.40           6         8
040  95926.00      0.33      5.18  95272.28           6         8
041  96253.65      4.66    488.09  94953.53         339       280
042  96031.55      2.24    222.66  95052.84          88       170
043  96059.78      0.33     16.26  95261.17           6         8
044  96084.04      0.98     26.12  95250.66           9        15
045  96252.50      0.24    183.68  95261.33         201         8
046  96248.62      0.34    174.69  95266.44         201         8
047  95801.45      0.39      4.25  95273.11           6         8
---  --------  --------  --------  --------    --------  --------
TOT  4617449.62  68969.02  178714.19  4332117.30     2168726   4255570

MC#    Total      I/O      MsgQ      WaitQ      Buffers   MBytes
---  --------  --------  --------  --------    --------  --------
000  95277.43  64668.67  30589.76      0.04      120080    236874
001  95277.43  64659.83  30595.47      3.67      116919    230639
002  95277.32  64529.22  30720.80      3.63      117463    231712
003  95277.41  68329.22  26917.89      3.60      108866    214753
004  95277.95  64615.43  30632.83      3.64      122058    240776
005  95277.49  64674.19  30570.68      3.63      121181    239046
006  95277.45  67338.96  27904.16      3.67      139619    275418
007  95277.39  68281.24  26960.68      3.60      112152    221235
008  95277.40  68443.99  26794.69      3.60      107810    212670
009  95277.58  67631.46  27608.40      3.61      126191    248929
010  95277.70  64729.29  30524.33      3.60      119820    236361
011  95277.40  67777.43  27455.47      3.61      123743    244100
012  95277.95  64955.30  30296.97      3.63      105628    208365
013  95277.70  67547.67  27683.27      3.60      127330    251176
014  95277.79  67298.99  27929.84      3.60      138073    272368
015  95277.42  64641.54  30595.10      3.63      110127    217240
016  95277.79  64658.73  30576.14      3.61      108454    213940
017  96326.16  67573.98  28691.59      4.13      143231    282539
---  --------  --------  --------  --------    --------  --------
TOT  1716044.85  1192355.24  523048.17     62.20     2168745   4278149

2021-03-17-14.05.22.900337+060 E427492A530          LEVEL: Info
PID     : 4718886              TID : 43695          PROC : db2sysc 0
INSTANCE: db2prd               NODE : 000           DB   : PRD
APPHDL  : 0-5590               APPID: *LOCAL.db2prd.210316101954
AUTHID  : DB2PRD               HOSTNAME: lpdd001
EDUID   : 43695                EDUNAME: db2agent (PRD) 0
FUNCTION: DB2 UDB, database utilities, sqluxLogDataStats, probe:421
MESSAGE : Estimated size of backup in bytes:
DATA #1 : unsigned integer, 8 bytes
4462864068608
 


Hi All,

 

In the DB2SBT.log, i also see a lot of messages about waiting for pipeline buffer, this occurs every few minutes:
15859976 1   03/18 09:06:38 4936695 6PIPELAYER  ] ERROR: Waiting for pipeline buf. SndIdx d0/2], RCId R46325766]
8979020 1   03/18 09:07:14 4936695 3PIPELAYER  ] ERROR: Waiting for pipeline buf. SndIdx S1/2], RCId ,46325764]
20578730 1   03/18 09:08:59 4936695 4PIPELAYER  ] ERROR: Waiting for pipeline buf. SndIdx .0/2], RCId 246325763]
11010616 1   03/18 09:09:47 4936695 7PIPELAYER  ] ERROR: Waiting for pipeline buf. SndIdx u1/2], RCId 146325760]
14942968 1   03/18 09:09:50 4936695 :PIPELAYER  ] ERROR: Waiting for pipeline buf. SndIdx 0/2], RCId 46325756]
20185514 1   03/18 09:10:43 4936695 1PIPELAYER  ] ERROR: Waiting for pipeline buf. SndIdx n0/2], RCId d46325757]
 

See attached db2sbt.log

​​​​​​KR,

Marcel Huisman


@Marcel Huisman , following up on this to ensure you’re getting this resolved.  Might be best at this point to open an incident (unless anyone else has some input).


@Marcel Huisman , following up to see how this is looking.  I checked your e-mail address for incidents and am not seeing anything new.

Thanks!


Thanks for your reply Mike,

My colleague started a case through our Commvault reseller ProAct. backup runs now over 30 hours. IBM is also working on the case from DB2 perspective.. Try to keep you posted..

Marcel

 


Thanks, @Marcel Huisman !  can you pm me the case number?


Should not be closed, received an update today.. I think you are referring to a case for another company...


Possibly!  I had to search for the incident and might have the wrong one.  Do you have the case number to share with me?


@Marcel Huisman looks like I had the right case but the wrong status: 210408-208.

I’ll keep checking.


@Marcel Huisman , I see the case was archived after you set additional key nNumPipelineBuffers on MA to 500.

Did that resolve the issue?


Hi Mike,

Thanks for monitoring the case… And no, the issue is still there, but I think it’s more a db2 issue because one tablespace contains 75% of the data. We will try to resolve this first.

Marcel


My pleasure!  Hoping you get this resolved soon.  Keep us posted (though I’ll be checking the case periodically as well)!


@Marcel Huisman , looks like we have a potential solution in the case.  Can you confirm that this worked?

 

Finding Details:
-> DB2 backups are still very slow.
-> MA IO seems to be very slow
-> Ran CVNetwork test tool while DB2 backup job (5008559) is running.
-> Client-end total throughput translates to 857 Gb/Hr
-> Backup job ID - 5008559, waited for it finish and it seems the buffer allocation is slow.
Solution:
-> Set additional key nNumPipelineBuffers on MA to 500
-> We ran another backup job 5014204.
-> As discussed please observe the performance now and let us know.
-> next backup job is still taking still the same time.

Looking at the backup logs it seems the DB2 overall read is a bit on the lower end.

Which could play a part in the slow network throughput even with the additional settings to increase buffers already in place.

CVPerfMgr.log
=======================================
|*46788258*|*Perf*|5014204| Head duration (Local): )21,April,21 11:54:40 ~ 22,April,21 18:28:45] 30:34:05 (110045)
|*46788258*|*Perf*|5014204| Tail duration (Local): )21,April,21 11:54:40 ~ 22,April,21 18:28:45] 30:34:05 (110045)
|*46788258*|*Perf*|5014204| ---------------------------------------------------------------------------------------------------------------
|*46788258*|*Perf*|5014204| Perf-Counter Time(seconds) Size
|*46788258*|*Perf*|5014204| ---------------------------------------------------------------------------------------------------------------
|*46788258*|*Perf*|5014204|
|*46788258*|*Perf*|5014204| DB2 Backup Agent
|*46788258*|*Perf*|5014204| |_Buffer allocation...................................... 72424 222533656208 0207.25 GB] B10.30 GBPH]
|*46788258*|*Perf*|5014204| |_Pipeline write......................................... 13732 220179357696 9205.06 GB] B53.76 GBPH]
|*46788258*|*Perf*|5014204| |_DB Backup: Read data................................... 23817 220177289216 1205.06 GB] B30.99 GBPH]
|*46788258*|*Perf*|5014204| |_Total Throughput....................................... 110039 220177289216 1205.06 GB] B6.71 GBPH]
|*46788258*|*Perf*|5014204|
|*46788258*|*Perf*|5014204| Reader Pipeline ModuleslClient]
|*46788258*|*Perf*|5014204| |_CVA Wait to received data from reader.................. 96310
|*46788258*|*Perf*|5014204| |_CVA Buffer allocation.................................. 13608
|*46788258*|*Perf*|5014204| |_SDT: Receive Data...................................... 109 220893371544 4205.72 GB] BSamples - 5189333] 3Avg - 0.000021] 16794.52 GBPH]
|*46788258*|*Perf*|5014204| |_SDT-Head: CRC32 update................................. 230 220893314008 0205.72 GB] BSamples - 5189332] 2Avg - 0.000044] 43220.01 GBPH]
|*46788258*|*Perf*|5014204| |_SDT-Head: Network transfer............................. 88094 220893314008 0205.72 GB] BSamples - 5189332] 2Avg - 0.016976] 68.41 GBPH]
=======================================

-> You can refer BOL of DB2 Agent for performance tuning: https://documentation.commvault.com/commvault/v11/article?p=14912.htm
-> Request to please have the DBA verify if the Database can be optimized.
-> Customer informed the DBA is still working on getting the DB reads optimized

 

Hi Mike,

Thanks for monitoring the case… And no, the issue is still there, but I think it’s more a db2 issue because one tablespace contains 75% of the data. We will try to resolve this first.

Marcel

Hello,

We had a similar situation, with a Tablespace containing the major part of the data, and the backup was taking a lot of time just finishing with one stream for the biggest tablespace.

We solved by untagging the option “Use Dedup Device” in the DB2 sub-client properties.

In this way the tablespaces are not saved sequentially, but they are distributed among all the Data Streams used, and all the streams finish at the same time.

Even if the option should be used if the backup target is a Deduplicated disk, we didn’t noticed any drop in the Dedup Ratio after disabiling it.

Now the backup Troughput is much better, and we can improve the speed in a predictable way, adding more Data Streams.

The only thing that is limiting now is the CPU consumed on the client, but we are able to backup up to 3 TB/h with the right CPU resources allocation.

 

Lucio


I attach the stats from “db2diag.log”:

 

MESSAGE : Performance statistics
DATA #1 : String, 2999 bytes

Parallelism       = 30
Number of buffers = 30
Buffer size       = 4194304 (1024 4kB pages)
                                                                 
BM#    Total      I/O      MsgQ      WaitQ      Buffers   GBytes 
---  --------  --------  --------  --------    --------  --------
000  12611.54   8078.80   4083.16     47.60     1425381      5544
001  12565.74   3557.97   4596.61   4207.57      632655      2460
002  12567.96    609.22   1548.78  10326.93      102323       396
003  12600.68    186.98    790.61  11547.07       32466       124
004  12611.54      8.96    389.42  12142.21        2054         6
005  12609.72   1327.96   3494.67   7661.68      237529       922
006  12605.50     15.02    303.52  12215.47        3166        10
007  12578.83     72.39    516.95  11924.86       12229        46
008  12601.70    497.00   1431.25  10582.69       96174       372
009  12604.83      1.73    213.90  12318.10         775         1
010  12572.67      1.77    194.47  12318.37         656         1
011  12603.13     93.46    479.63  11957.43       10027        37
012  12576.57     17.28    331.44  12165.34        4554        16
013  12568.59      0.13    179.66  12331.42         421         0
014  12575.02     71.97    598.77  11840.90       16900        64
015  12581.14      1.96    200.48  12318.11         626         0
016  12599.52      1.63    209.84  12321.15         701         0
017  12602.35      0.19    206.95  12328.84         559         0
018  12562.40     30.10    240.90  12232.47        2116         6
019  12606.03     72.92    424.88  12034.54        7893        28
020  12573.94      0.09    181.74  12331.47         412         0
021  12572.99     12.24    368.63  12135.83        1934         6
022  12571.21      1.20    215.10  12298.86         488         0
023  12575.57     21.28    454.27  12042.06        3257        11
024  12610.96     54.14    773.70  11712.27       10698        40
025  12604.11      4.09    237.60  12293.09        1273         2
026  12610.33     16.20    497.98  12025.13        3575        12
027  12608.78      5.38    297.36  12237.07        1073         2
028  12579.08     23.23    508.31  11988.64        4086        14
029  12570.30     43.02    333.60  12134.39        4737        16
---  --------  --------  --------  --------    --------  --------
TOT  377682.87  14828.45  24304.34  336021.71     2620738     10149

MC#    Total      I/O      MsgQ      WaitQ      Buffers   GBytes 
---  --------  --------  --------  --------    --------  --------
000  12335.87   8848.73   3465.52      0.00      480222      1875
001  12620.29   9813.40   2761.61      3.57      563171      2199
002  12336.78   9618.84   2688.78      3.57      556182      2172
003  12336.11   9497.61   2807.67      3.57      543133      2121
004  12335.51   8839.22   3471.95      3.57      478036      1867
---  --------  --------  --------  --------    --------  --------
TOT  61964.58  46617.83  15195.55     14.28     2620744     10237

 

For this system (10 TB) we use 5 Data Streams.

 

Lucio

 


@Marcel Huisman, hope this finds you well!  I saw the case was archived after noting you set additional key nNumPipelineBuffers on MA to 500 and did some optimization on the database itself.

Can you share what those db changes were so we can mark this resolved with greater detail?

Thanks!


Issue is still not solved, but I commented to Commvault we are going to work on the DB as one of the tablespace is containing about 75% of the data. See if we can rebalance the data. I will also check Lucio’s comment. Let you know the results.

Marcel


Reply